VMworld 2012 Live Blog of INF-BCO2155 vCloud DR for Oxford University Computing Services – Real World Example

This is the Live Blog of the VMworld 2012 INF-BCO2155 – “vCloud DR for Oxford University Computing Services. You’ll find my recap of the session below.

The presenters for this session are:

  • Aidan Dalgleish – VMware, Inc. Consulting Architect and a fellow VCDX :)
  • Gary Blake – VMware, Inc. Senior Consultant
  • Adrian Parks – Oxford University Computing Services Senior Systems Administrator

This session will discuss the disaster recovery for provider vDC’s in vCloud Director. The solution becomes more complex in the absence of L2-stretched networks (clusters). VCloud Director works with managed objects, so it is absolutely essential that everything in the infrastructure be restored in their current state. Any rebuilds will break the managed object relationships.

Adrian Parks will go over the challenges and requirements for Oxford Computing Services.

  • Over 300 independent units that each do their own thing
  • Devolved IT infrastructure
  • No central mandate

The University now operates a private cloud, offering Infrastructure As A Service to those 300 “independent units.”

Goals for the project:

  • Primary – Protect “Shared Datacenter Virtual Infrastructure Environment” from site failure
  • Secondary – Additional application availability
  • Secondary – Active / Active failover design for two Resource Clusters
  • Secondary – Automate the vCloud Director Resource Cluster(s) recovery
  • Secondary – Prioritize Organizational Virtual Datacenters during recovery
  • Secondary – Honor boot priorities within vApps during recovery

Workload Categories:

  • Virtual Datacenter Service
  • Hosted Virtual Machine Service
  • DaaS (Database as a Service)

Methodology:

  • Reorganize the clusters to provide management cluster
  • Controlled recovery order:
    • Priority 1 VMs
      • DNS
      • vCenter / VUM DB
      • vCloud Director DB
    • Priority 2 VMs
      • vCenter
      • vShield Manager
      • vCloud Director Cell
    • Priority 3 VMs
      • Chargeback
  • Pauses and processes throughout SRM recovery plan
  • Powershell (PowerCLI – Awesome!) was used to implement checks (e.g. Test Windows / Linux Services)

Infrastructure (2 Clusters)

  • Stretched vSphere Cluster across primary (USDC) and DR site (OUCS)
  • Workloads run in USDC with hosts in OUCS put in maintenance mode
  • Campus Resource Cluster is opposite (OUCS – primary, USDC – DR) with hosts at DR site in maintenance mode.

Failover Process

  • Break replication
  • Present LUNs to recovery hosts and force mount (cannot resignature)
  • PowerCLI was used to perform many of these functions
  • Remove recovery hosts from maintenance mode, setting HA to disabled with PowerCLI
  • Power on recovered workloads utilizing vSphere and vCloud API

The next few minutes are dedicated to going into the functions in detail. I won’t recap them here as they are pretty in depth and I likely won’t be able to keep up :) . I will say that this is an excellent testament to the automation capabilities of PowerCLI and the vSphere and vCloud API’s.

Limitations of the solution:

  • No support for SRM test mode
  • Storage failover is manual in the event of a Failover (Recovery Mode)
  • Cannot currently identify previous running state of the vApps at the time of failure

These guys have come up with a pretty slick solution to recovering a vCloud Director infrastructure by digging in and automating certain aspects of vSphere. The automation tools (PowerCLI and other APIs) can take care of everything from mounting disks and registering VM’s to disabling the “copy or move” question when the VM’s are registered on the recovery side.

leave a comment