Will Your Virtual Infrastructure Pass Its Health Check?

leading to performance concerns. At the operations level the ease and speed with which new applications can be deployed has resulted in many organisations resolving the issues of ‘server sprawl’, only to be faced with the new problem of ‘Virtual Machine sprawl’.
Listed below are 10 considerations for Virtualisation Best Practice:

1. Standardise
The main benefits of standardising across all aspects of the Virtual Infrastructure are ease of management and troubleshooting. This includes: software revisions; hardware configurations; server builds standards; naming conventions; storage and network configuration. Management is easier because all components are interchangeable and of a known configuration; in addition root-cause analysis is easier when the number of variables is kept to a minimum. Be aware; hosts with incompatible CPU types or stepping families’ can prevent VMware VMotion working correctly.
Standards should be defined and documented during the planning process and subsequently adhered to during deployment. Proposed changes to the environment should be reviewed, agreed and documented in an enforced ‘Change Control Procedure’.

2. Optimise the Network
The network is crucial to the performance and resilience of the Virtual Infrastructure – i.e. in addition to end-user traffic, the network is the primary means by which the Virtual Infrastructure is managed (through Virtual Center) and means of fault tolerance – using VMotion. For many organisations the network is also the method by which they connect to their storage. VMware recommends that there are a minimum of four Gigabit network adapters per ESX 3.x host-two attached to a vSwitch for the management network (service console, VMkernel, and VMotion), and two attached to a vSwitch for the VM network to support the virtual machines. In practice further segmentation is recommended. Whilst placing multiple NICs in a single vSwitch provides NIC redundancy and failover, placing all NICs on the same vSwitch restricts network segmentation, potentially leading to performance bottlenecks. An optimal balance therefore needs to be struck between network redundancy and traffic segmentation.

3. Optimise the Storage Configuration
Optimisation of the storage environment will depend upon the storage platform / protocols being used. All Virtual Hosts should be configured with multiple paths to the storage – to allow for failover in the event that an active path fails. ESX includes native multi-pathing support at the virtualisation layer. Multi-pathing allows an ESX host to maintain a constant connection between the host and a storage device in case of failure of a host bus adapter (HBA), switch, storage controller, storage processor, or a Fibre Channel/iSCSI network connection. All ESX hosts belonging to the same VMware DRS or VMware HA cluster for VI3, or two end points of a VMotion migration need to have access to the same shared storage.

SAN LUNs should be properly zoned so that each host can see the shared storage. If zoning is done improperly such that a host cannot see certain shared LUNs, this can cause problems with VMotion, VMware DRS and VMware HA (VI3). In order to improve performance and avoid the potential for storage access contention issues, LUNs should be zoned only to the hosts that need them.

In cases where multiple Guest OSes need to be configured to an iSCSI SAN it may be preferable to use the software initiator built into ESX. Using a single iSCSI initiator at the host level may improve performance over multiple aggregated initiators at the Guest level.

4. Allocate Sufficient Storage Capacity for Snapshots
Snapshots allow point-in-time copies of Virtual Machines to be taken, which can subsequently be used for testing and/or recovery purposes. A snapshot consists of block-level deltas from the previous disk state – comprised of a base disk and copy on write (COW) files that reflect changes – as a bitmap of all changed blocks on the base disk. Whilst can be very useful, care should be taken in using too many VMware based snapshots, which consume a considerable amount of additional disk space. VMware recommends planning on providing at least 15-20% of free space for snapshots. Alternatively it may be preferable to use storage-based snapshots, which only consume capacity on incremental writes.

5. Security
The security of the Virtual Infrastructure can be increased by restricting access to the ‘root’ user. The ‘root’ account can change any configuration setting within an ESX host, making it difficult to manage and audit the changes made. Remote access using the ‘root’ account should be disabled; instead users should log in remotely as a regular user in order to maintain an audit trail of user access, raising their access level to ‘root’ privileges if required.

VirtualCenter also has a number of ‘roles’ that can be assigned to users to refine the granularity of the security privileges assigned to individual users. In order to tighten security on the management network, close down TCP ports on the service console other than those used by ESX and VirtualCenter. Use secure shell (ssh) and secure copy (scp) for access and to transfer files to and from the service console rather than through lower security methods (telnet and ftp).

Increase the security of packets travelling over the network by segmenting network traffic travelling over the same physical NIC using ‘VLAN tagging’. VMware ESX supports IEEE 802.1Q VLAN tagging to take advantage of virtual LAN networks. VLAN tagging has little impact on performance and enables VMs to be more secure since network packets are limited to those on the segmented VLAN. Using VLAN tagging can minimize the number of physical NICs needed to support more network segments. VLANs provide logical groupings of network ports as if they were all on the same physical port to separate networks.

6. Define a Standard Virtual Machine Provisioning Process
Have standard guidelines and procedures in place in order to control the Virtual Machine provisioning process. Defining guidelines for sizing Virtual Machines in terms of number of virtual CPUs and amount of RAM, based upon the Operating System and application workload eases deployment and makes resource utilisation and forward capacity planning more predictive i.e. assisting administrators to ensure that there are sufficient resources to meet the required workloads. Requests that exceed standard guidelines should be handled as exception cases requiring necessary approvals.

Virtual Machines should be defined based upon their anticipated actual requirements for CPU and RAM, not upon the resources available to them in the physical environment, which often are unused and wasted. ESX performs best with running Virtual Machines reduced to a single Virtual CPU; Virtual machines with two or four virtual CPUs (Virtual SMP) should only be used when necessary. Simply giving all virtual machines access to two or four virtual CPUs at a time on an ESX host will likely waste resources, without any demonstrable performance benefit. The reason is that very few applications actually require multiple CPUs, and many virtual machines can run fine with a single virtual CPU.

If the applications used within the virtual machine are not multithreaded and capable of taking advantage of the second CPU, having the extra virtual CPU does not provide any increase in performance. The ESX scheduler reserves two or four CPUs (cores) concurrently to run Virtual SMP virtual machines. If a dual CPU virtual machine could run fine as a single CPU virtual machine, consider that every time that virtual machine is running, a CPU is wasted and another single CPU virtual machine can be prevented from running.

Virtual machines should be sized appropriately for RAM. It is tempting with ESX to assign extra RAM to a virtual machine because if it doesn’t need the additional RAM, an ESX host shares that RAM or forces it to give some up temporarily through the balloon driver. Unfortunately, the guest OS is likely to slowly fill that RAM with obsolete pages simply because it has the room. If all guests on an ESX host are sized this way they could continually swap out “unneeded” RAM with each other. Likewise, avoid overtly starving a RAM on a VM by purposely giving it less RAM than needed in the hopes of utilizing ESX’s identical memory page sharing. RAM starvation can lead to poor VM Guest performance.

Consistent guidelines for sizing virtual disks based on Operating System and application workload type can help manage free disk space and make disk usage more predictable. Requests that exceed standard guidelines can be handled as exception cases requiring necessary approvals.

To save space, avoid creating virtual disks that are much larger than needed by the Guest. A virtual disk can be expanded after its initial creation (although a tool within the Guest is necessary to recognize the additional space) but shrinking a virtual disk is not supported. Sizing virtual disks properly helps conserve storage space.

Virtual machines should have by default a single virtual NIC. Having a second virtual NIC does not result in any gains unless the second virtual NIC is attached to a second vSwitch to provide redundancy at the vSwitch and physical adapter level.

7. Provision Virtual Machines from Templates
Creating Virtual Machines from scratch is both time-consuming and increases the potential of introducing anomalies and errors. In order to facilitate the rapid deployment of new applications into the Virtual Infrastructure, administrators should create and maintain a number of standard Operating System / application ‘master installations, stored as ‘VirtualCenter templates. The use of such templates removed many of the common, time-consuming phases of the implementation process, reducing time-to-deployment, whilst ensuring that every new server has an identical configuration i.e. reducing errors, minimising risk and management overhead.

8. Create and utilise Resource Pools to improve SLAs
Resource Pools enable administrators to improve the Service Levels they provide to their users by providing Virtual Machines within a resource pool to have access to a guaranteed amount of CPU and RAM resources.

Resource pools are shaped by reservation amounts, limits, and shares. Reservations are guaranteed minimums. Limits define the boundaries of the resource pool and prevent the VMs within the resource pool from tapping additional resources. Shares are used to assign relative priorities. Resource pools allow proactive curtailing and control of user usage. Resource pools can be nested. In addition, reservations can be expandable, meaning that if a pool hits its reservation, it can try to reserve (“borrow”) more resources from a parent if they are available. Doing so takes away available resources for use or reservation by the parent or other entities. The total reservation can never exceed the limit of the resource pool regardless of how many resources are available to the parent. Resource pools can span multiple hosts. However, a VM can only run on a single host at a time and therefore cannot use more CPU or RAM cycles than a given host has.

9. Balance Workloads across Hosts using VMware DRS
VMware DRS (Dynamic Resource Scheduling) enables an organisation to provide Service Level guarantees back to its users, by dynamically balancing Virtual Machine workloads across multiple ESX Hosts configured in a cluster, in line with their resource requirements i.e. in order to prevent Virtual Machines becoming constrained, whilst ESX Hosts stand comparatively idle.

VMware DRS aggregates CPU and RAM resources across a cluster of hosts. Pooling such resources together allows VirtualCenter to intelligently calculate and determine where resource loads are imbalanced, while keeping track of all the resource reservations, limits, and shares. VirtualCenter can make recommendations for replacement of running VMs or even automatically move workloads around using VMotion.

If an ESX Host has to be brought down in order to undertake hardware maintenance, patching or upgrade, VMware DRS can also be used to automatically migrate Virtual Machine workloads from off of the effected server, minimising the impact on the end-users.

10. Data Protection and High Availability
Having virtualised the physical server estate it is essential that a solution is in place to protect, backup and recover the environment in line with the organisation’s Service Level Agreements.
Utilise the inherent high availability functionality of VMware VI3 to increase fault tolerance i.e. VMware DRS and HA, in order to load balance workloads, and protect them against planned / unplanned downtime.

Understand the potential single points of failure within a VMware Infrastructure and plan for redundancy where possible. The VirtualCenter database, license server files residing on the license server, and datastores containing VMs are all single points of failure that should be routinely backed up. The rest of VMware Infrastructure can be architected for maximum redundancy through teaming or hot spares. For teaming, use multiple hosts with multiple vSwitches and multiple physical NICs. Use multi-pathing to storage with multiple HBAs, switches, and storage processors. Use identical host hardware wherever possible to facilitate quick restores or reinstallation. Have hot spares for the VirtualCenter Server and license server.

Have a process in place for restoring ESX hosts. Identify and back up customized files and partitions for each ESX host. In general, specific customisations to hosts should be avoided or minimised so that each host can be easily recreated through a simple reinstallation, and hosts can be easily replaced. Have a standardised procedures or a ‘runbook’ in place so that an ESX Host can be reinstalled procedurally or through a script, in order to speed up recovery.

Have a process in place for backing-up/restoring the VirtualCenter database. The VirtualCenter database is a single repository of configuration information on ESX hosts and their Virtual Machines. There is also historical performance information that is logged. Backing up the database preserves the historical information and minimizes downtime in the event of disaster and recovery.

Have a process in place for backing up/restoring license server files. The license server for VMware Infrastructure 3 stores uploaded licenses in a local directory. Back up the files so that they are available in the event of disaster if the license server must be recreated or reinstalled elsewhere. Using a mapped drive to a network share to store the license files can be helpful. Alternatively, license files can be manually retrieved from the VMware website by logging in using a registered account. ESX, VirtualCenter, and Virtual Machines will continue to operate with a grace period of 14 days if a connection to the license server is severed. Certain abilities related to adding or removing hosts are disallowed during the grace period. After the grace period ends, running Virtual Machines remain powered on, but Virtual Machines cannot be powered on and VMotion migrations are disallowed.

Have a process in place for backing up/restoring Virtual Machines. Virtual Machines can be backed up using conventional methods that apply to physical machines by use of backup agents installed in the Guest OSes. However, the use of backup agents in each Virtual Machine is expensive; in addition the aggregated network traffic of many Virtual Machines running on a single ESX host all being backed up at the same time can result in higher network usage than can be tolerated. In order to address these issues it is often beneficial to use a storage based backup / recovery strategy i.e. using available functionality from the storage vendor to provide ‘crash-consistent’ (or in the case of a database application ‘application-consistent’) snapshots of the Virtual Machines, which can then be backed-up tom tape or a disk-based library.

Have a Disaster Recovery Plan that’s provides a against a complete site-level failure. A secondary Disaster Recovery site is needed to recover business operations. Due to the extenuating circumstances, these procedures focus on a shorter prioritized list of essential services to restore and lower than normal performance levels may often be tolerated. It may be desirable to prioritise applications, based upon their criticality to the business i.e. tier 1 is for the most critical applications, and tier 3 is for the least critical applications. Service level agreements are especially important for disaster recovery because their definitions help bring order to chaotic situations after a disaster. A plan for how to restore partial business operations caused by the loss of a primary site should be developed, and the plan should be tested regularly. VMware Site Recovery Manager may be used in order to define and automate recovery of the Virtual Infrastructure at the Secondary site.