Tag Archives: vi3

Moving a virtualized data center

Two weekends ago we have moved our data center to a new location about 60 kilometers away. While we have experience with moving a data center (we moved it to it’s previous location only two years ago) this was the first time we had to move 150 servers contained in only 30 physical servers.

Of the 150 servers about 120 reside on only six physical servers (DL585 G2) that are our VI cluster hosts. We also have two DL585 G1’s that hold about 50 Virtual Desktops, a DL585 G2 that holds a dedicated SAP testing environment. So there was quite a risk involved that servers would not survive the transport phase and we would be left with too few cluster hosts to run all the servers.

Because all the cluster servers have 24/7 HP support contracts on them we decided that the best we could do was to split up the transports in two.

In the weekend itself the transportation and mounting of the servers went without incident. It was only with powering on the first three 585’s that two of them gave critical hardware failures. One server (the dedicated SAP server) failed to boot altogether and needed to have it’s processor board replaced. On the other server the memory bank of processor 4 failed and soon after the memory bank of processor 2 failed as well. While we were able to get the server to boot and complete it’s memtest with memory in banks 1 and 3 only, ESX expects memory in all four banks or else it can’t address it. So ESX refused to boot with a kernel error.

After trying several things on Saturday I filed the support call with HP on Sunday, around 09:00 in the morning. After 15 minutes we already had HP support on the line and soon after a courier was dispatched with the necessary spares (also a new processor board and several memory modules). The HP technician was on site around 13:30 and fixed the machine.
Luckily the other cluster hosts were undamaged and they booted without incident. It was quite surprising though that such new servers would fail (they are less than a year old). We were very glad that we had stuck to our performance criteria of expanding the cluster when utilization would reach 60-65 percent per host so we had plenty of headroom to move the servers around.

Our first production HA failover

A quite unexpected event yesterday was the very first HA failover in production. Although we had tested it and seen it work a number of times in our testing environment it was something else to see it in the production environment. As a result of which we weren’t looking for an HA failover when all of a sudden 14 servers went down.

After reviewing the logs we found out that the servers in question were moved because of a failover. It turned out that one of the hosts lost it’s network connection for three seconds (we still don’t know why) and that HA decided to power off and move all the servers as a precaution.

We can safely say that HA works.

New hosts online, project cluster complete

And with this I wish you all a very happy New Year and good luck on all your virtualization projects. May it work out as well for you as it for us.

To continue the festive spirit we received some gifts from HP just before Christmas: the new DL585G2’s we ordered. And yesterday and today they were fully configured and have found a place in one of our racks next to their older brothers.

Truly impressive machines these G2’s and HP have made some great strides in the layout of the machines. Also for maintenance they are much more engineer-friendly as they don’t have to come all the way out of the rack to replace a CPU or to add memory. You can see something of it on the pictures below (click for larger view) or have a look at the Quickspecs @ HP.com. In effect the whole CPU / memory assembly slides out forward out of the machine after removing to large coolers. This way you can pull the server out half way in stead of with the G1 which needs to be out all the way to open the top lid so that also means less risk of stress on the rail kit.

585 G2 open                            DL585 open on case

I know it was mentioned on the VMWare forums: you can not VMotion between a G1 and a G2 without CPU masking. We will work around this for a short time while we get another two 585G2’s and move the G1 servers to a VDI cluster that we are planning.

All is well and we still love ESX

Well, in the end it didn’t need an official response from VMware to solve the problem:

It seems that our troubleshooting efforts (rescanning, resignaturing and rebooting the servers somehow fixed the problem). We checked the /var/log/vmkernel and found that after setting LVM.EnableResignature to on and LVM.DisallowSnapshotLUN to off the host already saw the LUN’s as regular LUN’s in stead of snapshots but we weren’t able to rename them.

At first we thought the two problems to be related but this morning (after sleeping over it) we started to look in another direction and found an orphaned Shared Storage resource that probably wasn’t cleaned up correctly by ESX with the rescan of the storage.

After deleting this orphan we were able to rename all the Shared Storage resources. After a final reboot of the host all Shared Storage resources were detected as normal LUN’s so everything is fixed now. I have made a post in the VMWare forums with more details.

The fun thing this morning was that we were running full production while fixing one host. We just VMotioned everything over to the other host ! And then when we were done we just VMotioned everything back. And we had no complaints from users while doing that !

I mean, taking 40 servers down for maintenance in the middle of the morning would normally be impossible. After hating it yesterday when it was giving us problems, I am loving it today.

SAN migration final

Well, as expected Murphy had to show up after everything had been going so well in the first part.

After connecting the ESX hosts to the new SAN we ran into what seems to be a documented problem: a LUN that is incorrectly detected as being a snapshot and therefore acces is restricted to that LUN.

This gives the problem that in the VI client you cannot see that Shared Storage resources. The solution is to flag LVM.DisallowSnapshotLUN to zero and rescan the drives. The problem we had is that ESX still thinks these are snapshot LUN’s. After some discussion we decided to remove all VM’s and re-add them to the inventory. This seems to work.

We will have to wait on final say from VMWare support what we do about this.

SAN migration status

For those who are interested: we are now in our second day of the SAN migration from MSA1000’s to an EMC CX3-40.

So far everything is going OK, with a few hiccups yesterday evening. The speed of the migration (we use EMC’s SANCopy for the data move) has surprised us with seven out nine LUN’s ready by yesterday six o’clock in the evening. Two LUN’s from one MSA, with primarily fileserver data, didn’t copy over normally. We suspect because of the controller having difficulties with the I/O load. After the EMC consultant rescheduled the copy jobs and moved them to another storage processor on the CX they competed fine at nine o’clock yesterday.

All that is left now (…) is to patch the fibers to the new Cisco MDS9020 SAN switches, configure the zones for the servers / LUN’s and bring the servers online with their new storage.

SAN migration this weekend

If you were wondering why there seemed to be a lack of activity: we are migrating to our new SAN this weekend.

To have a relatively stable environment during our migration planning we only converted two fileservers (600GB and 500GB respectively). Although they are the largest fileservers we have and with this conversion all fileservers have now been virtualized the impact on the hosts has been minimal.

Memory usage is now 54% and 50% and CPU usage is 16% and 11% per host. Delivery of our new hosts has been confirmed for next week and since the licenses have all been bought and activated already we can quickly add these servers and carry on with the migrations.

VMware Whitepapers

I was browsing around the VMWare website for inspiration for a presentation I had to make for an information session about our project and ran into some whitepapers I missed in my earlier visits.

I found them all quite interesting to read so here are the links:

3.01 and MS Exchange

In earlier posts I commented on what we will not virtualize. One of the application I mentioned was MS Exchange due to the lack of 64bit support (needed for Exchange 2007).

With the release of VI3.01 today there is production support for W2003 64 bit edition but in the end we decided against virtualizing our Exchange environment. Apart from the extra licensing for VI3 which makes it more expensive than just replacing the servers we feel it will make our environment more complex a bit too soon.

Because of our experiences with Exchange we would still cluster the servers with MS Cluster Services but at the moment only a handful of SAN’s support VMotion for MS clustered servers so we would need to take extra measures with regards to which host an Exchange cluster VM is located.

Should the replacement of the Exchange servers have come somewhere in 2007 we might have made a different decision but at the moment we don’t want to make that step yet.

The projected Exchange environment is as follows:

  • 2 Exchange Frontend servers (IBM x3550) in loadbalanced cluster
  • 3 Exchange Backend servers (IBM x3650) in active-active-passive configuration
  • 1 Bridgehead server (IBM x336, will become available through virtualization)
  • 1 Global Catalog (IBM x346, will also become available through virtualization)

All the new servers (the x3550 and x3650) will have the new dual core Xeon’s. This is an environment sized for heavy traffic on 4000 mailboxes and 10000 public folders.

ESX 2.5 to ESX 3.0

After I edited my earlier post about the PowerConvert licenses I noticed that I hadn’t talked earlier about existing ESX servers in our serverpark. To clarify: these are stand-alone ESX 2.5 servers that are managed by a group of admins that specifically does development servers. We have a number of in-house development teams and some of their development environments are on those boxes. The rest is R&D related.

Although we would like to get them into the Virtual Center Management server for management purposes we will probably upgrade them to VI3 and leave them alone.

This is something that we feel is better to decide by the end of the migration project as we will have more experience in seperating different environments within the Virtual Infrastructure. For now they will be left alone.