Two weekends ago we have moved our data center to a new location about 60 kilometers away. While we have experience with moving a data center (we moved it to it’s previous location only two years ago) this was the first time we had to move 150 servers contained in only 30 physical servers.
Of the 150 servers about 120 reside on only six physical servers (DL585 G2) that are our VI cluster hosts. We also have two DL585 G1’s that hold about 50 Virtual Desktops, a DL585 G2 that holds a dedicated SAP testing environment. So there was quite a risk involved that servers would not survive the transport phase and we would be left with too few cluster hosts to run all the servers.
Because all the cluster servers have 24/7 HP support contracts on them we decided that the best we could do was to split up the transports in two.
In the weekend itself the transportation and mounting of the servers went without incident. It was only with powering on the first three 585’s that two of them gave critical hardware failures. One server (the dedicated SAP server) failed to boot altogether and needed to have it’s processor board replaced. On the other server the memory bank of processor 4 failed and soon after the memory bank of processor 2 failed as well. While we were able to get the server to boot and complete it’s memtest with memory in banks 1 and 3 only, ESX expects memory in all four banks or else it can’t address it. So ESX refused to boot with a kernel error.
After trying several things on Saturday I filed the support call with HP on Sunday, around 09:00 in the morning. After 15 minutes we already had HP support on the line and soon after a courier was dispatched with the necessary spares (also a new processor board and several memory modules). The HP technician was on site around 13:30 and fixed the machine.
Luckily the other cluster hosts were undamaged and they booted without incident. It was quite surprising though that such new servers would fail (they are less than a year old). We were very glad that we had stuck to our performance criteria of expanding the cluster when utilization would reach 60-65 percent per host so we had plenty of headroom to move the servers around.
You must be logged in to post a comment.