I came across a few issues upgrading a site from vCenter 5.5 update 2 up to update 3b. I’m writing this article in the hope that if you run into the same problem it will save you hours of troubleshooting and support calls with VMware.
vCenter 5.5 Update 3b Upgrade
I had already gone through updating a few instances of vCenter 5.5 to Update 3b without any issue at all. From time to backing the database, a few vm and storage snapshots it probably took a little under an hour till it was complete. However this last site experienced a bizarre issue that even had some VMware techs stumped until the issue was escalated far enough up the ranks.
First Upgrade Attempt
This site had a first attempt of upgrading vCenter to 5.5 Update 3b, upon successfull completion of the vCenter server upgrade, the vCenter service started and not longer after, crashed. It repeated this numerous times creating endless amounts of vpxd.log files. We ended up reverting back to the backups and back to the original Update 2
A VMware support ticket was raised and the logs from the attempted upgrade were submitted. We were advised that the vCenter server contained corrupt virtual machines that needed to be removed from the back-end SQL database. The KB article (KB 1028750for this issue can be found here:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1028750
Since we had rolled back already to Update 2 we could simply identify the Virtual Machine, shut it down and remove it from inventory. Following this we would attempt to upgrade the vCenter server again to Update 3b
Second Upgrade Attempt
The second upgrade attempt did not crash the vcenter service and we thought the upgrade was complete, until a few hours later it began to crash again. As the vCenter server needed to be online for people to use, we once again rolled back to Update 2.
I repeated the same process by uploading the logs to VMware, they analyzed them and the result was another corrupted Virtual Machine that needed to be removed from the inventory. Once again we removed the questionable virtual machine from the inventory.
Third Upgrade Attempt
We could not continue rolling back, uplaoding logs, removing 1 virtual machine and then performing the upgrade again. It was way too time consuming This time we scheduled a webex with the VMware engineer to be online while we performed the upgrade, this meant he could troubleshoot the issue while it was happening.
The third upgrade attempt also did not see the vcenter service start crashing until 24 hours after. At this time it continually started and crashed.
Of course when it crashed it had to be after hours, however I managed to get hold of a new VMware engineer that helped me troubleshoot the issue for 9 hours non-stop (yes it was a long night). What we did was, clear out old tasks and events from the vCenter database followed by a shrink task. He then changed the threadstack size to 1024 then 2048 within the vpx.cfg file. We checked for the known bug of a stale LDAP entry within ADSI edit via this VMware KB2044680. We did a lot of other little things here and there, we went event as far as un-installing and re-installing vCenter Server. Still the service would continue to crash.
More Logs were taken offline and analyzed. It wasn’t until the morning somewhere between 9am-10am that we heard back from an escalation engineer in the US that found the following problem.
The Problem
Within the vpxd.log file, there was an issue with the virtual machines vswap file. Basically when vCenter service starts up, it verifies that the vswap files for each virtual machine exist in the vswap location, in this instance it is set to a specific vswap datastore. This verification step is apparently something new to update 3b, and explains why when we reverted back to update 2 the service successfully started and continued running.
VPXD.LOG Entry
Resolved localPath /vmfs/volumes/9625aa71-a57f224f/Company-Web01-1d614347.vswp to URL ds:///vmfs/volumes/9625aa71-a57f224f/Company-Web01-1d614347.vswp
mem> 2016-02-05T13:05:42.928+11:00 [10860 panic ‘Default’ opID=HB-host-7902@618-32a4dff9] (Log recursion level 2) Win32 exception: Access Violation (0xc0000005)
Next was to find the host where this virtual machine was running on, I had to log into each ESXi host until we found the running virtual machine. We enabled SSH and established a CLI session. We stopped the vpxa service on the ESXi host by issuing the command /etc/init.d/vpxa stop
Once the vpxa service stopped we then attempted to re-start the vCenter service. The service started and continued to run.
To double check if there was any swap file for this questionable virtual machine, we switched over to our ESXi CLI session and typed:
ls -lrth /vmfs/volumes/9625aa71-a57f224f | grep Company-Web01
This command did not produce any vswap files in this location.
We headed back into the vSphere client attached directly to the ESXi host and shutdown the virtual machine. Once it was shut down we powered it on again. Now it was time to re-check if the vswap file existed in the vswap datastore location. Issuing the same command again:
ls -lrth /vmfs/volumes/9625aa71-a57f224f | grep Company-Web01
Displayed the vswap file for the Company-Web01 virtual machine
Still within the ESXi CLI session, we started the vpxa service:
/etc/init.d/vpxa start
After a few seconds, the host re-established a connection to vCenter and re-populated all virtual machines. (previously they were displayed as orphaned while the ESXi host was disconnected)
The post VMware vCenter 5.5 Update 2 upgrade to Update 3b Woes appeared first on SYSADMINTUTORIALS IT TECHNOLOGY BLOG.