Wednesday, April 29, 2009

ESX WRITE10 error

Recently a WRITE10 error in one of my ESX host caught my attention, and it occurs more than 10 times every second.

Apr 29 12:01:10 cla1011 vmkernel: 11:22:45:34.946 cpu4:1077)WARNING: VSCSI: 5291: WRITE10 past end of virtual device with 29365, length 128

After search on Google and VMware's communities, and still could not find detail information and solution about it, I turned to VMware technical support. The technical support engineer sent me their internal KB.



Symptoms

Repeatedly logging messages similar to either of the following in /var/log/vmkernel (or /var/log/messages on ESXi):

Feb 5 15:44:46 USPLVS02 vmkernel: 63:05:31:58.181 cpu3:1129)WARNING: VSCSI: 5292: WRITE10 past end of virtual device with 33554432 numBlocks, offset 33554351, length 128

Feb 12 17:03:04 pa-tse-h02 vmkernel: 156:05:50:47.889 cpu0:1174)WARNING: VSCSI: 3430: READ10 past end of virtual device with 20971520 numBlocks, offset 20980737, length 16

These messages indicate that I/O is being attempted that is outside the boundaries of the virtual device (virtual disk). In layman's terms, the VM has a list of ten items, and the guest OS is asking for the 12th item on the list.

Resolution

These messages indicate that I/O is being attempted that is outside the boundaries of the virtual device (virtual disk). In layman's terms, the VM has a list of ten items, and the guest OS is asking for the 12th item on the list.

To find out which VM is responsible for these, the World ID (WID) must be determined from the log messages. The WID is after the cpu specifier, and before the WARNING in the above messages. In the case of the WRITE10 message above, the WID is 1129; for the READ10 message the WID is 1174.

If you look in /proc/vmware/sched/cpu, then the vcpu column (first one) will list the number identified in the logs.

To determine the VM responsible if it is not running:

cat `ls -rt vmkern*` | less

Find the first instance of the log message (using "/WRITE10" or "/READ10" will likely find it for you very well) Then search backwards in the logs for the WID value (in less this can be done with "?". ex: ?1129 Note: It searches beginning just before the top line on screen. Press 'n' to find the next match.) and keep searching earlier in logs until you find something similar to:

Feb 12 16:51:55 pa-tse-h02 vmkernel: 156:05:39:38.873 cpu2:1173)Sched: vm 1174: 4836: adding 'vmm0:ProblemVMName': group 'host/user/pool0': cpu: shares=2911 min=0 max=-1

The text will show you the name of the problematic VM after the vmm entry. In this case, "adding 'vmm0:ProblemVMName'" shows that the VM causing the issue is named ProblemVMName.

If you look at the contents of the descriptor file for the offending Virtual Machine's disks, you will find an entry listing the number of cylinders for the virtual disk. As an example:

ddb.geometry.cylinders = "2088"

In this case, the virtual disk has 2088 cylinders. Running "fdisk -l" against the flat file of the virtual disk will return information similar to:

You must set cylinders.

You can do this from the extra functions menu.

Disk ANSGOOD-flat.vmdk: 0 MB, 0 bytes

255 heads, 63 sectors/track, 0 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes


Device Boot Start End Blocks Id System

ANSGOOD-flat.vmdk1 * 1 2089 16779861 7 HPFS/NTFS

Partition 1 has different physical/logical endings:

phys=(1023, 254, 63) logical=(2088, 254, 63)


Note that in this case, the end value for the partition/disk set in the partition table is 2089, exceeding the number of cylinders set in the descriptor file. If this were proper, it would show 2088 as the end value, instead of 2089. The operating system, as a result of this incorrect partition table

Extending the VMDK just enough to allow it to contain the size of the partition table might fix this, but because it is an invalid conglomeration of settings, it can not be safely assumed that this can be fixed. Other possible means of fixing the behavior is to cause the partition table to fit within the disk, by correcting it's ending value. Moving the data to a new, properly configured disk/partition table/file system is the best bet as the state of the file system by trying to modify the VMDK or partition table isn't known, and may be damaged by the changes, or already is damaged. Give the customer the options, and let them choose how to handle the changes to their system, as they can best judge how they want to protect their data.



The solution to it is either extend the VMDK file or shrink the partition table. Extending seems safer than shrink. The solution that I chose was using VMware Converter. By the way, VMware Converter 4 is offering some cool features over the previous version 3.0.3.

1 comment:

Byron Zhao said...

It seems VMware has put up an KB for public here at KB 1008661.