recovering from an unresponsive hostd after a datastore/storage goes PDL

Hostd crashes with the below:

Hostd.log
2018-12-17T22:37:50.138Z info hostd[9130B80] [Originator@6876 sub=Hostsvc] Storage data synchronization policy set to invalidate_change
2018-12-17T22:37:50.140Z info hostd[9130B80] [Originator@6876 sub=Libs] lib/ssl: OpenSSL using FIPS_drbg for RAND
2018-12-17T22:37:50.140Z info hostd[9130B80] [Originator@6876 sub=Libs] lib/ssl: protocol list tls1.2
2018-12-17T22:37:50.140Z info hostd[9130B80] [Originator@6876 sub=Libs] lib/ssl: protocol list tls1.2 (openssl flags 0x17000000)
2018-12-17T22:37:50.140Z info hostd[9130B80] [Originator@6876 sub=Libs] lib/ssl: cipher list !aNULL:kECDH+AESGCM:ECDH+AESGCM:RSA+AESGCM:kECDH+AES:ECDH+AES:RSA+AES
2018-12-17T22:37:50.141Z info hostd[9130B80] [Originator@6876 sub=Libs] GetTypedFileSystems: fstype vfat
2018-12-17T22:37:50.141Z info hostd[9130B80] [Originator@6876 sub=Libs] GetTypedFileSystems: uuid 579bba34-1440dc54-3308-70106f411e18   <-----volume that went offfline
2018-12-17T22:37:51.136Z info hostd[9AB8B70] [Originator@6876 sub=ThreadPool] Thread enlisted
 
VMkernel:
2018-12-17T22:19:15.397Z cpu18:65910)VMW_SATP_LOCAL: satp_local_updatePath:789: Failed to update path "vmhba32:C0:T0:L0" state. Status=Transient storage condition, suggest retry
2018-12-17T22:19:18.801Z cpu14:65607)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "eui.00a0504658335330" state in doubt; requested fast path state update...
2018-12-17T22:19:18.801Z cpu14:65607)ScsiDeviceIO: 2968: Cmd(0x439d4981c540) 0x1a, CmdSN 0x6d2dc7 from world 0 to dev "eui.00a0504658335330" failed H:0x7 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2018-12-17T22:19:21.394Z cpu6:5268773)ScsiPath: 5115: Command 0x0 (cmdSN 0x0, World 0) to path vmhba32:C0:T0:L2 timed out: expiry time occurs 1002ms in the past
2018-12-17T22:19:21.394Z cpu6:5268773)VMW_SATP_LOCAL: satp_local_updatePath:789: Failed to update path "vmhba32:C0:T0:L2" state. Status=Transient storage condition, suggest retry
2018-12-17T22:19:22.892Z cpu22:65615)ScsiDeviceIO: 2968: Cmd(0x439d4981c540) 0x1a, CmdSN 0x6d2dc7 from world 0 to dev "eui.00a0504658335330" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2018-12-17T22:19:23.400Z cpu34:65627)NMP: nmp_ThrottleLogForDevice:3593: last error status from device eui.00a0504658335330 repeated 1 times
2018-12-17T22:19:23.400Z cpu34:65627)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x1a (0x439d4989a540, 0) to dev "eui.00a0504658335330" on path "vmhba32:C0:T0:L0" Failed: H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
2018-12-17T22:19:23.400Z cpu34:65627)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "eui.00a0504658335330" state in doubt; requested fast path state update...
2018-12-17T22:19:23.400Z cpu34:65627)ScsiDeviceIO: 2968: Cmd(0x439d4989a540) 0x1a, CmdSN 0x6d2dc8 from world 0 to dev "eui.00a0504658335330" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2018-12-17T22:19:26.798Z cpu14:65607)NMP: nmp_ThrottleLogForDevice:3647: Cmd 0x1a (0x439d4a8ed5c0, 0) to dev "eui.00a0504658335331" on path "vmhba32:C0:T0:L1" Failed: H:0x7 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
2018-12-17T22:19:26.798Z cpu14:65607)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "eui.00a0504658335331" state in doubt; requested fast path state update...
2018-12-17T22:19:26.798Z cpu14:65607)ScsiDeviceIO: 2968: Cmd(0x439d4a8ed5c0) 0x1a, CmdSN 0x6d2dd8 from world 0 to dev "eui.00a0504658335331" failed H:0x7 D:0x0 P:0x0 Invalid sense data: 0x31 0x22 0x20.

Scsi Decoder: Link

In my case, The volume appeared to have gone offline because the host was aborting the commands to the HBA.

/etc/init.d/hostd status
hostd is not running.
However,
ps | grep hostd
2098894 2098894 hostdCgiServer
2105175 2105175 hostd
2105176 2105175 hostd-worker
2105177 2105175 hostd-worker
2105178 2105175 hostd-worker
2105179 2105175 hostd-worker
2105180 2105175 hostd-IO
2105181 2105175 hostd-IO
2105182 2105175 hostd-fair
2105183 2105175 hostd-worker
2105184 2105175 hostd-worker
2105185 2105175 hostd-worker
2105187 2105175 hostd-worker
2105191 2105175 hostd-worker
2105192 2105175 hostd-worker
2105193 2105175 hostd-worker
2105194 2105175 hostd-worker
2105251 2105175 hostd-poll
localcli storage core device world list
Device World ID Open Count World Name
------------------------------------------------------------------------------------------------------
mpx.vmhba32:C0:T0:L0 2099479 1 smartd
mpx.vmhba32:C0:T0:L0 2105105 1 vpxa
mpx.vmhba32:C0:T0:L0 2105175 1 hostd
naa.600508b1001c555e5048cfd74e058fdc 2097185 1 idle0
naa.600508b1001c555e5048cfd74e058fdc 2097403 1 OCFlush
naa.600508b1001c555e5048cfd74e058fdc 2098198 1 Res6AffinityMgrWorld
naa.600508b1001c555e5048cfd74e058fdc 2098325 1 Vol3JournalExtendMgrWorld
naa.600508b1001c555e5048cfd74e058fdc 2099479 1 smartd
naa.600508b1001c555e5048cfd74e058fdc 2105175 1 hostd
naa.6001405d7dc7524f3364522a27b7c508 2097185 1 idle0
naa.6001405d7dc7524f3364522a27b7c508 2099760 1 fdm
naa.6001405d7dc7524f3364522a27b7c508 2099766 1 worker
naa.6001405d7dc7524f3364522a27b7c508 2099771 1 worker
naa.6001405d7dc7524f3364522a27b7c508 2100189 1 J6AsyncReplayManager
naa.6001405d7dc7524f3364522a27b7c508 2100219 1 worker
naa.6001405d7dc7524f3364522a27b7c508 2105175 1 hostd
naa.6001405d7dc7524f3364522a27b7c508 2105286 1 hostd-worker
t10.NVMe____THNSN51T02DUK_NVMe_TOSHIBA_1024GB_______E3542500020D0800 2097185 1 idle0
t10.NVMe____THNSN51T02DUK_NVMe_TOSHIBA_1024GB_______E3542500020D0800 2097446 1 bcflushd
t10.NVMe____THNSN51T02DUK_NVMe_TOSHIBA_1024GB_______E3542500020D0800 2098539 1 J6AsyncReplayManager
t10.NVMe____THNSN51T02DUK_NVMe_TOSHIBA_1024GB_______E3542500020D0800 2105105 1 vpxa
t10.NVMe____THNSN51T02DUK_NVMe_TOSHIBA_1024GB_______E3542500020D0800 2105175 1 hostd
naa.600508b1001c5a5167700b7ae7160e91 2097185 1 idle0
naa.600508b1001c5a5167700b7ae7160e91 2097403 1 OCFlush
naa.600508b1001c5a5167700b7ae7160e91 2098198 1 Res6AffinityMgrWorld
naa.600508b1001c5a5167700b7ae7160e91 2098325 1 Vol3JournalExtendMgrWorld
naa.600508b1001c5a5167700b7ae7160e91 2099479 1 smartd
naa.600508b1001c5a5167700b7ae7160e91 2105175 1 hostd
naa.600508b1001ca2c68c28022b4447710f 2097185 1 idle0
naa.600508b1001ca2c68c28022b4447710f 2097403 1 OCFlush
naa.600508b1001ca2c68c28022b4447710f 2098198 1 Res6AffinityMgrWorld
naa.600508b1001ca2c68c28022b4447710f 2098325 1 Vol3JournalExtendMgrWorld
naa.600508b1001ca2c68c28022b4447710f 2099479 1 smartd
naa.600508b1001ca2c68c28022b4447710f 2105175 1 hostd

This shows that hostd still appears to be stuck as running in a zombie state.

To resolve this, we will need to

  • reset scsi commands to vmhba32 (vmkfstools -B /vmfs/volume/disk/naa.xxxx)
  • Rescan vmhba32 and wait for 3-5 min (esxcfg-rescan vmhba32)
  • confirm that hostd is no longer running for that device  (localcli storage core device world list and ps | grep hostd)
  • start hostd

Worst case scenario, Host reboot.

Leave a comment

Your email address will not be published. Required fields are marked *