Synology storage latency and disconnects on VMware

Symptoms

  • Synology SA3400 SAN connected to  VMware vSphere ESXi 6.7 using iSCSI

  • This is a new install.  The issues have been occurring since you started using the Synology for VMware datastores

  • Event: Device or filesystem with identifier x has entered the All Paths Down state.             Warning
  • Event: Lost connectivity to storage device . Path vmhba64:C0:T1:L1 is down. Affected datastores: Synology_Datastore1.   Error
  • Event: Lost access to volume due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.               Information
  • Event: Alarm ‘Cannot connect to storage’ on 10.41.89.34 triggered an action      Information
  • Event: Alarm ‘Cannot connect to storage’ on 10.41.89.34 triggered by event 615108 ‘Lost connectivity to storage device naa.6001405de561547da144d4199dac86d6. Path vmhba64:C0:T1:L1 is down. Affected datastores: Synology_Datastore1.’  Error
  • VMware performance monitor shows regular extreme disk latency spikes (500ms, 20,000ms) every few minutes.

  • Occasional vCenter alarms will display showing that a host has lost connectivity to storage.  Normally only one host and one iSCSI datastore LUN at a time.

  • On the Synology side, the Resource view shows no latency spikes.

Root cause

In my experience, this is caused by VMware attempting to perform ATS Heartbeat checking against the Synology (which does not support it).

This issue may also affect EMC and IBM storage providers.

After ESXi 5.5, the VMware VMFS version updated from 3 to 5. One major difference between them is that VMFS5 has the “ATS heartbeat” setting default to on, which offloads the datastore heartbeat feature to the storage provider. According to this VMWARE KB link below,

“This optimization results in a significant increase in the volume of ATS commands the ESXi kernel issues to the storage system and resulting increased load on the storage system. Under certain circumstances, VMFS heartbeat using ATS may fail with false ATS miscompare which causes the ESXi kernel to again verify its access to VMFS datastores. This leads to the Lost access to datastore messages.”

Storage provides like EMC, and IBM are already asking their users to disable this feature on VMFS5 datastores due to the problems encountered:

https://www-304.ibm.com/support/docview.wss?uid=ssg1S1005201
http://h20565.www2.hpe.com/hpsc/doc/public/display?sp4ts.oid=75953&docId=mmr_sf-EN_US000005979&lang=en-us&cc=us&docLocale=en_US
https://community.emc.com/docs/DOC-52756

Solution

Refer to the instructions in the VMware Knowledge Base in the following link to disable the ATS Heartbeat:

When I performed this change, it took about 5 minutes, did not need a host reboot, did not cause any impact.   The latency spikes and storage disconnects stopped immediately.

https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2113956

Leave a Reply

Your email address will not be published. Required fields are marked *