Where do Z-Wave packets go to die?

I have been around the block for 14 months with no resolution on the random ZW 5-30+ second delays.
The last hope was the antenna upgrade. I now have better signal strength and a mostly DIRECT filled column of 47 devices. My hope was that, once DIRECT, the delay would be ruled out as there are no hops.
Wrong.
I still get random massive delays on DIRECT items.
Question here is when an Event is triggered and a command issued to the device where does it go to sit and ponder? The ZW stack/buffer until after many failed attempts it gets discarded after a timeout? Is this protocol like TCP/IP where there are ACKS or is this simply UDP like, here's a packet, hope it gets there.

Those symptoms of occasional (random) delays generally swirl around one basic constriction... the ZWave radio.

  • There's a bug that Silicon Labs is chasing that can cause this. It's a pretty specific bug but there's a possibility that you're hitting it.

  • The Zwave mesh is overwhelmed... this is what triggers the bug but even when it doesn't, you can just have devices that send too many reports, leaving no time left to send new traffic.

  • The Zwave mesh is weak. There are devices that can't hear or can't be heard due to distance or interference.

The question is, which do you have? Or which two? :smiley:

2 Likes

I have a fairly strong mesh, especially with the new antennas.
No ghosts, yada yada. You can ignore the 3 red lines I think as they are Aeon Minimotes, end devices. 75 and 77 are in the back shed so most of the house can't see them but they hit the Hub via a repeater out there. They were on ST with HubConnect and the issue was still happening.
When I get really pi33ed off I do a shutdown and unplug. All comes back bright and chipper.

I have a couple of power reporting devices but they don't actively report. I poll them every 2 minutes as there is no exact driver for the GoControl (Nortek) power plugs. I have sourced an AEOTEC Swtich Outlet 7 (hard to get anywhere) to try replacing the GoControl.

My locks are a bit of a pain but they have good signal and don't repeat.
Wish I could find ZB modules for them ( Kwikset)

Any suggestions are welcome. I'm sure there is something I haven't tried to eliminate. Maybe that HAM operator at 900 Mhz with a 5Kw TX. :rofl:

Bit of mystery especially when you get up in the morning to a network that has been quiet all night and the bathroom light and fan just stay dark.



1 Like

Out of curiosity, when this happens, are you just toggling one or two devices?

Are any of your devices power reporting devices?

I’m tempted to say they just go to the Great Data Dump in the sky…:joy:

WebCore is usually only toggling 1-2 switchs max in the path.
I did try to put them in Built-in "Groups and Scenes" and set the metering but that didn't change anything.

Z-Wave routing uses singlecast w/ACK, routed singlecast w/ACK, and (in the case of route failures) explorer frames; each has an expected max timeout at the application level. In their Z-Wave routing tutorial, SiLabs shows typical application tranmission timeouts (for a 'large route diverse network') in the table below. Note that 25 second delays are possible and symptomatic of multiple retry attempts of a series of stored working routes. Tack on another few seconds for the explorer frame fallback and you get to the 30 second delays you're evidently experiencing.

The short video tutorials are worth watching to understand why the 25-30 second delays are possible.

Unfortunately knowing the 'why' doesn't lead to knowing 'how to fix'. Once you've followed the best practices you've got to rely on the protocol to do its job; a lot of stars have to align for things to work efficiently,

In a perfect Z-Wave world, the controller has an accurate inventory of all Z-Wave nodes and current knowledge their in-range neighbors. It must also update that inventory (via neighbor discovery/report at inclusion or repair) every time a node is included or excluded, if the nodes get moved, or the RF environment changes significantly. If calculated/distributed routes aren't 'correct', they'll be used, retried, and fail in succession leading to the explorer frame fallback and excessive response times.

The nodes themselves must accurately report their neighbors, have sufficient memory to store a table of last working routes (otherwise they'll need to be repeatedly re-discovered) and keep those tables up to date.

When listening to the tutorials (linked here: Z-Wave Mesh Performance
--videos 8 and 13 are the most informative) I got the impression that only devices based on newer SDK's (4.5 and 6.X) are capable of doing this; even so, as network size increases the expected retry timeouts make this scheme delay prone unless transmissions always are error free.

Throw the 700-series apparent teething problems into the mix and there's really not much an end user can do about it, other than subdivide a large mesh into a couple (or more) smaller ones-- that would effectively reduce the maximum expected timeouts by reducing hop count and route complexity. The fact that so many routes are showing 'direct' yet the response times you're seeing reflect routing failure/retry timeouts sure looks like there's a mismatch between the controller's view of the mesh and reality. Maybe that's at the root of the problems to be addressed by the expected firmware fixes.

3 Likes

Thanks for the detailed reply!
As suggested by the HE staff in a PM I have signed up as a beta tester given the Z-Wave SiLabs firmware fix being rolled out within the code.
If I am indeed a victim of the underlying bug they have "fixed" that presents itself in the manner I describe I should be quick to acertain and yell Hooray!

2 Likes