Whats up with Z-Wave these days? Devices can't seem to find their neighbors

erktrek · December 22, 2020, 2:38pm

I have to say the migration over to the C-7 from 2 C-4's has been "difficult". All my devices are now paired and most are working. At my clients the C-7 is running and doing a pretty good job as well but there are still some devices issues.

It is confounding to try and understand how and why Z-Wave devices decide to repeat. Often switches of the same model/firmware right next to each other or separated by a few feet will have wildly different routes back to the Hub and wildly different throughput speeds - 100 vs 9.6kbits for example. The "slower" devices can't seem to recognize their neighbors even in a room full of "powered" neighbors. Repeating through powered devices seems hit or miss and "repeater" devices don't seem to be able to pickup routes or the devices that would repeat through them can't for some reason.

I'm not sure if this is an HE problem exactly - I'm suspicious of the new 700 series chip/firmware. Also the Zigbee stuff seems to work great and the range frustratingly appears to be better even with less devices and transmitting at a higher frequency (maybe better/faster error correction dunno). Certainly the mesh bit seems more functional at this point.

Having said that some devices that are slow to respond / non-responsive in HE are often very fast with a secondary controller.

With Zigbee (on a C-5) I am getting a few poor responses related to device wakeup - I have a button in my office that turns on a set of lamps. In the morning I tap it once and nothing happens. I tap it again a few seconds later and it comes on. This never happened when the device was on the C-4 but of course I'm sure the radios are different.

Note: My Zigbee devices are on a C-5 and my Z-Wave+ devices are on a C-7

Also I realize that routing can be counter-intuitive at times and am completely okay with a 3rd floor switch routing through a basement repeater as long as the throughput is good. The issue is when there are devices with strong routes back to the hub surrounding a slow device - why can't that device find any of them? Why does the repair fail the neighbors report?

Thoughts on potential causes:

Devices could be faulty but include/exclude and secondary controller on/off tests seems to work okay.
Weak mesh - not likely due to the high number of devices including repeaters.
Device firmware out of date. Doesn't seem to make all that much of a difference for certain devices (Zooz switches are what I'm using).
S2/S0/No security issues with repeating? Most devices are paired with no security and only a few repeaters are paired S2 unauthenticated.
Outlet boxes are metal and interfere with signal propagation - at my clients, they have mostly plastic housings for their devices. Some slow devices are in the same box as ones that work.
Slightly related to housing, overheating issues maybe?
Power is bad or too much draw for the device for some reason - cannot confirm this but seems like there would be other glaring issues ( / ).
HE Firmware - not sure where HE's responsibility for Z-Wave control ends and the radio chip's begins but routing decisions seem like a chip function.
Zw 700 series chipset firmware issues.
False/inaccurate speed reporting - maybe the routing report is inaccurate in some way. However slower devices usually are slow to respond and repair usually fails.
Too many hops for the device to properly establish a route through. 4 is the max I believe. This is one of the reasons a multihub configuration could be a good thing for larger houses.
Maybe some combination of the above?

It's a mystery...

rlithgow1 · December 22, 2020, 3:18pm

The zooz firmware should be updated in your switches. Also the s0 issue is causing major headaches (especially from zooz devices)

erktrek · December 22, 2020, 3:23pm

I've found it does not make a difference with the firmware from ?3.06? to 4.01. The only time it helps is after updating via a secondary controller - the device then reports as "is failed" and can easily be removed and re-included. One of my more problematic devices (2nd Floor Front Bedroom Light Switch) has the updated firmware.

Have never had an issue with S0 and the V3 & V4 Zen23/24s. Was able to pair all of those devices with no security. For S0 (non-S2) only devices like the Aeotec Recessed Door Sensor Gen 5 I use a secondary controller to pair with no security.

rlithgow1 · December 22, 2020, 3:32pm

I think the issue is with mainly the zooz stuff. Several threads here about it and even agnes from zooz commenting.

erktrek · December 22, 2020, 3:36pm

I hear you! Yeah probably a good idea to do that anyway. Just that I have switches on the older firmware running fine while some on the newer have issues. I guess I should think about the chain of devices as well. Of course some firmware updated devices aren't recognizing repeaters nearby either.

My main reason for this post was to point out some apparent inconsistencies with Z-Wave routing.. How devices can't seem to decide what to do and are failing at finding their neighbors even in a "mesh rich" environment.

rlithgow1 · December 22, 2020, 3:40pm

I hear ya. I have several mains repeaters and just based on their locations physically in the house, I don't understand the voodoo used to calculate optimized routes. Probably something for one of the techs like @bcopeland to answer.

erktrek · December 22, 2020, 3:42pm

I am not worried about crazy routes as long as the throughput is good. It is what it is..

Tony · December 22, 2020, 7:48pm

Z-Wave routing, rooted (pun intended) in a proprietary protocol, is a riddle wrapped in an enigma... In that light, the only thing that is known about its inner workings is what has been published and what has been inferred by watching it in action. The veil is (was?) supposedly due to lift and expose the network layer specification Q4 of this year. Until then, one of the clearest explanations I have seen (though in the context of the openHAB framework) is here:

Z Wave Routing Basics: Retry Strategies - Tutorials & Examples - openHAB Community

Stepping through the charts on that site gives a good explanation of how Z-Wave routes are explicitly sent by controller to slave devices (at inclusion and heal) and the explorer mechanism used as a backstop when they fail.

Some takeaways: Z-Wave repeater nodes (unlike Zigbee routers which dynamically choose new neighbor links) are pretty dumb. They blindly retransmit as long as they recognize their own node ID in the routed frame. They have no role in route calculation (that is the Z-Wave controllers job), though are a key part of the explorer mechanism used when calculated routes have failed.
Destination slaves also are pretty dumb. They don't participate in calculating routes either-- they rely on a small number of calculated routes sent to them at time of network inclusion (giving preference to whatever route most recently) and try them sequentially should one fail. If all stored routes fail, they flood the network with explorer frames and through that process recover a new working route to the destination; that route gets used until a better calculated route is set (or a new explorer derived one is required from a subsequent failure).

From this it is easy to see from this how things will deteriorate, performance wise, if every node doesn't have a good set of stored working routes... lots of bandwidth-sapping sequential retries by a node that has encountered a transmission faliure; even nodes with good stored routes will be sidelined while explorer frames are circulating should a new route discovery via explorer be underway on behalf of another node.

Quote from the author of that website:

"One last thing about routing in z-wave. There is no statistical improvement of the routes so the mechanisms described above are all that is used. The algorithm in the controller to calculate routes is not published so it is hard to know what is considered but certainly no past success beyond the simple mechanism here.
The initial return routes are calculated by the controller and sent to the slave devices so the underlying mechanism is the same but there is no option for a slave to calculate more routes on the fly.
If you read the above a few times you can see that it is perfectly possible for zwave to recycle poor routes as nothing ever fully removes them from the pool of possible.
Explorer discovered routes are the only routes that may be improvements to what was already in the table so these are the only routes that can bring fresh routes into the process.
If you start looking at the route in your network in zniffer expect to be surprised and bemused. When I first started looking it amazed me how the signal bounced around the house taking sometimes seemingly strange routes. While you can set preferred routes with some success remember if you do you are reducing the self healing capabilities so try with care.
I have become more circumspect with my attitude to these strange routes following a conversation with Peter @petergebruers over on the Fibaro forum. He pointed out to me that all sorts of things in your home block and amplify the signal in strange ways."
The example he gave me that drove it home to me was regarding how a module can get a signal in a metal back box behind a switch in a solid wall. The answer he gave me was that the mains cables can act as passive repeaters and probably brings the signal inside the box. He is a smart fellow and I am sure he is correct. He says to me RF is strange and I now just accept it and do not try to fight the tide."

erktrek · December 22, 2020, 9:32pm

Thanks for the info!!

It's not the nature of the odd routing that happens - I can totally live with that as long as the speed holds - as you mentioned it's kind of fun to see where things go (3D reality beats 2D thinking - Khaaaaannnn!). It's the inability for devices of the same model in the same room separated by "air" to decide ignore each other and insist on maintaining a poor connection. Completely understand that it's "proprietary magic" coupled with physical barrier issues but my point is some of the decisions seem suspect maybe moreso with the 700 series.

Tony · December 22, 2020, 9:49pm

Yes, its bizarre.. even though those devices know they are neighbors (they report that fact to the controller at time of inclusion, and whenever they get 'repaired'), they're unable to set a calculated route themselves (at least in the front-up non-explorer frame scenario). They just hand all the topology info back to the controller, and it assimilates it all and doles out what it thinks are the optimal routes. Only when those routes flop can the devices try to generate their own with explorer frames, but all that traffic takes its toll on the rest of the network and adds more latency.

Likely (hopefully?) the controller is taking into account some RF quality metric (RSSI reporting) that isn't exposed elsewhere and coming up with the unexpected routes (the idea of a homes wiring acting as a passive radiator is intriguing), Probably this controller centric scheme is an artifact of the age of the protocol... the 8-bit controllers used in all pre-700 devices just couldn't be expected to do much more. Maybe the legacy of all that is still dogging the latest devices.

erktrek · December 22, 2020, 10:01pm

Ahhh slowly starting to get it maybe - so it's easy to anthropomorphize the situation. Was wondering why these nearby devices nearby just can't "see" each other when in reality they are "blind" as to physical location with only data from the controller and occasional exploratory burst of activity. They aren't wirelessly trying to reach out to each other like I had originally thought (does Zigbee do this?)

You would think that given the slow routing on some devices relative to the others that the controller would be "encouraging" those devices to retry and find better routes.

I've seen others request ability to manually set routes and I can see why now.. mmm.

Tony · December 22, 2020, 10:13pm

Yes, it's not a great scheme.. but probably the best that could be done to maintain backward compatibility with the oldest devices. The explorer frame enhancement is pretty huge, but still lacking in comparison to how Zigbee routing works. I've sniffed my Zigbee network and purposely disabled a key parent repeater to my door lock just to see how long it takes for it to recover-- usually by the time I get back to my sniffer the door lock has already chosen another parent. Literally within seconds, transparently and without impacting the rest of the network..

There is a mechanism for the controller to set preferred routes ('Application Priority Route' -- it's detailed in the linked article). Doesn't seem to be widely used, and it adds another layer of try-fail-retry that slows things down when some transient condition disrupts the network.