Ran across this thread on reddit that I found very interesting. Even more interesting when you start clicking on the links and reading further. Sounds like a lot of the same issues that no only I have seen at one point or another, but others have reported as well in the forums.
Just file like sharing to see what others think as well.
I highlighted a few interesting tidbits in that thread below. Some of this sounds like early C7 experience. And maybe it is related to what is happening to some here that seem to have issues when others don't. Especially with the Zooz sensors and other problematic devices.
Also funny how Silicon Labs playing dumb "first we ever heard of that...".
Note: the below are edited screenshots, and don't show the whole post.
I've seen the issue, have sniffer logs of it, and have been able to reproduce it 4 or 5 times semi-predictably.
The one thing they neglect to touch on in that short reddit thread is that the issue also seems to be a lot more prevalent on non-US firmwares. Don't know why (although it could be related below, and the different zwave frequencies in those countries triggers it faster/more frequently), but that is how the reported data stacks up.
The speculation is that the issue centers around the airtime fairness/throttling algorithm that is in zwave 700 that was not in zwave 500 controller firmware. But I don't know that anyone has verified that - it could be a wrong guess.
I hope this leads to a longer term fix from SiLabs. They historically are NOT fast at releasing firmware updates to issues, so this could linger.
I haven't seen confirmation that SiLabs has even reproduced it yet - although I know for sure that a lot of sniffer logs showing the issue and steps to reproduce have been sent to them. There are at least 3 different companies/projects that have reported the issue to SiLabs at this point - and I'm sure others too - I know Aeotec is aware of it, but I haven't seen that they have submitted a ticket.
Side note - the z-way guys say their custom zwave 700 firmware doesn't have this issue, but multiple z-way users have confirmed it does, so take that with a grain of salt too.
We'll all have to wait for SiLabs to reproduce it, and release a new firmware or SDK workaround.
Oh, and my last thought on this... Just because the issue CAN happen on zwave 700 systems doesn't mean it WILL. There are many users that have large, and perfectly functioning zwave 700 networks.
It seems that the # of messages can exacerbate it, which is why reducing message loading (reducing power reporting, removing S0 devices, etc) can help prevent it from happening.
Along those lines, from my testing it is also more likely to happen during ad hoc high traffic periods like a network-wide repair, or even device pairing.
When the issue is triggered by one off events, as opposed to high baseline mesh traffic, restarting the zwave stack can resolve it and get things going again.
I had my zwave network die the other day. I had just added a zooz vibration sensor on my dryer to replace a zigbee one I had previously. Looks like the number of messages it was sending during the drying cycle overloaded the zwave network. I have since taken it offline.
That's the first thing I thought of when I read this article.
I'm just glad that this is getting more visibility and hopefully will help things overall.
The reddit thread saying the controller "locks up" isn't right either. The problem I've seen is that for reasons unknown the controller stops sending ACK messages back to devices. It isn't locked up, as it can still receive messages, and sometimes send some commands. But the lack of sending ACKs back wreaks havok on the mesh.
Some devices will resend their message a few times then stop when they don't get an ACK (which is good), but some other devices will keep resending the message more or less forever if they don't get an ACK (Zooz XS sensors for one @agnes.zooz ) which is very bad, and compounds the issue drastically. Because of all the traffic, any commands that do make it out of the controller are often delayed - sometimes significantly (tens of seconds, minutes).
As it happens on different software stacks the thought is that it has to be in the controller firmware as that is about the only common factor between the different systems and platforms (as not all use the SiLabs app layers).
I've been jinxed. Ever since reading this stuff I've had 3 zwave busy msg's hit.
To be fair, my zooz vibration sensor battery was dead and had a queue of msgs to be sent to it. As soon as I replaced the battery they were all sent and generated that event.
We've examined many logs from the HA/JS platform and have seen this on other battery powered devices so it could be a protocol requirement or something specific to 700 series. We're currently checking if this is something that can be fixed on the firmware side of the device so we can implement it asap,
The 7.17.1 firmware from SiLabs is out, just FYI for those that have other zwave systems they test on.
(Like many other hubs with integrated zwave, end users can't update the hubitat firmware except as allowed by Hubitat. I'm sure they will release an update if/when it is tested and worth them doing so. That's not a bad thing - and I'm not complaining. Just trying to clarify before the "How do I do this update on Hubitat" questions start.)