Z-Wave storms - looks like HE Hub's fault?

MFornander · July 30, 2020, 6:51pm

Ok Zniffer trace and hub log output below. It's the hub that is spamming with loads of Basic Get commands and the log just shows the responses to the Get. I've been chasing phantom nodes and buying $100 of debugging equipment by now.

Doesn't this look like HE is the culprit? What am I missing? I have turned off scenes and groups. This is just a RuleMachine 4.0 turning on a fan, triggered by a doubleclick of a switch.

hubitat

Rxich · July 30, 2020, 6:57pm

Is this the new C7 hub?

MFornander · July 30, 2020, 7:07pm

It's a C5

csteele · July 30, 2020, 7:19pm

There's a pattern.. short/long...

The hub initiates a Get then again 232ms later followed by another 1368 ms after, for a total of 1.6 second cycle.

Can't say I understand ANYTHING better as a result of knowing that, but that screen cap has a pattern.

[ lines 2734 - line 2649] & [ line 2649 - line 2635]

neonturbo · July 30, 2020, 7:49pm

I have no idea about the Zwave side of things, but is there a way to try this same thing without Rule Machine, maybe using Simple Automation app, Button Controller, or a community app like Advanced Button Controller?

At least that would narrow it down to either RM or Zwave side of things. And not that I can read these any better than you can, but what happens if you turn on all logging for device22 (Fan)?

Maybe post the rule that goes along with the switch?

And last silly question, what brand and model is that fan switch?

jp0550 · July 31, 2020, 12:26am

Do you have any type of zwave polling going on for things like switches that don't support pushing changes? I used to see duplicates like that for a device when my polling internal was too short, or if I sent the refresh command out for a lot of devices at one time, rather than staggering the requests. From the hubs perspective, one refresh request would return multiple responses like that in the logs for me.

MFornander · July 31, 2020, 1:41am

Thanks for chiming in. I removed all non-ZWave-Plus devices a few months ago along with misbehaving devices such as Schlage locks. No polling and only Z-Wave Plus now.

MFornander · July 31, 2020, 1:44am

Yeah and it's actually a little more interleaved. The Acks are not right next to the Gets so the latency is more around 2 seconds. I've highlighted the two zwave events that are the same sequence number below as an example:

MFornander · July 31, 2020, 1:50am

Turning it on directly in the Devices page works just fine, with no duplicates. I already removed Groups&Scenes but I'll try try some other apps to initiate the changes, good idea.

The storm is not limited to just the Fan. Most of the devices have 5-10 repeats. I'll turn on debug but it's not target device specific.

Leviton DZ15S-1BZ Z-Wave Plus Switch, but again many others show the storm, including HomeSeer Dimmers and GE Dimmers, all Z-Wave Plus.

csteele · July 31, 2020, 3:11am

You may want to try disabling All Apps for 10 seconds. Just go down the list and click everything in that column. Look at the logs.. did the 'storm' stop? Depending on just how active your house is, it's possible no one will even notice the outage

If it did, enable them slowly with one eye on the Logs.

johnwick · July 31, 2020, 3:21am

Rule machine is not doing anything different than a dashboard or the edit device page would be doing. Rule machine is just executing the commands that are exposed by the driver. It's the driver that handles comms with the device.Rule machine doesn't care what kind of device is executing the command that it sends, it just sends the command to the device and let's the driver handle all of that other stuff. So, I don't believe this is a rule machine or other app issue.

So, I think a better question is, what commands are you setting up and executing from the rule? You say that this is a fan device. Are you using a setLevel dimmer command with a fade time? For example, fade to 100% over 4 seconds? If so, then this is the exact pattern that I would expect to see. If you could show your rule, then we could tell what the problem might be. If you do have a fade time for a fan dimmer (like the GE fan dimmer), then I would recommend putting in a Zero for the fade time.

ericm · July 31, 2020, 3:38am

Is this something that started happening recently? I've had a few people mention to me similar repeat reports with our z-wave devices. What I saw was that the device was sending a packet to the hub, but not getting an awk response. So the z-wave sdk would send another packet . . .and another . . . until it got a response. You can see below the total of 3 reports sent before a response was received. Some users are having many more duplicate reports.

johnwick · July 31, 2020, 3:44am

But the OP reports that this doesn't happen when issuing a command from the edit device page. Wouldn't you expect this to happen regardless of what app issued the command (dashboard or rule or whatever) if that was the problem that was happening here? Or is this an intermittent problem which maybe just didn't happen the time that the OP was looking?

But there is a new z-wave stack in the firmware to support the 700 series chip on the C-7. So, if this has only been seen since the update to 2.2.2 that is a strong possiblity.

lairdknox · July 31, 2020, 3:48am

I just resolved a z-wave issue. It was really kinda silly - some packages were leaning against a switch and creating a storm of commands. It was even causing most of the events not to be logged.

Apparently after the nightly maintenance was finished it would clear temporarily but then would quickly get saturated again to the point were the logs didn’t show the traffic. Even after I cleared the issue the z-wave mesh didn’t start behaving until the next day.

It is possible that you have one device Is misbehaving and it is causing a cascade failure on your mesh.

ericm · July 31, 2020, 3:52am

This has mostly been when rule machine issues commands and seems to be more likely when multiple commands are sent at once. Sending single commands through the device page is much less likely to cause it to happen. Almost like the z-wave radio is getting overburdened by many requests and not able to respond in time before the device seems to think the packet is lost and re-sends it.

Almost positive that I have not seen this behavior until the last few weeks but I could have missed it.

lewis.heidrick · July 31, 2020, 3:55am

Does rebooting the hub temporarily resolve the issue?

MFornander · July 31, 2020, 4:09am

I have been battling this for months. I have added/removed all devices on the whole network once and moved the hub once. a separate time. I have bought several ZSticks and ZWaveToolBox. I bought a 2nd C5 to test things out on a smaller scale (works). I just don't feel like Hubitat engineers are testing larger installations with 3 hops and 50+ devices. WAF is fading with "why does it take more than seconds to start the night mode?"

neonturbo · July 31, 2020, 4:14am

I have just over 50 Zwave devices, and at probably least 70 Zigbee. And I don't have any problems at all. Things run very smooth here. I don't think I am unique on this forum with having that many devices. That is not meant to rub it in or anything, but I do think that this is something isolated to your particular hub.

MFornander · July 31, 2020, 4:16am

I'l try that tomorrow thx!

No change since 2.2.2.

True and I have removed selectively various devices for a day or so but it doesn't seem to fix anything.

Yeah I really think the Hub multi-command implementation at the core has some timing issues on slower networks with many hops. OpenHAB had issues like this a few years ago and they had to tune the retry rate and timings.

No, but triggering several of these "scenarios" tend to have a multiplicative effect. i.e 2nd storm will be a lot worse than the 1st and sometimes I do have to reboot where I'm guessing it's just reissuing Get and processing Acks over and over again.

neonturbo · July 31, 2020, 4:19am

No, but maybe if we can tackle one thing at a time, maybe we can see if there is some error, warning, or particular message that will help someone to figure it out. The debug logs have lots more information that could help diagnose this. I wouldn't turn on debug for everything, just pick a chatty device and log it for a bit and post it up here. Then turn debug logs back off.