Handling failure

I'm a few weeks old to Hubitat. I generally love it, but I've observed spotty reliability when flipping several GE Z-Wave switches at once with an RM off action. This has opened up a can of worms for me around detecting and handling failure.

When I go to sleep, if my sleep rule's off command fails, the living room lights stay on all night. I've observed this happening with surprising frequency: every 3-4 days, all or some of the living area switches are on when I wake up even though the logs show that the sleep rule ran. The stakes are even higher when I leave the house (which might be for days or weeks, once we're allowed to vacation!).

I have a robust, recently repaired Z-Wave mesh (a dozen switches throughout a 1250 sq ft flat), and I tend to see this issue when I flip around 5 switches in one action, but almost never throughout the day as motion turns on and off a couple switches at a time.

My questions and discussion points:

  1. What provisions are built into the Z-Wave protocol to handle failures like this? I could totally be wrong, but I thought Z-Wave devices ACK commands, which implies some kind of confirmation logic, if not retry logic built into the protocol. Is that true, and if so, what does the Hubitat do with that ACK or lack thereof?

  2. What RM idioms have others developed to improve reliability at the application level? One idea I have is to wrap an off action in a repeat that checks that they are indeed off and tries again if not (how reliable is that signal?). Another is to explicitly list each switch separately in its own off action, possibly wrapping each in its own retry loop. I've also considered adding time-triggered rules that double-check critical invariants throughout the day. Example: every couple hours, if the mode is Away, make sure the lights are indeed off.

  3. Corollary: The GE switches aren't in the list of devices I can poll or refresh in RM. Is that because the driver assumes the switches will always proactively report their own state changes (and make sure that state change is ACKed)? Problem is, I've observed Hubitat thinking a switch was off when it was on. If that state can possibly be wrong, am I dead in the water with respect to reliability? If I retry an off command repeatedly, is there a chance it will loop forever, even if it had already successfully turned off but failed to properly notify the hub?

  4. What assumptions can I make from the standpoint of a defensive rule developer? If I better understand the guarantees Z-Wave and Hubitat provide, I could better choose when to add clunky logic to repair the state at the application level, and when to instead raise an alert (e.g., text myself about a violation of an invariant).

I'm essentially looking for ways to handle the less common extreme of the trade-off: cases where it's fine to be clunky and slow as long as it definitely works, and that it alerts me the rare times that it doesn't (where rare means months, not days). I'd really appreciate any tips, suggestions, or just discussion!

I have always been under the impression that the early GE/Jasco Z-Wave switches had issues reporting status. There's a Z-Wave poller app here that mentioned something about GE/Jasco, IIRC. Maybe I'm going crazy?

Rather than read any further into kludges for my two old GE/Jasco switches, I ordered replacement Inovelli Red 2 switches with scene control and ditched the old ones.....

Good to know. I should have mentioned mine are all the newer 2nd gen GE switches, for what it's worth. And they have been mostly reliable and I'm happy with them.

But no matter how reliable they are, they (and no switch) will work 100% of the time. I'm interested in provisions and strategies for detecting and responding to whatever problems there are, which is slightly different from building a more reliable switch (since there will always be the possibility of error in a networked system).

1 Like

Just a wild guess:

  • You might still be having mesh issues. A dozen switches isn't a lot to be honest. Especially in that "large" of an area.

  • I have seen something similar on this forum before, and the fix was to add some delays between turning on or off devices. So switch a couple, delay 2 seconds, switch a couple more, delay, and so on. That would not be ideal if you have a group of recessed lights or something, but for turning off everything in the house at bedtime, the delays would be acceptable.

1 Like

Possibly mesh issues, though my hub is centrally located, there's no more than 6m between switches, and the farthest switch is only about 20m from the hub with 1-2 intervening walls.

Good idea re: delays. Almost by definition, the only "critical" cases are when I'm not around to observe whether it worked, so weird delays are totally fine.

My current idea is to create a new "turn off all the lights, no matter what!!" rule that gets called from just those critical places (when away, when sleeping, etc). That one rule can abstract away an arbitrarily complex dance to make sure they're off with small numbers of switches at a time, delays, and repeats if a switch isn't reporting off.

Ya... this is one of the annoyances of zwave. In general with these devices you can send the command but you can't guarantee that the command will reach and be execute on the device. Zigbee at least has a way to send the command to multiple devices at once to avoid this type of thing.

Usually the way the drivers work is that they will send the command, and then eventually the device will receive it, execute it and then report back the state to the driver which in turn updates the device state on Hubitat. This can get a bit clunky. If you try to execute a bunch at a time, it will flood the network with data and cause delays. I've done tests where I sent an off command to 30 switches in my house and waited. It took minutes to execute them all. I could sit there and watch them slowly turning off.

One of the best ways around this as mentioned above is to execute the commands in small chunks with a delay between each.

Another thing I would suggest is working with the zwave mesh to ensure it is a efficient as possible. Things like ensuring any devices that repeat are zwave plus, non plus devices are edge devices, getting rid of any useless chatter on the network such as power reporting and doing repairs when you make changes to ensure the routes are all up to date are some ideas. Also if you have secure pairing turned on, I recommend turning it off on all devices that are possible. Locks and garage door openers I think it must be on but everything else I would keep it off so you get the best zwave plus speeds. Keep in mind if you have any poller app running it adds extra traffic on the zwave mesh that you don't want. I would recommend upgrading the devices so you don't need the poller app if that is an option.

If you want to take it to the next level you can get into writing your own app to handle all this. A scene app would be a nice start. I wrote my own scene app to address these issues. If the device was already off it wouldn't send the off command again which eliminated traffic. You have so much power and can do whatever you want once you learn how to do that but I understand this also isn't for everbody.

2 Likes