Stuck on satisfied wait condition, even after wait timeout

peterwoggle · June 17, 2020, 7:08pm

Update: Found out once I collected more logs in post 6 that the wait action was the culprit.

This is a more specific followup to my earlier questions about handling failure and defensive rule-writing [1].

When I go to sleep, a rule runs that sets lights to 10% (so that they're dim if I turn them on in the middle of the night), then runs the actions of another rule to turn off all lights in the house. The rule that turns off all lights is super, super defensively written so that it attempts small subsets of lights (to prevent Z-Wave congestion) multiple times with delays between attempts (to sidestep transient issues). The goal of that overly verbose rule is to make absolutely sure that no lights stay on all night (or for weeks, if I'm away on vacation).

Yet, last night, I woke up to find all the living area lights on at 10%, and it appears that the rule to turn them off didn't run. Here's the sleep rule, which did run and set the lights to 10% (and the logs show that it only triggered once, so no concurrency issues):

That rule explicitly calls this verbose rule to turn off the lights:

Which, in turn, triggers this helper action (split up to serve as "functions" that other rules can use to reliably turn off subsets of lights):

Unfortunately the rules only logged triggers and events (I enabled logging actions as well after this incident). Here are the logs from last night, where it's clear that the sleep rule ran, but there are no events or switch changes to turn them back off (there's nothing relevant in the logs after this):

All I can figure is that the "run actions" action in the sleep init rule was dropped somehow. Have others seen that, or is there another explanation I might be missing? I'm running HE version 2.2.1.113.

[1] My more general thread on reliability best practices:

zarthan · June 17, 2020, 8:15pm

Without logs it is tough to say what happened. If it is a regular occurrence you might want to temporarily turn on as much logging as you can. I have a routine that turns things on that should be on and things off that should be off. I have that repeat later just in case. You might try @bptworld's Device Check app. Hubitat/Apps/Device Check Plus at master · bptworld/Hubitat · GitHub
There may be other apps available. I don't remember if @Cobra has an similar app.

bravenel · June 17, 2020, 9:32pm

Again, you have to turn on Action logging. It's the only way to know what is going on.

You're going over board with your rules, so much going on that isn't needed. For example, why repeat something 3 times? Are you thinking that maybe turning off a light doesn't work? If turning off a light doesn't work, there is a fundamental issue that needs to be rooted out, not a repetition of the off command. Please try making this simple and straight forward, instead of convoluted and complex.

If you are having problems with lights turning off, you should dig into that issue. This approach is just going to create confusion for you.

bravenel · June 17, 2020, 11:17pm

Let's look at this one: First you dim these lights to 10, and then you Wait for them to be on with a timeout of 1:00. Why wait? Why the timeout? You've given the command, it will be done. A short delay before the next steps would be OK, but there is no reason to Wait like this.

Again, if your devices are not responding as you expect, you wouldn't use a rule to overcome that. Instead, you'd dig into what's going on with the devices, and correct that. The hub is very reliable for sending commands to devices. What's happening with this rule is the needless adding of complexity for no improvement in what is happening with the devices.

Then you run the second rule actions. Are you also running the "Turn off all lights" rule from elsewhere, from other rules?

peterwoggle · June 18, 2020, 4:59am

Thank you for your help and attention. It's very cool that Hubitat has this forum and that people are so willing to help.

Point taken: I've enabled action logging, and I'll stop spamming the list with my partial logs!

RM coding style aside, something else must be going on here. However verbose or unorthodox my rules are, the flow of that particular jump from the first rule to the lights-off rule seems easy enough to reason about that I expected to see at least one off event in the logs (action logging wasn't enabled, but event logging was, and it should have reached at least one lights-off action).

I'll be more helpful if it happens again.

I still defend my wait and wait timeout, and, by the same token, at least pieces of the verbose lights-off rule.

Wait: As far as I can tell, when the switches receive the off command at least (or are turned off physically), they fade all the way down before they send the "I'm off" state change (even the LED on the switch doesn't indicate it's off until then). Even if they ACK that they received the command (I don't know the Z-Wave protocol), the state doesn't seem to change synchronously, so any later actions that assume they're already on (such as turning them back off), had better make sure these actions "take" in the right order. It's the idiom I've adopted to turn async requests into sync requests. I was bitten by it several times in multiple rules before I discovered this race between components. (GE 2nd gen dimmers, in case that matters.)

As for the timeout: I would never issue a synchronous RPC without a timeout, especially if it went over the air or touched other systems. Here, I don't know anything about the Hubitat stack or hardware, the drivers, the Z-Wave protocol itself, or the Jasco hardware or firmware. I do know that every switch must have to ack that it has been turned on before the wait condition could possibly be satisfied. If one switch misbehaves (or is just unreachable) and prevents getting passed the wait, it'll hang indefinitely. The timeout seems like a cheap and straightforward way to gain some failure isolation in the one place I have control.

I have, in fact, observed Hubitat reporting the wrong state for a switch (quite possibly because of the switch). I'm fine with that because it's as rare as it is inevitable (the only variable being the actual failure rate). But this error compounds with more switches that a single action relies on, maybe doubly so if Z-Wave interferes with itself or floods the Hubitat with updates. I haven't found documentation on this, but I have anecdotal evidence that flipping many switches at once is less reliable than flipping one. Writing rules defensively seems like the responsible thing to do because this is a complex networked system.

The other relevant angle is that I really don't care about the 10%--it's a nice-to-have--but I do really want the lights to turn off, and some off is better than none. So even a small possibility of dislodging from a hung wait is probably worth it.

This philosophy applies to the nasty repeats as well. I totally agree it's ugly, and this is the only place I do anything like that because it's the only place I absolutely, for sure want to make sure the lights go off because I'm not there to observe them (and may not even be home).

In addition to the very occasional no-ACK scenario above (with compounding error), every now and then the Hubitat gets a little slower to respond to events, then it recovers. It's rare and not a huge deal at all, but it adds transient issues to the list of things I should design around (again, only in these "critical sections") by backing off and trying again later if something doesn't work. So, given that it could help in this one critical place as long as I can reason through my rule and iterate fixing any bugs, it seems like a reasonable trade-off.

I should point out that I'm not talking about huge errors rates here by any means, but enough that I'd like to address it. The lights have only still been on in the morning once since I added the extra loops ~50 days ago. But if I wake up more than, say, one morning per year and the lights are still on, that's outside my tolerance, and that could seemingly happen with a very, very small per-switch flip error rate.

So while I agree that there very well could be a bigger problem elsewhere, no networked system is perfect, and it seems that as long as I carefully audit that one rule (treat it like a critical section), it can only help isolate and work around certain kinds of failures.

Yes, I "call" that lights-off rule and, independently, its living lights off helper rule from elsewhere. I use it for various many-lights-off scenarios where it's okay for it to be slow and methodical (e.g., when I'm asleep or away).

peterwoggle · June 18, 2020, 6:51pm

Great news: it happened again! Data!

It got stuck on the wait indefinitely. As expected, the condition was not yet satisfied when entering the wait. Soon after, all the lights were on, which should have satisfied the wait condition. Then the timeout passed, which should have unblocked it again. It never made it beyond there.

Action logs from just that rule (same init sleeping rule posted above):

All logs from the same time:

I'll get rid of the wait action tonight since, as you said, I shouldn't need to worry about synchronization here.

Until then, let me know if I can help by pulling more logs or diagnostic info.

peterwoggle · August 1, 2020, 1:35am

Adding a cross-reference to another occurrence of this issue. Different rule, different type of wait, but same symptoms: