Seeking Better Certainty of Device States

robert.bruce · June 9, 2023, 1:45pm

I apologize for the following post. It is very long. The issue is complicated.

I use a C-7 Hub Mesh and Z-wave devices to control the physical environments in rooms of my house that I use for my Indigo Snake breeding operation. If lights stay ON when they are supposed to be OFF, or if heaters/air conditioners are in the wrong state, there is the potential that I could lose animals, eggs, and an entire year of income or more if I don't catch it in time.

Over the past three years, I estimate that I've spent 1% of my rule-writing time on basic rules to control device states in the manner that I need, and 99% of my time writing much more complicated rules to achieve reliability of those device states. So far, although I have been able to improve the reliability of my control immensely, I haven't been able to achieve as reliable a setup as I know is possible. What's missing as I see it, could be solved with a minimal change to some logic within Rule Machine (or maybe I'm just missing something).

The problem arises from the fact that radio-based automations are never perfect. Sometimes a hub can issue a command, and the device doesn't receive the command. Sometimes the device receives the command and carries it out, then sends a response back to the hub telling the hub that the command was carried out, but the hub doesn't receive the response (called an "ACK" as I understand it). There are other scenarios that can cause this. The bottom-line is that a hub sometimes thinks that a device is ON when it is OFF, and vice-versa.

The following is stripped-down code to show an example of the best I seem to be able to do so far:

In the first image is simple code showing that instead of just turning a heater switch ON, I run a rule action to do that.

In the second image is a rule action that runs a Repeat Loop which turns the heater switch ON and waits for a global variable called "ValidState" to indicate that the switch was actually turned ON correctly (ValidState = 1).

In the third image is code from a third rule that sets ValidState to 1 when it is triggered by the event of the switch actually turning ON. ValidState = 0 is the situation for Heater OFF btw.

These three rules (in reality it takes at least four or five rules to do it properly and a lot more code than I'm showing here) prevent 99.9% of the logic-based failures that I am likely to encounter with only a simple ON/OFF rule.

The remaining problem lies in the meaning to Rule Manager of the event trigger "EggRm Heater turns ON." This event fires when the state of the heater changes from OFF to ON. If the hub thinks that the heater is ON (but by error it is actually OFF), when my rules call for the heater to turn ON, the heater may turn ON but the repeat loop will go on indefinitely because the hub thought the heater was already ON and therefore won't detect a change of device state. The only sure-fire method to verify that the heater is actually ON in this circumstance is to turn it OFF, detect a valid OFF state, and then turn it back ON again to detect a valid ON state.

What is needed is a "Heater Reports ON" event, not a "Heater Turns ON" event. I've seen log entries like that from some Z-wave devices. Most Z-wave devices ACK what I call a "duplicate" command. That is, if the device is sent a command to turn ON and the device is already ON, the device will still send back a response to the hub either "EggRm Heater is ON" or "EggRm Heater reports ON." The problem is that I can't find any way to trigger an event based on the ACK to the duplicate command that I know the hub is receiving.

I thought I had found my holy grail when in searching, I stumbled upon the "custom" event trigger for switch changes. With that event, the language used is "device reports ON."

Unfortunately, with the testing I've done, even though the devices are actually reporting their state following a duplicate command (as can be seen in the hub logs), the custom switch event seems to have the same issue that I consider a logical failure: "device reports ON" actually means "device reports ON and it was changed to ON from OFF."

There is also a digital switch event that can be accessed for devices on the Hubitat hub, but it seems to have the same shortcoming.

To truly know the state of a device from within a rule following a command being issued to the device (including a Refresh command btw) the writer of the rule has to test for the report of the device state sent by the device. That means there has to be an event trigger that responds to the ACK that good devices send when they are commanded to go to a state that they are already in.

Either I'm missing something or this is a serious logical failure in Rule Manager logic. If "device reports ON" actually meant that, my entire automation setup would be more reliable, would give me far fewer false alarms, and it would take less code to get there.

aaiyar · June 9, 2023, 2:43pm

This is fundamentally impossible. At best the hub knows the last reported state of the device.

sburke781 · June 9, 2023, 2:50pm

I'll admit I haven't read all of your post... but if I have got the gist of your issue right...

I feel like you are missing where the problem likely lies... Or at least where to spend your time to fix it. If your devices are reporting inconsistently, then I expect you need to focus your time on either or both the strength and reliability of your mesh network and/or the quality of your devices on that network. Hubitat can only work with what it is provided and if the devices used or their interactions are not consistent and reliable, then there is only so much you can do programmatically to compensate.

FriedCheese2006 · June 9, 2023, 2:52pm

If your devices and hub are communicating properly, none of this is an issue.

To focus on the same device, when the device page is showing the errant status, does a poll or refresh command (if available) update the status properly? Are these all Z-Wave devices or ZigBee or a mix? If Z-Wave, posting your Z-Wave table from the settings menu might show some reasoning for it.

bertabcd1234 · June 9, 2023, 3:01pm

Logs don't tell you what events are actually generated, though with "Enable descriptionText logging" turned on, you'll conventionally get the description text for events that the driver sends to the hub for processing -- but for most drivers, this is regardless of any default platform filtering that will happen during that processing. The end result is (normally--though the driver can control this for some events where it makes sense, like button presses) that events with the previous value as the existing value are discarded, as they do not represent a state change.

So what you'd really need to look at is "Events" on the device detail page for the device in question, not logs.

Likely because of the above, how events and app subscriptions to those events work.

But I agree with the above: figuring out why this is happening on your network is likely to be a better use of time than figuring out a way to make a workaround like this do something that you want. Some devices are certainly finicky, and you might want to with them, but in most cases, this isn't necessary, and addressing the underling problem will make things better all-around.

robert.bruce · June 9, 2023, 3:02pm

@aaiyar of course you are right. If you take a device outside of the range of the hub, or if you put it inside a metal box, and the device state changes, the hub will not know that.

It sometimes happens, no matter how good the mesh is, that a device reports its state and the hub doesn't get the report. If a Hubitat hub doesn't get a response to a command, it does nothing. The rules that I've written remedy this to a great extent. My rules don't do nothing. My point is that there is a level of certainty that is better than what I can do with the current logic in Rule Manager. Better certainty is better.

aaiyar · June 9, 2023, 3:16pm

These can be easily avoided. But there are other situations that are harder to avoid. Most sensors use lithium chemistry batteries, whose discharge characteristics make sudden drop in communication likely. I’ve seen sensors work with a battery level of 2.73V, but an identical sensor fail at 2.8V.

jtp10181 · June 9, 2023, 3:16pm

You would need to make a custom app then. You can make more customized and powerful "rules" directly in app code. There are some users who make a custom apps instead of using RM.

robert.bruce · June 9, 2023, 3:24pm

@bertabcd1234 thanks for your desire to help. Yes, I have description text turned on for my devices. I can click ON from within the device page when the device is already in the ON state and in the Hub Logs I will see "device is turned ON" or "device reports ON." This log entry shows that the hub is receiving the ACK. I can't sit in front of my computer and watch the logs 24 hours a day. I need to be able to get an event trigger on these duplicate ACKs from within rules in order to improve the reliability of knowing what's going on. Then a rule can notify me if there is a problem. The problem would be indicated by the Repeat Loop going on continuously, which would set a "FaultCount" variable that you can see in my code. If the FaultCount is increasing and not being reset to 0, the device is not doing what it should be doing. A Hubitat hub will do nothing if it gets no response that a command was correctly performed. Unfortunately my rules are limited by a logical limitation within Rule Manager, as best as I can determine. I can see the duplicate ACKs in Logs, but I can't get an event trigger for them that I can use in rules. As a result, I can't discriminate between an actual device or hub failure, and this logical problem combined with a single error in hub knowledge of device state. I can solve this with a gimmicky workaround, turning the device OFF and then back ON (or vice-versa) but there are some instances when I don't want to do that. For example, I don't want to turn lights ON at night for animals in order to verify they are OFF, because that could affect their day/night cycle and their breeding fertility etc.

robert.bruce · June 9, 2023, 3:26pm

@aaiyar I use duplicate sensors, battery and AC powered

robert.bruce · June 9, 2023, 3:41pm

@sburke781 yes of course, thank you for your desire to help. From my perspective, I know, you have to attack the problem from all angles. There are other things affecting reliability, hub crashes, Zwave crashes, power outages, equipment failures of many sorts, poor mesh setup, and reducing those things is extremely important. I'll probably post about some of my experience with those things in other threads. Here though, in this thread, is one specific thing that I need for my rules to do a better job knowing what is happening. Zwave devices ACK when they get duplicate commands, but like @bertabcd1234 said, these ACKs are sometimes discarded. In Rule Manager, I can't find any way to trigger on those ACKs even though I can see them with my eyes in Logs. For most people, the level of certainty I need is not important. But equipment engineers have designed these devices to ACK even when they get a repeat command. Some people need this. Even though the devices do this, Hubitat has no way to detect this within rules, events you can see sitting in front of a computer and watching Logs.

bertabcd1234 · June 9, 2023, 3:42pm

This wasn't really a question, just mentioning how it works. It won't affect the outcome. The authoritative source for device events is "Events" on the device detail page for the specific device, even if most drivers will write also something to logs with this option enabled.

Sort of. These logs work at a much higher level. You'd need a Z-Wave sniffer to see the ACK. But yes, you're unlikely to see something in there for a Z-Wave device unless a command was actually acknowledged, in which case you could probably infer this much.

Rule Machine, as in your screenshot, I assume? Not trying to be pedantic, but if you're using a different app, all bets are off. Again, this behavior is due to default platform event filtering, as I described above, and the way that app subscriptions to these events work (as a trigger or wait in RM normally creates).

As @jtp10181 notes, you might be able to do more with a custom app. For example, you could set a flag on the subscription to ignore filters, which would help if the driver is the kind that sends the event, regardless of current state, and lets the platform handle the rest. (This is probably how most Zigbee and Z-Wave drivers work; a lot of polling-based drivers might not, as some LAN and cloud integrations rely on -- but this all depends on specific implementations).

The developer documentation has some information on the general environment and how to get started with apps, including one that demonstrates how to create event subscriptions (a typical feature of pretty much any automation app). This is writing code and is where you can get more customization. Rule Machine is just a point-and-click GUI so you don't have to write custom apps if you don't want to (the summary of the rule actions is displayed in a sort of pseudo-code, likely because that is how the author believes the actions are best summarized, but it's not code per se).

Many people create the kind of rule you're asking about by using "Wait for expression," where the rule will proceed if the device is already in the specified state or wait for an event that causes it to become that state if not (if you just sent a command to the device in the previous action, you may want to wait a second or two to give it time to respond and for the hub to process the event). If you want to use "Wait for event," you'll need an actual event, a state change (or at least one as far as the hub can tell, which generally only happens if the new state is different than the previous reported state). This is even more unusual than the previous kind of rule, which is generally only needed for extremely picky devices (and in other cases would be better served by addressing whatever network or device problem might be causing this behavior in the first place). If you need something else, maybe take another look at the custom app idea...

robert.bruce · June 9, 2023, 3:58pm

@bertabcd1234 personally what I would like to see is that "device reports ON/OFF" has a different meaning than "switch turns ON/OFF." Right now they are the same. I already spend too much time with automation and I could better use that to focus on my animals etc. Somewhere in Rule Manager there should be a "device reports" event that doesn't discard instances where the device state doesn't change. If a device reports it, you should be able to trigger on it. The reason this discarded information is important is because Hubitat, like any other hub, doesn't always know the correct current state of a given device. So what Hubitat thinks is not a state change could actually be a state change. I've seen it happen enough times already. Discarding those ACKs is discarding valuable information, and IMO, extremely valuable information.

bertabcd1234 · June 9, 2023, 4:02pm

I'm guessing a change will not be made to Rule Machine to accomodate one particular, unusual use case, but since you've asked, I guess we can see what they say.

But in the meantime, a custom app might work if written in the above manner and if the driver works as would be required for this (again, this is still all at the application level for Z-Wave; you are not dealing with ACKs per se--no app or driver on the hub does). That is the only option at the moment ... or re-thinking something about the above setup.

robert.bruce · June 9, 2023, 4:33pm

@bertabcd1234 , @aaiyar , if we could ask this question of Hubitat users - have you ever found your garage door open when it should be closed, or your Zwave front door lock unlocked when it should be locked, or water running when it shouldn't be...

How many people would that be? A lot. Probably the majority.

My examples of rule code at the beginning of this post can be used to immediately eliminate 95% of those situations. I can get that number higher if this tool is available that I need.

marktheknife · June 9, 2023, 10:13pm

This is also true of course, but I think what others are getting at is the notion that for most users, this isn’t a frequent enough event to justify a lot of time and effort spent working around it; hence the suggestions to look further into the root cause of the dropped signals in your case.

aaiyar · June 9, 2023, 11:29pm

Count me in the minority - I haven't had any of those happen in 3+ years. I did spend a fair deal of time re-designing both my z-wave and zigbee mesh in early 2020. Time hanging heavy on my hands due to pandemic-associated shutdowns.

ritchierich · June 9, 2023, 11:39pm

Agree. Occasionally I will have a switch on that reports off but that is rare. I had this problem more often years ago with older Zwave devices.

Believe it was asked above but I didn’t see an answer what make and model are these devices? Seeing your income depends on these devices you should make sure they are current and up to date.

coreystup · June 10, 2023, 12:02am

I think what you're requesting is that a given device have both a logical state and a physical state. The rules/app would set and query the logical state, and a device driver would be in charge of getting the physical state to match the logical state. It would have all the retry and timeout and error handling logic within it to do so. So your rules are free from worrying about the physical. You could do this in a couple different ways, one is virtual devices that act as intermediaries between the logical and physical devices. Or write new device drivers that encapsulate that same logic.

This is what we'd do in the embedded controller world that works with physical device inputs and outputs.

In your use case of preventing loss or injury due to failure, I would also use multiple data points that all must agree before an error is notified or error correction behaviour is acted upon. Similar to industrial or automotive - when a current draw is commanded on, monitor the current use. If it falls out of a range, flag an error. A heater could easily be monitored for temperature delta AND current draw. All these extra checks could be wrapped in the physical driver that does all this work for you to make sure everything is in range of whatever logical state the device should be in.

robert.bruce · June 10, 2023, 3:55am

@coreystup thanks for your input. Here is a completely different example that is more along the lines of your advice:

You have a motion detector that triggers your hub to turn on lights inside your house when motion is detected. You have also written rules to notify you if you are away from home and the motion detector is triggered but the lights don't go off after a certain period of time. When that happens, in some instances you find that the hub had originally turned the lights on properly, but because of radio interference, didn't get the ON confirmation. The hub at that point thinks the lights are still in the OFF state. When your rules then command the lights OFF after a defined period of time, the hub cannot get confirmation of OFF based on an event such as "Wait for event: lights are turned OFF." It turns out, based on the current logic/capabilities in Rule Manager, that it is impossible to get confirmation of that event. When you get home and look at your Hub logs, you find that there is an entry for lights being turned off. You can see it with your eyes, but Rule Manager rules can't detect it in that situation.

So you add into your rule a Refresh command for the lights that occurs when there is no confirmation of an OFF event. Again, you see in the Log that the Refresh worked, but your rules won't trigger on the report of "Switch is OFF" from the device refresh. This is not possible right now without the addition of a "Switch Reports OFF" event in Rule Manager.

There are people here trying to help and suggesting to focus on the quality of the mesh, quality of devices etc. and I appreciate their input. For them, this would be such a rare occurrence that they can ignore it. But for you, you have pet cats that trigger your motion detectors frequently, and so the probability that this scenario happens is much higher for you. Or you might have times of high radio interference in your neighborhood. Or the geometry and construction of your house makes dropped radio signals happen more frequently for you.