Seeking Better Certainty of Device States

I think what you're requesting is that a given device have both a logical state and a physical state. The rules/app would set and query the logical state, and a device driver would be in charge of getting the physical state to match the logical state. It would have all the retry and timeout and error handling logic within it to do so. So your rules are free from worrying about the physical. You could do this in a couple different ways, one is virtual devices that act as intermediaries between the logical and physical devices. Or write new device drivers that encapsulate that same logic.

This is what we'd do in the embedded controller world that works with physical device inputs and outputs.

In your use case of preventing loss or injury due to failure, I would also use multiple data points that all must agree before an error is notified or error correction behaviour is acted upon. Similar to industrial or automotive - when a current draw is commanded on, monitor the current use. If it falls out of a range, flag an error. A heater could easily be monitored for temperature delta AND current draw. All these extra checks could be wrapped in the physical driver that does all this work for you to make sure everything is in range of whatever logical state the device should be in.

3 Likes

@coreystup thanks for your input. Here is a completely different example that is more along the lines of your advice:

You have a motion detector that triggers your hub to turn on lights inside your house when motion is detected. You have also written rules to notify you if you are away from home and the motion detector is triggered but the lights don't go off after a certain period of time. When that happens, in some instances you find that the hub had originally turned the lights on properly, but because of radio interference, didn't get the ON confirmation. The hub at that point thinks the lights are still in the OFF state. When your rules then command the lights OFF after a defined period of time, the hub cannot get confirmation of OFF based on an event such as "Wait for event: lights are turned OFF." It turns out, based on the current logic/capabilities in Rule Manager, that it is impossible to get confirmation of that event. When you get home and look at your Hub logs, you find that there is an entry for lights being turned off. You can see it with your eyes, but Rule Manager rules can't detect it in that situation.

So you add into your rule a Refresh command for the lights that occurs when there is no confirmation of an OFF event. Again, you see in the Log that the Refresh worked, but your rules won't trigger on the report of "Switch is OFF" from the device refresh. This is not possible right now without the addition of a "Switch Reports OFF" event in Rule Manager.

There are people here trying to help and suggesting to focus on the quality of the mesh, quality of devices etc. and I appreciate their input. For them, this would be such a rare occurrence that they can ignore it. But for you, you have pet cats that trigger your motion detectors frequently, and so the probability that this scenario happens is much higher for you. Or you might have times of high radio interference in your neighborhood. Or the geometry and construction of your house makes dropped radio signals happen more frequently for you.

Device events that do not involve a change of state are not passed on by the hub to apps, be it Rule Machine or any other app. This is a core part of the architecture of the hub, and isn't really the source of the problem you are chasing, only a superficial artifact of it. If your theory is that an On event that is sent when the state of the device is already On, should be used to trigger an app, you'd be very unhappy with the outcome. This would be a completely different architecture than this hub. This has nothing to do with Rule Machine at all.

This has nothing at all to do with Rule Machine. It is merely responding to events and evaluating states as reported by devices and the hub. If the events or state are false, wrong results are inevitable irrespective of the app used.

You are barking up a tree chasing reliability with rules. The overall reliability is limited by the least reliable element, and try as you might you can't overcome that. This has nothing to do with ACKs or device states and events. It has to do with things that cannot be relied on due to their inherent flakiness. All RF based communications protocols for consumer products suffer from certain intrinsic weaknesses. Robust systems would require rather huge effort to overcome these problems, an expense not commensurate with the value of the problem set. Your home automation system is not an ICBM failsafe launch system, and never will be.

The best way to achieve the highest reliability is to follow the KISS principle throughout your automation setup. The fewer moving parts, the fewer conditional tests, the better. Events trigger apps, resulting in states. It will NEVER be perfect. There will be automation failures from time to time.

5 Likes

If I were to implement the error handling of lost packets/RF interference at the driver level for a zwave power plug controlling a heater, I would do something like this:

  • Use a custom zwave driver (as the physical device in my example) for the plug that added additional error handling features. If if it requests a power on, yet doesn't receive a message saying the switch is on, it queries the status until its known, then either retries the command or posts the state to the logical driver as the new confirmed state. It would have retry and a watchdog/timeout if no confirmation was received. Eventually if it couldn't resolve the requested physical change it would post an error and give up.
  • Use a virtual device (thermostat or whatever) that i would drive with the automation rules. The virtual device acts as the logical state. The virtual device commands the physical driver mentioned above.

However...

I agree with Bruce 100% on this. In the case of your breeding example (where your livelyhood and/or animal lives are at stake), I would not use any RF environment for control/automation, only for monitoring. For your example with the heaters and environmental controls I would use closed loop thermostatic controlled heaters with some tiered failsafes (duplicate heater circuits and kill switches for overheat situations, etc). I would only use RF temperature sensors as monitoring/logging/notifications. Power monitoring too maybe.

4 Likes

@bravenel thank you for your reply.

What I am suggesting is not to change the current "Switch turns ON/OFF" event, but to add an additional choice of "Switch reports ON/OFF" event. Yes, this capability could be used in Wait for Event statements, Repeat/Until statements, and Triggers for rules etc.

I don't understand why you believe that I would be unhappy with that outcome. People would not be required to use that event choice, only have the option to use it. For me, I would be able to use this event to prevent the failures that I've mentioned.

I've already been able to greatly improve the reliability of my automation setup with rules like the ones I showed in the original post. When a device doesn't change state as it should, my rules detect it and the Repeat Loop in the Device ON or Device OFF rule continues to try until the device state changes. Ninety-nine plus percent of the time, within a few repeats, the device state changes properly. When it doesn't, it is usually due to the problem I'm mentioning in this thread. The ability to use these duplicate ACKs with a "Switch reports ON/OFF" event would enable me to increase this reliability even higher, and simplify my code writing immensely.

You have a fundamental misunderstanding. There is no difference beyond UI wording. Read about Events and States here: Introduction to Automation

You are free to do whatever you want, but this is not a real solution.

There are no such events reported by the hub to act upon. You could write custom drivers to do this, but trust me, you won't like the result. And, it's a huge waste of effort. If you are having to go to these contortions for your use case, you are using the wrong tools, as well said above by @coreystup.

If you want industrial grade automations, then you have to use industrial grade devices. These exist, are not cheap, and could be controlled by Hubitat. As long as you have flaky consumer devices, you're stuck with a bad fit.

If my hub gets fooled by a light being in the wrong state, maybe the light won't turn on when it should, or will stay on when it shouldn't. Such a failure self corrects on the next cycle. I'd guess this happens at a rate of less than 1 in 10000. The light not turning on has a super easy fix, hit the light switch. Attempting to take "home automation" beyond this expectation is not a smart move, imo.

1 Like

Maybe it is just semantics or language. I re-read your article and all of that is very basic to me.

"Switch turns OFF" means the hub believes the device has changed state. It doesn't mean the device actually changed state, I've tested this. Zwave devices send a verification of their switch state back to the hub following a Switch ON or OFF command whether or not the state of the device actually changed. Whether that turns into an event on a Hubitat hub depends on whether the hub believes a change of state of the switch has occurred. If the hub believes no change of state occurred, then the verification of device state from the device is not saved as an event. That verification is however, still visible in the Hub Logs.

What I mean by "Switch reports ON/OFF" as opposed to "Switch turns ON/OFF" is that I believe (but I don't know) that Rule Manager might be able to (with an update) fire an event for the verification of state it gets from the device after a command, whether the hub believes it is a state change or not. That is because the report of switch state that the device always sends can be considered an event of its own, and the hub normally fires an event after a report of a device state anyways. "Switch reports..." events would be a superset of "Switch turns..." events.

If you used rules like I use to verify that your light did what it was commanded to do, your 1 in 10,000 would be 1 in 1,000,000. I understand that you don't care if your automation fails at that low frequency, but my switches are changing state sometimes 500 times a day, especially when I'm incubating eggs and trying to obtain a tighter temperature control. It's not a good solution for me to leave a light on overnight. The 100-fold increase in reliability I get is real, not imaginary.

I understand what you are saying about industrial-grade devices and "flaky consumer devices." I don't want to change to industrial-grade automation. I like Hubitat and Zwave devices. The capability actually astounds me. I've been able to build in some redundancy as well as improve reliability 100-fold with the reliability-based rules that I've shown snippets of here. If it is possible that a RM change/platform update could make a "Switch reports ON/OFF" event available in RM, it would probably get me to a 1000-fold improvement in reliability. And like I mentioned, it would make my rule-writing a lot easier because I wouldn't have to go through as many "contortions" :slightly_smiling_face:

I want to clarify what I meant when I wrote that I've tested this. I can take a simple Zwave switch device in its ON state that is paired to a Hubitat hub, shutdown the hub, click the actuation button on the device to change it's state to OFF, and then power up the hub again. At that point, the hub will not know that the device has changed state and will incorrectly believe that the switch is still in the ON state. I can then go onto the device page for the switch and attempt to turn it ON or OFF.

If I attempt to turn it OFF, the hub will receive an "event" (the ACK of the command) from the device (even though the device does not actually change state), and a "Switch turns ON" event in RM will fire. This shows that the device sends an ACK to commands it receives whether or not it changes state.

The hub however, does not consider this to be a duplicate command and responds to the ACK from the device by changing the logical state of the device on the hub to OFF from ON. If I set a rule trigger or a Wait or Repeat/Until statement to the event "Switch turns OFF" for the device, that event will fire.

If I take the same device in the same starting condition (OFF state on a rebooted hub that thinks the device is ON) but I now command the device to ON, the device will change state from OFF to ON and also send an ACK to the successful command back to the hub. In this case though, the hub ignores the event it receives, and a "Switch turns ON" event will not fire. This event can't be used to satisfy Wait statements or act as rule triggers because the hub ignores these events. No matter how many times I re-issue an ON command, or a Refresh command for that matter, "Switch turns ON" will not fire. As a result, there is no way to verify that the device state is actually ON after the command (other than to turn the device OFF and then back ON). The device could just as well be unplugged from power, not in range of the hub, or completely failed. There is no way to know the difference without ON/OFF cycling the device because no event fires in RM when the hub believes that a device was commanded to a switch state that it was already in.

What this tells me is that whatever filter is discarding that event, that filter is at the hub level, not the device level. What I admit I don't know, is whether it is the device driver on the hub for the device that is stopping the hub from receiving the event, or whether the driver is passing the event to the hub and the hub is discarding the event (because it is not considered to be a state change). I'll be able to do a more intelligent job of discussing this if someone can enlighten me here.

What I am saying is that this is valuable information that should be available at the hub level in rules and which would allow me, and others who wanted, to write reliability-based rules with far more reliable outcomes. The increase in reliability is because the report of a device state after a command to change its state would be available to use to verify the successful operation of a rule, whether or not the hub properly knows the state of the device prior to execution of the rule.

I don't know about Z-Wave, but I'm pretty sure Zigbee messages always get forwarded to device drivers' parse() method and the driver code can elect to forcibly set the "state changed" flag when sending the event corresponding to that message, regardless of the value of the attribute - the default being to filter the event if the attribute has not changed. Perhaps you could write such drivers but that may create a lot of events (times the number of devices) and cause perf issues, etc.

What if the light bulb(s) are out? You probably have thought about all this already but I wonder if there are other facts you can collect at to assess the actual state of your controlled artifact. Perhaps your switch provides energy monitoring, or current measurement? What about (redundant) luminance or temperature sensors?

2 Likes

Drivers can post an event with a flag that says "force even if state didn't change". The event would then be posted as if it did change. It's up to the driver.
Edit: @hubitrep said the same thing above, I missed it. Lol.

That same driver could implement some retry, failsafe and logical vs physical state logic as I mentioned before.

A couple further thoughts:

  • Are your zwave devices at 100kbps? If not it might be a good idea to pair with S2. S2 adds full crc checksums on each message for all bitrates, vs no security only CRC checksums 100kbps. This would increase reliability of messaging. I'm in the camp that uses S2 for any device that supports it. YMMV
  • Are you cycling zwave plugs with physical relays 500 times per day? Those relays have an electrical and mechanical lifespan rating (often something like 100k and 200k cycles respectively).
3 Likes

Perhaps this level of automation shouldn't be relied upon for this type of delicate life/death situation. You may want to rethink using hard wired automation with positive feedback. I think this is the only way you can be certain that things are working as you need them to work to make the eggs survive.

5 Likes

@coreystup I've been meaning to respond to your earlier post, because I agree with you in many ways and I appreciate your desire to help.

You may not have looked carefully at the simplified examples of what I call "reliability-based" RM rules I showed, but their purpose is to verify device state after commanding switch changes. I use a Repeat/Until loop with the switch command inside the loop to do what you are calling a retry. I use a "FaultCount" variable to count how many loops have occurred without a valid change of state. I can turn on a warning light or a buzzer etc. in my house, or send a notification to a device, if the FaultCount variable reaches a certain number. I also set a variable "ValidState" which is set to what you are referring to the logical state of the device, and it only gets set (to 0 for OFF and 1 for ON) if the rule can confirm that the device is in the desired/commanded state (from an event being fired in RM for the state change). When ValidState is set to 0 or 1, the Repeat loop terminates.

Of course as I have been saying, the RM event doesn't fire if the command is to change to a state the hub thinks the device is already in, and the only way to get out of that fortunately rare scenario (without the changes I'm seeking) is to cycle the device switch. Not only is that a gimmicky workaround, but it isn't always a good idea, with lights that need to stay OFF at night or an AC condenser motor.

Yes, a hard-wired automation system would be more reliable, and I've considered it many times, but I've always been able to find ways to circumvent the weaknesses I encounter in an RF setup. I currently run two C-7 hubs in a hub-mesh, and normally one hub does the rule/app work while the other does the radio work. I've set up a mechanism that each hub regularly tests the other for complete functionality (every 5 minutes), and for critical temperature control in my EggRoom, either hub can fail the other and take over complete control, with its own devices. I also have duplicate temp/humidity sensors in each room, one on each hub. I've been setting up my heaters and cooling units to be operable by two Zwave devices, and I've been working on adding physical high-temperature shutoff switches as well. So even though I'm choosing to stick with an RF setup, I'm improving its reliability all the time. I've been controlling my animal/egg rooms for more than 4 years this way, and although I've encountered some problems that could have become serious, I keep my animals at home where I spend most of my time, and I have always caught the problems early, knock-on-wood. I've never lost any animals/eggs and when I see a failure point, I've always found ways to minimize the possibility of a bigger problem.

What I think I am hearing is that the device drivers on the hub are filtering out the events where there is no state change, not the hub's platform software. If that's the case, then the author of the device driver would possibly be able to make modifications I suggest, and someone like @bravenel who works on Rule Machine less able. I write Visual Basic code, and C++ looks like a foreign language to me. If its a single line of code or a simple change, maybe I could envision trying to attack it myself, but I'd have to have the source code and I'm not sure its available for these devices. I'm still figuring this stuff out as you can see.

To answer some of your other questions, all of my devices that are capable of S2 are on S2, otherwise S0. And yes, some of my relays get cycled in the hundreds per day during certain times of the year. I am very aware that I may see a relay failure at some point, but I still haven't. I've been moving to the newer 700 series devices, and all my switches except one are now 700.

Even with a hard-wired setup, relays can still fail. Even with an RF setup, redundancy can be accomplished, hardware fail switches can be implemented, and failure warnings can be created.

I did, and I would argue that RM is not good for doing that type of logic. I would do that in the driver. The rules would just be the simple temperature settings or time of day adjustments or feeding times, etc. The drivers would work out the error states and retry logic.

Drivers on HE are written in Groovy. A loosely typed java similar language, that runs on the JVM. It's pretty easy to learn and work with.

2 Likes

Other than this driver shortcoming, my RM rules have been great and have done everything I needed in the manner I desired. It would be nice if the drivers were by convention written to do these things, but I'm not sure where your hesitancy to use RM comes from.

Are the standard HE device drivers for commercial devices typically open source? Do you think it would be an easy task to change the "state change" flag that you and @hubitrep mentioned, and leave the rest of the driver code unchanged?

The built in drivers are closed source. The docs do give example drivers, likely including one for zwave control of a simple device like a plug. Community drivers are often a good start to use as well. The SmartThings Groovy drivers (the old cloud based environment on ST, the current ST environment is Lua based and is called Edge) are all open source on GitHub and are mostly compatible with HE with a few changes.

1 Like

Yes, the hub discards events that are not state changes, in the normal case. As @coreystup posted above, this can be controlled by the driver, by forcing an event even when there is no state change involved.

No, it's the hub that filters out events that do not represent state changes. The driver creates the event, but in the normal case it does not attempt to control what becomes of the event with the isStateChange:true flag in the event (defaults to false, and leaves it to the hub to manage).

I think it's important that you realize this architecture design is not just happenstance, and is done for a reason. In the normal case, duplicated events will cause apps huge headaches, causing things to happen when they should not. The whole idea is for apps to know that an event does represent a state change, and should be acted on accordingly. If a driver changes that meaning, there is the implication that apps would have to be changed accordingly. While this might be desirable in some cases, in general it is not.

The right way to deal with this case is with a custom driver, and not use isStateChange and apps (don't pass the problem up, just deal with it in the driver).

You have to start somewhere. Z-Wave drivers are in general non-trivial. Adding in state management would take them up a level of difficulty. It seems that you want every command to poll the device after the command is sent, to determine if the device responded as desired, and if not, to repeat the command on some interval until it does respond. Polling is generally not a good thing unless the poll rate is low (order of magnitude of a few seconds). But you have a poll rate of a minute, so that's doable for sure. This sort of thing definitely belongs in the driver, not in Rule Machine.

What driver are you using? Depending on what it is, we might be able to make the source code public for it.

4 Likes

@robert.bruce -

If these are Zooz devices, @jtp10181 has made his drivers available to the community. I think he has drivers for just about every Zooz device.

5 Likes

If Bruce is able to open source the driver, I'd be willing to help add the recovery logic to it. I write software for embedded systems controlling hardware devices for a living.

4 Likes

I thought this topic was DOA, but all of the sudden its moving. I'm very happy about that.

@coreystup your offer is exceptionally kind. Thank you, and I may take you up on that.

What I would like to do is start with the most expeditious route, which is to modify current drivers to set the isStateChange variable to true for the switch command events that are not actually state changes. I realize that this is not the ideal long-term solution, but I want to get something that works without the failure-point I've been describing. I want to stick with RM rules to do the device state verification for now, because it is easy and I have been doing this for a long time. Once I can get my reliability-based rules to be free of this failure point, I can make the full rules I've written available for people to try. I want Hubitat users to be able to see how much of a reliability difference this makes.

If people are still interested after trying my rules/procedures, I'll be happy to move on from there to custom drivers that do the verification internally. I understand that is a better long-term choice.

I was wishfully thinking that an isStateChange : false in an open-source driver could be changed to true and this would do what I need. @bravenel has pointed out that the default value for this variable is false, so there might not be explicit isStateChange assignments for events where the command does not result in a physical state change. This will take a little detective work but whatever help anyone can give me with this, I'll greatly appreciate.

Thank you for pointing this out to me.

@bravenel , I truly appreciate your help with this, and thank you for helping clarify to me how the automation hubs and devices work.

I currently use the Aeotec SmartSwitch7, the Zooz Zen4 (also a plug in switch module) and the Zooz Zen17 dual relay. The Zen17 relay is highly capable and versatile, and I could use it for all of my switching needs because I electrically modify all of my heating, cooling, lighting, and humidification devices already so I don't need plug-in switches.

@bravenel , The Zen17 is ultimately the most useful device for me, but the downside is the driver is going to be far more complex, because it has to install and handle four child devices. I anticipate, given the choice, that I'll focus my initial efforts on one of the other switches I mentioned, but if any/all of those device drivers can be made public, I would like to have the code. Thanks