Interested in hearing what approach folks take within a Rule to deal with a missing/dead/errant/stuck/misbehaving device (sensor) where that device's ongoing reporting is critical.
Let's say we're talking a humidity, temperature, or soil moisture sensor.
Do you use a belt & braces approach in every rule you write whereby you test or check some or all of the following-
the number is in a valid range
the number is not "exactly-the-same" as yesterday's number
the sensor HAS reported "recently" (whatever "recently" is meaningful)
the sensor's battery reports a reasonable number (whatever "reasonable" is)
Or do you always use multiple sensors and cross check "what is reasonable" against each other ?
I ask out of realizing that lazy dereliction of dutiful validity checking is no way to go about building blind trust in your Rule-scape. I will admit that I have discovered things not working because of a sensing device battery being dead (not noticing the notice), a Zigbee mesh problem that I was unaware of, ...or whatever.
Up to this point it's been "OK" to just realize after-the-fact that a sensor isn't performing and a rule is thus in limbo. But that's not OK as more and more oversight & control is put under HE's umbrella in an environment.
I'm just wondering if there is a way to Elevate how this is done, maybe even at the Driver level, or if everybody is actually making sure to build all those tests into each and every Rule they build to make sure the Rule is acting on a valid number or not.
Thanks in advance for the discussion/thoughts/ideas.
EDIT ADD: Thinking about this more. Trying to think how not to have to check the health or reasonableness of every sensor's data used in every Rule you are about to rely on it. So if it's not possible in the Device's Driver then I'm imagining some Master Device Table wherein I can set up a standard set of expected behaviors from all the devices in an environment and if reporting stops (yikes some timing or other allowances would have to be made for say a key repeater going down crippling multiple devices), a reported number falls outside some boundary condition, or is otherwise static for longer than reasonable....LET ME KNOW. Wonder if this is possible or way too much overhead for what it's worth?
That you even have to ask.... Look, sensors are going to fail, batteries are going to fail, cosmic rays are going to hit your home, s*** happens all of the time. So even if you create a ginormous watchdog in the sky, it too will fail sometimes, and then you'd need a super watchdog watching the watchdog....
For me the device activity app handles this pretty well. I guess the only thing that really critical are my water leak sensors. I get a report every day of any device that hasn't checked in in 24 hours. or a battery is below my set threshold.
I use device notes to put in when I have changed a battery last. So I can see how long it's been since I changed it and I keep a tally in the note of how long between battery changes.
Not with $30 "smart" devices. These are not industrial grade products. You certainly could take HA to a higher level, but at considerably higher cost. You get what you pay for.
Itās not been a problem here, even with $10 devices. The trick is not to try to precisely assess device health or have belts and suspenders on everything so it never fails, but to make the failure visible.
DAC is great for that purpose.
You could add other heuristics based on normal usage patterns.
But we can try to cover for their inadequacies & failings and as a result be known as "the Hub that thinks of just about everything so that you don't have to".
I think some of the Custom Apps are in place as crutches speaking directly to this, helping keep "the pulse of things" so that you don't have to worry as much.
To get back to my OP, I'm still asking folk to comment if they put much thought into handling anomalous (or stuck on last reported) sensor values to catch wayward values before they send a rule off in the wrong direction.
Just yesterday my gate contact sensor fell off, so the hub thought the driveway gate was open.
I had a rule that closed the gate if it was open for more than a half hour (deer is the main concern).
So, the hub tried to close the gate, which because it just uses the gate remote, caused it to open...for a good long time, before I discovered it.
I deleted the rule and make do with a notification.
Not this example, but sometimes it's hard to predict what will be the result of this or/and that failing. I'd say, keep it simple, is it necessary, and worth any risk?
Right, and that's all great when you are around and can say...."no, it shouldn't". So that means writing another Rule that watches such activity/in-activity and intervenes when you can't if you're not home.
Yes, exactly....and this I guess prompted my original post. I'm finding I'm not testing validity but increasingly writing "follow up/clean up" Rules for some wayward instances that leave a consequential situation, i.e. gates open, valves not closed, pumps still running, etc.
A well know problem for those with completed HA installations: What Now? Remember when that sensor failed to report and the lights were left on..... more to automate.... And then, check the checkers and keep the fun going...
Due to an addition, that we put on our home, two major bathrooms have no windows to the outside world. The exhaust fans (high powered inline 'remote' (attic installed) ) fans do heavy duty to remove excess moisture through exterior vents. Even for a brief shower, the built up humidity takes quite a long time to exhaust properly.
Wait for his reply....but here is my take-away from your question.
As an aside to the OP-
It is once again evidence that known Best Practices, and the tools that help new and old users achieve them in this environment, should be integrated in the core system and not tucked away in Custom App/Driver add-ons that you have to know/learn about....as you are about to.
Hats off to all those that have seen the need and created these add-on solutions...while that is the blessing of HE (that said individual creativity can be expressed in helpful tools), it is also the curse for new users ...as I believe a chunk of that work should be proof-in-concept that HE ought to adapt in the core system so it's JUST THERE ...and you don't have to be told to go add it.
First of all, what devices are critical in your environment? Door sensors? Water sensors? The smart outlet powering the iron lung?
I'd strongly suggest separating these things:
rules using a device
method for checking the health of a device
method for alerting on the health of a device
For example, I've got some devices (motion sensor) that are used in multiple rules (lighting, intrusion alarm, heating/cooling based on presence) -- I don't want to embed (duplicate) a health check for the motion sensor into each rule.
I'd suggest a set of rules to set per-device global variables based on device health, with a small and discrete number of values (ie., "Health OK at timestamp", "Health in question due to reason at timestamp, pending confirmation (X)" -- where "X" is the number of queries attempted or a timeout value, etc, and "Device Health Bad (reason) at timestamp").
The rules that use a device (ie., lighting, HVAC control, etc) would require the device health variable to be "OK" in order to proceed.
A separate rule would report on any devices that are not in good health, based on the variables.
Thanks for your thoughtful reply, but starting it out like this kinda sucked. I mean seriously, do you not think there are more than a few folks in this forum that would have some serious issues if the automations & oversight they have come to trust would incur a day or so of less-than-obvious failure?
Security, Plumbing, and Freeze Protection aside....some actually have agriculture, animals, and equipment on the line. I could give two sh_ts about the lighting scenes so many spend so much time optimizing. That kinda stuff offers me no labor savings/facility oversight/and protections in comparison.
So yeah, I have LOTS of Iron Lungs that I'm not having to be hovering over because of the use of automation, before HE in fact.
EDIT ADD: I want to point out that the content of your post was exactly the kind of stuff I was hoping folk might throw on the table. Thanks.
Thank You!!
I asked without first searching myself
So, double thanks.
My other question remained unasked in this thread because I did do a search for "DAC" & "device activity check" to learn it was an app and available through HPM
My question about what you consider critical was meant to inspire some thought about different levels of risk.
OK, let me put it another way:
Don't rely on low-end consumer grade equipment (ie., Hubitat and the whole ecosystem of sensors) for anything critical such as life-support (and yes, I'd include animal life).