EDIT (08-Jan-2022) - While I started this thread more to discuss how best to represent device / service status information in drivers and apps, it has morphed more into how, what we can monitor and where we should be doing this, but that's alright with me...
In recent times I have been setting up various virtual devices and checks for the availability of key parts of my HA setup, doing things like pinging other hubs such as my Bond Bridge, Harmony Hub and various raspberry pi's, as well as doing HTTP calls to services such as Grafana to make sure it is still available.
While ping is one test, and HTTP calls another, both valid and useful, this doesn't always guarantee other communication channels or services are up and running. I am interested to hear other developers thoughts on how Community developed drivers could present this status information in a consistent way.
One option would be to implement the Presence capability, which is essentially what the different ping and HTTP calls are presenting, at least in how I have utilised them. But communications can sometimes be a little more nuanced than that, and I feel like it may be limiting in some circumstances to only have two options, present or not present.
Obviously, I’m not a developer, but I wish there was a specialized capability other than Presence that is used to indicate device availability.
I agree with your point that values for this capability should be more nuanced.
Shortly after posting I did think I had limited my audience by restricting my request to developers. Even though they are the ones who would ultimately implement any solution, it should be driven by the use case more than how to solve the problem. Now I sound like I'm at work...
I guess what I'm after is a consistent way to report and assess the availability of devices and services. And that doesn't have to be limited to the "larger" items like hubs, it could also be lights, sensors, etc.
I think all interested parties will jump in. Again, bearing in mind that I’m not a developer, this is an approach that I have taken in virtual drivers that I use to bring zigbee2mqtt devices into HE via NR.
I create a custom attribute called LQI into which I dump the device LQI as reported by z2m.
I have a second custom binary attribute called Connected which gets set to No if z2m hasn’t heard from the device in 3 hours.
Part of the issue is needing an external server/service to really make such detection as accurate as possible - things like heartbeat/killswitch etc. Once you incorporate that you've left the confines of the HE system and things become problematic/non-standard quickly.
That's a neat approach - how often does z2m poll the devices LQI?
I poll each device every hour.
I didn't know you had to do it manually, doesn't z2m get LQI's automatically? Edit: I misunderstood your response, I was wondering how often z2m updated the LQI's internally).
(apologies @sburke781 for taking the thread on a bit of a tangent)
No worries, it was as much to promote discussion as come up with a solution.
You are right @erktrek. Accuracy is definitely an issue that each integration has to deal with, some with limitations outside of the developers control. I guess I am not looking to solve that specific problem, or problems, but rather how best to present those results in a consistent way.
Maybe it should be a part of HSM ?
What makes you suggest HSM?
thinking about some kind of framework to handle alerts etc in a consistent way.. seems like HSM might be good - you could certainly include virtual devices as triggers maybe some sort of hook/value addon to HE's paid service?
edit: for us multiple hub owners also incorporating HubMesh might give us local network detection as well.. (another excuse to buy more hubs!!!!)
Each time the device checks in, which varies between devices.
Interesting idea.... I hadn't thought that far ahead. You are right that a built-in monitoring feature would be quite useful. I guess I was thinking more low level, but what to do on top of that reporting is certainly worth exploring. That may inform what is the most appropriate solution...
For me the concepts in Device Watchdog are what I had in mind, looking for activity or a common set of attributes to provide an indication of any problems, which is largely where this thread came out of.
I like the watchdog approach too but you need more than just a hub in order to detect and alert when something goes wrong. This is where I think HE could provide additional value - "hub monitoring" to their Protect service. It wouldn't be perfect because the connection at either end could go down while internally things continue to function. This is why I was thinking an addition to Hub Mesh as well - some sort of shared heartbeat service monitoring for multiple hubs and maybe phone apps.
edit: I know I am not discussing what the standards would be for low level usage but maybe (echoing what you said in your earlier post which I totally agree with) we should think about how it would be implemented in the hubs first to get an idea of what we would need to do.
edit2: A community framework (something like Hubitat Package Manager) would be good but an internal HE solution would be better. Which is why I was thinking HSM..
In the Telnet driver there is a networkStatus attribute that is supposed to be an ENUM of online|offline although I think it might just be a string. I include this in my MQTT app to show broker availability. This came about as I ab(use) the Telnet capability because Hubitat wouldn’t add an MQTT capability.
There is no associated command exposed.
This is an example of what I was trying to promote discussion about, how do we come up with a common way to capture, report and handle things like the status of communication with hubs, devices or services external to the HE hub.
- There can be hard Online / Offline type situations, but some can be more nuanced, needing to indicate minor glitches in communications or the state of a more complex system
- Do we rely on capturing these different status values in the device(s) in HE or somewhere else?
- From @erktrek 's post, how do we monitor the state of these and alert people to any potential issues?
Interesting problems to try and solve....
IMHO HE isn't the place to do the monitoring of infrastructure, it should be left to something dedicated to that task such as Nagios, Monit, Sensu, etc.
I think doing it in HE you would be limited to simple PING or HTTP checks as you mentioned, but also it's not a complete picture, for example if a PING to your Hue hub fails is that because the Hue hub is down or some other piece of infrastructure in the path to it?
I personally use Nagios, where I have it monitoring over 150 IP connected devices and many 1000's of services across those devices. It's infinitely configurable so you can define levels for normal / warning / critical states and all manner of alerts can be set up.
I also reflect the state of some devices back over MQTT and into node-red so they can be used in logic. For example for Growl desktop notifications it's pointless sending them if the target PC / laptop is down so I skip sending to devices that are offline.