Availability of HA Infrastructure, e.g. other Hubs, RPI's, key services, etc

sburke781 · August 29, 2021, 11:36am

EDIT (08-Jan-2022) - While I started this thread more to discuss how best to represent device / service status information in drivers and apps, it has morphed more into how, what we can monitor and where we should be doing this, but that's alright with me...

In recent times I have been setting up various virtual devices and checks for the availability of key parts of my HA setup, doing things like pinging other hubs such as my Bond Bridge, Harmony Hub and various raspberry pi's, as well as doing HTTP calls to services such as Grafana to make sure it is still available.

While ping is one test, and HTTP calls another, both valid and useful, this doesn't always guarantee other communication channels or services are up and running. I am interested to hear other developers thoughts on how Community developed drivers could present this status information in a consistent way.

One option would be to implement the Presence capability, which is essentially what the different ping and HTTP calls are presenting, at least in how I have utilised them. But communications can sometimes be a little more nuanced than that, and I feel like it may be limiting in some circumstances to only have two options, present or not present.

Discuss....

Simon

aaiyar · August 29, 2021, 11:50am

Obviously, I’m not a developer, but I wish there was a specialized capability other than Presence that is used to indicate device availability.

I agree with your point that values for this capability should be more nuanced.

sburke781 · August 29, 2021, 11:57am

Shortly after posting I did think I had limited my audience by restricting my request to developers. Even though they are the ones who would ultimately implement any solution, it should be driven by the use case more than how to solve the problem. Now I sound like I'm at work...

I guess what I'm after is a consistent way to report and assess the availability of devices and services. And that doesn't have to be limited to the "larger" items like hubs, it could also be lights, sensors, etc.

aaiyar · August 29, 2021, 12:13pm

I think all interested parties will jump in. Again, bearing in mind that I’m not a developer, this is an approach that I have taken in virtual drivers that I use to bring zigbee2mqtt devices into HE via NR.

I create a custom attribute called LQI into which I dump the device LQI as reported by z2m.

I have a second custom binary attribute called Connected which gets set to No if z2m hasn’t heard from the device in 3 hours.

erktrek · August 29, 2021, 12:23pm

Part of the issue is needing an external server/service to really make such detection as accurate as possible - things like heartbeat/killswitch etc. Once you incorporate that you've left the confines of the HE system and things become problematic/non-standard quickly.

rocketwiz · August 29, 2021, 12:26pm

That's a neat approach - how often does z2m poll the devices LQI?

aaiyar · August 29, 2021, 12:27pm

I poll each device every hour.

rocketwiz · August 29, 2021, 12:31pm

~~I didn't know you had to do it manually, doesn't z2m get LQI's automatically?~~ Edit: I misunderstood your response, I was wondering how often z2m updated the LQI's internally).

(apologies @sburke781 for taking the thread on a bit of a tangent)

sburke781 · August 29, 2021, 12:37pm

No worries, it was as much to promote discussion as come up with a solution.

sburke781 · August 29, 2021, 12:41pm

You are right @erktrek. Accuracy is definitely an issue that each integration has to deal with, some with limitations outside of the developers control. I guess I am not looking to solve that specific problem, or problems, but rather how best to present those results in a consistent way.

erktrek · August 29, 2021, 12:42pm

Maybe it should be a part of HSM ?

sburke781 · August 29, 2021, 12:43pm

What makes you suggest HSM?

erktrek · August 29, 2021, 12:46pm

thinking about some kind of framework to handle alerts etc in a consistent way.. seems like HSM might be good - you could certainly include virtual devices as triggers maybe some sort of hook/value addon to HE's paid service?

edit: for us multiple hub owners also incorporating HubMesh might give us local network detection as well.. (another excuse to buy more hubs!!!!)

aaiyar · August 29, 2021, 12:50pm

Each time the device checks in, which varies between devices.

sburke781 · August 29, 2021, 12:51pm

Interesting idea.... I hadn't thought that far ahead. You are right that a built-in monitoring feature would be quite useful. I guess I was thinking more low level, but what to do on top of that reporting is certainly worth exploring. That may inform what is the most appropriate solution...

sburke781 · August 29, 2021, 12:53pm

For me the concepts in Device Watchdog are what I had in mind, looking for activity or a common set of attributes to provide an indication of any problems, which is largely where this thread came out of.

erktrek · August 29, 2021, 1:10pm

I like the watchdog approach too but you need more than just a hub in order to detect and alert when something goes wrong. This is where I think HE could provide additional value - "hub monitoring" to their Protect service. It wouldn't be perfect because the connection at either end could go down while internally things continue to function. This is why I was thinking an addition to Hub Mesh as well - some sort of shared heartbeat service monitoring for multiple hubs and maybe phone apps.

edit: I know I am not discussing what the standards would be for low level usage but maybe (echoing what you said in your earlier post which I totally agree with) we should think about how it would be implemented in the hubs first to get an idea of what we would need to do.

edit2: A community framework (something like Hubitat Package Manager) would be good but an internal HE solution would be better. Which is why I was thinking HSM..

kevin · August 29, 2021, 5:35pm

In the Telnet driver there is a networkStatus attribute that is supposed to be an ENUM of online|offline although I think it might just be a string. I include this in my MQTT app to show broker availability. This came about as I ab(use) the Telnet capability because Hubitat wouldn’t add an MQTT capability.
There is no associated command exposed.

sburke781 · August 30, 2021, 7:07am

This is an example of what I was trying to promote discussion about, how do we come up with a common way to capture, report and handle things like the status of communication with hubs, devices or services external to the HE hub.

There can be hard Online / Offline type situations, but some can be more nuanced, needing to indicate minor glitches in communications or the state of a more complex system
Do we rely on capturing these different status values in the device(s) in HE or somewhere else?
From @erktrek 's post, how do we monitor the state of these and alert people to any potential issues?

Interesting problems to try and solve....

martyn · August 30, 2021, 9:41am

IMHO HE isn't the place to do the monitoring of infrastructure, it should be left to something dedicated to that task such as Nagios, Monit, Sensu, etc.

I think doing it in HE you would be limited to simple PING or HTTP checks as you mentioned, but also it's not a complete picture, for example if a PING to your Hue hub fails is that because the Hue hub is down or some other piece of infrastructure in the path to it?

I personally use Nagios, where I have it monitoring over 150 IP connected devices and many 1000's of services across those devices. It's infinitely configurable so you can define levels for normal / warning / critical states and all manner of alerts can be set up.

I also reflect the state of some devices back over MQTT and into node-red so they can be used in logic. For example for Growl desktop notifications it's pointless sending them if the target PC / laptop is down so I skip sending to devices that are offline.