Hopefully I'm posting this in the right place...
Background:
I have a pair of Hubitat Elevation Hubs (Both are Hardware version Rev C-7, running Platform Version 2.3.4.123). The motivation for running the two hubs, which share several devices over Hub Mesh, was to segregate several demanding apps & drivers (we'll get back to that), as well as providing a more reliable connection to a couple of Zigbee devices.
The Problem:
Both hubs have been unreliable, going offline at seeming random intervals, typically at the same time (so it seems the primary issue is either caused-by or propagating over the Hub Mesh). When they lock up, the hub indicator leds remain green, but the network connections go down and the hubs are no longer ping-able. They can only be restored by manually disconnecting and reconnecting the power.
Having gotten tired of running around my house, yanking out and re-inserting fiddly micro-usb cables, I took advantage of the fact that both hubs are connected to UniFi PoE ethernet switches and bought a couple of PoE to USB splitters to power the hubs. This allows me to control the power for the hubs remotely, and I was able to create a script that pings both hubs every 5 minutes and automatically bounces power if they're offline. The scripts logs reset events. Here's the log from the last few days so you can see how often this is happening (the two hubs are named "Main" and "Office"):
( Sat Dec 24 00:09:22 CST 2022 ) Main Down, Bouncing Power
( Sat Dec 24 00:09:24 CST 2022 ) Office Down, Bouncing Power
( Sat Dec 24 20:33:15 CST 2022 ) Main Down, Bouncing Power
( Sat Dec 24 20:33:16 CST 2022 ) Office Down, Bouncing Power
( Sun Dec 25 06:27:48 CST 2022 ) Main Down, Bouncing Power
( Sun Dec 25 22:35:16 CST 2022 ) Main Down, Bouncing Power
( Sun Dec 25 22:35:19 CST 2022 ) Office Down, Bouncing Power
This is a reasonable workaround for now, but I'd really like to get to the bottom of the reliability issue. The problem is that the hub logs are unhelpful. There are no telltale log entries when the hubs go offline - they just silently stop working, as far as I can tell. I'm wondering if there's something else I can do to track down the source of the problem....
I know this issue is probably being caused by a 3rd party app or driver - a UniFi presence driver and Ecobee Thermostat Suite Manager apps that I run on the secondary hub are prime suspects, since they are relatively complex and create significant hub load, but both integrations (to Unifi for presence detection and to my Ecobee Thermostats) are pretty core to my home automation experience so I really want to debug the issue (and inform the driver/app authors) rather then just giving up on them. Also, I don't know that these are the true culprit because of the lack of detailed systems logging prior to the failures. Does anyone have any advice on how to get to the bottom of this?
Editorializing here, and I know this may be a controversial statement: I don't expect Hubitat to warrant their devices against 3rd party plugins, apps, or drivers. That said, I"m not inclined to let Hubitat completely off the hook here: a misbehaving app or driver should NOT be able to take the entire hub down. This seems like a fundamental isolation issue in the base platform or OS, with a lack of defensive controls around 3rd party logic and resource use.
In any event, thanks for wading through a long post. If anyone out there happens to also run their hubs off of UniFi PoE switches and is interested in my little auto-reset script, let me know and I'll post it.