Zigbee radio goes offline - no hub load warnings

brad5 · May 23, 2022, 5:36pm

There have been a lot of posts about zigbee going offline but usually it is related to severe hub CPU load, which I do not appear to be experiencing. At about 11:27 I got an alert that the zigbee radio was offline. I checked the hub and it was indeed offline, but no hub CPU warning. From the UI I disabled the radio, waited a second or two, and re-enabled it. It's been on since then and seems to be working fine. I thought for sure I would have to shut down the hub and power down the radio for 30 seconds to get it to come back, but I didn't have to do that.

Here's the info I've captured, thank you @thebearmay for your great hub info driver. 5 min load looks fine, instant load doesn't seem bad either. I checked the location log, nothing additional there. And I attached charts of CPU load and temperature over time. No spikes around the time of the issue. The only non-HE drivers I was using were for 3 Sonoff sensors. I have since switched to the included drivers rather than the community-supported ones just as a diagnostic measure. When I look at my top device stats you have to go a fair ways down the list (in decreasing order of activity) to see a zigbee device at all. I sent an email with this info into support but thought I would post here as well. My issue doesn't seem to quite match a hardware problem either.

Quick update - got a reply from HE support (thank you @bobbyd) and they're not seeing anything in the engineering logs so hopefully a one off but if anyone has any ideas feel free to share.

Rxich · May 24, 2022, 3:50am

Not super technical, but something drove up the CPU/temp at the same time. Check the logs at that time and see what app/rule was performing. I've also had this happen with a repeater goes wonky.
What is app 2999?

brad5 · May 24, 2022, 11:13am

App 2999 is an RM instance that alerts me when the radio is off. It is also what is recording the temp and CPU. They aren't alerts... just diagnostic info the I have RM dump to the logs if the radio goes off. The big spikes in temp and CPU are from the weekly reboot a few days before.

Hee's the rule:

Tony · May 24, 2022, 3:20pm

Not that it matters much but my OCD is forcing me to point out as an FYI that the radio on/off state is not represented by Zigbee Network State Online/Offline status. You can set Zigbee 'Disabled' and the radio stays on; if you set up your Zigbee sniffer you'll continue to see the x0000 network source address of the hub continue to broadcast. In fact, if you look at a network trace with the radio in this state (and don't examine what commands are being sent) the difference in the amount of radio traffic is hard to spot.

What is really 'offline' is the upper layers of the Zigbee network stack (software). The radio hardware at the MAC layer remains 'on' in this state, even if no Zigbee commands are actually being processed. There doesn't appear to be a way to power cycle the radio by software command, hence the 'pull the plug and wait till caps drain, then restore power' advice when the internal radios (Zigbee as well as Z-Wave) really must be reset to fix a problem.

Early on it was assumed that 'Zigbee Network State' shown on the settings page reflected the on/off status of the radio, and that seeing 'Zigbee Offline' along with rising CPU temperatures meant that the hub might be turning off the radio as a protective measure because the radio was hot, or an environmental temp threshold was being exceeded.

Since the radio never gets shut off, what is more likely (my speculation here) is that as CPU load passes a certain threshold (for a variety of reasons: workload due to automation/network activity, maybe some process stuck in a loop, etc.), Zigbee network queues get overrun. Error checking in the software detects this and stops processing Zigbee commands (hence the 'Zigbee Network Offline' symptom); temp rise just correlates to higher CPU load.

brad5 · May 24, 2022, 3:39pm

Thanks for the info. I'd be surprised if CPU hitting 10% was enough to trigger that response but I'll keep an eye on it. There was no temp rise in correlation to the event.

erktrek · May 24, 2022, 3:42pm

I wonder if memory usage plays a role as well..

kahn-hubitat · May 24, 2022, 4:24pm

definately mine goes offline once a month or two when memory usage drops too low..

hence i have a reboot rule based on zigbee off.

Tony · May 24, 2022, 4:40pm

Agreed; could be the 'process stuck in a loop'/queue overrun scenario (which wouldn't necessarily be caused by CPU load) at play here.

brad5 · May 24, 2022, 4:45pm

Hmmm that's a good point. I did check device and app stats didn't see anything obvious. Any idea how best to check for that condition?

Tony · May 24, 2022, 4:51pm

Not as far as I know; if that is happening it's probably something so low level that no diagnostic information is exposed to the user.

brad5 · May 24, 2022, 4:53pm

BobbyD checked the engineering logs. Hopefully something like that would have turned up there. I'll keep an eye on it and see if it happens again. I can always go back to my nightly reboots but I hate to do that.

I had an issue a while ago where I had a temperature average app that I inadvertently had configured recursively... the output of the average temperature was also one of the inputs. That took me a looooong time to find. The hub didn't care but the Alexa integration sure did.

brad5 · June 2, 2022, 12:58pm

Additional update based on advice from @erktrek and @kahn-hubitat - I started charting memory usage and found a pretty interesting pattern. I'm not sure what the "lower limit" should be but I do think this is part of my issue. Any suggestions on how to identify the culprit without one by one disabling apps? The app stats page does not show memory utilization by app, but is there a proxy?

kahn-hubitat · June 2, 2022, 1:46pm

That graph looks pretty normal and just like my two hubs

brad5 · June 2, 2022, 1:56pm

Hmmm really? Do you have a lower limit below which hub performance is affected?

kahn-hubitat · June 2, 2022, 2:19pm

you can see in one of the graphs the auto reboot that occured recently in nh when the zigbee went offline.. it appears to be around 140meg

in mich which is empty i cannot say how well stuff is working but it has been up for 39 days and my zigbee and zwave test devices are still responding quickly

brad5 · June 2, 2022, 2:32pm

Thanks! I think I will drop my reboot threshold to something shy of 200k and see if that helps.