[2.3.5.152-2.3.6.145] [C7] Help! Hub seems to be failing with repeat zigbee radio online/offline

Thanks, this post is really useful. I'm also experiencing this issue. Frustrating as it occurred on my C7 and asi have transferred everything onto a C8 it has also started to occur again.

It would certainly be useful to have more tools and reporting available natively within the hub to both see any issues and also check general performance.

I am also about to begin the process identified above. Fingers crossed.

1 Like

It can also be caused by hub overload in general. The zigbee radio is the first to shut down to try and recover.

Stuff that integrates with LAN stuff can cause an issue. For instance I had an issue with my Iotawatt and when it went off line the driver kept trying to contact it and the zigbee radio would eventually shut down when it couldn't due to all the failed retries.

Unless there is an indicator one can look up to confirm hub overload, or if the specific causes of hub overload that lead to radio issues can be observed / eliminated, this is not useful information, sorry. In my case and many others, there is nothing indicative of any form of overload.

Did anyone from Hubitat staff ever confirm this to you ? I can't wrap my head around it. If I had to guess what "hub overload" might have to do with this, if anything, I would think it's the hub starving the zigbee stack for attention, leading/contributing to it crashing, with the hub rebooting it to recover it once it notices. All speculation of course, I'm unfamiliar with the hub's architecture. I believe the beginning of all three instances of my repeat reboot sequences coincided with the automated nightly backup... and for the last instance, I was trying to get a backup done preparing for a soft reset and had to give up as the zigbee radio reboot frequency increased and the hub became unresponsive...

Presumably you had other indicators of the issue beyond the zigbee radio going offline, in which case, that is coherent with what I wrote.

Yes this has been confirmed by both @bobbyd and @mike.maxwell, simply do a search on it.

Example:

Actually the only thing that was a clue is at the same time, I found a failure with with the iotawatt be contacted. (was a bad gcfi) I then did experiments over the course of several days and could repeat it. @ogiewon updated his driver and hadn't had a problem with the iotawatt causing that again

1 Like

I'll trot out the Sherlock Holmes analogy again, but it's true.
Repeatability is key, but hard to achieve most times.

It's the same, or worse, for Z-Wave.
I have a problem location for a shop light plug in the basement, almost directly under the hub.
It went off line yesterday after a reboot, and stuff in the basement generally acted up.
Power cycle of the plug fixed it.

My point is, I guess, these meshes, are fragile, perhaps more so with Z-Wave, I don't know.
But, with experience, you get to know the rhythms of failure, lol, and how to address.

With all this Zigbee radio reboot talk, I think I'll re-re-install the 12 Sengled E1C-NB7's. My battery-only network has been reliably boring:

Thanks. A couple of things :

  • the post is 5 years old - wonder if the platform handles this differently now.
  • that sentence does not necessarily imply the hub is taking the radio down, just that the radio goes down when the hub is stressed. I am going to stick with my theory for now! :smiley:

I did a quick search, did not find other posts as specific as that one, but did come across a post by staff stating that the hub is I/O bound, not CPU bound (makes sense).

However I am unaware of any user-accessible stats or perf counters that allow us to monitor I/O, either on the LAN or with the radio stacks.

Here's a more recent one on the subject @rlithgow1
image

2 Likes

Ha, thanks for posting that screenshot -- I was just digging around for that interaction!

Yeah, there are definitely mixed messages on this. But I know my ZB radio has definitely crashed once or twice when the hub was stressed back on my C7. Thankfully, it hasn't happened for a long time now (never on my C8, even with "elevated load" status).

2 Likes

Elevated load doesn't do it I don't think. Well what I mean is the cpu can run a lot but it seems when one specific thing is getting beaten over and over at a rapid pace that the cpu gets a bit wonkey (this is purely observable speculation and I certainly may be talking out of my arse albeit unintentionally)

On a side note, my one minute duration for zigbee radio off on my shutdown/power cycle rule was too short. When I rebooted the zigbee radio yesterday, it took 124 seconds and kicked off the rule (there are other built in delays). I bumped the rule up to five minutes.

How about the old version of Zigbee theory on the repeaters?

Could still be an issue, I'm looking into that right now on mine. Running sengled outlets

In theory @rlithgow1 is correct. The hardware will shut down the Zigbee to preserve its resources when dealing with an overload. However, based on incidents this doesn't (often) happen, if ever, thanks to the work that @gopher.ny has done over the years to fine-tune how the overload is handled. Most common Zigbee radio issues, as of today, September 8th 2023, are related to the mix of Zigbee devices.

2 Likes

Older than E1C-NB7, right?

Nope, that model. Using generic zigbee outlet driver.

image

Oy. I bought 12 of those on your recommendation. :slight_smile:
On the other hand, a while ago I bought a bunch of the Centralites on someone else'e recommendation.

1 Like

Thanks. So the radio may be taken down to preserve resources, but this doesn't sound like its the first thing to go, at least not anymore.

It could be the first thing to go if you only see red in your logs with errors occurring every few ms.

1 Like

The manual said cumulative power readings were available. I assume this means kw-hrs. It doesn't seem the generic driver support energy. Not that I have power turned on, but it would be nice.

As an aside, I just looked at the manual for reset procedure for one of the 12 that had acted up before and was acting up again just now.
I found that pushing the button while plugging it in helped.

It supports watts