[2.3.5.152-2.3.6.145] [C7] Help! Hub seems to be failing with repeat zigbee radio online/offline

I need help diagnosing a problem with my C7. Any ideas would be most welcome.

I've owned this C7 hub for a bit less than three years. Although I have seen isolated episodes of "zigbee radio offline" in the past, the hub now seems to be in a cycle of roughly two weeks of uptime after which I see the radio going offline for 7-8 seconds, then back online, repeatedly. This pattern has now been observed three times in a row, over the past 6-7 weeks.

Interestingly the offline/online sequence seems to begin right after a backup

The hub doesn't seem to be overloaded in any particular way. Here's what the last 30 days look like (reboot on 8/20 due to this online/offline sequence). Median CPU over that period was 6.75% and max 40% (sampling every 7 minutes using @thebearmay 's Hub info driver v3).

In terms of automation, I don't have anything running over that sort of time period. Device and app stats don't reveal anything obvious IMHO:
image
image

The only errors I found in the logs are these, DNS name resolution failures which I see from time to time and are nothing to worry about. No interesting warnings either:

As this hub is covered by Hub Protect, I submitted a ticket and was told that this can be due to an overloaded hub or an "incompatible device". I was advised to take down any custom drivers, then add device / drivers one by one to figure out which is problematic. If I were to proceed like this, it could take me a long time (remember : the problematic behaviour only manifests after ~2 weeks!) during which time key automations would remain broken.

The Zigbee custom drivers in use on this hub are in fairly widespread use AFAIK :

  • Stelpro Ki thermostat driver by @philippe.charette (devices 1 and 2, so in use since I got the hub)
  • @birdslikewires 's Aqara Temp/RH sensor driver (in use for a couple of years I think)
  • @samuel.c.auclair 's Sinope thermostat drivers
  • the Zemismart Zigbee blind driver maintained by @kkossev

In addition, there is a custom driver I've written myself (derived from Samuel's work) for the Sinope water heater controller. None of these drivers are giving me problems. The attached devices all appear to work normally and dare I say flawlessly for months. In fact, the Aqara and Sinope drivers have been in use in two different locations with C7 hubs, for months. The other location has not seen any zigbee offline/online messages (I think ever).

Any ideas to troubleshoot this more efficiently would be most welcome. In particular, any tips into identifying a "bad/incompatible" zigbee device that could be taking down the hub's zigbee radio.

Next step I will revert the Sinope drivers to the system versions as I can do that without too much impact to my automations, however there is no system equivalent for the blinds or Aqara devices. I will also perform a "soft reset".

@support_team

So usually the zigbee radio going off line is due to something overloading the hub. The zigbee radio shuts down momentarily to recover whatever is ailing the hub. This doesn't necessarily mean it's zigbee related. Could be from a lan integration or something being overly chatty. Doesn't even mean the CPU is being overloaded. I would start with checking the logs for anomalies... (and since you tagged them, perhaps @support_team can look at the engineering logs). What kind of power reporting from devices do you have going on? What lan integrations do you have? (kasa is always suspect)

2 Likes

Thanks @rlithgow1 Iโ€™ll have a look at Kasa. Did you see anything in the data that I provided that hints at hub overload though ?

1 Like

Not off hand but like I said, overload doesn't necessarily mean the cpu...

Interesting that it's a C-7. C-8 has/had some problems with that.

That 7 second off/on is a Zigbee radio reboot. You can get the same effect when you hit "Reboot" on the Zigbee settings page.

Something is telling it to reboot Zigbee radio.

What have you done in the last 6-7 weeks? Yeah, easy question.

Maybe do a soft reset and then restore? Maybe follow up with a graceful shutdown and power cycle?

That's all I got.

1 Like

There is not even a single case proving that any Zigbee or Z-wave device, using any driver is causing these "Zigbee radio is offline" events. None.

I am very very curious to find out at least one 'problematic' Zigbee or Z-wave device or driver causing these offline event records.
No one has found any such thing so far.

3 Likes

Understood. Memory didn't seem super-low (still above 200MB, I've seen my other hub hum along below 170 MB after over 60 days...), DB size normal... are there other performance indicators I should look at ?

I also haven't seen anything yet that provided a smoking gun pointing at a particular device/combo or driver being at fault. I had continued Zigbee radio reboots on two C8 hubs, got a final replacement and it didn't have the ongoing reboots. Same devices/drivers migrated to all three C8s. Seems like this stuff is still mostly a mystery.

2 Likes

Actually, it is an easy question : nothing. I was away for most of that time. The most recent additions I have made were four energy monitoring plugs. They are a bit chatty but not to the point of overwhelming the hub methinks (1 message every 30 seconds). They are using the system driver.

Good point. I don't think I've power-cycled this hub in a while. Thanks !

Are you experiencing any specific Zigbee issues, like devices dropping off your hub, or not reporting all supported events (e.g., temp reported but contact or motion not reported)?

Or is it just the reboots are happpening, but everything works?

I did not find any evidence of any issues! Not a single device dropped and, from what I could tell, automations ran. For the first couple of occurrences of this pattern, I rebooted the hub as soon as I saw the repeated notifications, as a precaution.

This morning however I was losing UI responsiveness (through remote admin) and all I could do is reboot the hub. Hub events log then revealed the radio had started to reboot every few minutes.
Has been running fine for the past six hours since hub reboot.

These are quiet as a fish, if compared to some of the mmWave radars that I use without a problem. .. :slight_smile:
For some really spammy devices - see this screenshot. Yet - absolutely no problems with Hubitat hubs performance! :+1:

1 Like

I just went through this.

Watch the zigbee log. I found that an indicator was a dev: that showed up with no id but valid clusters. This let me id the offending device (ultimately) as a misbehaving outlet.

My specific device...was probably a Sengled plugin outlet. I've got a stack of plugin outlets that Ive yet to test that I pulled on the day I had my voila' moment.

I was using the Sengleds as a local power button on a kettle. Unfortunately, I think the 15Amp load rating of the Sengled is not for resistive loads.

Note, I said zigbee logs, that is the logs found by clicking on settings->zigbee details->zigbee logging.

I opened this in a new window, then using the device controls, stepped through the network, matching up device id's, trying to figure out which one was kicking out the nul id. Ultimately I noticed it was reporting a cluster I suspect is related to power monitoring, and i soon began to suspect the plugin outlets.

Since I remove the sengleds and half a dozen other unused plugin outlets, the problem has disappeared, and stability has resumed. Since recovery, I've installed 3 in wall Jasco 43102s, and the problem has not recurred.

I'm still considering a way to test the plugins. Trashing them remains an option.

Example log entry:

dev:, 2023-08-31 06:01:59.033 PM, profileId:0x0, clusterId:0x8032, sourceEndpoint:0, destinationEndpoint:0 , groupId:0, lastHopLqi:255, lastHopRssi:0

S.

2 Likes

Thanks for the detailed account! Will definitely sift through those logs.

So far I am not seeing anything exactly like that, but I do have one device sending some kind of ZDO message every 10 seconds.

It's an Aqara in-wall no-neutral switch - installed nearly a year ago, uses the Generic Zigbee Switch system driver. It controls a bathroom fan, which everyone is now used to not starting manually... So now that I've done a soft reset + power cycle, I'll wait and see if the problem comes back in a couple of weeks, and if it does I'll try replacing the switch.

1 Like

Would you know the specific model?

I took out all my Centralite plugs when zigbee had the radio reboot issue. I went repeater-less and no more problems.

I since bought 12 Sengled E1C-NB7 plugs. I did a short test, but went back to repeater-less, since I currently only have battery zigbee devices and that arrangement has been reliable.

When I did my test of the new Sengleds, however, one plug did have problems. The biggest telltale symptom was that the on/off switch on the device wouldn't work. I banged it on the table, (my go to move), and it started working again.

Anyway, I did have one zigbee radio reboot during this futzing-around period.

I have this exact same problem. The customer service told me the exact same thingโ€ฆ there is no way to easily which device may be flooding the hub.

1 Like

You can get a zigbee sniffer. Use something like xbee

1 Like

For now, based on @scottgu3 's experience, I am piping the zigbee logs websocket to a file on a local workstation. I will scrutinize the file for any hints (hope that, in itself, doesn't overload the hub! lol).

I suppose getting an Xbee is an idea but if I'm going to be throwing money and time and the problem... I've ordered a new C7 (not getting any further replies on my warranty ticket @bobbyD ).

However one would expect that the hub itself could be more specific about what is going on. Or that there would be some kind of diagnostic tool (or mode) you could boot into for these situations.

1 Like

Why not a c8?