Zigbee goes offline under heavy load

jshimota · August 3, 2022, 12:46am

At 9:30a and 5p I run a 'diagnostic' on my environment. I have a group called 'Rare devices' which is items that often don't get touched for days on end. I've a rule that simply turns them all on, then turns them all off and device watchdog is triggered for a report. This is all automated.

Additionally, I also have a nearly weekly manual process. The process is a manual backup, then a shutdown. Then I pull the power from the hub for 20 mins. Subjectively, I notice faster device responses to alexa commands, and faster group responses as well. This process I do every (4-5) days. This is a legacy process - there was a time when the community felt it was necessary, I got in the habit, and I've kept doing it. I have been ...told it was unnecessary - in no uncertain terms. I don't really care. After all, it is my hub. Regardless of your opinion on my personal process.

Having detailed all of that for you - now the actually issue.
In the last 2-3 weeks I haven't done this process as usual. During these last 3 weeks I've experienced 3 'Zigbee offline' situations. The first two, I came home and most devices were turned on. The gf surmised 'we must have had a power failure'. I shrugged, used my Android dashboard and set everything back to normal and went my merry way.

Tonite. I caught the failure. I have an Notification to all my alexas if my zigbee goes offline.
First, I heard the notification that it was 5pm. Then I saw lights going on all over - I had the hubitat Android Dashboard open fullscreen so it was clear what was happening.
Normally, having sat through this process 100 times, The lights all go on, then 3-5 seconds later all the lights go off.
Today, the lights went on, but I could see on the dashboard all the expected lights/outlets didn't complete. Then about 10 seconds later all my Echo's around the house started saying 'Zigbee network offline'. The dashboard, which gets zigbee status from Hub Info, said zigbee was online but inside the Hubitat it clearly was not. (obviously a secondary, unrelated issue).
Indeed - no device was communicating, the Hub interface showed the HE zigbee network offline. But the logs don't show WHY it went offline.
I rebooted the hub. No command issue to the hub to 'enable' the zigbee would work. There was NO entries in the logs and basically the hub went stupid.
The reboot fixed it.
I run the XCTU device and the zigbee network was NOT down. I could open the device and it showed the C device was gone, but all the other devices were still communicating. It was only the Hub that went south.

I believe based on what I saw that there is some problem where load, or extended load (it takes nearly 30 seconds to turn off all my devices and turn them all back on - I use groups, and ALL groups have delays built in) I think the timers that HE uses for waiting for execution of a long command cause the HE to decide the Zigbee went down when it didn't and the hub shuts it off.

I recognize this could be any number of problems, but given that there are numerous similar posts I hope that continued reporting may trigger someone with the source to take a look - maybe there really is something my post may help someone find a weakness overlooked in the past.

Panda · August 3, 2022, 4:50am

Interesting - I need that diagnostic since i've had a number of devices on lithium batteries that go from working to not very abruptly and dont catch this which I have determined causes a number of downstream issues.

I've also keep running into my wemo's going offline under heavy load - the device still works via the wemo app but the hub just says "connection refused" so none of my rules/buttons through Hubitat work. I've decided the hub does that after it gets flooded which is an issue with my rules that im trying to solve.

I reboot nightly due to it also despite commentary it was unnecessary...

Sebastien · August 3, 2022, 1:04pm

Extensive load on a hub will result in the Zigbee radio going off-line. Some community apps and drivers will cause this excessive loads. It is possible to see from the logs - Device Status and App Status which driver or app might be causing the issues.

It is possible to temporarily disable some apps or devices if they seem to be causing issues.

If a second hub is available, moving potentially problematic devices and apps to their own hub, linking everything together via Hub Mesh can be a workaround.

jshimota · August 3, 2022, 2:29pm

@sebastien I have seen posts and responses similar to yours in the past - and I can understand that app load can take the HE down, but it doesn't actually explain the zigbee network going down, at least to me.
The HE load is cpu related when it comes to apps - I also have a small monitor on load and have removed pretty much everything I can. My hub is already skinny it's not loaded and there are no heavy apps. I ran Hubgraphs for a while and could get the cpu to say it was about 15 %. I've misconfigured a driver I was writing and put the Hub in a loop and indeed it was bad. Not the case anymore. Network driver, buffering, should result in a log posting since supposedly the HE is detecting it. An error routine that throws into the logs is clearly missing to help find this culprit .
I don't recall ever seeing the same results on pure ZWave nets. As a pure Zigbee environment it seems consistent. It's like a fault in the network stack somewhere is causing this - and I don't think it's a load issue, is a timing issue. If I issue a 'turn all off' it does not work. HE stops turning things off part way through. of the 80 physical devices I've got in network, I can get about 45-50 to switch state before it stops working. Then if I issue again, the rest will go. Thats timing, not load. I'm convinced there is a zigbee network related issue thats near the heart of the HE stack. Maybe a compiler library needs an update or something!

Sebastien · August 3, 2022, 2:44pm

It would be nice if another culprit could be found!

bobbles · August 3, 2022, 3:08pm

It has been documented many times by different people that under heavy load, the zigbee radio is the first thing to go offline.
What constitutes 'heavy load', who knows.

From this statement I'm going to make an assumption that you are hitting your zigbee network with 80 zigbee off commands. Are you performing any metering for this?
I've had issues with trying to turn off 15 zigbee devices with no metering. Sometimes it works, sometimes it does not. A repeat of the command, as in your case, seems to do the trick.
I ended up putting my zigbee lights into 'room groups' with metering.
I then use the groups to turn off the lights. This has sorted the issue for me.

Personally I would say that trying to hit 80 devices in the way you are is going to cause you issues. (Just my opinion of course).

jshimota · August 4, 2022, 2:51pm

Mornin all- my net design is like this:
Every room in our house has an overhead fan with a 3 bulb array. The fans are NOT smart so not in play. Each bulb is a smart bulb in the array. Some are RGB, Some are CT. The bulbs are grouped for each physical device.
Each group has a metering. I've checked many times that all groups are setup the exact same way. I use metering as in my environment, if you command 3 bulbs to go off, randomly one might not. As a metered group, it always works. It may be important to note that I use subjective delays for all groups - ranging between 25 and 80ms - it depends on distance from the hub.
I never use optimization - too often my HE thinks a bulb is on when its not and commands are ignored.
I do not use Zigbee Group messaging. It was hit or miss in my environment so I choose not to use it. My environment does contain Zigbee 3 and zigbee 2 (?) devices and I honestly don't know how to tell which is which. I stay away because as I understand it, its a Z3 thing.
(my groups:)

And a randomly chosen sample group

It is my expectation my home should be capable of an 'all off/on' command. In my mind, that should be test set 1 for any hub. I've heard of homes with 300+ devices and it isn't an issue - albeit that is a Control4 setup in a 9000 ft home.

What I'm going to try this AM is setup 2 groups, on 'all devices by group' and 'all devices specific'. I'll use pure groups in one, and use pure specific devices in the other to see what sort of difference that makes. maybe a '1 ring to rule them all' group of groups might work better.

Still - imho, having HE command 80 devices shouldn't take the zigbee down. Complain that it couldn't complete the command in a timely manner, okay, but as a test bed, 80 just doesn't feel like it's that big a deal. And the HE disabling the zigbee network should be a optional setting that I control in my Settings - it shouldn't have arbitrary control of my network!

jshimota · August 4, 2022, 4:06pm

well- nothing useful to report I don't think. All lights/switches On off by group took longer to get started but seemed to give a better overall result.
A curious observation -
Starting with a clear browser screen of the Log page, on a different screen I issued the On command to groups. As the log started to show devices going on, when it finished - things looked okay. then I gave the Off command, and 3 more items showed up in the Headers area! If the on and off were all the same devices, you would expect that to be static. Same happened when I ran the test on All on/off by device. the off command gave a few more devices all of a sudden. and they weren't the same ones that were missed when I repeated the test. I wonder if a video would explain that better. gonna go video that... bleh. too hard to capture. it is what it is.
Last notation - fast, repeated pushing of the on and off on multiple screens from multiple web pages iterations did not take down the hub. if it was 'load' then that should cause my net to go down but it didn't... so odd.

jshimota · August 4, 2022, 4:30pm

I lied. I have another observation.
The Hub Information device which I'm assuming is just reporting what the Hub is reporting shows
a bizarre value for CPU Pct. I got a message 'hub load is severe'. Unit temp went from 109 to 129F as well.
I thought only humans can give 110%... So... theoretically - if the hub is calculating load incorrectly, the same routine could be in the zigbee network stack somewhere miscalculating network load and turning off the network... Possible?
Screen cap:

brad5 · August 4, 2022, 5:41pm

Take a look at this link about hub info. It gives a bit more of an explanation of the metrics. The hub has 4 processor cores so it may take that into account. But both the CPU 5 MIN number and the CPU PCT number definitely point to a heavily loaded hub. I suppose it's theoretically possible there's a calculation error but I've been using this for a while and it has at least been directionally correct.