[Solved] Zigbee Instability is back!

Anything is possible. It could be how the devices react to a buffer overflow. It’s just that I’ve never witnessed it on the Iris plugs before.

Was there any message logged in the Zigbee log when it happened? if not, can we assume HE is not involved and the device is operating on its own?

Here's the latest on this... There has been a continued degredation in the stability of the Zigbee mesh. Devices are continuing to go unresponsive, and when they do come back to life, it's not for very long. Battery powered end devices are also affected by this, as those that route through to an unresponsive device frequenty fail to report events.

I was able to get Wireshark setup with the ConBee USB stick using their ZShark remote capture software. It works very well, the trick is to first set up the Zigbee Alliance public key in Wireshark, then find a device to pair while watching the capture for the Transport Key. Once this key has been revealed during the pairing process, Wireshark can decide all of the packets.

This proved to be an invaluable tool in trying to determine what is going on.. What Wireshark there are no packets being sent to the device by the hub, nor is the hub broadcasting route requests to locate the affected device and comunicate with it.

Chuck and Bruce have been great in working with me to get testing done on my system. The good news is that I've provided them with some wireshark captures when the issue is happening. it has not been an easy issue to track down, and I'm sure it won't be an easy one to fix. But I do thank them for their efforts and sincerely hope they can track down a fix for this quicky.

7 Likes

your not alone is noticing this instability. i have also noticed in since a couple updated ago. everything was fine for months then all of sudden after an update(forget which) started loosing control of bulbs. yes they are lighify which are known problems. but they were fine for long time and then boom. so i got 2 xbee(great devices) and thought that would solve the problem,nope. i even had iris contact senors start to fall off. the last month has been really bad for my setup. only thing that changed was the update. i didn't add anything to my hub.

I'm going to chime in and say that I have had very erratic behavior from my setup as well. I don't even know where to begin, so I have been following this thread trying to figure out if anything stands out as a solution. I have groups not sending commands and virtual devices not being consistently updated by rules, so it has been nearly impossible to nail down why my nearly 50 bulbs will not do what they are supposed to do. I press a button and the groups say they updated, but none of the lights respond. It has been maddening. I added about 8 Iris plugs for repeaters, and don't use any custom drivers, and still the lights will not consistently turn on when I push a button. I know this might not be related, but instability appears to be a theme on my system.

What type of bulbs are these?

Lightify. They work perfectly when I control each group of bulbs individually, but when I use a button to control the 5 groups that are in my open living room, dining room, kitchen, only some respond. Multiple pushes of the button (because the virtual dimmer for the group didn't update) sometimes fixes it. Other times I have to turn them off, wait 20 seconds, then try again.

For information, the buttons (3 of them) all call on a RM action to turn on the virtual dimmers for each group.

If you are using Zigbee Group Messaging in Group-2.0, this can happen. Zigbee Group Messaging is a broadcast message, not a handshake. So there's a tradeoff with this approach. If you turn off Zigbee Group Messaging, you get some popcorning of the lights coming on.

Why not make another group that includes all bulbs in the 5 other groups?, zigbee supports multiple groups concurrently...

I plan to maintain a small test Hubitat system for a while, so long as the HE staff is willing to dedicate time and resources into tracking down the Zigbee bugs. Today while moving devices back to ST, I rebuilt a small Hubitat test system. Before I even hit 20 devices I noticed devices were starting to go unresponsive. On a whim I decided to bounce (disable and enable) Zigbee. I noticed for a very short time after (no more than a minute), I was able to control all of the devices that had previous been unreachable.

Not sure what that means. In theory, I suppose that could force the hub to seek out routes to the unreachable devices. But I'm not that familiar with Zigbee routing. It's an observation I wanted to share anyhow.

What devices? Iris plugs only?

No, it's a completely mixed network of about 80 devices, some of which will be moving to ST, others will not. At the moment, plugs include Iris V2 and ST 2nd 3rd, & 4th gen, ST contact and Iris multipurpose and ST buttons.

Evidently Zigbee networks with dense meshes of routers (dozens of routers within radio range of each other) can be as problematic as sparse ones due to limited router table and neighbor table space. Silicon Labs posted this knowledge base entry which mentions 16 as the number of in-range routers that their stack can track directly Guidelines for Large/Dense Networks with EmberZNet PRO. Nodes not in the table require multi-hop routes despite being within 'one radio hop' range, along with all the extra latency and routing overhead which that implies.

Surely this varies by stack and there are lots of 'tuning knobs' to mitigate these scenarios but I wonder if it is a factor in your environment. Read somewhere on the Digi Q&A's that once you get more than 40 XBEE devices within range of each other you're approaching a practical limit. Commercial deployments (with scores of 'always on' routers as in a building lighting control application) may go so far as to specially configure routing-capable devices as end devices to get around this.

2 Likes

Tony, you could have a point. I had no issues with SmartThings Orr Iris, but it goes without saying those platform probably have customized Zigbee stacks. The V2 uses the EM3587-RTR Zigbee module, but do not know if they’re using Ember. I also have read that Nordic supports 32 routing neighbors so larger tables are a possibility.

You just gave me two more scenarios that I want to test. First, I have an OpenHAB build I’ve been playing around with all weekend. I have not actually configured any radios on it, but they support the Nortek stick using Ember. I’ll see if I can transplant the stick to my home server. Supposedly OH will Import any existing paired devices so in theory I should be able to bring up the network and see if I can replicate the issue there.

Another test would be to set up both hubs to create parallel Zigbee meshes to split up the number of in-range routers. In theory, that should overcome the 16 router limitation.

I think it’s a theory worthy of validation.

1 Like

Some new data.. I set up a test dashboard with all 42 SmartPlugs and one Xbee so I could visually track which ones go unresponsive.

Routing devices status:
An hour ago 17/43 were dead.
About 15 minutes ago 16/43 dead.
Just now 17/43 dead.

In each trial, there was no consistency in which devices remained responsive and which did not. Some overlapped all three tests while others only failed one of the tests.

One thing is surprisingly consistent.... the quantity of dead devices.

If you only had ~25 routing devices attached, would any of them go unresponsive?

43 - 17 = 26? Probably you need 3 hubs for all your routers

How many directly connected to the hub?

I’ll PM my PayPal email for your donation. :smile:

I don’t know. I’ll have to check XCTU tonight.

That’s another test I’ll try after I attempt to use the stick in OpenHAB.

2 Likes

Thank you for continuing to troubleshoot this issue. Being able to figure out when it breaks should help to figure out the problem (I hope!)