[Solved] Zigbee Instability is back!

srwhite · January 28, 2019, 10:04pm

Another thought is that I can move the hub into the basement, two floors down. I’ve only got 9 plugs down there. That should in theory force a number of the routers to relay through the five 1st floor plugs. If nothing else, it will be interesting to see if the number of affected devices falls.

Navat604 · January 28, 2019, 10:09pm

I just counted and I have 33 zigbee repeaters. 16 are iris plugs and the rest are from various brands. I see only a Leviton dimmer is offline and by the look of it from xbee. It was caused by an EcoSmart bulb. Only bulb I have direct pair with HE and I should get rid of but that's for another day.
When you said "dead" does that mean you have to reset and pair again? I don't have this issue but once in a while. I have to physically press the on/off button on the Iris plug to get it going again.

Tony · January 28, 2019, 10:29pm

This Large ZigBee Networks and Source Routing also speaks to the dense router scenario. I wonder if the 'many to one' routing optimization it mentions is used by HE; it could account for better performance you are seeing on other platforms (vs. vanilla source routing). The second paragraph in the routing section of the previously linked Silicon Labs KB note discusses how the technique is used with the Ember stack; other stacks (like Digi) use configuration parameters: Many to One Routing

Somel · January 28, 2019, 11:31pm

Tony really good stuff mate.
Thanks

srwhite · January 29, 2019, 12:05am

Here’s some more data.. I moved the Zigbee hub down to the basement and retested all of the devices for responsiveness.

The first test,, about 15 minutes after the hub booted up yielded 13/43 dead,
The second test, just completed or about 30 minutes after the last test, 11/43 devices.

I call this test inconclusive. There were a few less devices that failed to respond, however that might be just as much of a function of moving the hub as it is the proximity to routing devices.

Next test, put the stick in OpenHAB.

Somel · January 29, 2019, 12:11am

Good luck

BorrisTheCat · January 29, 2019, 12:15am

If you move the hub wouldn't you need to leave it not powered for 15-30 mins to force them all to look for and change their routes?

srwhite · January 29, 2019, 12:58am

Ideally yes, but I purposely rushed the first test to force the routes to update. The second test, 45 minutes later yielded similar results. I do not believe the issue is caused by having too many routers in close proximity to the hub.

Tony · January 29, 2019, 3:57am

The issue might not necessarily be just the number of routers in close proximity to the hub (though the coordinator also has router table limitations) but the proximity of repeaters to each other (how many neighbors each repeater has in range). That could also exhaust their table space. Depending on the routing scheme being used (AODV, source routing, many to one) and the resulting overhead, this effect might be significant.

Though it's not a solution for you (as you need the plugs for their functionality) I'm interested in the results of the test Dan suggested (a couple dozen routing devices in total),

srwhite · February 3, 2019, 3:30am

Hi all... It's been a few days since the last update on this topic. I think it's due for an update, especially for those who may have come across this thread for the first time.

I have continued to dedicate a significant amount of time on my system. On Tuesday I reset both hubs, moved a ton of devices off SmartThings and split my large system across two HE hubs, creating two different Zigbee stacks. One hub has 88 Zigbee devices, including 35 plugs, while the second has 84 devices, including 26 plugs.

@chuck.schwer and I have been working on this issue over the past week. He has generously taken time to create several test builds, the latest of which has allowed me to directly tweak the Zigbee stack. THis has been deployed to one of my two hubs. I am very pleased to say that the updated hub with tweaks to the Zigbee configuration has yielded a remarkable increase in performance and stability.

My results are only based on a few hours of testing and observation, so more time for study is needed before drawing any final conclusions. However, as of this post, the unresponsive device issue has gone away as has the issue of duplicate incoming messages.

The other hub, without the updated settings, still exhibits the same unresponsive devices and duplicate message (shown below) symptoms.

At this point, it's too early to go into specifics as more testing and observation is still needed. I will say that it has nothing to do with the size of the neighbor table, but will refrain from providing further details until I've had more time to test.

We are definately making progress now.

vjv · February 3, 2019, 3:33am

I want to cry.... Thank you

njanda · February 3, 2019, 3:38am

That's terrific to hear and thanks for all your pain and suffering whilst helping make this platform better for all.
Noice !

bjcowles · February 3, 2019, 4:06am

Thank you for your continued willingness to troubleshoot. Many, including myself, would have thrown their hands up (middle fingers extended) and walked away had they been in the same position.

My 1 1/2 years on ST was a miserable experience. HE has done everything I’ve needed and more, so it is very painful to see someone not getting the same. Glad you are seeing some positive results! Thanks again and good luck!

Tony · February 3, 2019, 3:16pm

That's encouraging. Duplicate messages would seem to be a symptom of the underlying problem, as they would be expected when a lot of retries are happening (acks not arriving in time to inhibit another retry attempt). AFAIK this should be visible in a sniffer trace (the sequence number in a retried message would not match in a retry attempt). Puzzling why they aren't being filtered at the application layer; supposedly the ZCL profile handles that.

I remember seeing something like this very early on when I was setting up my HE system last February (back then with only very few devices when I first began the migration); there were a few users that reported it as well. It seemed to go away but over the past several weeks or so I have noticed that an audio track (played in response to a Zigbee button push) usually gets played multiple times (as if the app is seeing multiple button presses). I should be able to see this with the CC2531 sniffer dongle (best $10 I ever spent) once I get the hang of using it.

vjv · February 8, 2019, 11:07pm

Any news of your setup? I still here awaiting for your comments.

srwhite · February 9, 2019, 3:18am

I suppose another update is in order... I have about 22 devices still on SmartThings, roughly 12-15 not connected to anything, and a couple new devices that I've not yet deployed. But the overewhelming majority have been moved back to Hubitat.

I've now got my system completely split across two hubs/meshes, and have a third hub acting as a coordinator, endpoint for cloud integrations, and dashboard host. Hue lighting is connected to all three hubs so automations can be kept as close to the hub containing the trigger devices.

The Zigbee meshes are running great... Solid, responsive, and no more duplicate messages. There are 94 Zigbee devices on the first, and 95 on the second. I'm still continuing to test and tweak, thanks to some special test builds. But I can say for certain that we now know exactly what was causing the issue before, and how to properly fix it. I'm not sure when those tweaks will make it into produciton however.

The most recent challenge has been against some limitations in Hub Link/Link to Hub in a three hub environment. That's not a negative against Hubitat, I'm just looking to do more than the stock apps allow so I've spent most of my free time since Wednesday creating a replacement.

This weekend will be mostly focused on linking devices to the controller hub and rebuilding dashboards. I'm still missing dozens of automations but for now the priority is getting the dashboards rebuilt.

vjv · February 9, 2019, 4:58am

Very nice, I'm glad to know your system is working, it enhance our joy of having HE, and the quality/performance of the system too!

csteele · February 9, 2019, 5:24am

The big question is.. how's the WAF? We know it plummeted a couple weeks back... today?? Better, I hope.

kevin · February 9, 2019, 4:24pm

Really interested in this and some available way to get external devices into HE virtual/mirrored e.g.for implementing MQTT discovery. I hope you create a published hub link protocol for doing this. Keep us (me) updated.

erktrek · February 12, 2019, 12:18am

Thank you for working on this with the HE Folks and not giving up! As has been said we all benefit from your hard work. It is much appreciated.