Interesting. I do agree that making it end user settable (even if buried in a super secret "advanced users only" page) might be interesting for some testing.
Yes, I've also read similar, but there are additional things that should be taken into consideration, like if it is only once, and doesn't recur again within a minute etc. There is plenty of leeway in how to implement this and still stay within standard. I know I would do it differently, but it's not my product If I can solve this for my mesh and use HE I will (it worked before), otherwise I will revisit the alternative solutions, like Conbee. I would still very much prefer to see a solution that makes it possible to be more inclusive of "non-perfect" devices. Especially when that can be done and still stay within specs.
This is exactly what I had to do. After a very long while with my Zigbee working as it should, for the past year I've had issues. I installed more repeaters, checked all I could, but once that PAN changed, I was doomed.
They're all on Conbee now, all playing happily together I am keeping my fingers for some sort of change in the above. Thanks for the investigations into this @markus
FWIW, tagging @mike.maxwell to bring this discussion to his attention.
Thanks for pointing this out, we'll look into it.
8 days have passed, the last 4 days my mesh has been stable.
What I want here is to follow up on what I did to stabilize it and why it worked. Hopefully this can help others to diagnose similar issues.
When analyzing the traffic in Wireshark I was seeing a lot of Beacon packets, upon further analysis I found that there were multiple other Zigbee meshes around me on the same channel (25). This led to that once every 12 to 24 hours ONE incorrect beacon packet (out of about 80 000, 17.2% of all packets, per 24 hours correct ones) was sent and caused HE to instantly (see the first post for the details) change PAN ID. This change is due to how Hubitat has chosen to implement how to react to these packets, it could be done differently and still be within the specs.
Each time the PAN ID changed, most of my end devices dropped off for 3-12 hours or so (Xiaomi/Aqara devices). Some didn't return at all (1 or 2 every time), some returned after just 1 or 2 hours (other standard Zigbee devices, such as Sonoff sensors).
Since the excessive amount of Beacon packets were due to the many controllers around me on the channel, I switched to channel 26. I chose this channel since I know no gateway sold to Chinese consumers will select that channel.
4 days later my logs show a total of 702 beacon packets in FOUR days (0.3% of total packets) - All correct and none of the type that will cause a PAN ID change. My mesh is now stable.
With this said, if the PAN id had not been changing, my mesh would have been stable the way it was, sure it wasn't an ideal environment, but all my automations fired instantly and I had 0 issues except when the PAN id changed.
How to use this information to troubleshoot mass dropout of devices you may then ask? Look for a PAN ID change. Since the PAN ID will not update in the UI until after a reboot of the hub, if you have lots of Zigbee devices drop all at once, write down your current 16-bit (2 bytes, 4 characters) PAN ID (NOT the 64-bit EXTENDED PAN ID). Then reboot your hub and look for a change in the PAN ID, if it has changed you have your culprit for the drops. If this happens more than once, change your Zigbee channel (there are whole threads discussing which ones to use, you probably don't want to use the one I chose). When changing your Zigbee channel you probably will have to re pair many of your devices, but there is no need to delete them from HE, just pair them again.
Thank you for this explanation. Until now, I had thought that one of your zigbee devices had sent a malformed or otherwise incorrect beacon packet. Makes much more sense that it originated from a neighbor's zigbee mesh.
Hopefully Hubitat addresses this in a future platform release.
Sorry, this needs to be clarified, it is one of MY devices sending the incorrect Beacon packet, so far always one of the Xiaomi/Aqara sensors or buttons. It has been all different types of them doing this. In general they all get it right, but among all those 80000 beacon packets, one get to be wrong. This 1 incorrect packet is then interpreted by HE as a reason to change the PAN ID, which then starts the whole issue. However, the reason for all of these beacon packets to be sent, is because of how the controllers around me behave, that is a bit more of a complex issue, but suffice to say that it is according to specs.
You and me both, it is such an unnecessary reason to bring down the whole mesh.
Argh! This happened to me again today! I now have to reconnect all my zigbee devices... What a pain!!!
At least, thanks to @markus’ excellent work, I know what occurred and what my next steps are. Thanks you again Markus!
Really hoping this will be fixed soon. Thanks @mike.maxwell for looking into it! If there is any information I can provide from my hub, let me know. I will be happy to do anything that I can to help resolve this.
PAN ID prior to reboot:
PAN ID following reboot:
I knew I had seen this happen to me before, and almost sure this is the issue with my Zigbee.
Last Feb I posted this.
I feel for you! These are not fun issues to have Hope you can resolve them permanently!
when all or almost all devices fall off at the same time with the controller still online, there's likely not anything else it can be.
If a few endpoint devices fall off and the repeater they primarily route through changed their address that is another type of issue which has a similar origin. Here it is not so much due to what the hub does as what it didn't do prior to the change. I've seen these things happen around the same time as the nightly cleaning cycle or if a hub is slowed down significantly for other reasons. This is an issue that I still need to look at a lot more traffic in Wireshark to fully understand the origin, but a fix seems to be to make sure the hub never slows down, which is easier said than done sometimes...
You've struggled more than most with this , and tried everything to solve it, I do hope that now that this has been posted in a clear and harder-to-ignore way it will be prioritized by Hubitat.
I've seen my Zigbee go down completely a few times, yet it didn't seem to exhibit this behaviour, that said that was a very long time ago, and hasn't happened in at least the last year or so.
I've also had a very strange thing where I would take down my lighting circuit to add new light switches, or some other reason, to then power the lights back up, and bang!, all my Zigbee devices would drop. This one has baffled the hell out of me, as all my ZHA devices were still with power, it wasn't the socket circuit I took down, that I could understand. And this was recent, but one thing I didn't do was check if the PAN had changed. This was the proverbial straw which pushed me to move all my Zigbee to deCONZ, and since, not a single issue.
I have that mate, I am pretty sure yours and @Cobra ears must have bled at times during my bleating Thank you both for putting up with me on that front and I am 'ing for a quick solution, so I can then move everything back to HE
The two main things to check for is if there have been device address changes (the 16-bit part) or if the PAN ID has changed. There are of course other possible reasons for issues, but these are two of the big ones that can still exist in an otherwise well-built mesh. This is what we are talking about here, a mesh with more-than-enough repeaters, and good ones at that. All placed in as optimal locations as can be with the mesh built according to all the best practices. When you then still have issues, there's not much left except these two and "broken" repeaters not doing their job (which in general results in one of these two issues). I do hope this topic helps others, it is not easy to understand all of this for most, in an ideal world only those who like going to this much detail would need to.
This is a great discovery and in fact my hub did change its PAN ID, but only certain devices fell off. Is there a reason only certain devices fall off?
Which aqara sensors did you find troublesome? water, contact or vibration or all of them?
I'm hoping it will help people diagnose this, and more importantly result in some Zigbee stack changes from Hubitat. I'm sure they're looking at it now, when they've made the changes they'll tell us.
Those that properly detect a PAN id change will return, though even those that do may be offline for a 1-2 hours before working as normal again. This is more an issue for battery-powered devices since they don't send or, more importantly, listen for traffic all the time.
I've had all three of these fall off, but water and vibration sensors have often stayed (or rather come back within 1-2 hours). Those with most issues for me have been motion, contact and T&H sensors. These three types fell off every time. Some did come back within 24 hours.
We will have an update for this in release 2.2.3, after chatting with Markus we're going to bump EMBER_PAN_ID_CONFLICT_REPORT_THRESHOLD from 1 up to 10.
Fantastic analysis by @markus and response from @mike.maxwell - for the benefit of all of us as users of Hubitat! This should help some problematic devices stay connected better. You guys freaking rock
I suggest we replace @markus his title of King for a day to Emperor for a month
I don't know if they are doing Emperor for a month yet, but I would certainly second that! @Kings
Let's wait and see if it actually helps before we name him Emperor. LOL.
I expect it will help, though.