Zigbee PAN Identifier Conflict

markus · July 16, 2020, 12:23pm

I have had a stable Zigbee mesh basically from the third week after installing HE last year. That is until recently. I had a minor power surge knock out all my IKEA Trådfri Repeaters so that I had to go and unplug them and plug them back in.
In that mesh I have 7 IKEA Trådfri Repeaters, 2 Xbees, 2 Aqara Outlets and 60+ Xiaomi/Aqara sensors + some other sensors and end devices. A few months ago I added extra repeaters, previously it was half of what I have today. It worked perfectly before this and after adding the additional repeaters a few months ago.

That is the background, now to my issue, since I had my mesh knocked out early last week I've had problems with most of my sensors being offline when I wake up in the morning, by evening most have reconnected. Those that didn't reconnect by themselves usually just need me to activate them once or twice and they return. Occasionally I've repaired a device or two.
I set about tracking down why they disconnected basically every night, I ran Wireshark to sniff all traffic during the night and I found that the reason all fell off is that HE sends a PAN Identifier Update packet after receiving just ONE (sometimes two?) PAN Identifier Conflict package! This is within specs to do, but usually that is not how it is setup since it can cause instability. In my case having EMBER_PAN_ID_CONFLICT_REPORT_THRESHOLD set to 9 (8 of my repeaters sent this packet) or higher would have kept the whole mesh stable. The trigger for all these one-off Conflict packet was a SINGLE beacon packet (ALL other beacon packets have the extended address) without the extended PAN id in it. The standard says that if a routing device receives a beacon packet with the same PAN id as itself and the extended PAN id is either missing or different from the one the routing device has, it should emit a Conflict packet.

I see multiple issues here, one is that once HE changes PAN id, there is no indication in the UI, the Zigbee settings page doesn't update the PAN id until a reboot is performed. Another is that there really ought to be a notification in the UI, and the system events logs, about something as important as a PAN id change. It would help knowing this without having to use Wireshark!

Lastly, the threshold should be either settable or at the very least set high enough to not cause issues from minor and temporary problems.
These are my findings and thoughts, I'm not impressed by how this is handled. Tolerance in a mesh like Zigbee is important.
I have the full Wireshark logs saved, but will not post them for obvious reasons, if the HE team want to improve and do things better I can provide them if it helps.

EDIT: Added a screenshot of the logs, and yes, I've seen it happen after just 1, but this log shows it to be 2:

JasonJoel · July 16, 2020, 12:56pm

I'm surprised the conflict thresh hold is 1... I believe the default is 2, based on looking through some docs - I could be wrong though. (For others, the value is settable from 1-63, with the units being #/minute).

The recommendations I've seen online say it should be set to 2-3 for "large zigbee networks", but I think that is just a rule of thumb.

JasonJoel · July 16, 2020, 12:59pm

Interesting. I do agree that making it end user settable (even if buried in a super secret "advanced users only" page) might be interesting for some testing.

markus · July 16, 2020, 1:02pm

Yes, I've also read similar, but there are additional things that should be taken into consideration, like if it is only once, and doesn't recur again within a minute etc. There is plenty of leeway in how to implement this and still stay within standard. I know I would do it differently, but it's not my product If I can solve this for my mesh and use HE I will (it worked before), otherwise I will revisit the alternative solutions, like Conbee. I would still very much prefer to see a solution that makes it possible to be more inclusive of "non-perfect" devices. Especially when that can be done and still stay within specs.

Royski · July 16, 2020, 1:17pm

This is exactly what I had to do. After a very long while with my Zigbee working as it should, for the past year I've had issues. I installed more repeaters, checked all I could, but once that PAN changed, I was doomed.

They're all on Conbee now, all playing happily together I am keeping my fingers for some sort of change in the above. Thanks for the investigations into this @markus

aaiyar · July 16, 2020, 1:21pm

FWIW, tagging @mike.maxwell to bring this discussion to his attention.

mike.maxwell · July 16, 2020, 2:07pm

Thanks for pointing this out, we'll look into it.

markus · July 25, 2020, 12:25am

8 days have passed, the last 4 days my mesh has been stable.

What I want here is to follow up on what I did to stabilize it and why it worked. Hopefully this can help others to diagnose similar issues.

When analyzing the traffic in Wireshark I was seeing a lot of Beacon packets, upon further analysis I found that there were multiple other Zigbee meshes around me on the same channel (25). This led to that once every 12 to 24 hours ONE incorrect beacon packet (out of about 80 000, 17.2% of all packets, per 24 hours correct ones) was sent and caused HE to instantly (see the first post for the details) change PAN ID. This change is due to how Hubitat has chosen to implement how to react to these packets, it could be done differently and still be within the specs.

Each time the PAN ID changed, most of my end devices dropped off for 3-12 hours or so (Xiaomi/Aqara devices). Some didn't return at all (1 or 2 every time), some returned after just 1 or 2 hours (other standard Zigbee devices, such as Sonoff sensors).

Since the excessive amount of Beacon packets were due to the many controllers around me on the channel, I switched to channel 26. I chose this channel since I know no gateway sold to Chinese consumers will select that channel.

4 days later my logs show a total of 702 beacon packets in FOUR days (0.3% of total packets) - All correct and none of the type that will cause a PAN ID change. My mesh is now stable.

With this said, if the PAN id had not been changing, my mesh would have been stable the way it was, sure it wasn't an ideal environment, but all my automations fired instantly and I had 0 issues except when the PAN id changed.

How to use this information to troubleshoot mass dropout of devices you may then ask? Look for a PAN ID change. Since the PAN ID will not update in the UI until after a reboot of the hub, if you have lots of Zigbee devices drop all at once, write down your current 16-bit (2 bytes, 4 characters) PAN ID (NOT the 64-bit EXTENDED PAN ID). Then reboot your hub and look for a change in the PAN ID, if it has changed you have your culprit for the drops. If this happens more than once, change your Zigbee channel (there are whole threads discussing which ones to use, you probably don't want to use the one I chose). When changing your Zigbee channel you probably will have to re pair many of your devices, but there is no need to delete them from HE, just pair them again.

aaiyar · July 25, 2020, 2:55am

Thank you for this explanation. Until now, I had thought that one of your zigbee devices had sent a malformed or otherwise incorrect beacon packet. Makes much more sense that it originated from a neighbor's zigbee mesh.

Hopefully Hubitat addresses this in a future platform release.

markus · July 25, 2020, 4:50am

Sorry, this needs to be clarified, it is one of MY devices sending the incorrect Beacon packet, so far always one of the Xiaomi/Aqara sensors or buttons. It has been all different types of them doing this. In general they all get it right, but among all those 80000 beacon packets, one get to be wrong. This 1 incorrect packet is then interpreted by HE as a reason to change the PAN ID, which then starts the whole issue. However, the reason for all of these beacon packets to be sent, is because of how the controllers around me behave, that is a bit more of a complex issue, but suffice to say that it is according to specs.

You and me both, it is such an unnecessary reason to bring down the whole mesh.

Sebastien · July 30, 2020, 8:54pm

Argh! This happened to me again today! I now have to reconnect all my zigbee devices... What a pain!!!

At least, thanks to @markus’ excellent work, I know what occurred and what my next steps are. Thanks you again Markus!

Really hoping this will be fixed soon. Thanks @mike.maxwell for looking into it! If there is any information I can provide from my hub, let me know. I will be happy to do anything that I can to help resolve this.

PAN ID prior to reboot:

PAN ID following reboot:

Royski · July 31, 2020, 9:55am

I knew I had seen this happen to me before, and almost sure this is the issue with my Zigbee.

Last Feb I posted this.

markus · July 31, 2020, 10:10am

I feel for you! These are not fun issues to have Hope you can resolve them permanently!

when all or almost all devices fall off at the same time with the controller still online, there's likely not anything else it can be.

If a few endpoint devices fall off and the repeater they primarily route through changed their address that is another type of issue which has a similar origin. Here it is not so much due to what the hub does as what it didn't do prior to the change. I've seen these things happen around the same time as the nightly cleaning cycle or if a hub is slowed down significantly for other reasons. This is an issue that I still need to look at a lot more traffic in Wireshark to fully understand the origin, but a fix seems to be to make sure the hub never slows down, which is easier said than done sometimes...

You've struggled more than most with this , and tried everything to solve it, I do hope that now that this has been posted in a clear and harder-to-ignore way it will be prioritized by Hubitat.

Royski · July 31, 2020, 10:35am

I've seen my Zigbee go down completely a few times, yet it didn't seem to exhibit this behaviour, that said that was a very long time ago, and hasn't happened in at least the last year or so.

I've also had a very strange thing where I would take down my lighting circuit to add new light switches, or some other reason, to then power the lights back up, and bang!, all my Zigbee devices would drop. This one has baffled the hell out of me, as all my ZHA devices were still with power, it wasn't the socket circuit I took down, that I could understand. And this was recent, but one thing I didn't do was check if the PAN had changed. This was the proverbial straw which pushed me to move all my Zigbee to deCONZ, and since, not a single issue.

I have that mate, I am pretty sure yours and @Cobra ears must have bled at times during my bleating Thank you both for putting up with me on that front and I am 'ing for a quick solution, so I can then move everything back to HE

markus · July 31, 2020, 10:44am

The two main things to check for is if there have been device address changes (the 16-bit part) or if the PAN ID has changed. There are of course other possible reasons for issues, but these are two of the big ones that can still exist in an otherwise well-built mesh. This is what we are talking about here, a mesh with more-than-enough repeaters, and good ones at that. All placed in as optimal locations as can be with the mesh built according to all the best practices. When you then still have issues, there's not much left except these two and "broken" repeaters not doing their job (which in general results in one of these two issues). I do hope this topic helps others, it is not easy to understand all of this for most, in an ideal world only those who like going to this much detail would need to.

Rxich · July 31, 2020, 10:42pm

This is a great discovery and in fact my hub did change its PAN ID, but only certain devices fell off. Is there a reason only certain devices fall off?
Which aqara sensors did you find troublesome? water, contact or vibration or all of them?

markus · August 1, 2020, 12:27am

I'm hoping it will help people diagnose this, and more importantly result in some Zigbee stack changes from Hubitat. I'm sure they're looking at it now, when they've made the changes they'll tell us.

Those that properly detect a PAN id change will return, though even those that do may be offline for a 1-2 hours before working as normal again. This is more an issue for battery-powered devices since they don't send or, more importantly, listen for traffic all the time.

I've had all three of these fall off, but water and vibration sensors have often stayed (or rather come back within 1-2 hours). Those with most issues for me have been motion, contact and T&H sensors. These three types fell off every time. Some did come back within 24 hours.

mike.maxwell · August 1, 2020, 12:33am

We will have an update for this in release 2.2.3, after chatting with Markus we're going to bump EMBER_PAN_ID_CONFLICT_REPORT_THRESHOLD from 1 up to 10.

Angus_M · August 1, 2020, 9:55am

Fantastic analysis by @markus and response from @mike.maxwell - for the benefit of all of us as users of Hubitat! This should help some problematic devices stay connected better. You guys freaking rock

BrunoVoeten · August 1, 2020, 10:43am

I suggest we replace @markus his title of King for a day to Emperor for a month