I have had a stable Zigbee mesh basically from the third week after installing HE last year. That is until recently. I had a minor power surge knock out all my IKEA Trådfri Repeaters so that I had to go and unplug them and plug them back in.
In that mesh I have 7 IKEA Trådfri Repeaters, 2 Xbees, 2 Aqara Outlets and 60+ Xiaomi/Aqara sensors + some other sensors and end devices. A few months ago I added extra repeaters, previously it was half of what I have today. It worked perfectly before this and after adding the additional repeaters a few months ago.
That is the background, now to my issue, since I had my mesh knocked out early last week I've had problems with most of my sensors being offline when I wake up in the morning, by evening most have reconnected. Those that didn't reconnect by themselves usually just need me to activate them once or twice and they return. Occasionally I've repaired a device or two.
I set about tracking down why they disconnected basically every night, I ran Wireshark to sniff all traffic during the night and I found that the reason all fell off is that HE sends a PAN Identifier Update packet after receiving just ONE (sometimes two?) PAN Identifier Conflict package! This is within specs to do, but usually that is not how it is setup since it can cause instability. In my case having EMBER_PAN_ID_CONFLICT_REPORT_THRESHOLD set to 9 (8 of my repeaters sent this packet) or higher would have kept the whole mesh stable. The trigger for all these one-off Conflict packet was a SINGLE beacon packet (ALL other beacon packets have the extended address) without the extended PAN id in it. The standard says that if a routing device receives a beacon packet with the same PAN id as itself and the extended PAN id is either missing or different from the one the routing device has, it should emit a Conflict packet.
I see multiple issues here, one is that once HE changes PAN id, there is no indication in the UI, the Zigbee settings page doesn't update the PAN id until a reboot is performed. Another is that there really ought to be a notification in the UI, and the system events logs, about something as important as a PAN id change. It would help knowing this without having to use Wireshark!
Lastly, the threshold should be either settable or at the very least set high enough to not cause issues from minor and temporary problems.
These are my findings and thoughts, I'm not impressed by how this is handled. Tolerance in a mesh like Zigbee is important.
I have the full Wireshark logs saved, but will not post them for obvious reasons, if the HE team want to improve and do things better I can provide them if it helps.
EDIT: Added a screenshot of the logs, and yes, I've seen it happen after just 1, but this log shows it to be 2: