In Search of A Lost Child (Zigbee)

dan14 · November 18, 2022, 12:13am

Hi there!

I'm back. Unfortunately, my effort to get a firmware update from Ledvance for my problematic LED strips was unsuccessful, and they continue to randomly go unresponsive. I've developed a pretty reliable resolution path, which is to spam the device with about 50 on/off commands in rapid succession, and then reboot the HE. Usually that gets them back. Today I wasn't so lucky. I used the opportunity to collect some more data and have some findings to share.

The last time I had an issue with a device going poof, I discovered null entries in the Zigbee neighbor table:

That screenshot's from about a month ago. Tonight, more null entries. The name of the missing device tonight is "Backsplash". It doesn't appear in the neighbor table, but this null entry does:

After doing my command spam, Backsplash did make an appearance in the neighbor table, but in a state I'd never seen before (I'd never refreshed it immediately after issuing the rapid-fire commands): "In Discovery"

It remained there for only a brief moment before disappearing out of the neighbor table, and unfortunately, it still didn't respond to commands. I cycled through this a couple times with no success.

Ultimately, I had to do the very annoying factory reset procedure for the LED strip and put it in pairing mode, then re-run discovery. Once I did that, HE picked it right back up and I had it responsive again. It appears in the neighbor table routed...through itself, apparently, but it is working:

Whenever I have these issues, I've always got debug logging turned on for the problem device. Usually, when the device comes back alive after an HE reboot, the log records an address change for the device once I send it a successful command. It did so again tonight:

Something caught my eye there tonight. The "old" address was 8D08, just like it was a month ago (for a different unresponsive device, not this same one). Yet, when the device was non-responsive and "in discovery," it showed a different address in the route table.

In fact, I can recall at least one other instance of this happening where I distinctly remember the old, unresponsive address being 8D08.

From this, I can draw a couple conclusions, though they don't yet point to a root cause:

Whenever I have "null" in the route table, something either is or is about to be wrong.
These particular LED strips don't agree with address 8D08. I can't say for certain that having that address always makes them malfunction, but every time they malfunction, I've seen that address involved.

From there, I speculate that the problem device went to move on the mesh, and its own understanding of what address it went to disagreed with HE's understanding of same. HE was trying to send commands out to Backsplash at BF4B, but there was no Backsplash at BF4B, because Backsplash was really at 8D08.

One log entry I didn't get tonight when recovering the device via re-pairing was a zigbee device announcement in the logs. When my command spam+reboot method works, the address change log entry is often near a log entry with the device announcing itself.

Questions I have from this are why null keeps appearing in my route table as a portent of doom on the mesh and why the address 8D08 appears to exist as an event horizon for my devices: lights enter, but cannot escape.

I'd be happy to provide any additional information for developers as may be useful, including log access.

Thanks in advance!

Dan

Rxich · November 18, 2022, 5:08am

This is an amazing discovery. The hub used to have a phantom Zigbee device auto-recovery that got removed due to a very few(from what i can see)complaints about it affecting pairing of new devices. I suggested adding it as a toggle feature on/off option, but that was never implemented. In any event IIRC Zigbee has an address cache that does "something", but let's try to call someone way more knowledgeable @Tony

rlithgow1 · November 18, 2022, 11:47am

@dan14 aren't these ZLL strips known to have problems with drop offs? (I seem to remember seeing something on it a while back) I have a few of these on my hue bridge and they're stable there (I keep all zll stuff there). @mike.maxwell any thoughts on his findings above?

dan14 · November 18, 2022, 12:00pm

I'm pretty sure these particular ones are ZHA 1.2. Osram only used ZLL in EMEA, IIRC. You may definitely have seen something about them dropping off though, because I've been popping up about every month or so trying to hunt down the root cause.

dan14 · November 18, 2022, 12:02pm

Additional fun fact (sort of): a different LED strip (of the same type) has gone poof on me this morning. So we have a troubleshooting opportunity available.

Tony · November 18, 2022, 5:26pm

Ultimately, there's nothing inherently wrong with nulls appearing in the neighbor or route entry sections; however it's an indication that a previously joined device has rejoined the network (and the database housekeeping maintaining the correspondence via the IEEE address just hasn't caught up yet).

TL;DR part follows:

There's two ways that identify a particular Zigbee device joined to a mesh: the 64-bit IEEE MAC address (globally unique and burned-in at the factory) and a 16-bit short address, assigned at random whenever the device joins or rejoins the mesh. These short (NwkAddr) addresses are used to save overhead (airtime and power-- microjoules per transmission really matter in a low-power protocol like Zigbee).

It's normal (and part of the design) for Zigbee devices to join and rejoin the mesh transparently. Given redundancy in terms of repeaters (RF range and capacity to host the required number of end devices), you can safely unplug a Zigbee router at any time without "wreaking havoc on your mesh"-- it's designed to self heal and does it without further intervention (usually in tens of seconds, at least when things go by the book).

Should a Zigbee device (repeater or end device, or a mobile device that has moved) no longer communicate, it has to pick a new parent. Whenever it does this, there's a leave/join/rejoin process taking place which results in a new short address assignment. The reason new short ID's get generated at every rejoin is to avoid exhausting the address pool--- you can't be sure a device that has left will ever come back, and there's no other mechanism to reclaim unused short ID's that would otherwise be hanging around forever.

Since only short ID's currently known to the new parent will be avoided as assignments during rejoins, there's a potential for conflicts with a NwkAddr currently in use. Should this happen, there's a conflict resolution protocol (using the MAC address) that sorts this out.

So on every join or rejoin a random 16-bit NwkAddr gets assigned by the new parent (hub or router). A previously joined device will be matched to its original instance using its MAC address but be referenced by its new short ID going forward. A 'null' appearing alongside a now unused short ID is an indication that this has occurred; eventually these nulls get purged.

Why the devices keep falling off and rejoining is the key issue. A device that isn't polling its parent (or neighbor) regularly can be explicitly told to leave the network (to free up resources for other end devices or to reclaim space in a neighbor table).

Alternatively, a device that's having problems communicating with its parent can attempt to leave that parent and rejoin on its own. It will scan the mesh for routers belonging to its PAN ID; another router hearing the rejoin request may accept it if its own child table isn't full. Either way it's an indication of communication problems...

Ken_Fraleigh · November 18, 2022, 6:09pm

If they have Osram rather than LEDvance printed on the device, or listed on the device page in HE, then they have the troublesome IQHome zigbee chip versus the newer one. My old Osram bulbs that I still have in my drawer of shame have IQHome printed on them somewhere, which I found to be ironic.

Rxich · November 18, 2022, 7:59pm

IMHO the OP's light strip is having trouble communicating...for some reason-home design, wifi, neighbors's nuclear reactors, spy drones, sunshine, bird sitting on the chimney, who knows with RF. In any event there is a solution, add a repeater near the device and hope the light strip joins via the repeater. I believe in zigbee2mqtt you can directly join a device through a specified router, which adds complexity but would be a great feature to have in HE, available as a toggle, like in Z2M. Not sure of how that might interact with Zigbee's native routing logic tho.

@dan14 what is your firmware version? My Flex strip is-

endpointId: 01
application: 01
firmwareMT: 1189-001F-00102428
manufacturer: LEDVANCE
model: FLEX Outdoor RGBW
softwareBuild: 00102428

Which IIRC is the same as the Smart+ bulbs 102428 and my Gardenspot LEDS

Further thought on this, as I suspect HE is doing some technique to mitigate devices falling off, as ever since I did a simultaneous Zigbee channel switch 20<->25 between my C7 & C4 hubs I often must shut down 1 hub to successfully pair a device on the other, even though the hubs are physically separate by at least 10 feet & 1 level in the home. If I don't shut down the C4, many devices that I try to pair to my C7 never finish initializing. I guess if it's not HE doing something, maybe it's the Nortek stick on my C4?

Mikeymike88 · November 18, 2022, 9:43pm

I have one of the flex strips too and it has the same firmware as you. I've also noticed the device falling off the network and being unresponsive when automations are scheduled. I added a Tuya Signal Repeater about 5 feet away from this device and it seems to be using it and staying connected a lot more consistently.

It still doesn't act as a repeater, but some of it's connection stats look to be a bit better. It's now at LQI:119, age:4, inCost:7, outCost:5. Previously before adding the repeater it was always LQI: ~10-60, inCost: 7, and outCost: 0. Just wanted to give feedback on something that helped in my situation.

Rxich · November 18, 2022, 10:32pm

I have at least 6 repeaters(likely way too many!) around my flex strip and it only fell off once, in about a year, when I changed channels(not on the TV)

dan14 · November 18, 2022, 11:33pm

Mine are all these:

That's the latest firmware they got from the old Lightify hub they were on before that thing was discontinued. There's no newer firmware (in fact, no firmware at all apparently) published for it by LEDvance, which is a bummer. All 4 of them have the controller within about a yard of a repeater (either wired Jasco in-wall switch, or a Lightify plug).

dan14 · November 18, 2022, 11:41pm

I'm digging this thread, folks! A lot of great technical information here. I've noticed that my HE appears to be attempting to discover itself, via itself, at this particular moment:

A Journey of Self-Discovery

I've never seen that one before, but from other threads I understand 0000 is always the coordinator.

1924? Who might you be?

If I go to the zibgee logs in settings, and unplug/replug my current wayward device (named Cabinets), I see this:

Have cabinets, will not illuminate:

However, it remains unresponsive to commands. HE's zigbee radio last knew of it here:

It's not in the route table.

Marco? Polo!

Blasting it with unsuccessful commands gets me an In Discovery on it for the route table (via the coordinator, so direct attach to HE):

That address does match what the zigbee device table shows for it, yet alas, it's not responding.

Back to the drawing board

That quickly goes away, but now the HE is on a journey of self-discovery again:

That's persisting for some time.

Repair with a re-pair

I couldn't get the device to come back at all, so I did the old 10-unplug-10-replug to get it into pairing mode, and re-paired. Here's the zigbee log from that occurring:

A lightbulb moment?

When I re-paired, I got the address change notification in the system log:

That stuck out at me. The "new" address of 1924 is what we were seeing in the zigbee log when I was just unplugging and re-plugging the non-communicating strip, before re-pairing the device. So the device already had a new short address, but the left hand didn't know what the right hand was doing: that's not the address that was in the Zigbee device table, nor in the route table when I spammed it into showing "In Discovery."

Thus, the lost LED strip is getting the new NwkAddr address, but what appears not to be happening is this part from @Tony 's incredible detailed post:

A previously joined device will be matched to its original instance using its MAC address but be referenced by its new short ID going forward.

The device isn't being matched up after moving and getting a new short address, so it's effectively excommunicated. HE is still looking for it at the old address, and doesn't seem to be realigning it to the new address from the MAC. Even when the LED strip says hello to the mesh after being unplugged and plugged back in, I think that check might be getting missed, because the device on the new address (which is the lost device we're trying to send commands to) is just kind of ignored.

dan14 · November 19, 2022, 12:28am

I think that's got us way closer to an answer than I've ever been on this. Appreciate everyone who's been following along for the play-by-play!

Tony · November 19, 2022, 2:34am

It could be that your problematic device just isn't being heard properly when its transmitting... when a node joins/rejoins the network and gets assigned a new short ID, it will broadcast a device announcement (with its MAC and new NWKaddr) so that all devices in the network get this information (that's necessary for address conflict resolution to work in case a duplicate short ID was generated).

Maybe these broadcasts aren't always getting through, hence the coordinator isn't catching up to the new short ID (which could have been assigned by another router).

Could just be a marginal radio design or a bad sample.. I'd agree with @Rxich's suggestion to move a repeater nearby and see if that helps.

It's not necessary for the device to appear in the hub's neighbor table; evidently it's RF link is too marginal for that. But likely it will appear in the neighbor table of a nearby repeater. Hopefully the high -80's RSSI's shown in the Zigbee logs would also improve; those are weak numbers.

dan14 · November 19, 2022, 3:00am

Appreciate that! I’ve got a couple different repeaters near each of these problem LED strips. Backsplash and Island have a pair of Jasco switches about 5 feet away, line of sight, and Cabinets has a Jasco dimmer and Lightify plug about 10 feet away, also line of sight.

The announcements when I’m unplugging and replugging appear to be complete messages, so I’m not sure why the MAC isn’t getting picked up and reassociated to the new address.

Note that the zigbee log is from the bottom up, so the un-associated device actually had a pretty decent RSSI when I was doing the unplug/replug routine.

Rxich · November 19, 2022, 4:52am

Hmm, what is/are your wifi channel(s) and your zigbee channel? Do you have contributing home construction? Foil faced insulation, plaster with wire mesh, "loud" neighbor wifi(check wifi networks signal/strength with a network analyzer like WiFiman.

Tony · November 19, 2022, 5:31am

Interesting that the most recent log entries show the poorest last hop RSSI, given that previous messages appear to be via much stronger repeaters. Perhaps the last log entries were via a direct connection (maybe the other repeaters in the mesh just can't establish persistent links with that device).

Btw,'In Discovery' appears in a route table entry when the node requesting the route (in this case the hub) doesn't know the target node's current NwkAddr... it broadcasts a route request using the IEEE address instead, then waits for a reply. The target of the request will send back its NwkAddr (via a routed message, if it knows a route back to the hub; if not it will have to do a route request of its own).

I'm not sure what to make of the route table entries showing 'In Discovery' with node ID 0000...

Ken_Fraleigh · November 19, 2022, 6:19am

That is much less helpful than you think. If those devices don’t have a strong enough connection, then they won’t be chosen as viable routes. I put 5 strong repeaters within 15 feet of the hub, and for devices upstairs I have a few more repeaters (it’s actually quite a lot more, but I have a lot of devices) as close to directly above the hubs location as possible, and of course there are other repeating devices throughout the house. I have around 17 Ledvance recessed rgbw lights, 4 Gardenspot lights, and 5 of the Flex XL strips and it’s extremely rare if one of them doesn’t respond. They never drop off the mesh, but I got rid of the Osram bulbs and replaced them with Hue (on the Hue bridge) due to them being unreliable.

dan14 · November 21, 2022, 8:09pm

@Rxich I've got my 2.4GHz Wifi on Channel 11, and HE's zigbee on Channel 12, so there shouldn't be any overlap there. I have some distant neighbors on channels 6 and 1 for their 2.4GHz, but there's not much I can do about that. My entire apartment is 780sq.ft and is pretty open, so I don't think we're seeing signal attenuation issues in this case. The devices that move can still talk to the mesh, they're just not being acknowledged by HE for some reason.

Rxich · November 22, 2022, 9:53pm

How is the hub positioned? Anything near/on top of it possibly interfering?
RF is , as you know, 80% science and 20% fairy dust.

Anyway I think you've covered all the bases. Only other thought I had , is in doing this for over 4 years I've only seen the "low ram" once in the multiple hundreds of devices I own, the new inovelli blue switch. Do you have these connected? There are known Zigbee issues with certain switch batches with specific MAC addresses

Also I've only ever seen "high ram" with my Xbee extenders. What are those devices with the high/low ram notations?