In Search of A Lost Child (Zigbee)

Ken_Fraleigh · November 19, 2022, 6:19am

That is much less helpful than you think. If those devices don’t have a strong enough connection, then they won’t be chosen as viable routes. I put 5 strong repeaters within 15 feet of the hub, and for devices upstairs I have a few more repeaters (it’s actually quite a lot more, but I have a lot of devices) as close to directly above the hubs location as possible, and of course there are other repeating devices throughout the house. I have around 17 Ledvance recessed rgbw lights, 4 Gardenspot lights, and 5 of the Flex XL strips and it’s extremely rare if one of them doesn’t respond. They never drop off the mesh, but I got rid of the Osram bulbs and replaced them with Hue (on the Hue bridge) due to them being unreliable.

dan14 · November 21, 2022, 8:09pm

@Rxich I've got my 2.4GHz Wifi on Channel 11, and HE's zigbee on Channel 12, so there shouldn't be any overlap there. I have some distant neighbors on channels 6 and 1 for their 2.4GHz, but there's not much I can do about that. My entire apartment is 780sq.ft and is pretty open, so I don't think we're seeing signal attenuation issues in this case. The devices that move can still talk to the mesh, they're just not being acknowledged by HE for some reason.

Rxich · November 22, 2022, 9:53pm

How is the hub positioned? Anything near/on top of it possibly interfering?
RF is , as you know, 80% science and 20% fairy dust.

Anyway I think you've covered all the bases. Only other thought I had , is in doing this for over 4 years I've only seen the "low ram" once in the multiple hundreds of devices I own, the new inovelli blue switch. Do you have these connected? There are known Zigbee issues with certain switch batches with specific MAC addresses

Also I've only ever seen "high ram" with my Xbee extenders. What are those devices with the high/low ram notations?

dan14 · November 22, 2022, 10:28pm

The "Low Ram" devices are all Jasco switches, which are rock solid on the mesh.

The "High Ram" device that shows rather oddly in those logs is the HE itself -- as I understand it, the coordinator is always node 0000. I don't know why it's trying to discover itself, or route itself through itself. I also don't know why devices also seem to have a route loop like that from time to time in the neighbor table.

The hub is physically positioned on the top of a shelf on my desk, with nothing above it. None of my other zigbee devices (those Jasco switches, Osram's gardenspot outdoor LEDs, a pair of Sengled bulbs, a pile of plugs) ever fall out of contact -- just those LED strips. And the zigbee logs seem to show that they're not actually falling off the mesh, just changing addresses and then being ignored.

I feel like if those announcements the LED strips make when they get unplugged/replugged were caught, or there was a periodic poll of the mesh, we might be able to find them and re-associated them to the device.

Rxich · November 22, 2022, 11:39pm

I'm thinking that the Jasco switches, and this is a guess, are using the EFR32MG newer Zigbee chip or at least that's the only one I've ever seen report "low ram"
And my 2 HE C7 never reports itself as high ram, so don't know if you got a V 1.2 hub, all I know is from my perspective a bit odd, not necessarily bad tho.

tagging @support , maybe this is something they have seen before, especially as it seems all WAS well, then not

Tony · November 23, 2022, 1:10am

When you see a device 'routing through itself' this isn't a routing loop; it's just normally what you'll see in a route table entry (which only shows the first hop of any route) when the first hop to the destination is also the complete route. Meaning, the hub is sending a message to one of its neighbor routers-- the first and last hop are one and the same.

So that jogged my memory regarding the [null, 0000] via [null, 0000]... Whoops, I seem to have forgotten all about this when it came up some time ago: Seeing "In Discovery" in the routing table - #4 by Tony It's also normal....

High ram / Low ram concentrators are (AFAIK) features of the Zigbee stack (SiLabs refers to them as 'plug-ins') that can be enabled on some routers to make many-to-one routing more efficient. If a node is a high ram concentrator, it's telling other nodes that it has enough storage to hold a complete routing table and doesn't need a source route (back to the originating node) transmitted along with every message.

dan14 · November 23, 2022, 1:53am

I really appreciate this in-depth information, @Tony ! I am learning a ton about the way Zigbee works with HE, and it’s very helpful in updating my approach to narrow down the root cause here.

I’d love to be able to actually see the device announcement raw message, so I could check the accuracy of my speculation that my wayward devices are getting their new addresses back to the HE successfully and it’s just not re-associating them. Is there a way to get the zigbee radio logs (not the device logs) to a debug level?

Rxich · November 23, 2022, 3:09am

Here's thought out of left field, check on the web fcc.io and lookup the FCC-ID of these strips, could be they are ZLL? which are know to be problematic

dan14 · November 23, 2022, 3:24pm

I am about 95% sure these are ZHA 1.2 — back in the days before ZB3, folks who wanted to use these with Hue in the US had to reach out to Osram support to get them flashed to a ZLL firmware. They were sold ZHA in the Americas and ZLL in EMEA. I never did the flash, so they should be ZHA. I’ll check, though.

dan14 · December 6, 2022, 2:44am

I know it’s been a minute, but I’ve confirmed these are all ZHA 1.2. I had another one go “dark” on me today. It showed activity in the zigbee logs, but HE just never went and re-associated the new address to the device. You can see some slow unplug/replugs in this screenshot, followed by Osram’s very annoying 2-off/5-on (x10) reset procedure, then me re-running device discovery, which finally found it and remapped it (“Island”).

I know I’m like the only one with this issue, and it’s only with these specific LED strips, but I really do feel like it’s something with HE not picking it up at the new address. There were approx 20 opportunities in that screenshot to catch 145E and re-map, and I just can’t see a scenario where the data didn’t make it back to the HE for every single one of them. It’s less than 12 feet away.

dan14 · January 26, 2023, 11:51pm

So I know everyone has been waiting on the edge of their seats for the next chapter in this saga. I’ve got an update to share. After a fair amount of Googling, I learned that some zigbee devices just really don’t like low Zigbee channels. I took a look at my 2.4GHz spectrum and determined that I could move higher without much of an impact to congestion, so at the very end of December I did that when I was having these LED strips start to fall off the mesh on a daily basis, sometimes multiple per day.

This has effected a near-complete resolution of the issue — I’ve had only one go missing in the entire month since.

However, I have started to see Jasco switches start to exhibit the same symptoms, which they never have before. There are a couple key differences, though: manually actuating the switch immediately returns it to action, and taking no action results in them finding their way home again in no more than 24 hours (the LED strips wouldn’t ever recover — I left one in situ for a week and it never came back).

The physical switch action effecting it re-connecting it to HE makes sense; the activity prompts the switch to communicate its state change, and then HE seems to realize the switch isn’t where it thought it was. I don’t have enough log data to determine if the automatic recovery is a result of the switch figuring out it’s been lost and finding its way home, or the HE realizing there’s a problem and correcting it. Given that the Osram LED strips would go MIA forever, I suspect that the Jasco switches have slightly better health monitoring logic and are the initiators of the recovery, but I can’t prove it.

Ultimately, it would be nice if there was some built-in automatic recovery mechanism. The root cause seems to be a device changing its address and not getting that change recorded by the coordinator for one reason or another. An earlier post in this thread suggested that used to exist, but no longer does. Even if it were just an optional, opt-in setting, I’d love to have that available.

rlithgow1 · January 26, 2023, 11:59pm

Is this a non plus z-wave switch? If so you need to install the built in z-wave poller. Or is it a zigbee switch?

dan14 · January 27, 2023, 12:12am

Everything in discussion here is zigbee. I’m aware of the Wi-Fi and zigbee channel overlaps and have deployed accordingly.

rlithgow1 · January 27, 2023, 12:18am

Right just figured that I'd throw it up for you

Maybe @mike.maxwell can chime in... I'd also note that not all zigbee devices like really high channels either. 20 is usually a good midpoint while keeping the ap's at under 11. I have one jasco zigbee outlet and when I first got it it gave me a lot of trouble. When I went the same route as you, with 11 or below settings and 20 for zigbee, lowering my signal strength from all the ap's, things stabilized immediately.

dan14 · May 9, 2023, 1:38am

Bumping for visibility, as this bug is still unfixed. I’ve since observed it with Jasco switches as well. For those, at least manually actuating them will cause the HE to record its new address and connectivity will resume. However, since there is no way to manually actuate an LED strip, once they’ve changed their address and HE misses the memo, they’re de facto off the mesh until they’re factory reset and re-paired.

rlithgow1 · May 9, 2023, 10:36am

This doesn't seem to be an issue for any others so I'm not sure it's a "bug" vs something strange is going on on your system. I mean there are lots of people out there running all jasco/ge switches without issue. Based on the entire thread it seems that you have a device somewhere that isn't repeating properly. While I know that you've been through your mesh with a fine tooth comb, this almost sounds like the classic zll 1.2 bulb issue dropping devices.

dan14 · May 9, 2023, 3:27pm

Well, I’ve removed all those, so it’s not that. Please stop suggesting the problem is me.

I’ve literally reproduced the issue in the logs and provided them here. It’s a bug. If a device changes its address and HE doesn’t catch it when it happens, it will continue messaging to the old address indefinitely, even if the device announces itself again on the mesh when it’s unplugged/re-plugged. It’s ignored. That’s a bug. I know it’s a bug because other device-generated messages, such as an on/off state change, are not ignored — those correctly prompt HE to refresh its address table. HE responds correctly to one type of message event and not the other. A bug.

You can see HE doing the address refresh of a forgotten device from a state change event in the logs. You can see HE ignoring the other messages from a forgotten device in the logs. You can see the stale addresses in the zigbee device table. It’s very obvious what is happening and it’s not something I can fix by redoing my home or changing channels or using other bridges or anything else because it is a bug.

ogiewon · May 9, 2023, 3:52pm

Tagging @mike.maxwell and @gopher.ny from the Hubitat team, as I haven't seen either of them chime in on this particular thread, yet. As they are the two SME's on Hubitat's Zigbee implementation, hopefully they can shed some light on what you've been experiencing on your Hubitat Hub since last November.

mike.maxwell · May 9, 2023, 4:55pm

would the easiest way of seeing this issue be by firing up the zigbee logs and finding a log entry where the displayName is replaced with the devices (probably new 4 byte address)?

dan14 · May 9, 2023, 5:24pm

Yep! You need to cross-reference the zigbee logs and the HE event logs but it’s all there.

You can see it update the address when you run discovery after factorying a “lost” device, or when a lost device sends a state change — zigbee logs show the device ID communicating, and event logs show HE recognizing it and updating the device address table. However, for devices that don’t have manual state changes, the only human intervention possible is to unplug and replug the device. When you do, the zigbee logs record the device (with no name) announcing its existence back on the mesh, but the event logs don’t show it updating the address, and the address table continues to show the previous—now vacant—address for the device.

Let me know if that makes sense?