HELP! All device fallen off the Zigbee mesh at once - Oz Hubitat user

OK! I am clueless & not sure what is causing this & I give up, thought I'd ask the community if anyone has had this issue. Everything was working absolutely fine. One fine morning all my devices stopped working with the HUB. They all started showing OFFLINE on My Home app on my Android phone.

I tried rebooting the Hub which I thought will fix it but it did not. So I downloaded 'a known good working' backup copy. Then I did a soft reset on the hub & restored it from the back & still nothing seems to be working.

I removed everything & did a soft reset & started fresh by factory resetting my zigbee switches & add them back to the hub. I can successfully discover them & add them & they work momentarily. After a few minutes, they stop working.

This switch is close to the HUB in the Study.

I tried this on several other devices & same result it drops off after few minutes. I tried it with one of the Motion sensor as well & same result.

I do have 3 Wi-Fi Access points, I have used WI-FI Analyser & it shows these 3 are running on Channel 1-6-11, I have changed channel on the hub from 14 till 25 & I still have the same issue. I have even reset the radio for Zigbee. Nothing seems to be helping. My neighbour's wifi and all other discoverable SSIDs are on between 1-13 channels.

My house is a 2 Storey house with 4 bedrooms to cater with a single Hub. approx. 300sq mts.

I have all Zigbee devices

  1. Hubitat Rev C-5 with Platform version hub-2.3.2.134
  2. Roughly 5-7 Nue smart switches (mix of 1 2 & 4 gang)
  3. Couple of Ikea Tradfri globes
  4. Couple of Nue Contact sensor
  5. Couple of Motion Sensors

Other than the above Zigbee devices, i have Eufy doorebell , Couple of Google home mini/screen devices. All connected to Wi-fi Access points.

Am I missing something?
Is there someone who had this issue?
Is there something I can look further that can help me fix this issue?
Is it time to get rid of Hubitat?

Any Help is highly appreciated

This pretty much blankets the entire 2.4ghz spectrum. Which leaves only zigbee channels 14/20/25 free, assuming you are not using 40Mhz channel widths.

When did you upgrade to 2.3.2? Perhaps try rolling back to 2.3.1?

Anything unusual in the sys logs?

Anything in your network/home automation system change recently?

Have you tried with a device/sensor that you can place close to the hub?

Did you actually try powering down the hub for 5 minutes (unplug from wall not from hub) - after shutting it down in controlled way of course?

Also any mention of zigbee bulbs or Aqara devices gets my radar bleeping for zigbee issues.

1 Like

This works only for z-wave... You need to do around 20-25 mins to throw zigbee devices into panic mode...(to force a rebuild of the mesh)

2 Likes

Can the hub be rolled back to previous version?

Yea i had a motion sensor next to the hub with me & it dropped out

Yes, do it from yourhubip:8081 and Pick Restore Previous Version

Doing it now... Thnanks

Switching to version 2.3.0.124, please wait...

Here we go again..

So i restored the Hub to 2.3.0.124 & started adding devices again, it all looked promising but then they all started dropping off again.

Zigbee on Channel 20.

Most recently added switches work for 4-5 minutes & then drop out

Anymore ideas??

What does your hub logs show?

1 Like

First thing I'd try in a situation like this (if a power off/shutdown/restart doesn't improve things) is to look at the hub's ZigBee neighbor table (aka getChildandRouteinfo page):

http://your hub's ip/hub/zigbee/getChildAndRouteInfo

This info is dynamic (routers always exchange status info every 15 sec or so); refresh the page to see any changes.

It shows status of the most critical Zigbee devices (from a mesh perspective) within a single hop of your hub. They're the only mains-powered devices the hub is currently communicating with; basically they are the jumping off points that are needed to reach every single device in your network (aside from direct connected, non-routing devices that may appear in the child data section of that page).

if Child Data and Neighbor Table Entry sections are completely blank, that means the hub has no radio connections to any device in your network (likely the Zigbee stack isn't running or the radio isn't working at all).

Neighbor connections must be stable and persistent. If the neighbor table is populated but the devices change frequently (table is limited to16 devices and should be populated with the 'best' neighbors) the hub doesn't have a solid RF connection with them.. either one of the routing devices is malfunctioning (perhaps its repetitively crashing and restarting) or there is some massive RF interference disrupting the network. Perhaps one key router (which other routers downstream in the mesh need for connectivity to the hub) has an issue and is the root of the problem you're having. Try powering it off and see if the mesh can recover without it.

You don't need to do the 20 minute hub shutdown heal to see the routers in the mesh reconfigure; it should happen with a few 15 second status intervals (the heal may be useful for getting child devices to find better parents if you have moved or added repeaters).

Also note that this page supplies info about the quality of the radio links with each of the neighbors, both transmission and reception metrics. LQI numbers below 200 indicate poor reception quality at the hub end of the link, but the LQI figures shown here tell you nothing about the reception at the remote end of the link– for this you need to look at the outCost figures. Those get ranked on a scale of 1(best )to 7(poorest). Typically viable links will show outCost of three or less. If it's zero the link isn't supplying status information (it's not uncommon to see one or more of these showing 0; they indicate devices that have the poorest reception and if your mesh has enough repeaters it will just avoid sending traffic on them).

8 Likes

I don't know how much extra would have to be onboard/in firmware for a future HE Hub to better help diagnosis these kind of things...but I'd be all in for it.

1 Like

Some kind of "getChildandRouteInfo parser/analyzer" app might be feasible to flag major symptoms like empty neighbor tables, majority of neighbors with low LQI or zero outCost, or too many neighbor changes in a given interval (that would require analyzing and comparing multiple snapshots of the page). Could provide a good starting point for troubleshooting purposes, though some 'big picture' knowledge would still be necessary to make use of it.

5 Likes

One other thing (bit of a long shot) - could you also try deleting the devices from HE and also re-pairing the devices one at a time allowing some time between pairing to ensure that that particular device isn't at fault? It seems strange that the ones connected direct to HE would fall off also, these should remain connected regardless of whether the rest of the mesh dies unless the radio/HE has developed a fault or there is a lot of intererence. Your logs might give a clue to what's going on.

Another thing to try would be to turn off your channel 11 wifi and use zigbee channel 25. (or select whichever channels has the least level of interference from your neighbours). Don't forget also to turn off 40Mhz channel bonding if it's on.

Would be nice to have these tools like network mapping built in. It would eliminate the need to fire up the xbee and XCTU!

2 Likes

Thanks @Tony for all your inputs. This is really helpful. I never knew this bit. Please see below. I have gone through the getChildAndRouteInfo. I added couple of switches & they initially had outcost 1 & now they do not work & as suggested have outcost 0 .

Parent child parameters
EzspGetParentChildParametersResponse [childCount=1, parentEui64=0000000000000000, parentNodeId=65535]

Child Data
child:[Pantry Motion Sensor, EEE1, type:EMBER_SLEEPY_END_DEVICE]

Neighbor Table Entry
[Upstairs Common Area, 05B7], LQI:255, age:3, inCost:1, outCost:0
[Backyard Main Switch, 48D8], LQI:255, age:4, inCost:1, outCost:0
[Study Switch, 757B], LQI:254, age:3, inCost:1, outCost:0
[Formal Switch, 87A7], LQI:255, age:3, inCost:1, outCost:0
[null, AB4D], LQI:253, age:4, inCost:3, outCost:0
[Family Area Switch, ADA7], LQI:254, age:4, inCost:1, outCost:0
[Upstairs Bathroom, FEFF], LQI:255, age:4, inCost:1, outCost:0

Route Table Entry
status:In Discovery, age:0, routeRecordState:0, concentratorType:None, [Upstairs Bathroom, FEFF] via [null, 0000]
status:Unused
status:Unused
status:Unused
status:Unused
status:Unused
status:Unused
status:Unused
status:Unused
status:Unused
status:Unused
status:Unused
status:Unused
status:Unused
status:Unused
status:Unused

Route Table entries do have some entries that appear /disappear as if trying to make a connection with the hub but then they do not have a consistent connection.

LQI seems to be good and above 200 for all devices.

Thanks for your reply, I did try to turn off all my Access points & pair devices & see if it stayed connected, but unfortunately that did not work.

I have very limited access to my Access Point (TENDA NOVA MW3), which works from the App so I cannot select which Channel they run on or change the frequency.

I did a bit more troubleshooting, what I did was.

  1. Remove all devices
  2. Add just 1 IKEA Trafri RGB bulb
  3. Added a 1 Gang Switch, the Ikea globe & switch co-existed for quite some time without any issues.
  4. I introduced my bedroom 2 gang switch, as soon as i added the second switch the first 1 gang switch stopped working & the Outcost for 1 gang & 2 gang changed from 1 to 0.
  5. Between all this the Ikea RGB globe still stayed connected & had to issues or conflicts.
  6. I then removed both the switches again & re-added the 2 Gang switch and it stayed connected.
    7 After 15 minutes I added the 1 gang switch & it broke all switches again

I m running out of ideas at this point.

@mike.maxwell can you check his engineering logs? This is really strange

1 Like

Try restarting all your wifi devices. Strang as it sounds i have a few wifi cameras that have completely shutdown my zigbee network. Once restarted my zigbee devices all came back.

1 Like

The reception part (LQI's, inCost figures) shows that the hub appears to be receiving OK; it's the transmission part that seems to be an issue. outCosts get reported as 0 after a number of 15
sec. reporting intervals have elapsed without remote devices sending a valid link status report (which should contain their assessment of reception quality at the remote end of the link).

For a viable link, instead of outCost=0 you want to see an number in the 1-7 range, derived from that remote device's measure of its own LQI. Zigbee's routing strategy then attempts to choose the lowest cost path when routing messages.

Looks like some device had an issue pairing at some point (the null entry in the neighbor table might be a remnant from when it initially joined; that won't cause any issues and will get purged out eventually).

Not sure if this is feasible for you but I'd be curious to see if there is any subset of routing devices that would result in a stable mesh... turn all of them off (not necessary to remove them, just power them off) and turn on one... see if its entry in the neighbor table looks viable. After a while, turn on another, etc. Just in case a single bad device could be causing an issue. Otherwise, I'd have to suspect some transmission issue on the hub side (just going by all the zero outCost figures).

That part is normal; route table entries 'age out' when the path hasn't been in recent use and what is shown there is basically a 'cache' of route discovery requests that the hub makes when it wants to establish a path to a device. Those aren't meant to be static.