Persistent Hub Failures - Strange Issue

Nope. Just a single Unifi UDM Pro....

The key here that the backup I have taken hours after it first started acting up again can be restored over & over again and a one hour lifespan is guaranteed. When a backup is restored that was a week before I am good for (you guessed it) a week. Its like the database or something else is slowing getting corrupted over time like a fuse.

I sent a PM to @bobbyD asking if he could review this issue / thread a week ago but haven't heard anything back yet. Perhaps if someone from Hubitat is monitoring the forum they could get word to either Bobby or someone else in order to get me some help?

UPDATE ON ISSUE

So after I wrote my last post the hub crashing issue quickly returned to failing within an hour after restoring from any backup regardless of version. I however then made some headway extending the failure interval to a week by doing a SOFT RESET then a restore from any backup. According to everything I read a "soft reset" should not do anything more than a regular restore from backup so clearly there is something going on here with my situation that isn't normal.

Do you happen to have an old dumb 10/100 switch laying around? Those will typically drop any of the packets that would normally knock the hub offline. If you could try putting that between the hub and everything else it would be a worthwhile test just to rule all of that out.

Also, have you looked at the past logs at all? Can you get a screenshot of the logs leading up to one of the crashes.

I will dig up an old dumb switch but even if there was a strange network issue why would doing a restore after "soft reset" extend the time to the next hub failure laterally to exactly a week but when I restore without doing the soft reset it fails hourly like clockwork? The only logical thing here is there is something internal to the database that is getting "clogged up" over time that eventually causes the hub to crash. Restoring from backup doesn;t help but it seems the soft reset ahead of that cleans something out and resets the one week clock.

Tried to paste in a couple pictures here but they were illegible. I however placed the two screenshots in this Google Drive Share.

Picture of recent logs while its rebooting hourly and another about an hour after a soft reset and restore.

Thanks for the help.

If you paste in screenshots they look funny at first while it renders a thumbnail but once you post it and wait a few seconds you can click on it for the full size original.

Yes the inconsistent timing of the issue is making me think it is something on the hub, not the network, that's why I asked for the logs. But that switch test would take the network mostly out of the picture at least. We know it is not a hardware failure since two hubs have done the same thing.

Looks like you are using older drivers from Marcus? All of the warn logs give it away. They seem to be very over-engineered and staff seems to think some of the ZigBee commands they are using can be causing some problems. These have been the suspected problem of another user over in the beta section recently, he just switched to something else this week.

Is there another driver you can switch to, to see if that helps?

Also that device 589 looks like it is possibly trying to create a child device but is missing the driver.

What do you have for LAN/IP based devices integrated to the hub?

3 Likes

Other options for Aqara/Xiaomi drivers...

1 Like

So here is what I did;

  • I deleted all of my Aqara Leak Sensor devices.
  • Deleted the Aqara device driver.
  • Installed the waytotheweb driver for my single Xiaomi door contact, moved the device over to the new driver, deleted the old driver.
  • The "device 589" seems to be a device bridged over (via the connexctore) on the Home Assistant side of my setup so I went in and cleaned up the device that I suspect being the "orphan".
  • I did a hub backup, did a Soft Reset, and restored to the just taken backup after the changes made above.

Now that the one week clock is presumably ticking again my hope is one of the above changes fixed whatever / whomever the culprit has been all along and we will see how long I can go this round.

After a while I will take a look at the logs and see what's cooking. In the interest of not changing too many things at once I am letting the network alone for now.

Will keep you posted...

1 Like

Thanks for the updates. We have reviewed your engineering log and there is nothing more than what you see in the past logs. It appears that your hub is frequently disconnecting from the cloud and/or network, which would cause LAN connected devices to stop working. It doesn't appear that the issue is on the hub side, so I would continue your efforts to troubleshoot local network connectivity,

Also, performing routine Soft Resets is not necessary, and only gained coincidental positive results. While doing so doesn't really hurt your hub, you could probably be better off resetting your hub's network by pressing and holding the physical button on the hub for ten seconds.

1 Like

With all due respect your conclusions are not at all consistent with the observations and established facts as I have detailed above.

The frequent disconnects you are seeing in the logs are a combination of the hub dropping off of the network when it "crashes" and me constantly power cycling, restoring various backups, etc. in an attempt to get it back online so my lights and things around the house can continue to work.

My evidence that it indeed is the hub (not an external network issue) is the following;

  • When a brand new hub is introduced (i.e. the replacement you sent me) and I load any backup image through the restore process it operates perfectly fine for a number of days (lets say a week).
  • After this week or so interval runs out the hub will halt (not respond to pings) and upon power cycle halt again ~ 1 hour later. Like clockwork this hourly cycle will repeat indefinitely.
  • Restoring any backup image will not stop this hourly halting cycle.
  • When the Soft Reset is performed and any backup is restored the hub again stays up for ~ 1 week just as when the replacement hub was first introduced.

The above behavior and precise time intervals show a clear repeatable pattern based on manipulating the hub (i.e. cycle of hourly halts changing to one week only after s soft reset is performed) and completely counter to a random external neatwork issue causing the hub to disconnect.

I understand this a difficult nut to crack. I am not suggesting the hub itself is defective which is proven with the replacement. Clearly something is going on from a software standpoint that is causing the to fail at these precisely regular intervals. Intervention with the soft reset / restore procedure extending this to a week shows something bad is "building up" between soft resets and a soft rest is "clearing something out" which a regular restore does not.

My hope is that some of the steps I took yesterday may clear whatever this errant thing is but it will take about a week to find out.

If you'd like you can try the next release to see if that resolves the connectivity problem. The next update has some enhancements that may help reconnect your hub. Would you like to try that?

Lets wait an see if the tweaks I did yesterday do anything. If it stays up past the initial week or so crash interval things might be resolved (will want it to stay up for a number of weeks before claiming victory).

If it does crash again however on "schedule" sure I would be up for trying a beta release - nothing to lose. Maybe send me a link to the bits so I am prepared?

Again I appreciate the help as this is a difficult problem.

What is your DHCP lease time set to? There is another thread going on with a disconnect issue where he had it set to only 30 mins, and is testing right now to see if increasing that to 24hrs will help.

He also said if he simply disconnects the ethernet and reconnects it then the hub comes online again. If yours goes out again, that could be a test to try before rebooting, would be interesting to see if that brings yours back as well. May help to determine the issue.

1 Like

Using Unifi UDM Pro also?

My DHCP lease time = 24 hours. Also over the weeks I have tried simply unplugging the network cable and / or going into another port (and switch) to see if I could get it to reply to pings without power cycling. Never the case.

My topology leading up to the hub is;

Unifi UDM Pro > Unifi US 8 150W POE Switch > C-8
(although the C-8 is connected to a POE switch I turned OFF power to the C-8's port just to be safe)

Also tried;

Unifi UDM Pro > Unifi US 8 150W POE Switch > Unifi USW Flex Mini >C-8
(the Mini is s $30 "dumb" / non-POE Unifi switch)

Ok good to know, so not the same issue as the other one on multiple levels. Leading me more towards that the hub is locking up and not simply getting knocked offline.

I am hopeful that the device changes you made will be the trick.

Not in that thread. I believe that one is a MicroTik.

1 Like

I have a full UniFi network and have zero issues with my C8 hub. Nor did I ever have any issues with my old C5 hub.

My topology going to my C8 hub is:

UniFi UDM Pro --> USW 16 POE --> USW Lite 16 POE --> C8 Hub

The port the C8 hub is connected to in the USW Lite 16 POE is not a POE port.

I have a "Fixed IP Address" set in UniFi for the C8 hub.

Also just to clarify / reiterate. My configuration had been static for about two years with not even a device added (and same network switches). Only Hub OS updates .My original C-7 worked flawlessly and the C-8 was seamlessly migrated over last March. Not until six months past the introduction of the C-8 did this issue arise out of nowhere.

Maybe I didn't see it, but did this issue start with the latest release? A couple months ago would be about when the 2.3.6.x updates came about. I wonder what would happen if you rolled back to 2.3.5 series?

I would probably consider what BobbyD suggested above, I would jump forward to the 2.3.7 which has many improvements. At this point it isn't much of a Beta, it is nearly release quality in my opinion. Things are pretty stable for most from what I can see. You could always roll back if you had to.

I would say yes, about a couple of months ago. As I said the only thing that has changed in the last couple years with my setup is the occasional hub updates.

@bobbyD - OK, please send me the beta bits and I will install.

1 Like