Another hub lockup

Just highlight the table on the Devices page then copy and paste it into excel.

edit: Actually it's better to do it from the zwave and zigbee devices page.

The point of disabling custom code is not to reduce the functionality of your system. It is to find the source of the problem through selective isolation testing.

4 Likes

I agree that is a good strategy if the issues are recurring in a semi-timely fashion (every few days, once a week, etc). If the system only locks up every few weeks+, though, that is a hard road to go down.

The assumption that "outside code" is causing the problem is a bit opaque.

There should be some information to developers as to what are the problems that may arise.

  • It is well known if you go compute bound / infinite loop you can cause problems. But what happens after that has been eliminated?

I do think there there are system level memory leaks, and db problems that build up over time. Since "outside code" does not allocate / deallocate memory, nor control the db in any api fashion, it stands to reason they cannot be held responsible for these issues...ie they should not be able to cause system level leaks and lockups.

I expect most if not all the systems having this problem are not "infinite loop" problems, so there should be more debugging going on at the system level vs. "just keep removing things". It would be good to understand what the "offender" is doing wrong / what resource is being depleted. Either that or we need some counter so we know when to do a reboot/restart....

Is there really no way to get a memory dump of what is running state out of the hub? This instability is becoming reminiscent of ST instability that could never be explained and causes many to want to abandon it.

4 Likes

As the Volunteer supporter of WebCoRE, I've actually thought of coming to you once or twice to provide insight that the rest of us haven't dealt with, I imagined.

As you've pointed out, there's no API for the DB, but there is the ability to beat the ever lovin' begeezus out of it. :smiley: ApiXU has done that. It was my impression that InfluxDB did that. Both have poll loops that seem be a lot of code and thus create delays (?). Same with my favorite of the "bad apps": Homebridge.

I think that the "bad app" brush has helped the Community because none of those Apps are anywhere close to as bad as they were.

What about big jumps in upgrades? A person upgrading each release, each hot fix, could have a different result than someone upgrading from two minor releases ago. (2.4 to 2.8)

We know the system "checks" the DB on boot. Hubitat hasn't really come out and said that the nightly DB cleanup is as good as the one done at reboot -- which can cause an older version of DB to be loaded. I assume that doesn't occur nightly.

I am not disagreeing with the position that there seems to be more to be gleaned from the system. I'm just saying there have been incremental improvements, and that baby-with-the-bathwater is not anything I'd try. :slight_smile:

I see lockups on hubs without webcore....so hard to know if my webcore experience matters....

In webcore I have found that state size (which is stored in the database) matters. Things got a lot faster as state size was reduced (as well as reduction in use of atomicState). This suggests to me that the db is a bottleneck in performance (which matches your statements on attributes).

Logging also seems to go to the db, so this may matter also for performance.

Another bit of work I did in webcore was to reduce the number of active threads. It is not clear to me how the system handles out of thread (or how much queuing it can do if the thread pool is exhausted).

Others in this have suggested networking may be involved. That seems plausible to me, but that said I don't have any first hand experience that points me to this.

During the lockups, it does feel like some resource is exhausted or deadlocked. Without more information it is hard to figure out memory, threads, db access or what... Likely is it a combination of the above, and as you put more load on the system or run for a longer period of time without a reboot, the more likely the combination occurs.

I have seen folks describe having this without much 3rdparty, hence it is a combination resource issue that is not properly handled, and ends up in the deadlock. So I think the statements to remove 3rdparty apps is mostly not helpful for most folks. What is really happening is you are reducing load on the system, which does not really make sense as your automations get greater in scale.

Finally, I do think inefficient apps can cause this combination to happen more frequently. That is not to say there is anything wrong with inefficient apps - the system should run slower, but it should not lock up. It should handle resource limits in a more graceful manner. So I view webcore was inefficient, and I made it much more efficient, but the problem has never been addressed at the system level. Now that there are many more complex apps (rm3, hub connect, etc), we are seeing these lockups that have been there all along and were never addressed.

8 Likes

Personally I'd like to see less work on "new" features and more work on making things stable. Less releases at this point would be good. There is no doubt that hub updates have started issues that were not there before....even with 3rd party apps.

1 Like

Obviously this will always happen as only "staff" will be across the detail in the product roadmap etc.

It's a risk we all take. :slight_smile:

Suffered my first hub lock today (while a 3 hour drive away from home until Friday). Irritating. I do think - from an ease-of-use perspective - there should be an easy way to reboot the device while remote (if it is at all contactable). Maybe when the app comes up and says "timed out" as it fails to connect to the dashboards, it could provide an option to attempt a remote reboot?

This is not going to help you at present but I have my hub plugged into a WeMo outlet.
As long as my router at home is working I can turn off/on the WeMo outlet remotely to get things working OK.
Not an elegant way of resetting the hub but it does work for me.

2 Likes

Yeah, thanks, will definitely set this up using an old tp-link smart plug I have lying aorund. As you say, doesn't help right now lol. Also, it's not particularly elegant to just pull the power anyway. It would be better to have a controlled remote recycle. Plus agree with many comments elsewhere that this is really not acceptable performance. The only pieces of custom code I have are the tp-link driver and Magic Home drivers (neither of which are officially supported I recall) but nevertheless for this to lock the device up, with no standard ability to remotely reset, is irritating.

1 Like

I've had hub lockup issues as well.
It would lock up every 3 to 4 days.
I disabled the Chromecast (beta) app and things have settled down.
Do you have either the native Chrome cast or Alexa apps running.
Other posts have mentioned these as being problematic.

1 Like

Ah yes, I do have Chromecast app installed. Maybe it's that.

Is there a poe adapter we can use ?

You mean to power the Hub?

I think this has been mentioned before. I seem to recall someone talking about using a POE adapter to power a C5, but I'd have to do some digging. Are you thinking of being able to power cycle it from something like a Ubiquiti switch by turning off the POE?

S.

Yes thatโ€™s correct.

Iโ€™ve yet to experience locks ups, more of slow downs.

I use Cisco.

https://www.amazon.com/gp/product/B075CQRX2H

2A would be cutting it close, depending on which model of Hubitat hub you are using.

FWIW my C4 is rock solid with this PoE adapter. The C5 claims to only require 1A, and is microUSB so there are other options there.

I have the C5 model. WIll this be a problem since it's 2amp and the c5 is a 1amp ?