Another hub lockup

JasonJoel · May 20, 2019, 10:44pm

That is a prudent idea. Although I still think it shouldn't be necessary, I will try to expand on this before switching to something else. I had planned on this for a while, but have been waiting for HubConnect to work 100% with user drivers (including initial sync).

Maybe I will quit waiting... I'm still on 1.3, so I guess 1st step is to get on 1.4.

csteele · May 20, 2019, 10:49pm

It was probably a 9 week run, in late summer, last year.

Currently I’m rebooting for updates, only.

I believe I've only rebooted “in anger” once... at the behest of Support. It didn’t fix whatever it was.

I rebooted my coordinator hub this morning (that would be the 2nd time) because I found a nasty problem and want to create a Support ticket for it and felt I needed to cover that base too. (It didn’t fix it.)

csteele · May 20, 2019, 10:54pm

I 100% agree that it shouldn't be... but I decided to "follow the easier path" and target as much Up Time as I could wring out of them.

It seemed like a simple thing at the time.. take a Development hub, save the backup, re-purpose it for "upstairs hub" and give it a try via HubLink/Link2Hub. If it didn't work, I was out nothing, because I already had the hub. Now, two MORE hubs later....

JasonJoel · May 20, 2019, 11:00pm

I can understand that. I just need things to work - and STAY working without a weekly hub reboot (as some are doing). If splitting between hubs will keep things working, then so be it (although that doesn't say much for the platform in general). I'll get over the cost/unnecessary nature of it, if I can fix it.

csteele · May 20, 2019, 11:12pm

Many of us have admitted to being Tinkerers... (guilty ) And I began my Hubitat experiment weeks too early and so I got "to play" with things that today I wouldn't have to. Others would say they had to FIX THINGS, but me, I got to play.

My story is that I came to Hubitat with 4 other hubs all linked via ZWave. It was both horrible and wonderful at the same time. So you see, I've been in the Multiple Hub mindset for a very, very long time. I would say, it's a wonder it took me so long.

I already knew that it wouldn't work worse. I did have to work around all the restrictions of Link2Hub/HubLink...

Tony · May 20, 2019, 11:18pm

It would be interesting to know if your current configuration with HubConnect is as robust . It was probably a year ago when WebCore was singled out as the poster child for the slowdown/lockup phenomenon; lots of other suspects have since emerged. The increase in the number of LAN integrations does seem to coincide with the increase in slow down / lockup complaints. I think Dan's theory may have merit.

I'm OK with my 6-week problem-free streak considering I'm still running WebCore and OtherHub; I'd be even more OK with an automated maintenance reboot option.

csteele · May 21, 2019, 12:14am

I'm going to say Yes.

I'm biased, especially since HubConnect is a dream come true for ME. I have some hypothesis and so far, nothing has knocked those over.

Radios are our limiting factor. Relative to everything else now, they are 8080's in a i7 world.
LAN speeds are many multiple times faster than the 'human speed' events we are tracking.
Quad Core CPUs in our hubs are also multiple times faster than 'human speed'.
Multiple hubs reduce the horror of a dead hub by half.

(Human speed for me is about how quickly we can open then close a door, Step into then out of a Motion detector range. Or put another way: How many Events can we generate per second. Per second is good enough because at 1gig LAN speed and Quad processing, a second is a lifetime.)

I watch my HubConnect 'coordinator' log and see grass growing between Events.

Steve @srwhite sees many Events per second. But then he's got 4x devices as me.

Perhaps there is one magic number by which we could determine the benefit of additional Hubs. I'm pretty sure that with 3 HubConnected hubs I'm below that number, whatever it is. Steve may still be above that and would benefit from one more Hub

What I also know is that $99 is a tiny investment in making the number of "Angry Events" descend towards zero, compared to the $$ spent on the 160+ devices. I can be cavalier about it because I made that choice. Self-sanity perhaps Please, no one do the multiplication for me !! I do not want to know that number.

bobbles · May 21, 2019, 8:22am

I feel I should clarify my position with HE.
I love it!!
For me it works better than ST did.
I've had a hub since April 2018 so I've seen the leap forward that has been made.
In the early days I had zigbee issues which support identified quickly and to resolve it I basically had to start again. (my fault).
No big deal.
I had a hub that would slow down as I was determined to run webCoRE. I bit the bullet, dumped it and put everything on RM 1.5 I think. Nearly 3 times as many rules but I got there.
HE worked flawlessly when I was away last year and my router died. Checked the events when I got home and my curtains opened and closed on time daily as did my lights turn on and off.
I've put feature requests in that have been implemented if viable.
I suppose my frustration could be that everything has worked too well!!
Anyway I will persevere as I'm sure the HE team will resolve the issue.

Well, that's that.

jrfarrar · May 21, 2019, 11:11am

I'm adding in here also. Had these same problems. Same rules/code I've been using for months. Started disabling things. I'm looking into automating a reboot. 2 problems. 1 when the hub locks up, everything is down, bad. 2 when I automate reboot, I'll not know what is helping/hurting because I'll be rebooting daily.

inetjnky · May 21, 2019, 1:07pm

For those of you who are locking up, how many devices do you have connected to the hub that's locking up? Are they mostly zwave, zigbee, or other? How many rules do you run and how much logging are you doing of those rules and devices?

bobbles · May 21, 2019, 1:47pm

I have the following.
15 x zigbee motion sensors.
5 x zwave MS
6 x zigbee contact sensors.
4 x zwave CS
2 x wemo outlets
2 x zigbee o/ls
2 x zwave o/ls
15 x virtual devices.
Tado thermostat.
2 x sonos speakers.
2 x Google home devices.
Life360
Locative using makerapi.
Some other Fibaro switches and other stuff but as you can see, not that much really.
All my rules are using RM3.
I do use a couple of Cobras and bpts apps.
I never turn any logging on unless I have a rule that does not work, mostly because how I've defined it. Only then will I turn it on and I try to remember to turn it off.
I would think my hub is probably below average on what is defined on it.
That's about it though if I dig down there is other stuff I've probably missed.

I should add that since my last update I've disabled Chromecast integration at the request of support to see if this resolvs my issues, which i have done.
All good so far and I have everything crossed.

jgagnon5541 · May 21, 2019, 2:09pm

I had to switch back to 2.0.9.132 2 days ago. 2.0.9.133 locks up after being online for less then 15 minutes and it's very repeatable. I starts with all my zigbee devices not responding then not long after that my zwave devices stop responding, then I can't access portal. No issues in 132.
23 Z-Wave devices
21 zigbee
1 Nest thermostat with 9 Protect
1 HubDuino Parent Ethernet with 20 devices
1 HubDuino Parent Ethernet with 8 devices
Pushover driver
InfluxDB (I tried disabling that, made no difference.)
19 Rules in RM
14 Rules in simple lighting
2 arduino device
2 homemade apps to control the HubDuinos devices
7 virtual devices
Homebridge with only my light switches connected to it.

Shane_pcs · May 21, 2019, 4:31pm

I have been removing 3rd party slowly to try and stop the hub lockups. The system is not worth having if they are all removed. I am down to only using the GE zwave plus light and fan drivers, Aeon Home Energy Monitor Gen1 Driver, Linear Garage door opener user driver, and then the evinsalink app and driver (required). I am using the ecobee integration thas built in, the google home built in, Hue Bridge integration, lock code manger, and chromecast integraton. Removing the other 3rd part apps like influx db did seem to reduce the number of lockups but not eliminate them. So either its multiple 3rd party apps causing the issues, or its a resource limitation issue. We will never know because the developers do not want us to.

Perhaps there is a way we could export our devices and apps, parse, and track on a spreadsheet to see whats in common? I really love everything else about the system, just hate when it just stops working.

Shane_pcs · May 21, 2019, 4:36pm

I have

Zwave Devices: 63
Zigbee Devices: 5
Rules: 24 (built in)
Logging turned off or minimum on everything.
Integrations: Hue, Chromecast, Ecobee, Google home, Lock Code Manager, Mode Manager, and Envisalink integration (3rd party)

cuboy29 · May 21, 2019, 4:57pm

Just highlight the table on the Devices page then copy and paste it into excel.

edit: Actually it's better to do it from the zwave and zigbee devices page.

bravenel · May 21, 2019, 4:59pm

The point of disabling custom code is not to reduce the functionality of your system. It is to find the source of the problem through selective isolation testing.

JasonJoel · May 21, 2019, 6:28pm

I agree that is a good strategy if the issues are recurring in a semi-timely fashion (every few days, once a week, etc). If the system only locks up every few weeks+, though, that is a hard road to go down.

nh.schottfam · May 21, 2019, 8:00pm

The assumption that "outside code" is causing the problem is a bit opaque.

There should be some information to developers as to what are the problems that may arise.

It is well known if you go compute bound / infinite loop you can cause problems. But what happens after that has been eliminated?

I do think there there are system level memory leaks, and db problems that build up over time. Since "outside code" does not allocate / deallocate memory, nor control the db in any api fashion, it stands to reason they cannot be held responsible for these issues...ie they should not be able to cause system level leaks and lockups.

I expect most if not all the systems having this problem are not "infinite loop" problems, so there should be more debugging going on at the system level vs. "just keep removing things". It would be good to understand what the "offender" is doing wrong / what resource is being depleted. Either that or we need some counter so we know when to do a reboot/restart....

Is there really no way to get a memory dump of what is running state out of the hub? This instability is becoming reminiscent of ST instability that could never be explained and causes many to want to abandon it.

csteele · May 21, 2019, 8:25pm

As the Volunteer supporter of WebCoRE, I've actually thought of coming to you once or twice to provide insight that the rest of us haven't dealt with, I imagined.

As you've pointed out, there's no API for the DB, but there is the ability to beat the ever lovin' begeezus out of it. ApiXU has done that. It was my impression that InfluxDB did that. Both have poll loops that seem be a lot of code and thus create delays (?). Same with my favorite of the "bad apps": Homebridge.

I think that the "bad app" brush has helped the Community because none of those Apps are anywhere close to as bad as they were.

What about big jumps in upgrades? A person upgrading each release, each hot fix, could have a different result than someone upgrading from two minor releases ago. (2.4 to 2.8)

We know the system "checks" the DB on boot. Hubitat hasn't really come out and said that the nightly DB cleanup is as good as the one done at reboot -- which can cause an older version of DB to be loaded. I assume that doesn't occur nightly.

I am not disagreeing with the position that there seems to be more to be gleaned from the system. I'm just saying there have been incremental improvements, and that baby-with-the-bathwater is not anything I'd try.

nh.schottfam · May 21, 2019, 8:58pm

I see lockups on hubs without webcore....so hard to know if my webcore experience matters....

In webcore I have found that state size (which is stored in the database) matters. Things got a lot faster as state size was reduced (as well as reduction in use of atomicState). This suggests to me that the db is a bottleneck in performance (which matches your statements on attributes).

Logging also seems to go to the db, so this may matter also for performance.

Another bit of work I did in webcore was to reduce the number of active threads. It is not clear to me how the system handles out of thread (or how much queuing it can do if the thread pool is exhausted).

Others in this have suggested networking may be involved. That seems plausible to me, but that said I don't have any first hand experience that points me to this.

During the lockups, it does feel like some resource is exhausted or deadlocked. Without more information it is hard to figure out memory, threads, db access or what... Likely is it a combination of the above, and as you put more load on the system or run for a longer period of time without a reboot, the more likely the combination occurs.

I have seen folks describe having this without much 3rdparty, hence it is a combination resource issue that is not properly handled, and ends up in the deadlock. So I think the statements to remove 3rdparty apps is mostly not helpful for most folks. What is really happening is you are reducing load on the system, which does not really make sense as your automations get greater in scale.

Finally, I do think inefficient apps can cause this combination to happen more frequently. That is not to say there is anything wrong with inefficient apps - the system should run slower, but it should not lock up. It should handle resource limits in a more graceful manner. So I view webcore was inefficient, and I made it much more efficient, but the problem has never been addressed at the system level. Now that there are many more complex apps (rm3, hub connect, etc), we are seeing these lockups that have been there all along and were never addressed.