Another hub lockup

bobbles · May 31, 2019, 7:09am

I had exactly the same thing. Chromecast disabled but still had a lockup. Then I realised I still had apps trying to send messages to them.
I've now disabled those apps as well and things have settled down.
Do you have any apps still sending messages out? If so try disabling those as well.

Angus_M · May 31, 2019, 7:30am

The underlying question to the Hubitat team is; is it possible to expedite resolution of these issues and move the Chromecast (beta) integration to a fully released product (if it is indeed the cause of these issues)? Any analysis available yet? This integration is hugely important!

Loving the Hubitat but need it to be robust!

njanda · May 31, 2019, 8:48am

The only thing my HE complains about chromecast is an unsupported volume command.
java.lang.NullPointerException: Cannot invoke method setVolume() on null object (setVolume)
Though I’ve only one CC and use it solely for TTS

SmartHomePrimer · May 31, 2019, 12:14pm

@mike.maxwell wrote in a post that he was in the process of cleaning up the Chromecast Integration code. So, short answer I guess would be, it’s being actively attended to.

JulesT · June 1, 2019, 10:47am

Hmmm. OK. My hub (265 devices) is continuing to lock up on a 'a little less than weekly' basis. That, along with Alexa continuing to say 'Something must have gone wrong' on any non-trivial light operation, is really hurting the WAF.

I'll disable the Chromecast integration as well (in addition to every scrap of custom code I've ever written, influxDB logging in case the data rate just can't take it, etc). Maybe that'll fix it.

My guess, though, is it's just a plain resource exhaustion problem. My guess is that there really isn't enough RAM on these little embedded boards to run a java app of this scale, which leads to relatively little headroom for any dangling object style memory leak. I bet you any money you like that the linux logs we're not able to see simply have oom-killer nuking the java process periodically, and no watchdog to sort it out.

If I had reasonable confidence that splitting the devices between hubs might resolve the issue, I'd even give that a go, but it's a big pain in the arse job to move all the Z-wave and Zigbee bits around like that, and it'd seriously disrupt the house while I was doing it, so I'd not set out on that unless I was told that I was simply running too many devices for the little hub.

All thoroughly frustrating.

-- Jules

csteele · June 1, 2019, 2:38pm

A 2nd Hub doubles the memory, right?

Moving influxDB to a rPi saves hub memory, correct?

Moving all the 'risky code' to it's own hub would mean NOT having to disrupt the house. Leave the Z-stuff alone, just move the resource consumers.

Seems like you have a significant potential fix available, by your own logic. What if you're right?

ChrisUthe · June 1, 2019, 3:57pm

My support call never got responded to after the initial "Reboot it and I'll go look at it", and now here I am traveling for work and it appears the hub is completely down, like, not even in the DHCP reservations down. (the portal says it last heard from the hub thursday)

Asked my wife who is staying at her moms taking care of a few things if she had been home and she summed it up best:

"No, I haven't gotten any push notifications, I assumed after all the time and money you put into that system that if I didn't get a push it was all fine..."

JasonJoel · June 1, 2019, 4:28pm

That stinks.

I bought 2 TPLink Kasa WiFi outlets last week for this exact scenario... Then at least I can hard reboot it by turning the outlet on/off.

It's annoying to have to have that kind of backup plan, but for now it is what it is...

SmartHomePrimer · June 1, 2019, 5:37pm

InfluxDB is on the naughty list and has been said by others to be responsible. I would disable that first and see how things progress from there. Disabling a lot at once doesn’t help narrow things down as easily.

jon1 · June 1, 2019, 7:50pm

It might be interesting to disable the Echo Skill and see if that makes any difference.

I saw a log entry the other day that looked like the Alexa cloud socket timed out, and at the same time I got an alarm from my node-red hub performance monitor showing higher than normal response times, eg higher hub load. If the Echo Skill was not using async calls, it could be hanging up the hub when the cloud is laggy or unresponsive. Its been seen before where not using async calls can hang up the hub, makes sense. Of course it could also have been the other way around, where something unknown in the hub cause high load, and this caused the Alexa socket to break, maybe it timed us out. Or it could all be a coincidence.

JulesT · June 3, 2019, 8:28am

Yeah. InfluxDB is disabled. Among the list of other stuff that's already disabled is all the Chomecast Integration, and all my custom code. I'll further disable Alexa TTS, Google Home, and Homebridge.

I've got to say, though... if all the stuff that makes it useful is disabled, then there's not a lot of point in carrying on.

I think I'll lob one more call at support, and while that's happening, I'll take another peek at a home-brew home-assistant plus node-red kind of setup. At least I'll get to reuse the zWave and Zigbee sticks, but the other half won't put up with this for very long.

ARRGH! Managed a crash in less than 24 hours there. I rebooted it yesterday... and it had already failed by 8am this morning.

-- Jules

SmartHomePrimer · June 3, 2019, 12:59pm

It's possible you have a corrupt DB. I can speak for Alexa TTS, as I don't use it myself, but many others do and I don't recall anyone stating they narrowed down "crashes" to that. For Homebridge, it really depends on which one. The earlier version that ran on the hub was pointed to by a number of people that had issues (myself included in that), but the current version it said to not cause any issues.

Google Home I have running myself. No issues I can point to single out that as a cause for any slowdowns or other issues. Chromecast Integration is beta and suspected. Work is being done to address those concerns. Regular "crashing" hasn't been an issue. I quote this not to disbelieve you, but rather to suggest that it may be an extreme slowdown, versus a total lock up. I was experiencing slowdowns in the middle of the day. Unusual behavior, which I am not experiencing since disabling the Chromecast Integration.

However I also have the Nest Integration disabled at the moment as well, and this is not going to immediately narrow things down for me, since I disabled two at once. So now the hub has been stable for several days with both disabled. I'm going to re-enable the Nest Integration today and see if the hub remains stable over the new week.

JulesT · June 3, 2019, 1:07pm

In terms of 'extreme slowdown' vs 'crashing', or 'locking up', I suppose you're technically correct, but once the system isn't responding to events in a timely fashion, or (as was the case in both my reboots in the last 24 hours), the web interface was incapable of rendering before the browser timed it out.

Ultimately, the problem for me is that the domestic setting can't afford the disruption necessary to spend a lot of time tracking it down. I'll stuff it on a nightly reboot (it won't be the first device in the house like that- both my LTE internet connections fill their connection tables and explode if they're not rebooted every day or so) for the time being, but if support can't rapidly help me get to some sort of a resolution, then I think it's time for me to go find another approach, even if it's home-brew. I at least stand a better change of seeing what's exploding that way.

Ultimately, I suspect that I'm just not a core user for this stuff. I'm pushing 270 devices installed now, so a system that would be stable for a dozen light bulbs just doesn't cut it for me. I think I'm going to end up hanging out where the crazy people go, which a quick googling looks like home assistant and brushing up my YAML. Great

-- Jules

JasonJoel · June 3, 2019, 1:17pm

Or, you know, buy another hub or two and split the load...

A lot easier than starting over with a new system - and as a person that runs Home Assistant in one of my homes now, I'll say that it isn't a trivial amount of work... And definitely not guaranteed to work better either on large scale systems.

But that said, Home Assistant can, and does, work fine for many people.

SmartHomePrimer · June 3, 2019, 1:21pm

Homebrew doesn't seem like it would support this.

Haven't heard of people getting rapid support from the homebrew systems.

If you're experiencing explosions, I can't imagine finding comfort in a system that has such potential to explode and take up all your time.

I'm not sure I understand what you mean here. A core user for Hubitat is pretty much the category you're in.

I can think of nothing less I'd like to spend time on. Hope that's going to help. Good luck to you and I do mean this in all sincerity. Your statements are really at odds with each other. Hope you do find what's best for you.

JulesT · June 3, 2019, 1:23pm

I've been thinking about that... but the more I think about it, the less obvious it is that that fixes the problem. You're still going to want to reflect all of the devices on all of the hubs, just so that whatever's running on that particular hub can see the states of all the devices. It's a black-box, so we've no way of knowing what it is that's causing the load - it might be some of the additional apps, but equally, it might just be the load of carrying around all the device states. Either way... it also has the unpleasant side-effect of making the solution more complex, with more moving parts, and more point of failure. From an engineering perspective, it's always preferable to replace a failing component with a single, better, component, than multiple equal components with additional dependencies with each other.

I'm certainly not that keen to start up again elsewhere... but what I am relatively sure of is that future progress will have to rely on me being able to see what's going on during a failure, and that's certainly something I can't do currently.

Shame, really - I was really rather pleased with it when it was only looking after a floor or so of the house.

-- Jules

JasonJoel · June 3, 2019, 1:24pm

The hub issues do seem to happen more the more devices there are and the more user code there is. Which sounds like memory leaks or resource exhaustion.

We will never know, though, as Hubitat doesn't expose any meaningful loading statistics.

I will say that if it is resource exhaustion, I would have much rather paid $300 for one high powered hub with more resources than the $300 I paid for 3 separate hubs...

That said, my 3 hub system is setup and working. So it can be done (although not without some good planning and work). Previously simple things like dashboards take a lot more thought - a dashboard can only connect to one hub....

JasonJoel · June 3, 2019, 1:28pm

I actually didn't find that to be true. I only replicate a small subset of my devices. But I concede that is very dependent on how you are doing logic/RM rules/notifications.

No argument there. But a solution like Home Assistant has MANY tweakable/configurable individual internal components (and many of then NEED to be tweaked to make things work right). Not sure I would call it less complicated, even thought it is in 'one box'. It seems like I'm constantly having to fiddle with both YAML and internal component versions on my Home Assistant setup.

Anyway, good luck either way. In the end the right solution is the one that works for YOU.

I will say that if I get to the point that I can't avoid hub lockups, Home Assistant is what I would likely switch to. Either that, or I would rip out the zigbee devices I have and go back to HomeSeer (already own the license, so cost isn't a barrier). Not sure, and hope I don't have to make that decision in the end.

JulesT · June 3, 2019, 1:30pm

Indeed. I'd expect roughly no support at all. But I would have access to all the gubbins, and I could then potentially fix the problem myself. Not attractive, I'll entirely agree with you, and it's not like I'm claiming that there's a sunny upland anywhere else in particular, but what's entirely clear from today is that a device that can lock twice in a day isn't something that can continue to be working here. We'll be back to manual light switches and 'why did you spend all that cash on something that doesn't work' conversations, and life's REALLY too short for that.

Anyway... we'll see if support can cough something up that will help. They've been good with other people in the past, but hubs that aren't stable (for at least those people who have them) looks to be a pain point that's not going away, for me at least, and by the look of the number of threads, some other people too. Equally, I appreciate that the team isn't huge, and can't fix everybody. Sadly.... the ultimate result of 'can't fix everybody' is also 'can't retain everybody'.

Maybe I'll check back in a couple of years... I'm sure eventually things will be resolved, or a version hub will be available by then, and will have more available resources and be less fragile.

JulesT · June 3, 2019, 1:33pm

Yeah. I'm with you on a 'pro' hub. In a realtively real sense, by moving from a hub based solution to something that I can host on a proper platform, I'm trying to simply remove available resources from the pool of problems. But the Hubitat platform won't (not surprisingly) let me move off their hardware, and they don't make a bigger one. Moving off the hub also gets me something massively more fault-tolerant to run the automation on, and that's a comfort too.

-- Jules