C7 started locking up once a week or so

paganini · August 8, 2022, 7:46pm

Hello,

I've been using my C7 for about nine months now, without problems. Right now, all my automations are on HomeAssistant, and I use the hub to connect my Zigbee and Z-Wave devices.

For about a month now, I started experiencing random "weekly" lock-ups. When it happens, the hub shows a steady green light and does not respond to pings or web requests. I tried unplugging the network cable and re-plugging but no deal. The switch utility shows the link active and successful negotiation at 100Mbps, even when the hub is not responding. There are no heat generating sources near the device.

I'm currently running 2.3.2.141.

Any help is appreciated. This is a difficult situation because the failure causes all my automations (alarms, light automations, etc) to fail, and some of them must run at odd hours of the night, when nobody is around to look.

Regards

neonturbo · August 8, 2022, 7:51pm

Any errors or warnings in the Logs tab?

Did you go to the logs tab, and over to the App Stats and Device Stats and see if something is consuming excessive hub memory? (Enable all columns under settings in those menus)

paganini · August 8, 2022, 7:59pm

I looked at "Past Logs" and all I could see were innocent looking events right before the hub locked, and then no events until I rebooted it.

I don't see anything showing obvious errors (but let me know if there's any specific place where I should look).

This is what I see in App Stats:

I sorted by "State Size", but not clear if this is the most interesting column here.

Looking at Device Stats, I don't see anything particularly interesting or apparently out of line, but please let me know if some particular indicator is of interest.

cgmckeever · August 8, 2022, 8:09pm

@gopher.ny helped me out of this with a CPU change .. YMMV
I DM'd him my router ID and such and then he gave me the recommendation

thebearmay · August 8, 2022, 8:09pm

I normally sort by % of total if I'm looking for a "busy" device or app. Since you say this is a ~weekly occurrence, my guess is that it may be memory related, and would suggest maybe doing an hourly check on the amount of free memory. This can be done manually using http://yourHubIP/hub/advanced/freeOSMemory or if you want to automate the check you could use the Hub Information Driver and a quick rule.

paganini · August 8, 2022, 8:26pm

When sorting for % of total , the "Maker API" app is at the top with 0.029% (I suppose this is RAM consumption?).

Looking at freeOSMemory I see 505420, which I suppose to be Kbytes. It was higher and went steadily down, but seems to have stabilized at this level. I may create a prometheus rule to scrape this and graph it. What would probably produce interesting results.

The theory of a memory leak is plausible, but it's also interesting that the hub OS doesn't take care of this automatically.

I'll keep an eye on this metric to see what happens and report back.

thebearmay · August 8, 2022, 9:39pm

Actually it's time, but if you have a runaway app or device it will be eating up clock ticks at a noticeable rate.

Correct. Mine usually start out around 490-501 immediately after boot and then slowly decrease. Hub tries to recover but there are some apps/devices that don't seem to release all of their memory back to the hub (those with a lot of HTTP traffic tend to be my first suspects).

paganini · August 8, 2022, 10:48pm

Basically, yes, CPU time. So at least for now, it doesn't seem to be the case.

I put together a quick scraper for prometheus using the file collector. I don't have enough data (yet) but things seem to stabilize around 504k or so. It will be interesting to give it a few days and see how things work out.

I may graph temperature as well. This would give a more complete picture. Anything else useful that we can capture via these endpoints?

Regards

thebearmay · August 8, 2022, 11:00pm

CPU load or CPU % (load / 4) is something I always try to watch as it may indicate a run away app or device. If the CPU spikes at the same time as the temperature it’s a real good indication that something is at issue.

paganini · August 27, 2022, 7:31am

Hello there,

Since we spoke last time, I've been capturing the amount of free memory in the C7. The device locked up again today ~21:00. This is the memory profile for the last 17 days:

From the image, it appears we still had plenty of free memory, so this may not be memory related. The device has the green light, but is otherwise dead (doesn't respond to pings and a portscan reveals no ports responding).

Any further ideas?

Since this started out of nowhere (no updates or anything) I'm concerned about hardware issues, and I wonder what's the warranty on these devices (I found it surprisingly hard to find out on the website.)

kkossev · August 27, 2022, 7:52am

Usually it is very difficult and time consuming to find the core reason for these periodic lockups. In most of the cases the reason is not the hardware.

The easiest workaround is to schedule automatic hub reboots, once per week is a good starting point. You can search for ‘hub reboot’ here in the forum and in HPM.

paganini · August 27, 2022, 8:01am

Yes, I was wondering how to do that. It's curious though why it started so suddenly. This hub has been rock solid for months, going on without a single reboot.

thebearmay · August 27, 2022, 10:05am

You say you are running all of your automations on HA, what protocol are you using to go from HA to HE? Thought here is that if you lost DHCP temporarily would your connection fail?

rlithgow1 · August 27, 2022, 10:56am

@bobbyD Maybe you can look at @paganini engineering logs?

bobbyD · August 27, 2022, 11:36am

Sure, @paganini send me a private message with your hub id and if the hub is connected to the cloud, will take a look.

paganini · August 27, 2022, 7:29pm

In general, all my "networking and automation" devices have a fixed IP. I double-checked and this is the case for the hubitat, so unlikely to be a DHCP issue. I may have to configure one of my switch ports as a monitoring port to see if anything at all is coming from the hub's MAC address when it's in this "dead" state.

For HA->HE, I'm using the Maker API App.

paganini · August 27, 2022, 7:32pm

Oh thanks a lot! Will send you a DM with the info.

thebearmay · August 27, 2022, 8:30pm

Depends on where you’re doing it. If using the router’s IP reservations the DHCP service is still handing out the IP when it expires, just knows to hand out the same one.

paganini · August 27, 2022, 10:53pm

Oh, I'm a belt & suspenders type of guy. My DHCP server is runs in my linux router. The hub itself is set to a fixed IP (no DHCP), and on the server I also have a fixed MAC->IP configuration for the hub. I do this for all my devices in case they lose the configuration and reboot with default of DHCP.

paganini · September 6, 2022, 12:32am

So, just leaving this message here for posterity.

Thanks @bobbyD and everybody else. Logs analysis by @bobbyD revealed it to be hardware issue. I got a new C7 and replaced the old one.

The migration was somewhat painful. I had to remove ALL devices (most of them, ZWave devices) and re-add them in the new hub. Then there was the fun of HA + Maker API, but it's all done now. On the bright side, the next migration shouldn't hurt so much.