Severe hub CPU load detected

C8 Pro / Latest

I'm logging a location event for severe overload on a regular 5 minute and 5 second interval.

I disabled every APP and that changed nothing.

I see nothing in the live or past logs that corresponds with the overload timestamps.

I've shut down the top for % of busy devices in the log.

I've shut down the last few devices I've added. (I seen this in logs before...but it would quit after 7 or 8 cycles.

Nothing stops this.

How do I troubleshoot this?

power cycle it. (not just reboot)

Settings: shutdown
pull power, wait 10-20 sec, power up

That will clear it. But it just comes back. Always.

I need to stop this and find the root cause.

I often travel and need stability. How do I determine a root cause?

Logs are your friend - This as well as app/dev logs, and as the separate Zigbee/Zwave live logs. - You need to get to at least a hint of what is hammering the hub.. Either one of the apps or devices - Or something on one of the meshes is generating lots of interrupts

Can you show screen shots of your app/dev logs pages, and potentially enable debug on some of you devices, to see if there is any high volume of events occuring - Are you by chance using MakerAPI, or have an LAN interfaces/devices?

-gfa

2 Likes

Have to leave...will post in the AM...THANKS!!!

You should do a soft reset or download a recent backup and then restore it to clean the database. This might solve it.

1 Like

Thanks! At least for now that seems to have solved the problem. It’s only been about 30 minutes since I did it, but it was happening every five minutes on the nose pretty much before.

Thanks! I never saw anything in the logs that indicated there was a problem except in the location log with the CPU spikes. I’ve restored the database and it seems to have fixed the problem for now, but if it comes back, I’ll do a little more hunting again.

I do have a lot of land-based tablets, use sharp tools, online,…etc.. One of my biggest app hits is usually ecobee…. I turned off all my rules and disabled some of those devices as well, and it was still doing the same spike. I do want to get much better at troubleshooting these things.

Thanks. It didn't solve it.

Two Q's.

Are the log messages actually spiking almost exactly every 5 minutes...or is that just when the hub reports it. I think the latter is true from another post I saw.

What is the parameter below telling me.

It is a running 5 minute average as @gopher.ny mentioned in this post:
https://community.hubitat.com/t/2-2-5-109-runtime-stats-question/62360/3?u=ritchierich

I am not sure. Maybe @gopher.ny can chime in and provide input and guidance for you. He or @bobbyD can look at your engineering logs for clues too.

That value is the 5 min CPU Load value as a tool like top would provide. That is why the message is being created every 5 min as it would take that long to adjust in any meaningful way. Also a high value of 2 over a 5 min interval means during that interval there were likely times the cpu was much higher.

What goes into calculating that value is complicated and doesn't directly relate to cpu usage, but is more like needed threads to process work based on cpu usage and other states the system is experiencing at a give time. This means things like waiting for storage or network connectivity can have a significant effect on the value.

What kind of value is meaningful though depends on the system's purpose, the hardware involved, and how sensitive it is to delays in processing. In its simplest interpretation you would want to keep the numeric value less then the number of threads the computer's cpu can handle. But that only accounts for pure cpu usage and not external wait states.

I have seen a system run perfectly fine when that value was consistently twice the number of threads the server could produce because of what that system did.

Hubitat with a value of 2 or greater is likely a very bad thing because we need our HA Platform to respond immediately right. As that number grows you run the risk of a small delays in automations and such.

Unfortunately it can be very hard to pin down the casuse though. You may want to look at live logging and thr zwave/ zigbee log to see if anything is causing allot of activity. You may also want to post your app and device page from the logging tab to see if anything stands out.

Thanks! Thanks a LOT of detail!

I am hoping you nailed the root cause of this, and it is the LAN. I have some really nice ASUS Mesh routers. Even with every tweak possible to set them to do nothing but route (except the FW) after a period of time they randomly start blocking things they determine to be a risk on a local LAN...local LAN to Local LAN. I'm pretty decent in working on routers. I started installing them in 1992 and have worked for ISP's etc.

After swapping it out for a cheap TP-Link router and hard coding three of my wall tablets...(one has been all along for testing) they've responded perfectly. It was a little hard to track down since it wasn't blocking traffic to them from places like SharpTools.

Anyway, I hope to get all my addresses back into the DHCP server this evening. I did the main ones I needed yesterday...but not all of them and the HE is even less happy today running 4+'s. Another indication that you're right it's the LAN.

Hopefully :upside_down_face:

1 Like

So, I need to watch it for a couple days, but the problem certainly looks like my ASUS RT - AX92U Mesh routers.

CPU Load/Load% 0.19 / 4.75 %

Glad I have a lot of wall tablets because I dim them all when I leave or go to sleep, and it was the most obvious symptom when they constantly responded erratically/intermittently.

Thanks!

This could be an example of when load isn't the best calculator for actual cpu load. I do think there is some kind of issue with the router and the network latency was likely driving up the load value, but I suspect it wasn't blocking activity from occuring and the message could have been a red herring. TCP delays like what have happend could cause issues if you run out of a connection pool or something like that, but processing behind the scenes was likely running fine as long as the network delays werenot causeing a connection pool contention.

Either way that is awesome you found it and glad i could help you find it.

A quick google search has showed some interesting threads about that router. It seems the order of activating the firewall with some other features can have some very detrimental effects.

At any point did you backup your config and do a factory reset on the Asus gear to see if that would fix it?

1 Like

I could only ping my Fire Wall tablets from the router. This was intermittent and cleared on reboot. I could not get to them on and off from my hardwired desktop. Again...this came and went. When I couldn't ping them HE packets to them failed as well. All things like the wireless LAN separation have been turned off as well. That's been seamless since swapping routers. Again, Internet based communications always worked to them. There was definitely a local LAN problem. Yes, on the wipe and reboot of the router. I've been fighting this for months and just didn't want to spend the bucks to replace otherwise absolutely awesome routers.

I never set things up for Wireshark...wish I had. No idea other than lazy.

Interesting to know that this is TCP vs UDP traffic.

I wish ZWAVE was!!!!

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.