Logs are your friend - This as well as app/dev logs, and as the separate Zigbee/Zwave live logs. - You need to get to at least a hint of what is hammering the hub.. Either one of the apps or devices - Or something on one of the meshes is generating lots of interrupts
Can you show screen shots of your app/dev logs pages, and potentially enable debug on some of you devices, to see if there is any high volume of events occuring - Are you by chance using MakerAPI, or have an LAN interfaces/devices?
Thanks! At least for now that seems to have solved the problem. It’s only been about 30 minutes since I did it, but it was happening every five minutes on the nose pretty much before.
Thanks! I never saw anything in the logs that indicated there was a problem except in the location log with the CPU spikes. I’ve restored the database and it seems to have fixed the problem for now, but if it comes back, I’ll do a little more hunting again.
I do have a lot of land-based tablets, use sharp tools, online,…etc.. One of my biggest app hits is usually ecobee…. I turned off all my rules and disabled some of those devices as well, and it was still doing the same spike. I do want to get much better at troubleshooting these things.
Are the log messages actually spiking almost exactly every 5 minutes...or is that just when the hub reports it. I think the latter is true from another post I saw.
That value is the 5 min CPU Load value as a tool like top would provide. That is why the message is being created every 5 min as it would take that long to adjust in any meaningful way. Also a high value of 2 over a 5 min interval means during that interval there were likely times the cpu was much higher.
What goes into calculating that value is complicated and doesn't directly relate to cpu usage, but is more like needed threads to process work based on cpu usage and other states the system is experiencing at a give time. This means things like waiting for storage or network connectivity can have a significant effect on the value.
What kind of value is meaningful though depends on the system's purpose, the hardware involved, and how sensitive it is to delays in processing. In its simplest interpretation you would want to keep the numeric value less then the number of threads the computer's cpu can handle. But that only accounts for pure cpu usage and not external wait states.
I have seen a system run perfectly fine when that value was consistently twice the number of threads the server could produce because of what that system did.
Hubitat with a value of 2 or greater is likely a very bad thing because we need our HA Platform to respond immediately right. As that number grows you run the risk of a small delays in automations and such.
Unfortunately it can be very hard to pin down the casuse though. You may want to look at live logging and thr zwave/ zigbee log to see if anything is causing allot of activity. You may also want to post your app and device page from the logging tab to see if anything stands out.
I am hoping you nailed the root cause of this, and it is the LAN. I have some really nice ASUS Mesh routers. Even with every tweak possible to set them to do nothing but route (except the FW) after a period of time they randomly start blocking things they determine to be a risk on a local LAN...local LAN to Local LAN. I'm pretty decent in working on routers. I started installing them in 1992 and have worked for ISP's etc.
After swapping it out for a cheap TP-Link router and hard coding three of my wall tablets...(one has been all along for testing) they've responded perfectly. It was a little hard to track down since it wasn't blocking traffic to them from places like SharpTools.
Anyway, I hope to get all my addresses back into the DHCP server this evening. I did the main ones I needed yesterday...but not all of them and the HE is even less happy today running 4+'s. Another indication that you're right it's the LAN.
So, I need to watch it for a couple days, but the problem certainly looks like my ASUS RT - AX92U Mesh routers.
CPU Load/Load% 0.19 / 4.75 %
Glad I have a lot of wall tablets because I dim them all when I leave or go to sleep, and it was the most obvious symptom when they constantly responded erratically/intermittently.
This could be an example of when load isn't the best calculator for actual cpu load. I do think there is some kind of issue with the router and the network latency was likely driving up the load value, but I suspect it wasn't blocking activity from occuring and the message could have been a red herring. TCP delays like what have happend could cause issues if you run out of a connection pool or something like that, but processing behind the scenes was likely running fine as long as the network delays werenot causeing a connection pool contention.
Either way that is awesome you found it and glad i could help you find it.
A quick google search has showed some interesting threads about that router. It seems the order of activating the firewall with some other features can have some very detrimental effects.
At any point did you backup your config and do a factory reset on the Asus gear to see if that would fix it?
I could only ping my Fire Wall tablets from the router. This was intermittent and cleared on reboot. I could not get to them on and off from my hardwired desktop. Again...this came and went. When I couldn't ping them HE packets to them failed as well. All things like the wireless LAN separation have been turned off as well. That's been seamless since swapping routers. Again, Internet based communications always worked to them. There was definitely a local LAN problem. Yes, on the wipe and reboot of the router. I've been fighting this for months and just didn't want to spend the bucks to replace otherwise absolutely awesome routers.
I never set things up for Wireshark...wish I had. No idea other than lazy.
Interesting to know that this is TCP vs UDP traffic.