Let's Chat About... Monitoring Hub Performance

Not a new topic... Including for me.... But like many of my late night posts, something that felt like a popular topic at the moment. So let's chat about how we:

  • Assess the current performance of our hubs
  • Look at the performance over longer periods of time
  • (and I'm expecting this will get lost in the conversation) How we diagnose potential issues with our automations

For me:

Immediate analysis of a "situation" typically includes analysing the logs page for current and past logs, which will typically include errors and warnings, based on how I have setup most of my devices. If I have narrowed down to a particular Device or App that I am interested in, then I will likely turn on logging options in the device and app setup. Otherwise I may also look at the Device or Apps Stats pages if I haven't already detect and degradation in performance elsewhere.

For longer term or broader assessment of performance, I tend to (like many users I expect) on data coming from the Hub Information driver, including hub temperature, CPU, memory available, database size and uptime days. I export these (along with other devices status info) to InfluxDB for presentation on Grafana dashboards. Monitoring such as this can be a topic of it's own....

Beyond some of the more general suggestions I have made, I also suggest some basics when reporting any issues here on the Community, such as the platform version you are running, any logs for the app / device in question, plus many and varied screenshots displaying your issue.

Discuss...

2 Likes

When having issues with automations/applications, post a full screenshot of your setup page. There's a high number of rabbit holes around RM issues where 3/4 through the thread, the OP says something along the lines of "oh yeah, I have a required expression" and the 'aha' moment finally happens. If a problem is reproducible, logs should 100% be capture for the failure and provided. If there are devices involved, their logs also need to be included.

One gripe about long term retention is usability. I know it gives us the warm and fuzzies to have those data points, but I've rarely seen them actually contribute to RCA and what-not. Case in point, I'm seeing a memory leak on my C-8 where I'm getting down to the low 200s for available memory. Once it dips that low, things get weird with rule execution and device responsiveness. A reboot corrects it.

The issue (especially with how often updates are coming out) is that the first question posed when posting is "are you on the latest FW." To which I'll say no, go do the update, and, since we just rebooted, the issue is gone. By the time it comes back, I've got a new firmware release and the cycle repeats itself.

3 Likes

But you got a resolution... Tsk.... These unhappy users.... :wink: I get what you mean...I can understand those wanting to see a limited set of changes to compare results against.

I wouldn't call rebooting the hub a resolution in this case though. It's definitely not a high priority with the other issues they've got going on so I haven't pushed it. I also think it would be beneficial to start trying to figure it out once the firmware updates are able to slow down. The other thing is that I'm not seeing the same issue on my C-7 on the same fw revision. But, I also have the radios disabled on it. So, a small chance that the memory issue correlates with some of the ZigBee issues.

My point overall is that I would've done the same troubleshooting steps regardless of whether or not I had the memory data points.

1 Like

A big issue, and one I'm guilty of is seeing a thread saying I'm having XX issues, or my hub is doing XX. And my mind goes to that place where I think I'm having the same issue and jump to conclusions.

I guess that's the nature of our minds, and putting info out on the forums goes. See, here I'm going to chime in on something LOL

@FriedCheese2006 I always watched my memory dwindle away on my C7 until I had to reboot. Upgraded it to the C8, and it still does the same thing. I repurposed my C7 to the garage and have offloaded a lot of automations it it. The memory on is fine for weeks.

1 Like

How so? I think my point resounds....besides having something to look at...how does having the hub stats metrics provide a meaningful attribute for trouble-shooting?

1 Like

I should stop posting late at night... I think you make some really good points... I was only joking about the reboot being a resolution :slight_smile: I was mainly wanting to re-iterate one of the points you made that changes in being applied to the hub, either the reboot or a firmware update, don't help in understanding or diagnosing your kind of issue.

I think my only idea around using the device or app stats was to provide some kind of comparison, so if the usage of a device or app changes significantly, it may be worth investigating, if there is some degradation in performance being experienced. It wouldn't be the only thing to consider in troubleshooting an issue, but may help in certain circumstances. Possibly not the one you describe I expect.

1 Like

I had an idea in a different thread..

This is more a question for you developer type...

Could you create a RM rule, or even a basic app that does some sort of action and determine the time it took to perform that action.

Ie, turn a virtual switch on and off on a schedule and track how long it took for that switch to change states. Then save that value as either a variable or virtual device attribute that we can log.

I've been thinking about how hub performance seems to degrade with uptime, and trying to think of a way to measure it.

An app could certainly do that. Would be a simple function to log the time, turn the device on, wait for the attribute update, and log the time. Could even get fancy and write a line to delta the two times.

1 Like

I think that's exactly what Hub Watchdog does/did. I never really found it useful enough, but I also didn't have a good way to visualize the differences over time.

2 Likes

I just installed that. Looks like what I was envisioning.

Maybe another thought besides turning a device on or off, would it be possible to do some calculations in an app and time how long the calculation takes and use that as a datapoint too.

Just spitballing ideas.

There is also a Node-Red performance flow that captures the time it takes for a page to load to effectively get a response time value. That is another metric i have shown on my grafana graphs.

I can share the flow if you want.

Your idea of turning a device on/off could be impacted by the device network so keep that in mind.

2 Likes

I have several flows in NodeRed, including the performance flow that @mavrrick58 mentions above, that monitor performance. Stats like response time and my amount of time to download backup of hub are graphed as data points over and beyond what’s included in the Hub information driver which I also graph.

I am connecting to the event and log web sockets and dropping that data into a MariaDB on my NAS. I have a table of all my devices and have an updated date that gets maintained by a trigger on the events table so I know if a battery device falls off. I am also capturing battery change durations too.

Every 12 hours a flow will query the device table and let me know if something hadn’t reported in that timeframe. I also have it query the logs table for errors and give me a count so I can research if necessary as some errors are expected like with Ecobee and Amazon which I just ignore unless the count is extreme.

2 Likes