2.2.6.140 Hub Performance Stats & Severe Load Alert are Incorrect

C-5 / 2.2.6.140 - Experiencing very frequent Severe Load alerts and HE slows down to a point that it is turning off the Zigbee radio and is highly unresponsive to actions in general until reboot.

  1. Sonos seems to be very chatty in terms of raising events to subscribing clients. This needs to be looked at and I posted this issue on another thread as it seems unreasonable.

There seems to be a fundamental issue with how performance is measured and reported - either that or I am not understanding something trivial.

  1. Device Stats for example raises to 1172% load. Wrong Stat. The load on the system would be maxed at 100% so the calculation of "Total Device Busy Time / Total Hub Uptime" as a running counter is not a representation of the CPU load either in real-time or on average whatsoever... Unless one samples the load on specific time intervals and averages them over the duration vs. the current aggregation method - see screenshot attached.

  2. Alerting mechanism seems to generate severe load alerts based on the above calculation hitting ~75% either on the device stats or the apps stats. However, this calculation, as stated above, is not a depiction of the CPU load which is an easy core statistic that can and should be picked up by the firmware from the Operating System

  3. Then we get into the fact that the stats can magically and manually reset by clicking on the trashcan icon and the hub suddenly is relieved of the load? What does that do - it resets the performance counters but has nothing to do with the actual load. Yet, the severe load alerts clears as if, automagically, I erased history and the hub is now fully performant at 16% (and rising over time) - really?? If the hub was at 80% (let's say we get the core stat from the cpu), how would deleting code history suddenly drop the cpu utilization to 16% - what happened to all those busy applications?

And then the mystery begins - if the hub thinks it has a severe load (unsubstantiated from the way things are behaving and described above) then why is it suddenly slow and unresponsive? what causes the performance to degrade if we cannot see the real average load statistics

In short, this, at least to me, is a major flow in behavior that started with the latest firmware and suddenly and unnecessary changes the behavior pattern of the hub to a point that I have to start pruning devices and integrations that were working fabulously on 2.2.4

Thoughts?

1 Like

Actually this could be along similar lines to what I saw recently.... I'll find the response from @thebearmay....

If Sonos speakers are wifi-based, this could be similar to my situation with my Kasa plugs and how I increased the polling frequency and have seen elevated CPU usage readings.

1 Like

So for reference, you can get the CPU load average of the hub, if you so desire.

http://<hub_ip>/hub/advanced/cpu1min
http://<hub_ip>/hub/advanced/cpu5min
http://<hub_ip>/hub/advanced/cpu15min

There's also a community made driver that collects similar info and consolidates it for you, if you so desire.

The stats, as reported on the apps and devices pages are reporting the amount of time those drivers or apps have run. If you're resetting the counters and the values are quickly increasing, it sounds like you have a device that is running constantly and possibly out of control. The percentage you see is amount of time running / total system uptime (this total uptime might reset when you reset stats). That percentage has nothing to do with CPU use.

For example, on my hub device drivers have run for a total of 24 seconds of a total uptime of almost 6 minutes (I just reported the hub).

image

2 Likes

Thank you - the cpu stats links are very helpful and I can't for any reason understand why are they not made part of the UI even as individual hyperlinks or buttons.

Doesn't negate my rant that the total stat that is displayed is severely misleading and quite frankly utterly useless

@djw1191 - yes, the culprit in my case is 15 sonos devices that are inevitably linked to one dashboard or another. This causes a subscription to a large number of attributes, data and events that they produce and given the event subscription model, one would have to ask if there is a way to apply settings to which states and events are exposed or if the rate by which they are raised in the source code is reasonable for the type of device.

Given the Sonos app is a HE platform one and as such source code is not exposed or editable, this issue is one that I cannot simply resolve by optimizing code and same is likely to present itself with many other built-in apps and device drivers. However everything was peachy in 2.2.4 with the default of 100 states and 100 events so something material has changed in the firmware performance characteristics to cause such a severe load ongoing even with a 5 state and 5 event global setting in 2.2.6 - There is much more to it for sure.

The overall gist of my complaint is still valid despite the fixes. The overall implementation of performance reporting and optimization capabilities vis-a-vis the understanding of CPU loads at the infrastructure, OS, Apps, and Devices tiers renders what's exposed to the users in 2.2.6 both misleading and unuseful.

How do you increase (or decrease) the polling frequency? Is it because you can access it via a custom driver for Kasa or is there a hidden way to do it for built-in apps like the Sonos integration one?

For my Kasa plugs this is within the community drivers I am using, not something generic that would apply to Sonos. In your case, I would be seeing if the drivers / devices for your Sonos speakers within HE have a similar option to adjust any automatic polling that may occur, i.e. reducing how often this may occur.

1 Like