Found a reason for Elevated/Severe Load warnings

Finally I figured out what was causing massive Elevated/Severe Load warnings.
I have 4 very nice Xiaomi Light Sensors. (I have few more different Light Sensors but only
Xiaomi caused a problem). 3 sensors are installed outside and one is internal.
During Sunrise and Sunset time frames these light sensors are very active and could send
many messages within few milliseconds apart. Sensors/Drivers do not have any user
configurable parameters and basically bombarded hub with messages. Each of these sensors
is used in many different automations. That is why every time hub reported elevated load
warning different RM Rules were on top of apps stat page and I was confused which rule was
causing a problem. Finally I noticed a commonality for all potentially problematic rules.
All of them are using one of this very active light sensor. So, I had to re-design all these rules
by adding some sort of debouncing (gated each rule with Private Boolean and a 30 sec delay
between re-triggering).

What is not clear - why suddenly this became a problem.
All these sensors and rules are more than a year old were running flawlessly on the
C7 hub and up to 2.3.4.148 platform version. All bad things started after migrating to C8
plus new platform 2.3.5.xyz (lates is .121). Since C8 processing unit is the same as C7
I think, something was changed in platform itself.

Or your device connectivity is better. A possible explanation is that they were on the edge of connectivity, so some of the spammy messages weren't making it to the hub or were delayed spacing them apart. A more solid connection thanks to the external antenna means the hub is getting every single message as soon as it's sent.

1 Like

This is not a case. I do remember seeing a lot of messages from these light sensors
after I installed them. But somehow this was not a problem and I completely forgot about
this phenomena because installation and associated rules are more than a year old.

EDIT.
In addition one also year old rule occasionally made to the top of the very active rules.
This one occasionally could be re-trigered about every 1-2 sec by Power Monitoring plug.

I am almost sure, latest platform became very sensitive (unfortunately) to the
rule re-triggering frequency (just my educated EE guess).

1 Like

The next public release will turn excessive load errors into warnings for the built-in apps. So the rule will still run, albeit with complaints in the log.

That said, having a device triggering rule every few milliseconds is a Bad Thing.

3 Likes

RM is not the best tool to debounce a chatty device. There is a small app in our public repo that will do a much better job of this:

This one is for a contact sensor, but it could be very easily changed for any specific device attribute that is a problem (on lines 19, 37, 38 and 49 of the code).

3 Likes

I 1000%+ agree with this statement. And for sure this must be prevented one or the other way.
But what was surprised for me - C7 was somewhat happy with this occasional bombardment
but C8 is not happy even with re-triggering around 1 sec.

Well, at least in my case when hub was complaining about elevated load (forget about
severe load warning) many rm rules became very unhappy. There was either missed
triggers or very sizable delays with actions. So, that was not only warnings.
So, I am not sure the idea to suppress errors is a good thing.

PS.
Once I was involved to find out why one FPGA design failed left and right.
It happens to be the design had a gazillions of reported design errors but very
smart designer simply turned them off. Suddenly non working design looked
like very clean.

Is it really necessary to have rules triggered with every change? I use four of the Xiaomi Mijia Illuminance sensors without issue. Granted, each one only reports every five seconds at the most, but I don't need to know the light level every five seconds. In rules I use a periodic trigger, between certain hours, and check the illuminance as part of the actions.

Oh yes, you are absolutely right.
My sort of de-boucing is very simple re-triggering throttling:

The Required Expression is using Private Boolean = TRUE (could have some other logic)

Actions section:
set Private Boolean to FALSE
Actual Actions
delay x sec (30 sec my usual value)
set Private Boolean to TRUE

This is not a real de-bouncing but does the job.

Thank you very much for the de-bouncing app.
I will try to use it (if I will be able to adjust/modify it for the light sensor).
Where is this app in a line?
Does it take output of the Device Driver and sends filtered value to the Event Generator?

During Sunrise and Sunset ramps all of my Xiaomi Light Sensors are very active and
occasionally could send messages few mS apart. During the day they are quite.
For my automations reporting light status every 30 sec should be sufficient.

I'm using Markus' driver for the a Xioami/Mija light sensor I have, and it allows you to control reporting frequency. The driver you're using for it does not allow control of reporting intervals? Maybe change drivers?

image

2 Likes

Even that seems excessive unless you actually NEED to register very small changes in light nearly instantly. Instead of having mine report every given time period in seconds or minutes, I have mine report once every half hour or any time the illuminance has changed more than 50lux since the last report.

1 Like

I am using this Driver:


Unfortunately this Driver does not have a setting for the reporting interval:

At a time (a year+ ago) when I installed these sensors the above driver was recommended
by community because with this driver Xiaomi sensors did not drop from the Zigbee Mesh.
Well, in my case they never dropped. But this driver does not have any control for
status reporting interval.
As I mentioned above, I already solved this problem by throttling a re-trigerring interval.
I guess, reducing a reporting interval will do a better job by reducing number of Events
which is obviously will be less load for the CPUs.

From my understanding the driver you're using shouldn't have an impact on whether or not your device stays connected.

2 Likes

Not near instantly but it must be fast enough because all these light sensors are in charge
with Lighting and Curtain automations. I am in South Florida (Sunny Isles Beach) and all
my windows are facing West. After about 1:30 pm Sun is shining directly into all my windows.
What is worst, there are many fast passing clouds. From one side I don't want the Curtains
to open/close very frequently but from the other side I don't want very bright Sun to shine
through the windows for a very long time. 30sec (experimental value) seems to be satisfactory
and has relatively high WAF.

Here is one of my modified (and significantly simplified) rule:

The delay for re-triggering depend on what to expect.
For Curtains Closing it is 30sec but for Opening it is 3min

PS.
My original rule was far more complicated.
It was some sort of PID loop which required very fast data reporting.
Somehow this was not a problem on C7.

This sounds very logical but according to what I was reading (long time ago) it somehow does.
I personally did not experiment with different Drivers. I simply picked one whichever was
recommended by community and never had a drop off problems.

Sounds like you've got something set up you like. If you ever want to play around to the drive that has the control over the reporting timing here's the link.

I have played that game myself I have to admit thinking hoping that a driver would help when things won't stay connected. But I've yet to find anything that proves that the driver has any impact.

Thank you for the link.
But yes, now I am all set with the driver I already have.

I am trying to understand what this app is doing.
Correct me if I am wrong (I am not a fluent SW person) but it looks like the app gets a state of
the physical device, waits number of mS and modifies accordingly a state of the virtual
output device. If during the delay physical device changes its state the delay restarts
without changing a state of the output device.
Another words, the output from physical device must stay unchanged for the specified
de-bounsing timeout. If I am correct this is exactly what I am doing in HW for filtering out
noise from Contacts or whatever devices with noisy output during its state transitions.
This filtering algorithm works very well for the devices with two (On/Off) states (like contacts).
However for the devices with analog output (like Light/Temp/Humidity sensors) the filtering
algorithm must be more complicated. First of all - analog device may not return the same
value from different sequential reports. Another words, output from analog device may never
be the same for the specified de-bouncing timeout. However output values may stay within
acceptable delta for a long time. So filtering for the analog device should take in account
multiple sequential data and make sure all these samples staying within the acceptable delta
during the predefined de-bouncing timeout. And if "yes" send out average value.
I can easily implement this in HW (actually already did it many times) but I am not sure I will
be able to modify yours code, designed for the de-bauncing Contact Sensor for the
de-bauncing analog device like Light Sensor, etc. I know what exactly what should be done
(create an array, fill it like a circular buffer, find min and max values, calculate a delta for the
min/max values, make sure delta is less than acceptable, calculate average, if everything is Ok
within de-bauncing timeout send average out) but my programming skills is not that good
for implementing this filtering algorithm in SW.

This is a completely different problem, as you define what should happen.

The problem addressed by the app is literally a bouncing sensor, where it mistakenly sends two or more events in quick succession (within a few milliseconds), for a single physical world event. I have a contact sensor on our office door that does this. The app also takes advantage of a property of the scheduler, that scheduling itself forces a single queue, so that multi-processor relative timing is removed as a consideration.

Use of Private Boolean in a rule with Required Expression, reduces the window of vulnerability to a very small period of time -- most likely less than the interval between bounces of a bouncing device. For these very short intervals, averaging data for analog devices is irrelevant.

I know you don't need any more help from me, but I can't help myself. Forgive me. :wink:

I really think that the appropriate place to throttle reporting is on the device itself, rather than letting it bombard the hub and writing work-arounds to ignore the extraneous events in rules. Just seems like you're working at it from the wrong end. Totally your choice, of course, and I will go away now. :smiley:

2 Likes