Hub process slowdown after several days

bravenel · October 2, 2019, 5:52pm

I just created a rule fired by a button, that just logs something out. It took 294 msecs from button push (per logs) to log output.

Then I wrote a trivial app that does the same thing. It took 172 msecs. From that, I'd deduce that the base RM overhead is approx 122 msecs.

dan.t · October 2, 2019, 5:54pm

Your way is fair too. The one other thing I see is that there is always a difference of 200-300 ms when the APP received the event (first log output) and when the rule received the same event (first log output of the rule). I have two windows with the logging open and compare timing between the two

bravenel · October 2, 2019, 5:54pm

Then I ran it a second time, and the little app time dropped to 87 msecs. Probably because the second time it was already in memory.

bravenel · October 2, 2019, 5:56pm

One thing to note: the web ui itself is a large consumer of resources. So a better way to do this is to not use logging at all, but to keep internal times, and then look at them later. In RM you can put now() into a variable.

dan.t · October 2, 2019, 5:56pm

That can probably make a big difference. But don't stay rules in memory too?

bravenel · October 2, 2019, 5:56pm

Yes. There is always going to be some variability in these numbers also. What else is happening at the same time in the hub.

bravenel · October 2, 2019, 5:58pm

I think it is fair to say that RM introduces some runtime overhead. That's only logical, as well as empirical. There is no way that something large and complex can be as fast as something small and simple. It's a tradeoff.

dan.t · October 2, 2019, 6:03pm

Fair and I agree.
Now we just need to convince you that your idea

Was a great idea!!!!

I am going to stay patient and wait for RM5

j715 · October 2, 2019, 6:17pm

Yeah, I don't know if it triggered or not. I just know it didn't execute the second change. So my thought was it was still taking care of the first change and didn't get things updated as you say before the second change. It was probably a good thing it happened as it told me something was going on that I wasn't aware of.

bravenel · October 2, 2019, 6:19pm

It would log both events if you have event logging turned on, even if it didn't fire the second time.

j715 · October 2, 2019, 6:21pm

Once I have a rule working I leave the logging off. Figure no sense in loading up the log making it harder to find something when I need to.

asj · October 3, 2019, 6:15am

Cc: @dan.t

so I've done some profiling as well, looking at latency in a few ways. I have a simple driver that uses now() to calculate On->Off time. I run various triggers on non overlapping schedules six times an hour and ship it to influxdb (which I think I fixed to avoid killing the hubitat)

This uses a 1 hour average, or 6 samples.

The graphs:

50ms: driver sets runIn to 50ms, calculates how long it took to actually get called. I use it as a reference/baseline.

RM4 trigger: An RM4 rule triggers on the On and sends a switch off. This should be the latency to run RM4

App: A custom groovy app sends an On and subscribs to the off.

The number align with what @bravenel said already. About 100ms longer for RM4

I'll share the code in a few days once I'm happy it's working ok.

aaron · October 3, 2019, 12:35pm

So I don’t have the issues everyone else is having. I do experience a lag at the beginning of the day for HE to wake up and turn on lights. Takes about 5-8 min but once it does it is fast all day long.

My config:

C3 edition hub
72 devices connected
- Zig, zwave and WiFi
10 custom apps
28 rules
10 built-in apps
10 custom device drivers
16 dashboards

j715 · October 7, 2019, 11:46am

I ran my test rule and test app on 10/2. The rule was executing in about 100 msec. The app would execute in about 50 msec.

I then left town for a few days. No one home. Nothing going on except for a few simple lighting to turn lights on/off while gone.

Got back and ran tests today 10/7. Rule is now taking about 700 msecs. App is taking about 300 msecs. So something has slowed things down.

I had disabled the Harmony driver as suggested by @bobbyD. So that apparently didn't help. The only other custom driver I have is the Inovelli and I can't really get along with out that one.

I have simplified rules, etc. So don't know what else to try.

frits · October 7, 2019, 12:28pm

I don't no if possible, but couldn't RM be sort of creating it's own child apps without the overhead? Presuming the overhead is in every child instance and not in the parent app of RM...

asj · October 8, 2019, 3:44am

So my hub has been up and running since Sep 30th and here's a graph of the latency since then for 6 different "tests"

50ms latency

Driver measures time from between:

now()
runInMillis(50) -> latency_driver.off()

RM4 poll

latency_driver.on()
poll() on 6 zwave devices
latency_driver.off()

RM4 trigger

This measures the time is takes for an RM4 rule to receive an event and perform an action.

Rule 1:
Tigger: timed, 6 times/hour
Action: latency_driver.on()

Rule 2:
Trigger: latency_driver turned on
Action: latency_driver.off()

App

This is written in groovy so it avoids RM4 overhead, this one is much noisier and really benefits from averaging, but anyways. This measures the time it takes for an app subscribed to an event to perform an action.

App schedules events: schedule("0 6,16,26,36,46,56 * * * ?", turnOn)

subscribe(latency_driver.on)

Action: latency_driver.off

Zigbee:

RM4 rules increases the dimmer level by 1% and measures how long it takes to get the level change event:

zwave

calls refresh() on a zwave evergy monitoring Zooz switch, and measures the time it takes to get a refresh power usage. Was never sure if the delay was the just the device responds slowly, but the inflection on 10-5 maps to everything else.

It feels like every single device has an inflection point at 10-5 at 12/noon:

Random thoughts

Except perhaps Zigbee everyone has a similar infection point, could be co-incidence, but it's interesting that zigbee latency increased about 6 hours earlier, and ran much higher for about 12 hours. But then it settled down to follow the same increase. Maybe the device changed routes?

What I find really interesting is the nearly linear increase in latency after the initial step.

There's a lot running on the hub sadly from probably ~50-60 zigbee and zwave devices to the the fixed up influxdb logger. I rate limited it's async https calls to at most 1 at a time, and at most 1/sec. I did remove the LG TV app earlier since it need some TLC.

homebridge-hubitat-makerapi is also running and connected to the websocket interface. Mostly websocket ping<->pongs interspersed by change events according to the logs.

What's more interesting, turning on/off z-wave lights has no noticeable latency. So perhaps it's just events processing that's slow?

Oh well reboot time it is. But pretty pictures it did makes. I'm going on vacation for 10 days, it'll be interesting to see the graphs while I'm away.

Ken_Fraleigh · October 8, 2019, 4:22am

I'm sorry to say that I had to reboot yesterday and today. I didn't know that I could crash the zigbee radio by downloading a hub backup while it was changing modes; I have about 105 zigbee devices. The backup downloaded and then everything zigbee stopped responding. I was able to still access the web interface and the zigbee radio remained unresponsive (or whatever it said to that effect) for several minutes until I shut it down and restarted. It also happened last week not during a mode change, but while downloading a backup to my pc.

jon1 · October 8, 2019, 7:46pm

It may sound drastic, but it might be interesting to see if disabling the z-wave and zigbee radios, one at a time, if this bring it back to responsive or not. Its easy to toggle them off for a test for a few minutes, and then back on again.

Ken_Fraleigh · October 9, 2019, 3:08am

I’ve read on here that the zigbee radio is the first thing to have issues when the hub is having problems. Can’t remember why.

asj · October 9, 2019, 3:22pm

That would be interesting, can't hurt eh? Once I'm back from vacation I'll give it a test. Sadly I'll probably have to reboot the hub while overseas, which I'm not looking forward too. (Airbnb guests use the hub to access the house)

And just to follow up, here's the latency change after reboot. (all latency tests on 1 graph, and with a 1 hour average to remove little spikes)