Troubleshooting weird performance issue

basilisk · February 13, 2024, 3:18am

C7 with 2.3.8.118.
I've been trying to figure out this weird issue, which seems to be slowly getting progressively worse over time, for a few months now: random slowdowns and "hiccups". Like simple motion sensor rules taking ~1.5-2s to work instead of typical ~200-300ms; or callback responses from Kasa devices taking seconds to process instead of sub-200ms.

But what surprised me the most, was the network speed test:
EDIT: speed test was a red herring and is not relevant here.

Based on those symptoms alone, it's not unreasonable to assume it's an overloaded hub, but device stats / app stats don't look too bad (2% / 1.8% respectively). A database rebuild didn't help, database itself is actually surprisingly small (dbSize : 8). Free memory is around 325M after a few days of uptime.
No errors in logs. Devices: 60 Zigbee, 50 Z-Wave, 20 Kasa.
The rest of network tests didn't show anything problematic either:

Looking for ideas, what to try next, to get to the bottom of this. "Upgrade to C8 Pro" is definitely being considered, but without having a better understanding of what's going on, that feels kind of defeatist
What else can it be? Impending hardware (eMMC) failure?

daniel.winks · February 13, 2024, 3:20pm

That speed test is beyond useless. Just ignore it.

Here's mine, on gigabit internet, on a C8-Pro hard-wired:

I'm guessing whatever this speed test is doing is half-baked and doing something stupid, like unzipping the zip file it downloads, and that's what it's doing for the other 40-55 seconds.

Do you have issues on purely z-wave automations? Ie, Z-wave motion sensor triggers z-wave switch or similar?

I've got my money on this being 2.4GHz congestion. WiFi and/or ZigBee things randomly taking an extra second or two to respond certainly sounds like congestion to me. It's precisely why WiFi and ZigBee are always 'last option' purchases for me. If it's available as Z-wave, I get the Z-wave version. Even if it's a bit more money it's worth it, since it's entirely immune to my neighbors' WiFi causing issues.

Either that or it could be cloud servers causing latency, if those Kasa things use a cloud connection.

basilisk · February 13, 2024, 6:27pm

Oh interesting! So it looks like that the speed test is basically just

wget -O /dev/null http://speedtest.wdc01.softlayer.com/downloads/test.zip

and it does sometimes take that long to download the file, with or without HE. The server probably has a very limited number of connection slots so it is that slow to respond, but when it finally responds, the download itself goes quickly. Okay, that part of the mystery is solved and seems to be completely unrelated to the rest of the issue

Only occasionally, as I have very few of those - but I do get a missed Z-Wave command here and there, like either porch or deck lights failing to turn off at sunrise. About 2 years ago I did the antenna mod, which solved almost all of my Z-Wave problems, and pretty much all the devices are connected directly to the hub with 1-3ms latencies, except for a couple of stubborn ones that keep change their routing twice a day despite being within 15ft and direct line-of-sight from the hub
But the delays I'm seeing don't appear to be device-related, it's literally the time between seeing

dev:18 2024-02-13 06:51:11.585 info MBR Bathroom Motion Sensor is active

and

app:422 2024-02-13 06:51:12.906 info Action: On: MBR Bathroom Light

Almost like there's a stop-the-world garbage collector kicking in right in the middle of a rule execution.

I'm a fan of Z-Wave myself, and everything hardwired in my house is strictly Z-Wave. I use Zigbee primarily for sensors: I use battery backups for my repeaters, so if something goes wrong when the power is out, at least the mesh remains stable and sensors won't have to scramble looking for alternative routes, potentially dropping messages.

As for Kasa, it's even more interesting. I went way down the rabbit hole in the Kasa thread, but the TL;DR version is that responsiveness of the devices themselves is not the issue: it's the time it takes HE to process the response from the device. And since the same problem doesn't affect Home Assistant that I set up as an experiment (I even plugged it into the same network port where HE normally is), doesn't look like WiFi congestion is at play here.
Kasa commands are sent over UDP to port 9999 using hubitat.device.HubAction, with a callback function that is invoked when the response from the device is handed over to the driver code to process. Typically that round trip takes about 100-150ms.
I.e. command ("on") is sent -> device switches on immediately -> response is received by HE typically within 150ms, but can be anywhere between 1s and 10s, and I've seen it go as high as 23s.
It gets really bad twice an hour, when ZWaveNodeStatisticsJob runs.
I kind of expected that handling async responses for LAN protocols would have much lower priority than anything related to Z-Wave / Zigbee, I just didn't expect it to be that bad.
Now I'm debating whether to just move everything cloud and LAN to HA, or give C8 Pro a shot.

daniel.winks · February 13, 2024, 7:13pm

I've got nothing nice to say about HA, so I definitely can't recommend that. I doubt a C8-Pro would "fix" everything on its own, but you could migrate everything other than those Kasa devices over to it, then turn the radios off on your current hub and use it just for the Kasa stuff, and bring them over to the C8-Pro via Hub Mesh.

Maybe these drivers just really suck then? Because that's not a "Hubitat" thing... I've got some WiFi devices that are every bit as 'instant' as anything else on my hub. I don't think there's any sort of prioritization of Z-wave over WiFi stuff on HE. If the hub is busy, everything will suffer. Heck, shutting off the ZigBee radio is one of the first things HE does when overloaded.

hubitrep · February 14, 2024, 1:52am

The Kasa drivers and app are built-in to HE.

The network interface can generate a lot more work for the hub than either Zigbee or Z-Wave radios or both put together, so I would imagine there would be some sort of prioritization going on (or throttling)

daniel.winks · February 14, 2024, 12:43pm

That doesn't mean it's not a CPU hog. My Sonos Advanced app does 10x what the built in Sonos app does and uses less CPU. Built-in apps/drivers tend to be reasonably well coded, but they're certainly not all as optimized as they can be. I'd have thought the built-in stuff would be coded up in pure Java for speed, but as far as I can tell, they're all just groovy running on the same groovy interpreter with all the slowness that comes with groovy just like any other app or driver does. Some definitely do run backend pieces in Java, like the HomeKit app, but most are groovy, and I suspect there's a lot of use of def in them with all the massive performance hits that incurs based on the example code on the HE official GitHub.

Can being the key word. If a device generates 200kb of network traffic it's going to have less impact on things the faster that 200kb is done and processed. 200kb on the LAN is less work than 200kb on one of the Z radios. Sure, the Z radios cap out at 100kbps and ~125kbps each, so their maximum impact at any one time is lower than the 100mbps that the LAN can theoretically do. But if it's just the interface sending a "turn this switch on" packet and another coming back with "ok, done, it's on now", it's going to be no more work to do that over LAN than over ZigBee or Z-wave, and probably less.

It's not like the Kasa stuff is somehow generating 40MB of data to process each time a switch is toggled or something. I'm writing a bunch of drivers for Shelly devices and I can tell you flat out they transmit less traffic for things like on/off than literally any ZigBee device I've ever worked on drivers for. They're comparable to Z-wave in terms of how many bits need sent over the air to toggle a switch on/off. Even if Kasa is 2x worse for data transferred, it's still going to be in the same ballpark as what ZigBee does. A few hundred bits is a few hundred bits, regardless of the radio.

And like I said above, I've never seen any hints that the LAN connection is somehow de-prioritized in any way.

basilisk · February 23, 2024, 2:50am

If I remember correctly, built-in Kasa integration is based on @djgutheinz's community integration, and it uses async UDP for communicating with the devices. And async is probably the least cpu intensive way to do it: the driver code immediately hands over control back to Hubitat after sending the UDP packet, and after that it completely out of our hands, until Hubitat decides that it's our turn to receive the response payload.
All my Kasa devices combined use a tiny fraction of the cpu: only 4% of the overall 1.8% cpu usage, which gives us, roughly, 0.07% of cpu time? Hardly a "CPU hog". And yes, I know, "time spent" != "CPU used", because if the driver spends most of its time in blocking I/O, it would use very little actual CPU but will still show really high time spent numbers. Either way it's moot, as that's the only metric we have visibility into.
Recently I've been running a slightly modified latest open-source version of this integration where I added a few additional log statements to track latency, and I'm seeing the same delays. I'm measuring the time between the last line of code after sending the packet, and the first line of code when receiving the response. There are literally zero lines of driver code being executed between those two statements.
Packet capture on the router shows that the response packet from the device always arrives 6-7ms after the command is sent, which is basically the round-trip time over wifi. So I don't think the delay could be anywhere else besides Hubitat.
Everything you said about bandwidth and network traffic is, of course, true, but there's one caveat (without going into things like hardware interrupts etc): we are dealing with a few levels of abstraction above the hardware layer, especially with async. I don't know any implementation details about how HE works internally, but I wouldn't be surprised that there's some kind of event/messaging bus that all the asynchronous events (including Zwave/Zigbee messages) go through before they get handed over to their respective device drivers. And it's not outside the realm of possibility that there's some kind of logic on top of it, something along the lines of "UDP? and there are Zigbee/Zwave messages waiting behind you? Back to the end of the queue you go, you have to wait". That's the only remotely reasonable explanation I could think of, that doesn't involve aliens , and of course we'll never know how it actually works unless we hear it from the developers themselves.
Anyway, this particular issue seems to be limited to async UDP only, so my takeaway is "when writing drivers, don't use async if you care about response latency for callbacks".
The engineer in me wants to rewrite the drivers to use raw tcp socket, which should solve it, and the more pragmatical me wants to replace that one Kasa plug where I care about the latency with a Zwave one and call it a win. Not sure yet who is going to win!

And I hear you regarding HA, it's probably the least user-friendly smart home platform I've seen, with many interesting architectural choices, mostly to work around Python's limitations.

djgutheinz · February 23, 2024, 1:28pm

One other consideration for WiFi / Zigbee performance issue is 2,4 GHz WiFi and Zigbee interfering with each other. Their frequencies and signal strengths overlap and the WiFi routers do not look for non-SSID interference signals in the auto channel mode. You can check on-line for information on this interference. If you are encountering this, you might need to change Zigbee (or WiFi) channel.

Kasa devices. The long responses are when there is no return from the Hubitat command. I have built-in error handling that will first retry, then scan for devices and retry again (to handle IP changing when user has not assigned Static IP addresses to the Kasa devices).

hubitrep · February 23, 2024, 1:37pm

How did you assess this ?

Shouldn't make any significant difference, your blocking call will put the thread to sleep and will still have to be woken up. @gopher.ny did add some thread pool logic for async calls for 2.3.7.x but in my experience there are explicit error messages in the logs when you hit pool thresholds.

What difference do you expect that would make ? UDP is not much more than a protocol number over raw IP...

basilisk · February 23, 2024, 1:47pm

WiFi part always takes only 6-7ms, and the rest is all ethernet. I run 2.4GHz on channels 1 and 6, and Zigbee on channel 20. So even if I had WiFi / Zigbee interference, having it affect ethernet cables to such degree would require the level of power that would turn my house into a giant microwave and cook its entire contents

djgutheinz · February 23, 2024, 2:35pm

It does not affect the cables. It affects the WiFi reception at the router, wifi devices, and Hub (if WiFi). Worst case would be at the Hub if it is on WiFi (vice cable). There are a significant articles on the web discussing this (Zigbee and WiFi interference).

djgutheinz · February 23, 2024, 2:55pm

For the Kasa Devices, I tried raw TCP as the protocol. Not all Kasa devices support the TCP interface (are limited to the UPD interface); so, a non-starter. (I actually support the TCP on some of the devices (HS200 Switches) as a backup to UDP LAN control. These were the very earliest versions of the Kasa devices.) Additionally, the Kasa app implementation uses UDP as does almost all (if not all) implementations on GitHub).

Note: The raw TCP required a significant amount of additional processing within the driver to assure the message was complete and to combine the message segments.

Alt Lan Preference on Kasa HS200 Switch:

basilisk · February 23, 2024, 10:27pm

Honestly? An educated guess. If this affected not only UDP but TCP as well, we would be hearing about this a lot.

That is very true, but that would be the system kernel waking up the thread which is actually really fast, and it won't have to go through Hubitat's pipeline that actually calls the driver code. I'm not sure that in case of async there would even be a thread to wake up, I don't think HE maintains a dedicated thread for each configured device, that would be quite wasteful. It's a lot more likely that every event that triggers a driver, executes just the corresponding function within that driver in a threadpool, in whatever thread is available.

I do believe the difference is entirely in how UDP vs TCP are handled internally in Hubitat's code. And yes, UDP is basically raw IP protocol + specific port, and TCP is UDP with guaranteed (heavy air quotes) delivery.
I just did an experiment and modified the driver to use interfaces.rawSocket, as David's code already had most of what was needed, and lo and behold, the problem has disappeared. Now almost all responses are handled in about 150ms, which makes sense as it's right about the same as the default tcp read interval, which is needed to logically separate the response payloads.
Now it just needs a few more changes to use persistent TCP connections (for some reason, interfaces.rawSocket.close() is an expensive operation, and it takes about 1s).

basilisk · February 24, 2024, 1:40am

So this is my setup:

All the traffic between Hubitat and the Kasa plug goes through the router, as they are sitting on different VLANs.
I know for a fact that the delay never happens at [1], [2], or [3]:
[1] takes about 10ms
[2]+[3] together take 6-7ms
There is no way to easily measure [4], but I'm willing to bet money that the problem is not there either, and that leg would be a fraction of a millisecond.
I just don't see how would Zigbee/WiFi interference come into play here, if the problem never affects [2] and [3]. Or am I missing something?

djgutheinz · February 24, 2024, 2:07pm

Thanks for sharing. This is my best-shot at what could be happening. It is an interesting concern.

First, I never thought the excessive delays are in the cable side anywhere. Given you configuration. I do still feel the delay is due to either rf (wifi/zigbee) induced device reception error OR an overloaded router - too many devices (I have seen this in my wifi loaded house). When a valid response is not received by the Hubitat instantiation of the devices, the delays then occur.

Hubitat Delays are in the driver design: What the Kasa Hubitat device instantiation does. Why you see long times in the Kasa devices is the drive built-in error handling. When a response is not received by the instantiation within 4 seconds, the command is automatically retried (with a debug message). If the second attempt fails, the app is asked to poll for Kasa devices to check if the IP was updated - updating as necessary. The command is then retried (third time). After the third time, retries are disabled until a successful new command is executed.

Explanation of RF interference (since I do not know your experience level). For the signal to be processed, the WiFi signal must be about at least 10 dB stronger than other interfering signals at the same frequency.

Zigbee; The error will pop-up when zigbee and wifi comms occur at the same time. (This issue is well documented on the internet.) (In your system, this could occur at either the access point antenna or the plug antenna.)
Wifi: The other RF interference can be wifi to wifi. This occurs when the signals of independent signals overlap at the same time. If your Router/Firewall also has an active 2.4 GHz transmitter, this may be the fact. This is usually solved by setting the 2.4 GHz channels to not overlap.

My explanation of Router Overload. Routers are designed to handle a certain number of devices. They can handle more; however, when this occurs, they may drop a device that has not checked-in for an extended period. (This happened to me several month ago. Devices were not on the list within the router. Researched on internet and then executed assumed solution.). I mitigated this by adding a mesh to my existing router. No more issues. (Note, my router is listed as handling 30 devices and it worked fine until I exceeded that number. I simply added the mesh node designed for the router and it worked fine).

danabw · February 24, 2024, 4:48pm

For those who prefer pictures:

dennypage · February 24, 2024, 6:35pm

I hesitate to get involved here, but these latency values seem pretty vague. There seems to be a lot of analysis based on insufficient information.

The easy one first... I assume that when you say "[2]+[3] take 6-7ms", I assume you are measuring this based on transmit and receive of udp packets on the single interface on the firewall, correct? Assuming so, I generally believe this measurement, but you want to ensure a large enough sample set to confirm that the variance is small.

When you say "[1] takes about 10ms", what actually are you measuring? How are you determining the start point for the timing? And how are you determining the end point? What is the variance?

If you are basing this on timestamps in the hub, they are not that precise. In particular:

There may be a large amount of delay and variance in-between the point in time that a driver performs an action, and a timestamp is for that event is taken for either the event of for the log.
The event timestamp and the log timestamp on the hub are not the same thing. They frequently vary by at least 1ms.
Timestamps in the hub cannot be correlated with timestamps in other systems at a millisecond level. The hub's NTP implementation is not that precise.

If you have span or mirror port capability on the switch, you can at least reliably measure the round trip time from the hub to the Kasa device. If you have a large enough sample set this would allow you to remove the switch/firewall/switch/ap/kasa from consideration.

hubitrep · February 25, 2024, 1:37am

I wrote a small app that loops around writing UDP packets to an echo server running on my laptop on the same LAN (C7 wired to the switch, then over Wifi to the laptop). The timestamp just prior to executing sendHubCommand() is the payload, which allows calculating command-to-response-handler round-trip time while taking the Kasa device out of the equation.

calls typically return in just over 100ms, but sometimes more, usually in integer multiples of 100ms
there clearly is a thread pool of size 8 for these commands, I assume per app/driver; overflowing the thread pool delays the queued calls by 100 ms (and I get those platform warnings re. pending commands)
I tried adding a known delay to the echo on the server side and the 100 ms pattern remains

I am not sure why I sometimes see up to 500 or 600ms round-trip times. I assume it's not network related but I didn't check; those times being very close to integer multiples of 100ms sure seem to correlate with the hub being busy processing something more important in its command queue.

I ran out of play time, might look into it further at some point.

dennypage · February 25, 2024, 5:13am

Yea, I didn't event get around to OS timestamp granularity, OS thread scheduling, JVM timestamp granularity, JVM task scheduling, etc.

I would have expected a better granularity than 100ms, but I wouldn't really expect accuracy below 20ms in a JVM unless it's been specifically tuned for latency.

hubitrep · February 25, 2024, 3:05pm

Yup. I was more interested in understanding the limits and what might lead to apparent multi-second delays.

The test calls sendHubCommand() 8 times (filling the thread pool) every 2 seconds. Probably hammering the hub a lot harder than any driver or app out there, but it revealed that, when running the test for a longer period, there were instances where the calls stacked up for several seconds.

Looking at the logs on the echo server side also revealed a new socket appears to be created (and closed, I assume) for every call (did not notice any pattern). I suppose that's not easily avoided since a different callback can be assigned to every single call. Adds to the overhead nonetheless.