Cron executions seem to make z-wave devices unresponsive

Hi,
Has anyone noticed that their z-wave devices lag and even become somewhat slow to respond at the precise moment cron scheduled executions happen?

I've been investigating the issue for a while now and it appears clearly that there is a direct correlation between these two events :

  1. To assess the problem I run a "flash" command on a z-wave plus switch - it allows me to concretely "hear" when the z-wave network lags.

  2. I disabled all user based custom apps/drivers. It allowed for some improvement but the flashing switch would still lag at times. Simply put, by doing so I considerably reduced the number of cron schedule executions, hence the improvement. However, the lagging/unresponsiveness would still happen every time a device throws some scheduled reports.

2.2. I can clearly see in the logs that it did so exactly every time there were some values being returned by some devices such as voltage, wattage, etc.

  1. I then tried to reproduce the same symptom (lagging in the switch flash execution) by forcing some devices to refresh and return these values over and over again. It did not produce the same effect at all!

That's when I noticed/concluded that it appeared to do this only when a command was executed from a cron scheduled task. Whenever there is a cron-based execution, my entire z-wave mesh starts lagging.

It looks like there is a problem with cron scheduled executions at large.

In case this was due to my particular hub/DB/firmware being corrupted, here is what I have already tried, to no avail:

a. Cold reboot.

b. Soft reset/restore

Please let me know where I should look to resolve this issue or if you are aware of a known case of the sort.

I wrote a ticket with cust service but got no answer after a week.

Thanks in advance for any help you can provide.

1 Like

You know what they say, correlation is not always the causation. Based on details you shared, flashing + power reporting most likely cause the Z-Wave radio to get overwhelmed, to the point it is going unresponsive. Common symptom for a mesh with multiple power reporting devices.

2 Likes

Thanks so much for your answer and your help @bobbyD !

You are totally correct! I have considered both these concerns: a) causation/correlation and b) taxing the mesh with the flashing.

However, if I and when I hit "refresh" on a device, inducing new reports, no prob. Ok, I get that this only sends a "last reported value", so just a req on the db.

So, I also tried to set two power devices to report every 1 second. No issue (of course, It would be a problem on the long run).

As I understand it, 1-minute reports will be using cron expressions while 1, 5 or even 30 seconds probably use a runIn() commands. Am I right? If so, then it still leads to cron being the problem.

Also, the problem appears every time one of my APPs (some that have been running unchanged for months) runs a cron-based scheduled task...

Most of my concern is that this zwave problems seemed to have appeared while I had made no changes to the mesh (and it is quite a strong mesh - I've learned my lesson in that regard, I think...) nor any substantial change in my config in general.

Except for 1 device that needs to report its wattage every 30 seconds, most of my devices report at least every 1 minute, most of them every 5+ minutes.

Let me know what you think and thanks again for your greatly appreciated input, as always.

2 Likes

Devices don't usually have driver schedules. They use internal parameters to schedule when to generate a report. You can see the schedules on the driver side on the Device Details page, under "Scheduled Jobs".

You may not make physical changes but the radio and its routing table are in constant change with or without your input. Devices can go bad, radio interference can happen, new routes can be established all behind the scenes and all without user's intervention.

The most problems happen when many devices hit the radio simultaneously. You can have 10 devices reporting power every hour, but if all 10 hit the radio at the same time, then the radio will struggle.

4 Likes

Do you know a way to quickly identify a troubled device (on a C5 hub)?

I'm definitely going to look into this in more details. Seems I'm using a community based driver "Aeotec Heavy Duty Smart Switch" that uses cron scheduled tasks. Deactivating this device seems to do the trick for now but I need to see what happens when I reactivate my other apps. I'll let you know.

1 Like

Ok, deactivating this device did nothing on the long run. Came back home today and as it's been the case every day for the past 2 weeks, my entire mesh is down. The biggest issue is that I don't know how to pinpoint suscept devices. The same config (same apps) was working fine 2 weeks ago, so obviously there's a faulty zwave device somewhere generating some queuing of some sort...

More likely a device that was repeating died and other devices on the edge of your mesh are having trouble communicating and continually sending repeat messages trying to get through in essence spamming your network..

I had one of my main switches die that tons of things were using to route through and while the mesh was not totally down, many devices were not working.

1 Like

I really have many of them in a relatively not so big of a place (New York). Would your hypothesis still be valid if disabling / enabling zwave would make everything work properly again for several minutes?

It could be a device that only works intermittently. But you are correct if it works at srartup nd then stops it could be a bad device.
Most likely some kind of battery operated sensor on the edge of your mesh. Look there first.

If you turn debugging on on your devices you.may see tons of duplicate messages coming in. Hopefully it will help you narrow it down. Or just start removing batteries from like a few devices at a time or air gapping switches to see if it helps but i doubt it is switch those don't usually spam with reports... its almost always is a sensor.

1 Like

Thank you. The way I understand it a battery operated device should not affect the mesh since they're never used as nodes, right?

No but those are the ones especially temp .or multi-sensors that seem to send frequent reports tieing up and bringing down a mesh
. Especially if they have marginal connectivity say at the edge of the mesh and continually keep resending.

Looks like you were right actually, I had a couple of Aeon Multisensors that seemed to make a mess. But it wasn't all and I found several other culprits... Some people (I won't name them... suffices to say that they're miniature human beings...) in my household had unplugged a couple devices... lol

What helped the most was actually simply looking at which devices had not thrown any event in the past week. I should have thought of this sooner.

5 Likes

I use "device activity check". Works great to let me know of low battery levels and devices not communicating. In fact, I was notified this morning about a contact sensor that is MIA.