RM Rule mystery

@bravenel @jtp10181

This is a continuation for this thread:

HUB is C8 Pro running latest Platform 2.3.9.193
The Device is ZEN-17 (very old original verision)

And driver is Jeff's "Zooz ZEN17 Universal Relay Advanced" latest version driver v1.3.0

In short:
Everything is 100% correct when I am running this mystery rule(s) ether manually or automatically when I change a trigger time from 1:30am (night time run) to whatever closest time for testing.
However when rule runs automatically at the desired night time (1:30am) all the related logs are very confusing.
I am 100+% lost in this forest.

Here are a rules in question and a related logs (I am sorry, it could be a long list).
Mystery - 1
This very simply rule was suggested by Bruce as a second check for the Relay-1 status:

Mystery - 1
The rule is always triggered for all test runs but never triggered for the real night runs
(log will be displayed later in the combined log for the main rule).

This is a real pump control rule:

Log for the automatic run at a closest time, 100% correct, no problems
(log is the same for the manual run actions):

Mystery-2
Here is a log from night run at 1:30am:

The test "ZEN-17 Watch Relay-1" was not triggered at all.
Two test runs were triggered but automatic did not:

However the Device Event page clearly logged a device activity around 1:30am:

And here is a log from previous day run around 1:30am

According to this log pump was running twice:
First run for 5 sec:


...

and second run for 4 sec after 13 sec delay:


...

As I mentioned above, I am 100+% lost on what is going on with this two rules.
It looks like everything is OK at day time (all my test runs are 100% correct) but at a night time around 1:30 am something is not right. All logs for night run are very confusing plus never the same.

I am open to any ideas for the debugging this mysterious rules behavior.

PS.
I will run this rules at 1:30am few more times and than switch to day time run.
But any ideas are very welcome.

I am seeing where your rule fired at 1:30AM and then sent a command to the device. BUT there is nothing showing the device actually responded. This is why the "watch" rule did not fire, no state change event got recorded.

Here are the events when it worked:

Vs the 1:30 AM run which did not work (no switch events means device did not respond)

The device does respond. However "watch relay" rule triggers only if "physical" event is present/logged. This is absolutely unexpected behavior (BTW, this behavior was observed with built-in driver as well).
During my test runs (performed many times) sometime Device Event page shows both "command" and "physical" events but sometime (more often) it shows only a "command" events. BUT RELAY ALWAYS CLIKED REGARDLESS.
So, actual physical device never exhibited any problems and always responded 100% and really fast. This behavior is very consistent for all my test runs for rules and/or device itself.

All "who knows what" happened only when rule was automatically triggered at 1.30am.
Few times I ran this rule by changing trigger time to closest minute.
Guess what - it was 100% successful runs.
At this time I have no any idea what test should be created and running. I am 100+% lost.
But I am open to any ideas what to do next.
This setup has unused Relay-2. With this spare relay I can do any testing.
In addition I have so far spare ZEN-17 (most likely the same HW/FW revisions).
But I need an idea(s) what to do.

If the device responded it would be in the event history. How do you know it responded? Do you have a zniffer setup?

This is an exact expectation but unfortunately not a case.
And I REALLY want to know why this is not a case.

As I said, by watching running pump (a lot of extra vinegar already dispensed) and listening a relay clicks. But only while I was running tests, not at a night time when all unexpected things actually happened.

Believe it or not but yes:

But if you want me to use it efficiently I will need a little guidance for the exact sniffer settings.

EDIT
This is a captured Relay-2 On/Off activity:
|

Exactly, you dont know what the device did on the overnight runs.

Do you have any zniffer logs from the 1:30 AM range overnight? If you send me the trace file I could take a peek at it. I would just need to know the DNI of the device in question.

Might be good to have a trace from when it worked correctly also, to compare.

Of course, no. I turned on a sniffer only after you asked if I have one.
Sure, I can capture ZWave traffic over night and send it to you.
Is it any specific zniffer settings I should apply or just run it "as is" with all defaults.

PS.
By Murphy Law (which never fails) everything will be just fine because a lot of instrumentation (logs plus zniffer) will be ready to capture a problem.

You could filter it to just the device DNI we are worried about if you want, but I suspect it will be pretty quiet overnight anyway so it may not make a huge difference. Otherwise I usually just run it wide open and then filter it after I capture stuff.

You may want to filter it to just your "Home ID" so you dont pick up any other z-wave traffic on accident and clutter things up.

OK, I will run zniffer over night with all logs still enabled and will send you all the results tomorrow. Let see what will be trapped.

Unfortunately I missed a zniffer trace from overnight run (stupid me forgot to turn off PC Sleep option). Today I will turn off PC Sleep timer.
But here is another observation.
Once again - no matter what I am trying ALL my test runs (manual and/or triggered) are always 100% successful. I cannot make it running the same way as overnight run.
But from the Device Page "physical" events are always logged followed by "commands" and timing is always correct (for this test run I captured a xniffer trace file but how go I send it to you?):


And Physical Events for Parent Device is also present in a log if they are in a log for child:

Occasionally "physical" events were captured about 20 sec past "command" events.
This is still a BIG question why this is happening once in while.

However around 1.30am "physical" events are not logged in every possible logs.
Plus "Watch Relay-1" rule never triggered as well.
All this is a more or less clear pointer to Relay-1 actually did not turned on even an ON Command clearly was sent out by rule. I don't think this is related to yours custom driver because all this saga started with my observation of abnormal rule behavior with built-in driver.

And the biggest mystery is - all abnormal behavior happens only at 1:30am.

I hope, this mystery will be eventually solved.

Does the rule need to be exactly at 1:30 AM? You could try something like 1:42 AM or some other time slightly earlier that 1:30. Maybe the hub is doing something at exactly that time causing an issue. Still you would expect it to be reliable at any time but it is a worthwhile test to see if it is specifically just 1:30 where it fails or other times around there as well.

Also, if you are able to get a zniffer log of when we suspect it is failing I will be able to see what happens exactly. The question still remains, does he hub send out a command to the device and the device does not respond, or maybe the hub is not sending anything out. There is no good way to tell without h zniffer logs.

No, it could be ANY time including a day time. 1:30am was randomly picked up assuming hub should be near idle at this time. I can/will change rule run time to anything else but before I would like to get a clear answer on what is going on. Who knows, It could be very well hidden problem. This behavior is near 100% consistent. zniffer trace should provide a much better understanding on what is going on a Zwave physical layer. I really want to get a very clear answer. I promises to capture a trace file when all mysterious behavior will be in place.

EDIT

I have another active ZEN-17 controlled by buit-in driver (this is balcony door actuator controller).
The Device Event log always has a "physical" events logged together with "commands". However the Device Event log for the parent device is empty. This could be due to a differences between buit-in driver and yours custom.

1 Like

@bravenel since you were involved in the original thread
and @bcopeland since I think it is z-wave related.

I got zniffer traces via PM and it appears the issue at 1:30 is due to the good old ZWaveNodeStatisticsJob problem. Looks like the ON command from the hub delays for 20 seconds and by then the rule has queued up an OFF command so they both go out one after the other. So the device just stays off and replies back that it is off.

This coincides with the ~20 second delay that @velvetfoot originally observed a while back.

@vitaliy_kh is going to move the rule out a few minutes and check again tomorrow.

Here is the On and Off commands (~25ms apart) and then the device response saying that it is off.

Shifting rule run time from 1:30a to 1:35a "fixed" the problem.
Rule now runs exactly as it should.

BTW, I have few other rules starting at 6:30a.
They are randomly failing but not 100% consistent as a rule in question.

It looks this problem is very well known (somehow I missed this thread):

and must be addressed.

In a past I observed random failure or random rules and actually reported this problem multiple times. I never paid an attention on timestamps when rules were failing but I could be on a 30 min boundary. Probability for the random trigger may hit this condition is low but far from zero.

Many BIG thanks to @jtp10181 for chasing this problem.
I hop finally the problem will be addressed and fixed.

1 Like

I could not remember if I had turned that stats job off or not but I went ahead and turned it off on my main hub. You could turn it off to fix the issue and monitor the zwave details to see what impact it has.

The method to turn off is here: Z-Wave Pause / Disruption Every 30 minutes - #11 by velvetfoot

Already done.

Make sure you reboot afterwards, which I realized this morning. You can confirm in Logs > Scheduled jobs and search for "zwave". Before rebooting it was still there, after reboot it is gone.

Yes, I did reboot a hub. I checked "Scheduled Jobs" for "zwave" and it is gone.

Unfortunately this is not a real fix for the problem.
Now I can understand a reason for many unexplained random failures of otherwise stable rules. As I already mentioned, this abnormal behavior was reported many times but unfortunately was ignored.

@bravenel, I hope, this time the real fix will come (hopefully) in a next release.

PS.
I checked my rules and many time-based Triggers are "naturally" (why anybody will use something like hh:01/hh:31 instead of very logical hh:00/hh:30) aligned to the 30min boundary.
From the other side Platform should/must schedule potentially distractive tasks outside the hh:00/hh:30 boundaries.

Observation and summary:

The problem related to the ZWaveNodeStatisticsJob hub-internal task happens to be well know (to the stuff) but unfortunately not widely announced. If I got this right, this internal task is scheduled to run every 30min on hh:00/hh:30 boundaries. Any rule happens to be running at the same time is impacted by this task resulting in a huge (around 20sec) delays for executing ZWave related commands such as On, Off, etc.

My rule in question was scheduled to run at very bad time (1:30a). In addition the rule has On and Off commands spaced by 3 sec which is far less than potential delays. Also the Device (ZEN-17) is programmed for auto-off at a timeout set by rule but this was not a part of a problem. Actually the problem was an added Off command in case if auto-off failed. All this resulted in very 100% consistent rule failure when rule was running at a specified time.
However this rule never failed during every possible test because test never ran at a 30min boundaries.

Shifting run time to 1:35a "fixed" the problem.
Also now I can understand why many rules occasionally randomly failed without any apparent reason. The question was asked many times but I never got a good answer on what is going on and what could be the reason.
Now I know.

Also by suggestion from @jtp10181 I turned off this problematic hub-internal task.

1 Like

UPDATE

Hub was updated to the latest 2.3.9.196 platform.
Foe the sanity check I changed a Trigger Time to original 1:30am.
Rule which was constantly failing before now runs flawlessly.
@bcopeland BIG Thank You for the real ZWave problem fix.

1 Like