RM rule is not finishing

vitaliy_kh · July 5, 2023, 1:45am

Here is a rule in question:

Rule works just fine about 99.9% but once in a while Private Boolean is not resetting to
TRUE and as a result rule becomes non functional.
Of course, there is no logs (logs are off) to see at what exact point rule stopped.
But how even in theory how this rule may not finish?
There are few Wait for Expression statements but all they are protected with Timeouts.
What I am missing?

bertabcd1234 · July 5, 2023, 2:09am

Yeah, you're going to need those to troubleshoot...

The only thing I can think of off-hand is if your rule happens to re-trigger within the split second between your initial trigger and the first action, which sets PB to False. (This is unlikely, but could be possible, depending on your device/driver--not very likely with well-behaved real-word devices using well-written drivers, given your trigger.) But, as usual, Logs will tell you for sure what is happening--anything else is just a guess.

vitaliy_kh · July 5, 2023, 3:01am

This is absolutely impossible. The trigger for this rule is a self resetable Virtual Switch
with 2 sec auto reset time. This is done specifically to prevent potential retriggering.
Multiple buttons could turn this Virtual Switch on. But when switch is already on
it will stay on for 2 sec.

Debugging something which happens very seldom with logs is not easy task.
But say, log will show rule stopped at whatever line. Than what?
The problem is not easy reproducible. It will not be something like: yes, here is a
problem. That is why I am asking how in theory this rule may not go to the end.
Potential retriggering in a split second is not a case.
What else it could be?
What if timeout expiration and expression becomes true happens at nearly the
same time? Is this a possibility for failure?

bertabcd1234 · July 5, 2023, 3:20am

Then other log entries, device events, or application state may provide clues as to what actually went wrong.

But without log entries, it's all guessing. I don't want to spend my time doing that, and I suggest you don't, either.

vitaliy_kh · July 5, 2023, 7:36pm

The key point in the above statement is "MAY".
My experience for chasing very complex not easy repeatable problems like described above
by digging trough the gazillions of logs (assuming they are available) never was productive.
By Murphy Law the key entry always was missing. However guessing was somewhat helpful.

Talking about guessing.
Yes, guessing for the person observing the behavior of Black Box is not productive.
However for the person who know a lot (if not all) implementation details the object in
question is not a Black Box and just guessing could be very productive. In a past I was
able to solve few very complex problems just by guessing because I knew very well a lot
of implementation related details.

Anyway, I will enable the reflated logs and MAY BE the problem will be somewhat
revealed. Most likely the problem will be related to the RM (or even Platform) implementation
details. The RM rule in question is near linear and I (and you as well) don't see any good reason
why rule may not finish. Also all involved devices are Virtual (not physical), I.e. all potential
problems related to the device drivers and communications protocol are out of equation.
This leaves on a surface just RM and/or Platform implementation details (which are a Black
Box for me but not for you).

The reason for this post was to point the HE developers attention on the hard to reproduce
and therefore difficult to debug potential problem.

I don't see any reason to continue this discussion. At this point it should be stopped.
Anyway, Thank you for your time.

bertabcd1234 · July 5, 2023, 7:49pm

At the risk of continuing the discussion, your post made think of two things:

1:

If you want to know more about how something you write works, I would highly encourage you to write a custom app. There are many new developer docs, especially ones geared towards writing apps, if you haven't looked recently. It seems like you have the skills to learn this, and as has been suggested to you multiple times in the past, you might be happier with the level of control you can get by doing things this way.

2:

I can 100% assure you that they would want log entries, at a minimum, before looking at such a problem. (And it won't be "gazillions"--you can filter the display to a single app, this rule.)

vitaliy_kh · August 13, 2023, 1:14pm

UPDATE

Finally RM rule did not complete and the related log was captured.
Here is a rule in question (once in a while it fails to complete leaving Private Boolean = false):

And here is a related log when rule failed to complete:

This line:

Action: Wait for Expression: NOTBd Balcony Door Open/Close VSwitch C7(off) is off(T) [FALSE] --> timeout: 0:00:15

was a show stopper.

Here is an involved VSwitch status log:

Wait started at: 2023-08-12 03:59:23.257 PM
VSwitch reported ON condition at: 2023-08-12 03:59:23.195 PM EDT
(even a bit before Wait was started).
Wait is protected by 15sec timeout.
And Wait failed to finish either by condition became true or timeout

So what is it about?
My good guess, this is some sort of racing condition(s)?

PS.
I have few similar RM rules randomly failing the same way. I.e. rule is not finishing to the end
leaving some control variables in a wrong state.
What I am doing wrong?

bertabcd1234 · August 13, 2023, 6:16pm

What did the App Status page have during this time, particularly for event subscriptions and scheduled jobs? You appear to have clicked "Done" or "Update Rule" after this, so it's too late to check now, but those are the other pieces I mentioned above that may give some insight. (But I don't see any reason this should have happened.)

vitaliy_kh · August 13, 2023, 6:45pm

Yes, I had to manually reset Private Boolean to TRUE. I need this rule to be functional.

YES! And this is my point and concern. This NEVER must be happening but occasionally
it does. And as I mentioned, this is not the only one rule, there are few more rules like this.
The commonalty is a "programming style". All my "wait for expression' rules are near always
protected by TIMEOUT. The reason is - rule NEVER should/must be waiting forever and
ALWAYS must complete. The result is a bit opposite - these rules occasionally simply not
completing. Do you have an idea how to fix this phenomena?
I am thinking about adding something like "If Private Boolean stays FALSE too long reset
it to TRUE". But this will be just a "masking tape" for the real problem, whatever it is.

vitaliy_kh · August 26, 2023, 3:31pm

@bertabcd1234

Again RM rule did not finish!
Of course, there is no logs, etc. But what could be reason why this near linear RM rule set
Private Boolean to FALSE but did not reset it top TRUE?
Herse is a rule:

To my eyes there is absolutely no reason why this rule did not complete.
Occasionally I have this failure with many different RM rules. Of course, I cannot turn on logging
for every RM rule and failure is very seldom (which is good from one side but makes it near
impossible to debug). Something is going on and I really want to know what.

EDIT
The commonality between all occasionally failing rules is a "WAIT for Expression" statement
protected by TIMEOUT. My good educated guess: The re is some sort of race conditions
between "timeout expired" and EXPRESSION becomes true. If this two events coincide
the WAIT statement hangs. Once this actually was catch in a log (please see previous
post with logs). To my eyes this smells like platform related problem.

bertabcd1234 · August 26, 2023, 4:15pm

Might be one for @bravenel, but some notes:

Without logging for this rule, it's not possible to say much. (Why "can't" you enable it?) The questions above would still also be interesting to know.
If you think there is a problem with this particular action, you should be able to demonstrate it in a simple rule that does only that.

bravenel · August 26, 2023, 4:30pm

All display of PB show true, so I don't see what you're talking about -- and this is a static page, not updated by anything other than opening or refreshing the page. Without logs, this is a pointless discussion as there are no facts to discuss.

vitaliy_kh · August 26, 2023, 4:36pm

As I mentioned many times - many different rules occasionally (once 2-3 month) exhibit this
behavior. Therefore enabling ALL logs for All rules is impractical. The commonality is a
"Wait for Expression" statements in the rules. Near all "Wait for Expression" statement are protected by TIMEOUT (this my good EE design practice, it must not be a condition which will
be waiting forever).

What could be simpler than the rule above? Certainly something running say every 10 sec and
waiting for whatever EXPRESSION generated by another periodically running rule may catch the
problem much faster. The problem is - I don't have a spare hub just for testing. Two my workhorse
hubs (C7, and C8) are very busy with many automations running basically very smoothly. I cannot
use these to hubs for the very intensive debugging. But somebody from HE developers definitely
should/must have hub dedicated just for testing and debugging. Plus developers may have
much better idea how to create an environment for the debugging and testing.
The reason for this post(s) is to point developers for the potential hard to catch problem.

vitaliy_kh · August 26, 2023, 4:41pm

Yes, because I did manually reset PB to TRUE (I need this automation to be alive).
The problem is - on a previous run the rule did not complete leaving PB in a FALSE state
(and of course, logs was not turned on).
The logs shoving this problem (for a different rule) is few posts above.
Rule hang on "Wait for EXPESSION" statement protected by TIMEOUT. And it looks like
it happens with many rule with the same wait segments.

bertabcd1234 · August 26, 2023, 4:43pm

A rule with just the one particular action that you think is causing the problem (and maybe something after it to demonstrate).

bravenel · August 26, 2023, 4:46pm

Why not? Logging is not expensive, and it's the only way to prove a problem is what you think it might be. There are filters available to just look at the logs for a specific rule, to show what happens. This would be good engineering practice, having logging on for any rule for which you have some suspicion of proper function.

vitaliy_kh · August 26, 2023, 4:48pm

Did you understand my explanations above and a reasons why I cannot setup full scale
testing and debugging environment? With a real live runs it takes forever for this failure
to happen and every time different rule is failing. The commonality is a "Wait for EXPRESSION"
statements protected by TIMEOUT.

bravenel · August 26, 2023, 4:52pm

I have tested this extensively, and cannot get it to fail. I have also studied the code involved, with the same outcome. There could possibly be a race condition concerning the precise sequence of events, with simultaneous condition change and timer expiration, although the odds against this are extremely high. This could only be shown with logs. It's worth noting that an event causing condition change and a timer expiration event could launch two different instances of the rule, thus potentially opening the door to a race condition.

vitaliy_kh · August 26, 2023, 4:56pm

I have a lot of rules (maybe near 200). And forest of logs for all this rules will be expensive.
Navigation with filters among all this logs is a nightmare (I did try this approach).
In few post above there is a log showing a failure (I was waiting for this event few moths).
How many time you will need to see a failure? As you know, I am EE with huge experience
of designing and debugging very complex hw (yes, with embedded CPUs). I know very
well how to setup a debugging environment and debug a complex problems.
But I don't have the ability to do the same things on my two running hubs.

bravenel · August 26, 2023, 5:02pm

Not true. Logs are not expensive. I have hundreds of things logging constantly on my home hub, and there are no issues with this.

Look, it's up to you. If you really want to show that there is a problem you can show the logs, or just wave your hands about it with no logs.

So, turn on the logging. Without it, there is no point in this dialog.