Another Wait for Expression RM failure

vitaliy_kh · February 14, 2024, 1:35am

Sorry, I reported "rule failed to finish" cases multiple times and first time about 2 years ago.
But I didn't provide all the required info because random rules randomly failed and of course, logs wasn't turned on.

bravenel · February 14, 2024, 1:54am

Yeah, Duration is like Stays, the timer starts when the Expression becomes true.

bravenel · February 14, 2024, 2:03am

Makes it pointless to pursue. You always have the chance to report with evidence, so I have a chance to attempt to reproduce the issue. Usually, these topics fall silent because you claim that it's not possible to find the logs. Sorry, but no logs makes it pointless to spend time on a reported problem, unless it is easily reproduced. Yours have not been reproducible, and we've been around this mulberry bush before.

Alan_F · February 14, 2024, 2:29am

So is there anything else I can look for if it goes into the state where it logs the 'wait' and then doesn't log anything after that?

bravenel · February 14, 2024, 3:19am

Once the Wait starts, which happens when the Expression is false, the rule creates some Event Subscriptions for the devices in the Expression. Show me those if you can catch them. That's how the rule would see events from those devices.

PunchCardPgmr · February 16, 2024, 3:56am

I'm grateful for this thread.

I'm building a new rule to control something starting around daybreak dependent on a contact.
I had set it up with a Required Expression that explicitly limited mistaken triggering at night.
But here's the key....I built this with a WAIT For Sunset to finish off the requisite actions in the Rule.

Sure enough in the first day I ran into a problem. I am not implicating the Wait For but reflecting on this thread had me thinking...what the hell are you hanging that Rule up ALL DAY LONG with a WAIT for ? Leaving it sitting there prone to whatever might occur for which you have not written in exceptions to handle.

So shaking the mindset that "of course this should be one single rule to handle the management of this one single task" I'm busting this out into TWO rules and adding conditions that check the state of things that otherwise would have been known/controlled in that previous single rule.

I feel much better not worrying about the WAIT For and anything happening to the Rule that I didn't expect or understand could happen to it's flow.

vitaliy_kh · February 18, 2024, 10:34pm

@bravenel
Hub is C8 Pr0 @ 2.3.8.119
And here is one more failing rule (I have never seen before anything like this).

Here is a rule:

Here is an event subscription related to the latest failed run:

And here is a log:

Timer was started:

but at this time another trigger event re-triggered a rule:

and Wait for Expression did not wait for any events.

Now previously started 1 min timer expired

And the result is:
Action "Set Private Boolean True" is entirely missing.

It looks like 3 lines in the log on top simply belongs to the end of previous trigger.
This looks like is an evidence of a race conditions. I.e. re-triggered run interfered with previous running instance. And because Private Boolean was not reset to True this rule is now dead because it cannot be re-triggered. I think, if triggering was not blocked the rule could potentially self heal on a next trigger.

bravenel · February 18, 2024, 10:47pm

What it looks like to me is that the Wait for Elapsed Time was not cancelled by the new trigger. I'll look into that. Why isn't the set PB true that last action instead of before that wait?

There are two instances executing at the same time, but that Wait for Elapsed should have been cancelled by the trigger.

You can toggle PB in the UI for the rule.

bravenel · February 18, 2024, 11:03pm

A simple test of Wait for Elapsed Time is cancelled by a new Trigger. Adding in the Required Expression for PB, like yours, yields the same result -- namely that the Wait for Elapsed Time is cancelled by the new trigger.

However, in your case, this is undoubtedly a race condition, since your Wait for Elapsed Time and the new Trigger happen 4 milliseconds apart. The hub doesn't have that tight a timing for reading and writing state, which it would have to do since you have two instances of the rule running. The easy fix for this is to move your setting PB true to the end of the rule, where it belongs. Otherwise, no sure way to escape the race condition.

What you experienced is to be expected given what you have going on with that rule.

Well done on finding a 4 msec race condition

vitaliy_kh · February 18, 2024, 11:35pm

The idea is to prevent re-triggering from multiple devices (motion sensors in this case) while any is still active but to enable re-triggering as sone as all of them becomes inactive. If I move PB resetting to the end the rule will certainly finish but Presence will be reset as well. This will be undesired behavior. I guess, I can split this rule into two rules in order to filter out undesired Presence Status flickering.

vitaliy_kh · February 18, 2024, 11:37pm

I am EE dealing with nano seconds. So, milli seconds in my environment like a century long.

vitaliy_kh · February 18, 2024, 11:41pm

This could be architecture-related but why not kill instantly already running instance on next trigger event? To my eyes it will look like much cleaner implementation.

bravenel · February 19, 2024, 12:00am

And how are you to know that is the correct thing to do? Another option is to use single threaded app, which forces a second instance to wait. But this too has possible drawbacks.

Best solution is to not create rules where it is even an issue in the first place. Make it a "don't care" instead of a "gotcha".

vitaliy_kh · February 19, 2024, 12:33am

All is based on my huge experience in HW design. But yes, HW design is very different from SW design. My SW design skills are very limited.

And how I should know is it an issue or not? My rule above is very simple and to my eyes it should not have any issues but in fact it does. Because you know very well all internals and implementation details yours vision is very different from users. Step-by-step I am learning what should not be done but this is not easy without knowing what is going on behind the scene.

Anyway, as usually I learned one more little piece and Thank You for the explanations.

bravenel · February 19, 2024, 12:38am

Any time you (and I mean this personally) create a 'fancy' rule that isn't KISS, beware... Pretty much every rule you've posted that has a problem is 'fancy'. My rules are dwarfed in comparison. I don't do fancy.

The possibility of complex and dynamic changes in state, multiple triggers, changing Required Expression truth dynamically during rule execution... these are all elements for problems. KISS saves the day.

vitaliy_kh · February 19, 2024, 1:07am

Well, I don't think my rules are 'fancy'. I am doing logic design (in HW) for many years and so far whatever I designed works very well (this is words from my co-workers and managers).

What is 'fancy' in this specific rule? I explained you why resetting PB is not a very last action.
To my eyes this rule is very simple and logical. But ops, it is problematic because multiple instances are running in parallel and creating an interference.

Sure, creating multiple simple rules could be much better vs one complex rule. But because there is no good rule's management (as simple as a folder for set of the related rules will do the job) many users (not just myself) are creating one big rule just for maintainability reason.

BTW, (just thinking loudly) is it possible to install multiple RMs?
If yes, this could be a very 'fancy' solution for managing rules.

bravenel · February 19, 2024, 1:44am

What would you do in an electrical circuit where there is a 4 nanosecond possible race condition? How would you guard against this? Would your circuit solve it by preventing it in the first place, or would you just take your chances?

For this rule why don't you add Duration to the Wait for Expression, since you want this to not fire again unless 1 minute has passed since there was motion? I don't understand your reasoning.

Using Rule Machine, once you allow for multiple simultaneous execution, as you intentionally do in this rule, you are opening yourself to this type of race condition. Seems odd that you'd get 4 msec separation, but you proved it possible.

vitaliy_kh · February 19, 2024, 8:36am

In HW design taking a chances is absolutely prohibitive. Everything must be 100% guaranteed.
In short, all external asynchronous events first synchronized to the system clock (system event). After this first stage of synchronization everything becomes 100% predictable. Output from all logic equations must be stable before next clock pulse. Because all delays in logic components and clock period is well known it is not a big deal to satisfy this requirements.

No, I want to set Virtual Presence Sensor to "arrived" state from any Motion Sensor in the area whichever becomes "active" first. Once this is done I don't want rule to be re-triggered by other MS in the area, However resetting Virtual Presence Sensor to "departed" state must be done after all MS in the area becomes "inactive" plus a delay. My rule was designed to do exactly this and most of the time it works as expected. However it happens to be a hidden problem. Rule works at least visually flawlessly if PB is not involved.

Well, this looks a BIG problem.
Almost every rule after being triggered (i.e. started) has to wait for some events such as Delays, Conditions, etc. to happen before finishing. However next Trigger Event may happed (and usually does) before rule is complete. Example - motion-based lighting control or my rule in question. Instead of restarting clean a second (or even more than two) instance of rule is created. The result is - racing conditions between multiple running instances and unnecessary memory use by already useless but still somewhat active instances. I am sorry to say this but it looks like a system design problem. In a HW there is impossible to create multiple running instances on a fly but it is possible to instantiate them during system compilation. The example is - multi core CPUs.
I am not a SW designer but instead of creating multiple running instances for each trigger event why not to reinitiate (restart clean) already running instance?
This is exactly what is happening in HW.

bravenel · February 19, 2024, 4:36pm

So, why not approach rule design this way as well? Or, better yet, learn to code in Groovy so you have complete control.

Have you thought of using Duration on the Wait for Expression, instead of delay after?

Rule Machine is very awkward way to do motion-based lighting. Use Room Lights instead. It handles the complexities and subtleties of this. With Rule Machine, you will need multiple rules to do what a single Room Lights can do. I would never suggest using RM for lighting applications, especially motion-activated lighting.

Alan_F · February 19, 2024, 5:02pm

I hate to re-rail (is that a word?) this thread, but I think I caught a 'wait for' failure that ISN'T a race condition.

Here are the logs:

Here are the subscriptions:

I'll screen shot the rest of the application status page but only post it if more is needed.

To summarize my issue again, the rule is waiting for one of two things to happen... the front door closes or 5 minutes elapses since it was opened. The front door closes about 15 seconds after this wait begins, but the rule doesn't proceed at that point nor does it proceed after 5 minutes.

It is 12:01 as I write this, and the rule hasn't logged anything since the screenshot above.