Another Wait for Expression RM failure

vitaliy_kh · February 6, 2024, 10:12am

There is a bit of commonality. "Wait for Event" simply waits for Event to happen. However "Wait for Expression" waits for Event(s) first an than evaluates an expression. In both cases the Event is a primary and must to happen.

It looks like rules which are failing to complete must be gated with "Requires Expression" and occasionally "Wait for Expression/Event" will not be executed. I have a few rules failing the same way even when "Wait for Expression" is protected by Timeout. In this case the Timeout is an Event in addition to the Expression which also requires an Event in a first place. So, it looks like a somehow missing Event(s) are the source for a problem. And occasionally missing Event(s) happens in a gated with "Required Expression" rules. So, there is some commonality for all observed failing behavior.

bravenel · February 6, 2024, 5:40pm

Well, ok, but everything in the hub is event driven. Without events, nothing happens.

Meaningless in and of itself. The question is what event is not causing the correct behavior. You have yet to show any evidence of that.

vitaliy_kh · February 6, 2024, 7:01pm

This is 100% clear and nothing is wrong with this approach.
However the Event(s) could be either missing or (most likely) masked. And missing/masked Event is definitely creates a problem for the logic which is waiting for this Event(s) to happen.

I did show an evidence of missing Event(s).
But how do I know (or find out?) which Event(s) is missing if log does not record any Event(s) at all? You mentioned, looking through the past logs could reveal something. May be. But I have a hard time to find something which did not happen. What I can say for sure - app log did not log any Events for the Devices in question. However the Events page from the Device Driver page clearly shows Status of the Device was changed (in this case MS reported Inactive State). So the Status of the Device was changed (from Active to Inactive), Hub Event Generator logic must produce an Event. App Log does not record an expected Event. So, where and how I should find out if Event was actually generated (mots likely it was) or was missing/masked (most likely a case)?
My debugging capability is far less than yours. Unless I am missing something, all what I can use for the debugging - is digging through the enormous amount of logs.

bravenel · February 6, 2024, 7:18pm

The logs for the device itself, or even its Events from the device page. No event, no cigar. If they exist they are logged -- if not by the app then by the device itself.

This is not difficult at all, I do it all of the time. You can filter for the app plus the devices involved in the app in Live Logs, and even in Past Logs. You know when the app fails. You can look back in time in the device Events to that time. My own approach is to leave a tab open on Live Logs, and look back on that for whatever it is I'm looking for, presumably something that has happened recently.

Without logs, the rest is hand-waving. Sorry, but how would you test a circuit without some instrument? For the hub, logs are the instrument.

vitaliy_kh · February 6, 2024, 7:41pm

Sure, instrumentation is a must to have,
Since now I know (I guess) what failing rule should look like I will try to create a simulation for this case. This is exactly what I am doing for the debugging randomly failing hardware. BTW, if a failure is 100% reproducible this is not a problem, this is a just a bad design which is easy to fix. However real problem always exhibit a random behavior.

bravenel · February 6, 2024, 7:56pm

With software, not so much. Vast majority of bugs are easily reproduced. Race conditions are very rare, and even then can be spotted. Some deep issues such as caching are more difficult. For your case, a combination of logs, and snapshot of Event Subscriptions right after the Wait are the first required steps.

bravenel · February 6, 2024, 8:27pm

To explain a bit more:

A Wait creates new Event Subscriptions for the events being waited for, or for the conditions of a Wait Expression. These subscriptions are different than those of either triggers or Required Expressions, both of which would still be in effect. When the Wait is completed, or the rule triggered, these Wait subscriptions are removed.

One possible explanation for a failed Wait would be that the event subscription was not created properly. This would mean that the waited for event would not be logged and the Wait would hang. While in theory this could explain your problem, generally speaking such a failure would always occur and every Wait would fail due to such a bug. A deeper (and never seen before) problem could be that the event subscriptions don't happen quickly enough (race condition), and that the waited for event happens before they are in place. This would be revealed by the logs showing the relative timing of the Wait action and the missed event. Such a missed event would show up in the device logs with timing very close to the Wait action. It's worth noting that motion sensors have a time interval from the cessation of physical action that causes motion.active event, to the reporting of motion.inactive. This is usually at least 15 seconds (an eternity for software). Again, seeing the device logs or events would reveal this interval. If immediately after the Wait action the hub went into some totally busy state, such that no events happen at all and all are missed, or the event subscriptions were not accomplished during that interval, that could explain your problem -- but if this were the case the failures would show up all over your hub, not just with Waits. And, such a failure would be visible in the logs.

vitaliy_kh · February 6, 2024, 9:00pm

Well, I have seen a lot of random SW failures. I am (still) working for company which makes Blood Analyzers. The most exciting was a case when analyzer suddenly (of course very random, and of course in the middle of the night) started a very loud nasty screeching noise. The main computer board was design by myself and I was told I have only few days to fix whatever everybody was thinking is a problem with the Sound Chip. To make long story short, the actual problem was introduced by the upgraded Linux WiFi Driver. This driver occasionally was using a memory dedicated for the Sound Chip DMA as a buffer for packets. This was 100% proved but even top paid Linux Gurus could not figure out what is going on. And guess what? The "fix" was to remove the WiFi Card and the associated WiFi Driver from the system.
Bottom line:
Random SW failure is not uncommon.

vitaliy_kh · February 6, 2024, 9:12pm

You know near everything about hub internals.
For me hub is something like a Black Box (thanks for many valuable explanation now it is grey).
By analyzing carefully all my cases and whatever reported by other users the potentially
problematic rule:
A. Must have a Required Expression.
B. Occasionally fails to complete and stopes on "Wait for Expression" or "Wait for Event" statements.

With the rule presented at the beginning of this conversation the case became closer to Back-and-White. If "Private Boolean" is used to prevent multi-triggering rule randomly fails. Otherwise the rule is rock solid. This gives me an idea how to create a simulation for this case.
So, I will try to create a simulation for this case and see what will happen.

bravenel · February 6, 2024, 10:20pm

To the extent we've seen any, they were related to things like caching. Random hardware and mesh issues, those are frequent. We don't know that you don't have a flaky motion sensor or mesh (yet).

One form that cache issues can take relates to state, and Private Boolean relies on state, as do Waits. A Wait exits the rule, leaving behind state and event subscriptions. So a problem either writing state when the rule exits, or reading it upon an event would be problematic.

vitaliy_kh · February 6, 2024, 10:57pm

I guess, I made this very clear multiple times:
Rule is rock solid unless "Private Bolen" is involved.
Right now this is very clear Back and White case and I will try to simulate this case.
As far as a problematic devices goes, all my MSs and Hub Mesh (btw, rule in question does not use mesh) are very solid and reliable. And 'yes', if anything is not right first thing I am checking for the potentially problematic Device(s) ('yes", sometime this is a case).

bravenel · February 7, 2024, 12:33am

Still waiting for the logs that show the motion.inactive after the Wait (or before it).

vitaliy_kh · February 10, 2024, 5:37pm

Statistically it is too early to report/claim a 100% victory but after updating hubs to the latest releases 2.3.8.117 --> 2.3.8.118 and enabling use of "Private Boolean" I am not seeing failing to complete rules. I hope, everything is finally/actually fixed. So far so good but let me see what happens over longer time frame.

Alan_F · February 13, 2024, 2:46am

I just had an older rule that hasn't failed before (as far as I can remember) fail. This one does NOT use a timeout but does have a duration on the wait. So many differences, but still the same behavior. The rule logs that the wait is beginning and then doesn't log any of the events that it should be waiting for.

Here is the rule:

The logs should look like this:

but the next time the dryer ran:

It should have been logging the home power level every minute as it did the first time.

The rule shows these subscriptions:

This is a C-8 Pro running 2.3.8.118.

@bravenel - Is there anything else we can do as users to help run this issue down?

bravenel · February 13, 2024, 4:15pm

Do you have Event logging on in that rule? Really need to see that. Also need to see Scheduled Jobs from the App Status page (at the bottom) right after the Wait begins, to see if the Duration is scheduled.

PunchCardPgmr · February 13, 2024, 5:54pm

Bug or not,

I think what you experienced....and the discussed implications of cache, state, and event subscriptions leaves a lot of room for gaps in understanding what's happening under the hood and what the Rule design and/or device behavior (errant or not) might be contributing to the observed outcome.

I had gone from ZERO Required Expressions to PLENTY thinking they were the next best thing to sliced bread as far as avoiding rules stepping on top of themselves when one instance is all that was required.

Lately I've been taking another look at whether or not using that mechanism is necessary simply because I've had way too many Private Booleans "stuck in FALSE" due to the rule not finishing out the way I expected.

Could that be a design flaw on my part, sure enough...it's easy to trip over how Delays, Waits, and their cancel-able options behave given certain conditions, external or internal to the particular rule.

EDIT: Maybe I should say "due to things not proceeding as expected" because I too have wondered if the Rule was "hung on the Wait". But I want to be careful here and say that I have not always known where best to look to conclusively say "hung" was an appropriate description. Thus why I'm following the diagnostic steps being offered here.

Certainly if there is a bug here that should be addressed...I've had very simple rules which misbehaved around similarly designed constructs as above. But I've chalked it up to me not understanding something....(or even having done something like editing a rule while it was active and seemingly gotten things out of sync) and there in-lies an opportunity...to educate and real-time inform so that users themselves don't create these quandaries.

vitaliy_kh · February 13, 2024, 6:55pm

This is EXACTLY near all of my failed rules cases.
However after latest update to 2.3.8.188 I am not seeing these fails.
But again, statistically the time frame is too short.

Alan_F · February 14, 2024, 12:31am

I've experienced the same thing, but that's not the issue in this case. The recent failures I've had were not caused by a rule not finishing out the way I expected. They are due to a rule simply not following through on a wait action. In the case of a 'wait for event or duration' there is nothing about private Booleans or required expressions that should prevent the rule from proceeding to the next action after the duration even if the event subscription was somehow dropped.

I don't have 'cancel pending actions when required expression becomes false' enabled on the misbehaving rules, so changes to the required expression shouldn't interrupt the execution of the actions.

I didn't have event and trigger logging on for this rule, only action logging. I've now turned on all logging and will wait for it to fail again. I'm going to throw a text notification into the rule just before the wait begins so I can log in and take a look at the scheduled jobs. It doesn't fail every time, so it may take a while to capture another failure.

bravenel · February 14, 2024, 12:39am

This is what makes this extremely difficult to diagnose. It's almost always the case that bugs in RM are easily reproducible. So this one, and others that are also transient, pose great difficulties, especially when trying to diagnose from afar.

Given that there are a few users reporting transient failures (posted above), I have to worry about what could be underlying these reports. All of this has only come up recently, and so far I have no clue what underlies the reports. The lack of reproducibility is frustrating the effort to get to the bottom of what is going on. So in all honesty, I'm grasping at whatever straws I can get from logs and App Status page details.

Alan_F · February 14, 2024, 1:04am

Ok, I was impatient and decided to try to test the rule now. While I didn't get it to fail, I did observe that there is no scheduled job showing when the 'wait for' expression is initially false.

I modified the rule a bit so I could force it to run. I added a virtual button as a trigger and added a text notification to my phone.

The first few times I triggered the rule the 'wait for' was immediately true because the dryer wasn't running and the house power use was below 4 kW. Every time I triggered it under those conditions it showed a scheduled job:

Here are the logs where the 'wait for' was immediately true:
(on this particular run I had set the expression to '< 1100', rather than the original 4000. That shouldn't matter, I'm just pointing it out to be thorough)

App status page:

Then I modified it to wait for the home power use to drop below 1 kW, as the value was a bit above that so the wait wouldn't immediately be satisfied.

Here are the logs where the 'wait for' was not immediately true and there was no job scheduled.

App status page:

It didn't fail in this case. It ran as expected, but there was never any scheduled job. If there is supposed to be a scheduled job showing, then something is wrong. I refreshed the app status page several times while it ran and it was blank every time.