I had the most improbable series of events happen today. It appears that it exposed a race condition bug in the Notification app. With that said let me explain the series of events.
- Our garage door was opened this morning at 8:42:50.400 AM EDT.
- Our garage door was closed this morning at 8:52.50.405 AM EDT.
Yes you read that correctly. We managed to have the garage door sensor open and close exactly 10 minutes and 5 milliseconds apart as you can see below.
Turns out that we have the notification for the garage door configured to fire 10 minutes after it is opened and not closed.
Of course the notification that it was open fired about 200 milliseconds later. Also note how the notification description changed the event time from "8:42 AM" to "8:52 AM".
The problem being as you can see above that it then fired again 30 minutes later and still 30 minutes after that. The notifications fired even though state of the notification showed that the last event received was the close event which should prevent the notifications from firing.
My hub is currently running the following.
Hubitat Elevation® Platform Version
2.3.0.124
Hardware Version
Rev C-7
Let me know if there is anything else you might need.
You can call it a race condition if you like, but there's no way to avoid it with this particular logic (given the odd timing). It starts looking for the "next event" after 10 minutes, but the event had already happened before it started looking for it. What is your suggestion as to how to solve this circumstance? There is always going to be some asynchrony in the relative timing of hub events to the tune of possibly 100 msecs more or less.
So let me say. I totally understand the first notification firing. I think that's OK and to be expected.
My quibble is the future state of the application and that it continued to fire afterwards. I don't know your code but I can make some educated guesses and some ideas given my knowledge and experience.
One way you might handle this would be to shove all event with a particular app instance through a serial queue so events are orderly and processing happens in a predictable and deterministic way. Of course that could produce blocking effects and might cause performance problems so probably not be best candidate.
You could shove only portion of the code that handles the state of the application through some type of serial queue and have logic to recognize these edge cases... e.g.
- I fired a notification and scheduled the next doRepeat.
- Event came in looks like I scheduled a doRepeat let's cancel that.
Conversely
- Event came in and need to cancel things.
- I need to schedule the next doRepeat but looks like the event already fired. Move along.
I think this is really the critical path here and likely is the part, if I were to guess, that isn't synchronized and which the race condition happened.
Of course those are likely difficult complicated solutions.
Maybe a more straightforward idea would be to check the state of the world right before a notification is fired and recognize when things aren't right. Looking at my screenshot the application state clearly knows that the lastEvtValue is closed. Maybe the notification routine should be taking that into account and canceling the notifications and reseting the state in that instance?
You know the codebase infinitely better than I do. I can only let you know what happened and give my perspective as limited as it may be. 
1 Like
I will take a look at this. I suspect it's a multiple simultaneous instance problem involving application state. That can probably be overcome. Don't know how to test it, given the very specific timing required for it to fail.
2 Likes
There is indeed a critical state variable involved with this. I believe the fix is to commit that to atomic state, so that separate app instances are effectively serialized with respect to reading and writing that variable.
Test plan, using naked hub with a single virtual device: First, replicate your failure by throwing carefully timed events at Notify. Once that can be more or less reliably caused to fail, introduce fix to code, and see if it actually fixes it.
4 Likes