What is Hubitat's thread safety model?

artyom.tokmakov · March 8, 2020, 7:14am

Thanks, but I guess it's not the same as when Hubitat itself calls threadSafeAppend as a callback?

Per my understanding of atomicState's documentation, atomic state will be loaded from DB before each execution, and written to DB every time it's updated.

In your example, it's a single execution, so per documentation it'll load from DB once, and then you call this one loaded atomicState object from multiple threads.

In case where Hubitat calls the callbacks, it'll load from DB for every execution, and atomicState object is going to be potentially different (of course, I'm only making guesses here, I'm probably wrong) for every call.

So here's my attempt (it's quite possible I'm doing something wrong here) to do the same test, but with making Hubitat to call the callback on multiple threads:

def getAppends() { 10 }
def getChars() { 'b' }

void scheduleAllAppends()
{
    for (def val in 0..getAppends()) {
        runInMillis(1, "appendData${val}")
    }
    
    runInMillis(2000, checkData)
}
                    
void appendImpl(def num)
{
    for (def c in 'a'..getChars()) {
        threadSafeAppend("${c}${num}");
    }
}

void appendData0() { appendImpl(0) }
void appendData1() { appendImpl(1) }
void appendData2() { appendImpl(2) }
void appendData3() { appendImpl(3) }
void appendData4() { appendImpl(4) }
void appendData5() { appendImpl(5) }
void appendData6() { appendImpl(6) }
void appendData7() { appendImpl(7) }
void appendData8() { appendImpl(8) }
void appendData9() { appendImpl(9) }
void appendData10(){ appendImpl(10) }

void checkData()
{
    log.info("Checking atomicState.queuedData...") 
    def data = atomicState.queuedData
    for (def val in 0..getAppends()) {
        for (def c in 'a'..getChars()) {
            def stringToFind = "${c}${val}"
            if (!data.contains(stringToFind)) {
                log.info("queuedData is incomplete - '${stringToFind}' not found in ${data}")
            }
        }
    }
    log.info("Done checking atomicState.queuedData...")
}

def threadSafeAppend(String data) {
    String oldData
    String updatedData
       
    synchronized(this) {
        oldData = atomicState.queuedData  
        atomicState.queuedData = oldData + data
        updatedData = atomicState.queuedData
    }
    
    log.info("Appended '${data}'. '${oldData}' => '${updatedData}'")
        
    updatedData
}

private def update()
{
    unschedule()
    atomicState.queuedData = ""
    scheduleAllAppends()
}

def installed()
{
    update()
}

def updated()
{
    update()
}

The output I get:

23:08:18.147 info Done checking atomicState.queuedData...
23:08:18.146 info queuedData is incomplete - 'a10' not found in a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8b8a9b9b10
23:08:18.141 info queuedData is incomplete - 'a1' not found in a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8b8a9b9b10
23:08:18.139 info Checking atomicState.queuedData...
23:08:16.232 info Appended 'b9'. 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8b8a9' => 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8b8a9b9' ## Or maybe this one overwrites a10
23:08:16.229 info Appended 'b10'. 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8b8a9b9' => 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8b8a9b9b10' ## Overwrites a10
23:08:16.221 info Appended 'a10'. 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8b8' => 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8b8a10'
23:08:16.217 info Appended 'a9'. 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8b8' => 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8b8a9'
23:08:16.214 info Appended 'b8'. 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8' => 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8b8'
23:08:16.208 info Appended 'a8'. 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7' => 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7a8'
23:08:16.152 info Appended 'b7'. 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7' => 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7b7'
23:08:16.145 info Appended 'a7'. 'a0b0b1a2b2a3b3a4b4a5b5a6b6' => 'a0b0b1a2b2a3b3a4b4a5b5a6b6a7'
23:08:16.137 info Appended 'b6'. 'a0b0b1a2b2a3b3a4b4a5b5a6' => 'a0b0b1a2b2a3b3a4b4a5b5a6b6'
23:08:16.129 info Appended 'a6'. 'a0b0b1a2b2a3b3a4b4a5b5' => 'a0b0b1a2b2a3b3a4b4a5b5a6'
23:08:16.106 info Appended 'b5'. 'a0b0b1a2b2a3b3a4b4a5' => 'a0b0b1a2b2a3b3a4b4a5b5'
23:08:16.100 info Appended 'a5'. 'a0b0b1a2b2a3b3a4b4' => 'a0b0b1a2b2a3b3a4b4a5'
23:08:16.088 info Appended 'b4'. 'a0b0b1a2b2a3b3a4' => 'a0b0b1a2b2a3b3a4b4'
23:08:16.083 info Appended 'a4'. 'a0b0b1a2b2a3b3' => 'a0b0b1a2b2a3b3a4'
23:08:16.078 info Appended 'b3'. 'a0b0b1a2b2a3' => 'a0b0b1a2b2a3b3'
23:08:16.073 info Appended 'a3'. 'a0b0b1a2b2' => 'a0b0b1a2b2a3'
23:08:16.045 info Appended 'b2'. 'a0b0b1a2' => 'a0b0b1a2b2'
23:08:16.038 info Appended 'a2'. 'a0b0b1' => 'a0b0b1a2'
23:08:15.990 info Appended 'b1'. 'a0b0' => 'a0b0b1'
23:08:15.988 info Appended 'b0'. 'a0' => 'a0b0' ## It sees only 'a0'
23:08:15.985 info Appended 'a1'. 'a0' => 'a0a1'
23:08:15.982 info Appended 'a0'. '' => 'a0'

So I still have a question if there's a good way to accumulate data in a thread-safe way, but I guess not.

graysoncarr · March 8, 2020, 1:07pm

Oh yeah, that behavior makes since now that I look at the state diagram in the docs. The db read happens before your app executes and thus is outside of the synchronized statement. I have another idea though. Try creating a child smart app whose only purpose is to save and retrieve data from its atomicState, and then call its methods within a synchronized block.

So move this to a child app without the synchronized:

def threadSafeAppend(String data) {
    String oldData = atomicState.queuedData  
    atomicState.queuedData = oldData + data
    String updatedData = atomicState.queuedData

    log.info("Appended '${data}'. '${oldData}' => '${updatedData}'")
    
    updatedData
}

And then change your original threadSafeAppend method (in your parent) to:

def threadSafeAppend(String data) {
    synchronized(this) {
        def child = findChildAppByName(“My Child App”)
        child.threadSafeAppend(data)
    }
}

artyom.tokmakov · May 29, 2020, 12:50am

Thanks, it took me a while to test your suggestion.
It still doesn't seem to be working, and I'm still getting data corruption .

ross2 · March 6, 2021, 4:01pm

I want to fix issues in some of the community scripts that I'm using on my HE.

So I first sought out some documentation on the Hubitat environment. Didn't find anything on threading model or several other key things to understand before spending time on these scripts. Some key items on Developer Documentation (such as State) take me to "There is currently no text in this page...". It doesn't exist. While searching for thread safety I landed on this forum thread. Community members referencing ST's documentation for a key component (atomicstate) seems really odd to me.

Let me first say that I really appreciate @bravenel responses in this thread and Hubitat interaction with the community. I'm sure there is a stack of issues he'd like to be making progress on, and taking the time to respond to help the community here is really appreciated.

I'd like to share a couple of thoughts on the entry barrier of which thread safety is a piece of.

Developers with any experience on this platform (or on SmartThings) are most likely familiar with the issues raised by multi-threading. Many end users are as well, because it is not difficult to run into problems with Rules in Rule Machine --> a not infrequent topic when an error is thrown by a rule caused by this. We assume that app and driver developers are familiar with this topic in general, or quickly become familiar with it when they encounter it.

This indicates one needs to have key tribal knowledge of this platform in order to write quality driver & app and be efficient in their time developing for it. A wiki page explaining the basic from some with access to the core sources would help a lot of people.

There is value in giving developers the tools their looking for since they help expand your platform for you, and best of all, you never even have to give us a single dollar!

In order for Hubitat to best attract developers to spend their time on doing this, there needs to be good documentation to start with. Android is a different world, but it's worth considering briefly as a case study. The main reason why it caught on and quickly became such a success is because the developer SDK documentation is very well fleshed out (and they made tools for developers to use easily). If you look at the platform documentation however (AOSP internals), it's really only begun to be fleshed out in the past few years. My point is, they made it as easy as they could for app developers to sell their platform for them, and good documentation targeting those developers made it possible to ramp up quickly.

As a business we always have to confront the issue of how to prioritize what we apply resources to.

I totally understand this and work in the world of prioritizing software issues every day. If a goal is to make this platform accessible to owner developers, however, Hubitat needs to prioritize documentation of the fundamental environment the scripts are operating in. Otherwise, those who may try to contribute will spend their time elsewhere. If developer-driven expansion is a lower priority than other business goals, putting off documentation is fine for the business purposes.

As a low-level software engineer with many years of experience, I want to contribute to this community (and fix those scripts I see issues in), and think I have a good background for it. But my time is limited. I have other priorities in life. I know I could bumble around through trial and error figuring things like how atomic atomicstate really is, how methods can be called on different threads and the synchronization solutions available, but that will take considerably more time than I expected with no official documentation. The barrier to entry for contribution is steeper than I expected, so I am unlikely to spend my free time learning the tribal knowledge required to use my time efficiently contributing here.

jwetzel1492 · March 6, 2021, 4:31pm

Context: I’m a professional software developer, and have spent literal years of my life working multi-threading and synchronization topics. I also have a dozen or so apps and drivers released and in usage by the community.

I have not had to think about locking or semaphores or anything like that at all while developing my drivers and apps. And I would bet a dollar that 99% of all drivers and apps do not need it. I think I’ve only seen a single example of an app from another forum member where it looked necessary.

It’s an event-based model. Your code is called in response to an event queue. Now, could you create a race condition by your application-level logic, such as having two different apps that do things to trigger each other infinitely? Of course. But that’s a problem with application logic, and not low-level thread synchronization.

To get into one detail you mentioned: my understanding of state and atomicState is that state writes to the database when your event is done being handled. AtomicState writes to the database immediately. But you still don’t need to worry about thread synchronization.

Think of it like running some JavaScript in a browser. To the developer, it’s single-threaded and based on events and callbacks.

(Anyone please feel free to correct me if I’m wrong on a detail. But it stands that thread synchronization is not something I’ve had to worry about while developing my apps and drivers.)

artyom.tokmakov · July 30, 2021, 9:37am

I guess it's a "bit" late for the reply, but I guess I'll just add it here for some other desperate developers to read

My understanding is that in JavaScript, your events run on the same thread (at least, from the developer's point of view). The event loop picks them up from the queue, and they run one after another, thus safely changing shared objects such as DOM.

You'd expect that the same thing is happening in Hubitat (well, I expected that for sure before I was disillusioned).

In Hubitat's case, the events are actually handled on different threads, and there is no synchronization of state between them. Both atomicState and state are not thread safe for this purpose - there's no guarantee that if one event is updating state, the other one will see the consistent update.

So you have to manually do this with tricky undocumented global synchronization objects like these (from this thread):

@Field static Object mutex = new Object()

def handler() {
  synchronized (mutex) {
       // Your code under real mutex
  }
}

Which is frustrating.

jwetzel1492 · July 30, 2021, 12:58pm

I’m curious what events you have that are happening so close together in time? In my house, events occur seconds and minutes apart. No one is simultaneously sending multiple conflicting commands to my door locks, for example.

672southmain · July 30, 2021, 3:21pm

You may not fully understand the issues in concurrent processes that interact, and which need mutually exclusive access to a resource.

Just as one example, the Litter Robot driver needs to poll a cloud server to get status and receive events. On my hub, a rule needs to interact with that driver and with another rule that changes color (red, yellow, green, and flashing red) of Hue under-vanity lightstrips to indicate the level of cat poop in the litter drawer, and whether the poop level is critical and needs to be emptied.

Those processes are asynchronous and unrelated, and critical regions are needed to control the lightstrips. It’s the asynchronous nature of the cloud server responses relative to the polling of the cloud server and the flashing of the under-vanity light period that creates the concurrency issues.

Just one example. I’ve got several.

dman2306 · July 30, 2021, 4:28pm

First - energy metering devices can send many events per second.

Second - you sound like you’re assuming a single event type for one device. What if I have a handler that is handling all temp changes for 20 different devices?

Yeah, I find synchronization to be a challenge some times. Not because it’s hard (though it is a more challenging programming topic) but because as @artyom.tokmakov said, none of this is documented and so we just get to try to figure it out and make our best guesses about what is going on under the hood.

jwetzel1492 · July 30, 2021, 4:45pm

I'm really curious if you could show me an example. Are you having events from all 20 devices touch a single shared resource? (And thus wanting to synchronize it.)

jwetzel1492 · July 30, 2021, 4:50pm

No, I've spent years of my career working on parallel systems and synchronization. I'm fine with "issues in concurrent processes that interact". I'm more asking about your specific use cases here, because in my own personal code inside the Hubitat runtime, synchronization has never been an issue. So I'm curious what your use case is.

Just as one example, the Litter Robot driver needs to poll a cloud server to get status and receive events. On my hub, a rule needs to interact with that driver and with another rule that changes color (red, yellow, green, and flashing red) of Hue under-vanity lightstrips to indicate the level of cat poop in the litter drawer, and whether the poop level is critical and needs to be emptied. Those processes are asynchronous and unrelated, and critical regions are needed to control the lightstrips.

I think this is where I'm not understanding your case yet. It seems that your rule would be triggered by a state change on one of the Litter Robot's attributes, and would then send a command to the light strip. The rule doesn't need to know about the internals of the litter robot driver, and doesn't need to know about the asynchronous API call. What am I missing that makes it need to be more complicated than that? Is it the "and with another rule" part? What is going on in that rule?

dman2306 · July 30, 2021, 4:58pm

Yeah. I’m not at home but I’ll find you a sample. Like imagine having a situation where an event should increment a state variable. I’ve have situations where 10 events happen but the end result is 8. Why? The state.variable++ stomps on each other because they’re not synchronized. It’s basically a TOCTOU bug because it’s not atomic. When state.variable is read it still has the old value (because the other thread hasn’t written it yet). Atomic state reduces the odds of this, but that’s still not actually atomic. This is the classic “ATM withdrawal” example all software engineers learn in school. The solution is thread synchronization

tomw · July 30, 2021, 5:06pm

I have had to synchronize access to atomicState for device discovery in my apps where I want to maintain a global list of what is discovered.

I send out a upnp discovery message, and everything on my network chatters back. Asynchronously and unpredictably relative to each other, with a lot of potential for collisions and concurrent running of the response handler.

So, it's either fiddle with it until it seems synchronized "enough", or say "just run again if it doesn't work" in the user notes. Or both.

dman2306 · July 30, 2021, 5:12pm

Very true! That was the first time I saw the synchronization issues on HE. At the time I didn’t know enough about their thread model to know what was happening.

bertabcd1234 · July 30, 2021, 5:16pm

Do you do any Z-Wave driver development? Hubitat recommends using a couple static field variables to handle Supervision: [GUIDE] Writing Z-Wave Drivers for S2

I've seen some people complain of errors that look like concurrency issues, which I see that the 2.2.8 release notes note a fix for:

C7: Fixed concurrency error produced on some Z-Wave drivers when using S2 and Supervision.

However, I'm curious what the fix was...switching to ConcurrentHashMap instead of the default Map, maybe? In any case, that is a practical example where this matters--not only could Hubitat could get two Z-Wave reports back from the device in quick succession (while the first instance of the driver does not finish executing before the second begins), but static fields are shared among all devices that use the driver, not just one particular installed instance--hence the reason these are static Maps indexed by device ID (which if someone could tell me why they recommend converting that to a String instead of the actual Long type, I'd be interested in knowing, but that's unrelated...). So, if two devices (using the same driver) just get a single report back at nearly the same time, the same concern applies.

(If ConcurrentHashMap was the fix, I don't see that mentioned anywhere and definitely not in the above. Might be worth asking, unless this was some platform-level change, but the "some drivers" phrasing doesn't sound like it. Probably worth asking in that other thread...)

dman2306 · July 30, 2021, 5:24pm

Not much, no. Most of my issues have been in apps.

bcopeland · July 30, 2021, 5:39pm

Yes

jwetzel1492 · July 30, 2021, 5:51pm

Ah, that's interesting. I see what you're saying there. It's not a situation I've had to solve in my apps, but I follow now. What kind of state variables are you incrementing?

artyom.tokmakov · July 30, 2021, 5:55pm

I'm really curious if you could show me an example. Are you having events from all 20 devices touch a single shared resource? (And thus wanting to synchronize it.)

I've hit this with InfluxDB logger, where it collects events from different devices, and then sends a batch http request to influx db to log them. Thus it needs to share the state between threads which add events and a thread that sends the events to the DB.

But it doesn't have to be as complicated as here and in examples others have shared. You can have one timer or http handler, and the events they trigger will already be in conflict with your lock events, if they share any state.

I really wish Hubitat's documentation would be clear about this issue, and about solutions (maybe it is now, and I'm not up to date?). It just hurts developers and ecosystem for no reason.

dman2306 · July 30, 2021, 5:58pm

I was building a presence app. Geofence is so unreliable. I basically was using multiple apps and keeping track of > N report you’re away, then you’re away. But I noticed the count didn’t match reality. It turned out to be this issue.