So, @bobbyD and I were going through some logs today, and we found some lovely entries like this one:
...HubSchedulerFactory - job dev161Recur.refresh has been firing for 3494 seconds
In case you just went "like huh?" that's a device driver method running for almost an hour It doesn't matter what driver it is, the point is that platform has no safeguards against this kind of thing at the moment. And even well written driver code can fall victim to unforeseen combination on inputs, or just spend way too long waiting for, say, a network response from a third party service.
So the question I'd like to throw out there is, what is a reasonable time limit for an app/driver method? 15 seconds? One minute? I'm going to implement something to address this scenario in version 2.2.4, and I'd rather it be based on developers' community's input instead of gut feeling. Any constructive input is welcome, please speak up!
Being not educated enough on what would/can cause this. 1 minute seems long enough and 5 minutes sounds like a lifetime. But again, I guess it just depends on what the circumstances are.
I usually add a 15 to 30 second timeout on all of my http calls. May have missed one or two but have tried to go back and make sure they are all there.
Historically, WebCoRE had a significant problem the last time this was tried and it had to be reverted. WebCoRE has moved forward enormously since, but that's one that I would make sure got an Alpha or Beta test.
I think the topic needs a bit more substance for example what's the something you're proposing as the cure? Will the Driver/App need to altered to recover gracefully? I know from a HubConnect vantage, we'd probably wish the detail was available for distribution back to the originator.
That was my thought too.
Agreed, it would be appropriate for hub timing intervention to have feedback to the app/driver.
At this point, I'm thinking simple Thread.interrupt(). If method is block waiting for something to happen or sleeping, it will trigger InterruptedException, which can be handled by app/driver. We can't really force kill a thread in Java, only ask nicely for it to pretty please step away from the buffet.
I would think you'd have to wait at least twice as long as any normal TCP timeout. That would allow the driver code to handle a timeout scenario elegantly without getting whacked by the platform. 15 seconds seems way too short. Somewhere in the 1-5 minute range seems much more reasonable.
the point is that platform has no safeguards against this kind of thing at the moment
I agree that it seems odd and unnecessary to have a method taking that long to execute, but does it actually cause an issue definitively?
Unless it is detected to be deadlocked and/or blocking access to a shared resource that affects unrelated operations, a blanket limitation on the execution time of any individual method seems sort of heavy handed. Said another way: a false positive might cause more problems than it solves.
It does create problems - there’s a limited thread pool used for running app and device methods. Have a handful of runaways like that and other apps/drivers start to starve.
Makes sense. It's not clear how close to that limit things run in a 'typical' system.
I just saw your earlier suggestion. Seems reasonable to me:
At this point, I'm thinking simple Thread.interrupt()
When would you invoke that limitation? I'd suggest that it should only be when the thread pool (or other limited resource) is sufficiently starved. The behavior of the Linux oom killer is a potentially useful example.
I am far from a developer, so take everything I say with a grain of salt. But I am struggling to think of why you would need a driver to be active that long at a time? Shouldn't the driver do it's job, and sit back and wait for the next command? Even if you are stepping up (dimming) a light over time or something, you are sending often, but intermittent commands, right?
I mean, these commands they are sort of serial up to a point, if I am reading Victors post correctly? The pipe is only so big, so you have to keep things moving along. Even IF something is really slow or locked up, it seems like it should execute within a minute? If not, then do we want things to purposely but gracefully fail (my guess is yes) and move on to the next queued item?
If this were a car, it would give an error code after some quantity of these messages were clogging up the CAN network or after it saw a non-responsive device. It might retry things every key cycle, but it wouldn't just pound the network so bad that it made the car inoperative! Like sorry we can't turn on the headlights because you left the blinker on too long! Or because the blinker bulb burned out.
The longest I can remember something on my hub taking to respond is on the very rare occasion my locks have misfired. They might take a minute to respond at worst. But an hour! So my uninformed opinion from a user side of things would be to make it fail gracefully if possible, after maybe a couple minutes to maybe 5 minutes.
The Platform supports both. It supports Synchronous conversations in which the hub must wait for a response. Then there's Asynchronous, which is of course the opposite and the two pieces do not rely on one another. Fire and forget, followed by a 'surprise, here's some data.' Thus writing async is a touch harder. This is just one example to assist in understanding not a tutorial
Seeing as how methods like httpGet allow me to have a timeout of up to 300s (which I use in a few apps for long polling) wouldn’t it need to be at least that high? Any long running web request would be an issue depending on how you implement this (is waiting on a synchronous httpGet considered “sleep” time?) While I understand the reasoning for what you’re proposing, this is a big part of why I left SmartThings. Their limits of ~30s were very difficult to stay within in many cases. Apps like Ecobee Suite (I’m sure @storageanarchy can add details) and AlarmDecoder had to come up with workarounds of firing events to do stuff asynchronously to get around the limits.
Edit: how do calls to pauseExecution fit into this solution? I have code that does this in a loop to implement a lazy-mans retry system. Similar question on synchronization methods like awaiting on a semaphore or using the synchronized keyword. That’s generally considered sleep time for the thread since you’re yielding the processor to other threads. Will your solution consider that part of the limit?
Edit 2: in HPM when someone does package match ups, if they have many packages installed this process could take several minutes to complete. That’s why I made the method asynchronous and in the background using runIn. On my hub it takes about 8 mins to complete as an FYI.
The more I think about this, the more I’m thinking I’ll have a lot of potential rewriting to do... maybe I’m wrong, guess we’ll see.
I suspect you're wrong because each async call would be it's own thread. The httpasync call would effectively end one thread, the handler would be another, different thread. At least that's the picture in my head
Yes, asynchttp calls are processed run in separate threads. The callback is invoked when HTTP request is done, in thread's context that's different from the one initiating asynchttp. While HTTP calls are subject to HTTP timeout limit, they would not be subject to max method run time. Same goes for runZZZ methods, they are asynchronous.
Can you point me to a driver example? The reason I've started this discussion is to get a better understanding. Looking at the code will provide a clear picture of the behavior hub code needs to work well with.
Edit: Let me take a look at the package manager first...
If runXXX is excluded HPM won’t be an issue. In the ST world it applied to all execution time slices, no call could rake more than a set limit of their cloud time. Can I throw out an idea? Release a beta with an advanced setting to set the limit in seconds. Default it to something like 60, then let developers report back on what they had to increase it to(if anyone even needed to) and that will give some hard data. Right now we’re kind of all guessing
I am talking about httpGet, synchronous, not the async methods.
Definitely, and starting limit is going to be at least 300. A potentially high impact change like this is not going to just appear in release notes one day. ST had to pay for cloud infrastructure running all that resource hungry code. We, on the other hand, are just trying to find eliminate common slowdown causes based based on what appears in the support tickets.
Those would be subject to the limit as they run app/driver method thread context.
You could release a beta that only ALERTS on the proposed trigger point, and doesn't kill threads.
Then you would have data on how often this would come into play without breaking things.
We got a watchdog process that dumps that information in the log, but it's not the public log. That can change. Needless to say, we only see the most egregious slowdown cases.