Seeking Better Certainty of Device States

bertabcd1234 · June 11, 2023, 3:17am

Not to suggest a totally different path without some thought, but changing the driver would affect the way all apps work, and this may have unintended consequences for some apps — or not (it depends on your setup and what this means for your apps). The default platform behavior is as it is for that reason. Have you considered the idea to use a custom app instead of a custom driver? When subscribing to the event, one of the options you can use is filterEvents: false. This should catch events that the device sends, even if they aren't considered state changes by the platform. See: the subscribe() sigantures docs. This does depend on the driver still trying to send events (and relying on the default platform filtering), but most Z-Wave and Zigbee drivers I know of would. With an open driver like the one suggested above, you could make sure of it, too.

Or you could modify the driver. But if your rule is reasonably complex and you have any Groovy (or Java or similar, or could learn) skills, then an app might be an easier approach anyway — and the advantage is that it provides you this option without affecting anything you don't want to work in the way a driver modification would cause. Also, from a practical perspective, the driver developer docs are still being worked on, but there are likely enough app docs now to get someone started who has any capability for this (though this particular driver modification should also be easy if that really is the best option for your use).

EDIT: I'm adding the fact that this is a rarely-used technique (likely for the same reason that most drivers don't work this way in the first place), so you're unlikely to find many examples. However, subscribe() itself is incredibly common, and this modification to that is easy; most of the work would be writing the app code itself to do what you want "action"-wise.

EDIT (again): I may have misunderstood one suggestion above, that being to make the driver retry the command if it doesn't seem successful — not to make the driver force everything to be a state change and leaving the retries to an app. That could work -- or if your device supports the Z-Wave Supervision command class and is paired with S2 (I think that is required for this), you could use Supervision encapsulation for "actuator" commands (on, off, etc.). The driver can retry the command in a few seconds for a few times if the device doesn't get a Supervision reply, and there ares some examples on the forum of how to do this. This technically doesn't tell you whether a state change actually happened, just that the device received the command (or if it never replied that it did), but if the device is working properly, it should almost be the same.

robert.bruce · June 11, 2023, 4:07am

@bertabcd1234 thank you for your desire to help.

Yes, the basis for the reliability improvement I'm trying to accomplish is to get verification of the state change from the device following a command, so it looks like Supervision Encapsulation wouldn't get me what I need.

I have plenty of coding experience, but not in Java or Groovy. I've looked at Groovy driver code and it is reasonably intuitive, but I wouldn't be able to just dive in.

Likewise, I have no familiarity with custom Apps. I get the gist of what you are saying, that a custom app could be created to subscribe only to the events that I want, and by setting the filterEvents option, I could have simple access to events that the hub normally discards. This is very interesting to me, but again, I wouldn't be able to just dive in.

The RM rules I've already written do an excellent job of verification of device state, retry, etc., for the events that pass through the isStateChange filter in the driver, and that is the only caveat. The capability of RM is robust, I am very happy with it. Using my RM rules doesn't add a significant overhead on the CPU. For an initial trial, it seems easier to me to modify an existing driver by setting the isStateChange variable to true for all switch commands, and if it causes problems as @bravenel thinks it could, I'll find out.

I would prefer the easiest path to get what I can demonstrate as a proof of concept. I have had to go through a couple dozen iterations of the RM reliability-based rules I've created to get to where they work well for me, so that I don't end up with multiple instances running, endless loops, that the verification routines are cancelled properly, that the ON rules work with the OFF rules, etc. The rules I've created are not seemingly complex, but if they are written improperly, they get the hub into trouble. I've gone through that and found the problems/challenges so I'm in an advanced state there. I know it isn't an ideal long-term solution for something that might ultimately be truly useful to a lot of people, but I want to get to a proof of concept point first without too much expenditure of effort. I've got a lot of animal care and public interaction work I've got to keep up with

bravenel · June 11, 2023, 4:18am

All of this should be done in the driver. The current driver sends a command but doesn't follow with any verification that the command was completed by the device. The Groovy to do this would be fairly simple, much simpler than your rules. Going down the path your are describing of messing with isStateChange would be a mistake, imo.

coreystup · June 11, 2023, 6:06am

I think its the opposite. @bertabcd1234's suggestion to use Supervision is exactly what would help in securing the level of reliability. It forces the lower level chipset to do the additional work of tracking a failed/non ack'ed request, performing retries, etc.

I know you've put a lot of time into adding error handling into RM but its the wrong place to do it. Listen to what @bravenel and others are saying. Using the Supervision CC with S2 devices and some additional error handling in the device driver would take care of all of it at the correct level. The driver wouldn't post the new event state until it could confirm (which receiving a positive Supervision Report would do) that the device confirms its new state.

One thing I'm not sure drivers tend to do is resync their state on a hub reboot, which you brought up as a possible corner case. That would be a good thing to add as well.

robert.bruce · June 11, 2023, 6:14am

@coreystup how difficult a project do you think this would be? Would there be a way for the driver to send an event to the hub if a certain number of retries had been done without a report of a valid device state so that the event could trigger a warning action (like a switch turning ON)?

Edit: I apologize, your previous post came in just as I was posting this

coreystup · June 11, 2023, 6:19am

Not very difficult to get started. I'd estimate a few hours/a days work for the meat of it. Tweak over time.

Sure, the driver could post an error type event that another rule could act on and notify via sms, turn on a lamp, etc. Whatever.

robert.bruce · June 11, 2023, 6:37am

The devices I've tested don't call a self Refresh on a hub reboot. I've used this behavior (or lack of behavior) to purposely put devices into a state different than the hub believes them to be in. See my Post #27.

Since it's important for reliability because device state can change when a hub is down, I put a Refresh command for every device paired to a hub in a rule that gets triggered by a hub startup event.

robert.bruce · June 11, 2023, 6:48am

@coreystup this is what led me to think that Supervision might not be the answer. But if Supervision can be combined with a test for the device to report the desired state, then it could work. I normally don't retry more frequently than every minute. I don't know if Supervision can be set to retry at a defined time interval.

coreystup · June 11, 2023, 6:55am

You shouldn't even need to test (query) for the desired state. Just keep asking for it to be put into that state until it confirms that it is. Sending a GET + receiving a REPORT is the same amount of airtime/packets as just asking it to be placed into that state (SET + SUPERVISED REPORT). The first retries would be handled by the chipset (tries every 500ms X times, etc). The driver could/would have a retry layer on top of that if needed.

I also don't think the ZEN17 would be any more difficult to work with either. Its still just a binary switch, even though it has multiple endpoints, etc. I would suspect the biggest issue is the level of support that each of the devices you're thinking of using has for S2 Supervision, etc. Zooz tends to be responsive to any issues/bugs pointed out though.

Just FYI, I have a ZEN16 and ZEN17 available for development use. I don't have the other units you mentioned.

robert.bruce · June 11, 2023, 7:02am

Do you mean the chipset of the device? The device retries on its own? Is the retry time interval or number of retries able to be set by the driver?

coreystup · June 11, 2023, 7:11am

The zwave chip itself in both the hub and the device. When sending supervised encapsulated data, they can each do their own retries for failed acks, checksum failures, decryption failures etc.

Devices usually offer configurability or some reasonable defaults if not. For instance, the ZCOMBO smoke/CO alarm offers:

Then on TOP of that if the driver needs to do additional retry/error handling it can do so using whatever logic needed.

robert.bruce · June 11, 2023, 7:22am

@coreystup That's all very interesting. My belief is that the primary failures I see are either the device not receiving the command or the hub not receiving the ACK. Maybe there are some checksum failures in there as well, but I was seeing the same occasional failures before I was using S2.

There is a Supervision command class on the Zen17 btw, which I imagine you would expect being that is an S2 device.

robert.bruce · June 11, 2023, 9:55am

@bertabcd1234 I can only find a few references of users attempting to include Supervision into custom driver code, but I can't find any examples of anyone actually testing it and using it. Did you find some posts that I am missing? It's a great capability in Zwave, but it seems like the only places it is being used are commercial security devices and the like, such as Zwave doorlocks etc., not custom driver code.

Not that there is any reason it wouldn't work in custom code. Just can't find any examples.

bertabcd1234 · June 11, 2023, 4:01pm

Yes, also linked to in the docs:

The driver you were previously referred to may also be using it already (or not; I did not look at the code).

bravenel · June 11, 2023, 4:13pm

This is correct, and it is a case that needs to be dealt with by some means.

I am going to create a modified driver this morning that does this. My approach will not be Z-Wave specific, won't be using Supervision CC, so it won't be limited to S2 devices. Instead, a simple check and repeat approach. It certainly could have a settable maximum number of retries after which it throws some event (sets an attribute), and also a settable refresh interval. This is not a difficult thing to do, and I will share the core part of what I come up with.

bravenel · June 11, 2023, 4:58pm

Here is the code for POC of this. I added two commands to the driver, onV and offV. These could be commanded with RM custom commands. For now, the refresh rate is hardwired, and there is no max retries check. I tested this by turning off the switch, then unplugging it; turned it on from the device page, and after some time plugged it back in. Works as expected. Here are the logs:

Notes on the logs: The refOff at 44:11.285 is after offV() checking that the light is indeed off. The next refOn at 45:40.740 is 10 seconds after I gave the onV() command on the device page, checking to see if it turned on. It did not since it was unplugged. At 46:10.843 the redoOn attempts again to turn it on, and this time succeeds. The final refOn at 46:20.862 is when the sequence stops, as the switch has responded to the command and is now on.

The code for this would work in most drivers:

def onV() {
    cancelOffCheck()
    state.onCheck = true
    runIn(10, refOn)
    on()
}

def refOn() {
    Boolean check = state.onCheck
    state.onCheck = false
    log.debug "refOn: check=$check, device=${device.currentValue("switch")}"
    if(check && !(device.currentValue("switch") == "on")) {
        state.onCheck = true
        runIn(10, redoOn)
        refresh()
    }
}

def redoOn() {
    Boolean check = state.onCheck
    state.onCheck = false
    log.debug "redoOn: check=$check, device=${device.currentValue("switch")}"
    if(check && !(device.currentValue("switch") == "on")) {
        state.onCheck = true
        runIn(10, refOn)
        on()
    }
}

def cancelOffCheck() {
    state.offCheck = false
    unschedule(refOff)
    unschedule(redoOff)
}

def offV() {
    cancelOnCheck()
    state.offCheck = true
    runIn(10, refOff)
    off()
}

def refOff() {
    Boolean check = state.offCheck
    state.offCheck = false
    log.debug "refOff: check=$check, device=${device.currentValue("switch")}"
    if(check && !(device.currentValue("switch") == "off")) {
        state.offCheck = true
        runIn(10, redoOff)
        refresh()
    }
}

def redoOff() {
    Boolean check = state.offCheck
    state.offCheck = false
    log.debug "redoOff: check=$check, device=${device.currentValue("switch")}"
    if(check && !(device.currentValue("switch") == "off")) {
        state.offCheck = true
        runIn(10, refOff)
        off()
    }
}

def cancelOnCheck() {
    state.onCheck = false
    unschedule(refOn)
    unschedule(redoOn)
}

The calls to cancelOnCheck() and cancelOffCheck() should be added to the off() and on() methods (with test that it's checking) instead of where they are in this code. My test driver is actually a dimmer driver, so cancelOffCheck() would also be added to setLevel() as well. So if the device is turned on with onV(), and then turned off with off(), it doesn't continue to check.

To handle the hub offline and reboot issue, I think the state variables state.onCheck and state.offCheck would instead be an attribute, with values like [checkingOn, checkingOff, notChecking]. Then for those devices where recovering from hub offline is important, an RM rule with trigger of systemStart would check that attribute value, and if checkingOn would do onV to the device, etc. This restarts the checking for the device that hadn't responded before the reboot. Like this (my test device is called Tree):

Finally, nothing says this has to be done with a new (custom) command in the driver. This could be done directly in the on() and off() methods, effectively changing the driver from fire and forget, to fire and keep trying until success. And choosing this option could be made a preference in the driver, so a single driver can serve both approaches.

coreystup · June 11, 2023, 7:41pm

Or what about add the Initialize capability to the driver, and if a preference is enabled ("Recover from hub offline") then it restarts the previous checking.

bravenel · June 11, 2023, 7:45pm

Yeah. That’s good. I forgot about Initialize capability. No rule needed.

robert.bruce · June 11, 2023, 8:24pm

@bravenel this is awesome. It is going to take me some time to digest and absorb this properly because I have minimal familiarity with Groovy and driver structure, but I'm generally understanding the code. Most of my lack of comprehension comes from which words are Hubitat keywords/reserved words, which are general conventions for variables/methods in Hubitat drivers, and which must be defined in the driver. I suspect that you have defined everthing here that you need to though.

Particularly "onCheck", I haven't seen that, is that a predefined State in Hubitat drivers?

Somehow I was thinking Initialize was a method that was called when the device was powered up, and I saw your suggestion about hub offline/reboot as addressing when the hub rebooted. Does Initialize get called in both situations? Are you suggesting using an attribute instead of a state because attributes are preserved through a hub reboot and states are not?

If the added commands onV() and offV() were left in the driver, then cancelOffCheck() and cancelOnCheck() would need to remain in those methods as well as be added to on() and off() I presume.

Thanks again.

bravenel · June 11, 2023, 8:57pm

No, state objects are just inventions of the coder.

The initialize method is called when the device is created. The initialize capability causes that method to be called upon hub startup, which is a different kettle of fish. This is used for things such as reconnecting to telnet for a driver that uses telnet, etc. So it could serve the purpose of restarting the checking after a reboot if it was happening before.

No, I did that to expose it to RM, but if initialize is used, it doesn't need to be an attribute, and state would work fine. Both state and attributes are preserved across reboots.

No, because in each case the corresponding method is called. onV() calls on(). No need to cancel checking twice. But there is a need to cancel it in on() and off() in case those are commanded instead of onV() or offV().

runIn() is a hub service method that schedules something to happen in the future. device.currentValue() is a built-in method that returns the current value of a device attribute. on(), onV(), refresh(), etc. are all names of methods in the driver. Certain methods are mandatory depending on the Capabilities defined. For example, Switch Capability mandates that the driver must implement on() and off(). onV() and offV() were declared to be commands of the driver in its metadata section:

        command "onV"
        command "offV"

That causes those methods to show up on the device page. exposing them to the world.