Aeotec Smart Switch 7 with 2.2.4 driver question

No. I didn't actually knew about wakeup broadcast behavior with FLiRS devices. And I can't really test that atm as I have only one in use. But for me it is a bit weird that it hammers network. Once again I don't yet know all the details. So my next thoughts are mostly assumptions based on general principles.

There are two types of RF-communicators (devices). Those that work in simplex or duplex mode. It simply means if each device can either receive-or-transmit or do both at a time. In both cases only one device can communicate to some other device. (WiFi MIMO is a bit different thing that is using multiple closely arranged frequencies).
And the lowest level protocol usually enforces rules of communication channel acquisition and device negotiation. Higher level protocols works on already established logical connections (that are actually packets sent at negotiated device-to-device sessions).
From this perspective multiple concurrent transmissions from awakened devices sounds more like some sort of issue in the channel negotiation logic. But I might be wrong.

All in all I have 40+ Z-Wave devices, so my mesh needs to be handled with care. :wink:
One FLiRS won't hurt, but I can reproduce the effect when only triggering the 3 TRVs in my living-room.
Triggering all 7 TRVs I have will seriously upset the hub.
You'll find some evidence of this in the forums, e.g.

Summary

Eurotronic thermostats hang my hub - #55 by Arek

Interesting, but I can only tell from my observations, that sometimes, when triggering my living-room devices at a time, all indicators lighten up almost simultaniously, but only on a good may, mostly one of them gets stuck.

If there's one thing I learned from this forum, that there's a lot special with this protocol. :wink:

Hmm.. Some of the mentioned lags I've seen even with my single device even in manual control mode. Experimental driver tries to ensure actuator command delivery with 60s timeout. I've seen cases when device had been awaking up to 40s. I didn't applied supervision to status requests. Only to commands (on/off, setHeatingPoint).

The device log will show if driver had to retry some supervised commands. Or if command were timeout without confirmation.

Device state will show the contents of the driver command queue (but it needs to refresh drivers page to update).

Interesting if it may help with multiple devices.

This is a lot, what are your other Z-Wave devices?
Sounds like a mesh issue to me, or maybe you have some ghost devices?
If you find a mains-powered device in the Z-Wave details page that shows up no routes, it needs to be excluded, if possible and paired again.

You mean I should test the experimental Fibaro driver?

"Should" is way too strong wording)))) But you can try)

I'm keeping an eye on my z-wave network. And it looks fine. The case with long wakeup is rather rare. But I saw it can happen. I see my devices change routes from time to time (with no obvious need or reason). And it made me suspect that if something happens in a meantime while route is changing could it be the reason of such delay.

Just gathering statistics and looking for possible causes)

1 Like

I can tell you that this behaviour is normal.
If you interested to go deeper into Z-Wave, I recommend having a look into the SiLabs docs.

Now for my testing:
I created a rule that is supposed to trigger my living-room TRVs at once.
Installed the drivers, rebooted the hub.
The result was always the same, 2 out 3 TRVs worked, one failed, without witing any logs.
Also, this very device could not be controlled by any means, entering the setpoint manually didn't work as well refreshing it.
Tried all that several times.

Reverting to your "standard" drivers fixed the problem. Weird.

Edit: I just rembered also having had an issue with long response times, and all of them had been SS7 included with S2-security. Look here.

Did you have a chance to look if device queue were empty or had some commands queued?
Like this:

I have an issue that sometimes command get stuck in the queue. And I can't figure out yet why. While suspecting some possible multithreading issue I still looking for error in my code.

Both seemed to work, this is what the logs said for each of them:

trv__log

These logs says that two controlling commands were accepted by target device and removed from queue as acknowledged

Ah, sorry, didn't have an eye on that, this is the faulty device, the queue for the other 2 is empty:

Summary

Arhhh.. The queue has stuck.

You can press "Flush all queued commands" to punch it. They will start executing.

I need somehow to track down what gets wrong...

I have feeling that UI interactions are handled in single-threaded mode. While schedules fire action events from other threads.

And they did, which also caused the device to go nuts. :grin:

Summary

After some testing and logging I'm pretty sure 'experimental' command queue based drivers are facing thread conflict issues. With small data amount scheduled one thread might not see other thread is pushing something to the queue. While with large amount of scheduled data threads don't see command queue is being consumed leading to infinite command send/execution loop.

The conflict is between cron event callback and runIn/runInMillis calls (the latter can conflict with it self due to running event from different thread).

I found mentions on forum about few thread-safe collection types. But providing multithreaded behavior into driver instance without providing full synchronization API it is (let's say softly) a weird solution. I would expect either a full MT API stack or no multithreading at all (in a programmers model for drivers and apps).

By my last point I mean that driver instances may run in different threads. But I don's see a need to run driver instance methods from different threads at a time. Each driver instance may have its own event queue under the hood filled from different threads and executing from some single worker thread. This way all drivers will be secured from MT issues.

It's not about use case that may be faced or not. It's about a rifle that will shoot sooner or later. Basically it posses a strong risk if to try to use HUB in some critical system/environment.

To fix my experimental implementation I see 2 ways. The simpler way is to use a 'pauseExecution' instead of postponing. But documentation doesn't mentions if the function call is blocking (if it will block a driver script only and allow context switch inside allowing other scripts to operate or it will block execution thread completely). The more complex way it to use concurrent collections provided.

Made fixes for multithreaded command queue access. Pity that I have to use global variable. I would prefer to have some shared access driver field for the purpose a-la 'state' instead.

Testing experimental implementations for stability

Started to investigate Schedule CC to implement temporary manual override functionality. Seems like HUB has an issue with parsing Schedule CC replies.

Requesting:
ScheduleSupported

HUB parsing result:
ScheduleSupportedReport(fallbackSupport:null, numberOfSupportedCc:null, numberOfSupportedScheduleId:null, overrideSupport:null, startTimeSupport:null, supportEnabledisable:null, supportedOverrideTypes:null)

Device replies with
FD 0A 02 43 01 40 01 80
FD -> numberOfSupportedScheduleId: 253, same as in manual
0A -> supportEnabledisable: false, fallbackSupport: false, startTimeSupport: [hour/minute, weekdays]
02 -> numberOfSupportedCc: 2
43 01 -> thermostat setpoint: set
40 01 -> thermostat mode: set
80 -> overrideSupport: true, supportedOverrideTypes:[]

I see you had quite a busy time. :slight_smile:

You mean for security or stability-reasons?

Hm, maybe a dumb question, but one of my TRVs, the one with firmware 4.6, occasionally tends to spin up the valve to fully opened.
While in this state, the deivce won't accept any commands, neither local or via the driver, I always have to re-calibrate it to make it work again.
Using the prior version of your drivers, it did this every night, annoying, but it looks like now I found a pattern, seems always to go nuts at about 03:20 in the morning.
The hub's database cleanup runs at 03:15 every day, do you think there could be a connection between these 2 events?

Stability. Imagine you have some critical emergency vent in case of CO/CO2 critical level. And in 1 of 10000 it might not get started due to some multithreaded access device state corruption. (virtual example just to explain the point)

Hm.. I'd like to say "no". Sounds like a weird connection. But if I'd had such behavior I would definitely investigate it more deeply (by putting tons of logs around all the commands the go to and from the device).

Btw, driver has some attributes related to the topic

attribute "alarmBattery", "STRING" // low, empty, charging, idle
attribute "alarmCannotReachTemp", "STRING" // alarm, idle
attribute "alarmHardwareMalfunction", "STRING" // alarm, idle
attribute "alarmWindowOpened", "STRING" // alarm, idle

alarmHardwareMalfunction is triggered at:
// External sensor error 0x02
// Motor error 0x03
// Calibration error 0x04

But I didn't made actual reason parsing.

Not all of that might be available from the device with fw prior to 4.7

All of those are initiated by device and can be used in RM to send notification

Experimental drivers now have command queue logging/tracing options. It is very verbose. And may show what was sent to the device at specific time. It can be enabled for a specific device.

Will look like this
Execute commands [[delay:0, timeout:100, sessionId:-1, sessionTimeout:60000, commandString:9F0302700501], [delay:150, timeout:100, sessionId:-1, sessionTimeout:60000, commandString:9F0302700502], [delay:150, timeout:100, sessionId:-1, sessionTimeout:60000, commandString:9F0302700503]].

Commands are sent one-by-one first-in/first-out
The very first command 'delay' shows if command will be sent after some time, right now (if '0') or it was already sent ('-1') and timeout for 'Application status report' were reached (device had a chance to ask to retry; subject to remove and proceed to the next one)

There had been addad some functions to parameter 2 with v4.7, but notifications for hardware failures are being reported, I had to return my first TRV due to a hardware / tamper failure, this was during Animus Heart days, but I assume this should also work for the hub?

So I created a rule that would notify me if an event with value 3 or 4 is reported, do you think this will work?

Just installed the new experimental driver, will start debugging this night, thanks for that.
I assume the command string in the debug logs shows the CC and the executed commands?

A general observations regarding the recent and also the former non-experimental drivers:
Triggering my TRVs seems to be clogging my network again, though I added a delay of 8s between each execution.
The behaviour is the same when triggering many devices at a time, at least one won't execute at all, also other automations will be executed delayed or not all.

But I found that now the devices report changes immediately, great job. :slight_smile:

The driver translates some fields of parameter3 and notification events into 'alarm*' attributes. 'openedwindow' and 'cannotreachtemp' (with enabled heating medium demand report) are triggered for sure. Others also should. Waiting for battery to discharge to check battery related 'alarms' :slight_smile:

2, 3 and 4 are implemented trough hatdwaremalfunction attribute. But currently without specifying which one was the cause. I plan to improve that. Just lacking a bit of info how to extract that number from the notification report to distinguish.

Yes. But in the final encoded form. The way it goes to the z-wave transmitter. I can add unpacked form for debugging.

The command queue behavior can be tuned in terms of timings. There are set of fields for that in the code. All command have 2 timing parameters: delay and timeout. First talks for itself. Second is how much time to wait for possible application status report from the device in case it is busy and wants you to retry later. Supervised command (driver uses them to command device but not for asking reports) have also retry interval. By default driver retries command each second during minute until device gives an answer.

You can try different values.

The main goal is to avoid spamming device with different commands. But it doesn't saves from multiple driver instances spamming rf channel.

Trying to find good formulae to make it work nicely)