Z-wave device message sometimes delivered as a different device

mledford · February 6, 2021, 8:28pm

I have had the following happen to me enough times over the past 6 months of owning my Hubitat c-5 that I am confident that this is a problem. I thought I had solved this problem through a workaround that I'll describe later but it happened again today though worse than previously.

Today at almost the exact moment that my garage door sensor reported it was open one of my leak sensors reported it was wet. Automation took over and closed the main water valve. Great. However, the water sensor's alarm wasn't going off and a quick peek under the house showed that the area was totally dry. I've seen some posts about similar things so I wonder if this might be the thing that is happening to others.

My water sensor reported wet when it wasn't. I've had this happen enough times and looking at the logs that it seems clear to me that somehow the message that another device is sending is getting interpreted as the water sensor. This is bad for all sorts of reasons but really. How can I trust anything that is happening if the simple things don't hold true?

Previously I one of my motion sensor's node id was exactly one less than the water sensor. I had the water sensor report wet several a couple of times while walking past the motion sensor. So I figured this was the problem. Either the device is sending the wrong node id in its message, the radio screws up the node id, or software (memory corruption) is changing the value. I don't have any sniffers so I have no idea where exactly the problem is. So, I excluded the motion sensor device from the network and then added it again. This caused the node id of this device to no longer be one less than the water sensor. Since I made that move I have yet to have that water sensor to report wet again in the past 3-4 months. Where it had reported wet 2-3 times in the previous 3 months.

Clearly I was on to something and thought I was in the clear. However, today as mentioned previous it happened again. However, instead of the triggering device being exactly one node id before the water sensor it was two. Also, it is a different water sensor from the one reporting wet the previous times. So to me this doesn't seem to be a leak sensor problem. It seems that the problem is the Hubitat c-5 device itself.

So here is some extra information.

The last time my hub was restarted:
systemStart System startup with build: 2.2.4.158 2.2.4.158 2021-01-17 06:11:52.680 PM EST

The log events around the time in question:
dev:391 2021-02-06 12:50:42.934 pm info Kitchen Motion motion is inactive
dev:100 2021-02-06 12:50:33.166 pm info Main Water Shut-Off switch is off
dev:100 2021-02-06 12:50:33.159 pm info Main Water Shut-Off switch is off
dev:100 2021-02-06 12:50:33.155 pm info Main Water Shut-Off valve is closed
dev:100 2021-02-06 12:50:33.155 pm info Main Water Shut-Off valve is closed
app:136 2021-02-06 12:50:32.063 pm info Action: Notify ALL iPhone, iPhone: '%device% detected water at %time% on %date%.'
app:136 2021-02-06 12:50:31.989 pm info Action: IF (Main Water Shut-Off open TRUE) Close: Main Water Shut-Off
app:136 2021-02-06 12:50:31.926 pm info Leak Notification and Mitigation Triggered
app:136 2021-02-06 12:50:31.920 pm info Leak Notification and Mitigation: Crawl Space Leak water wet
dev:97 2021-02-06 12:50:31.831 pm info Crawl Space Leak is wet
dev:12 2021-02-06 12:50:31.119 pm info Garage Door was opened
dev:11 2021-02-06 12:50:30.285 pm info Keeping Room Motion motion is active

Node id of devices in the log:
0x0B (011) Keeping Room Motion
0x0C (012) Garage Door
0x0E (014) Crawl Space Leak
0x14 (020) Main Water Shut-Off
0x29 (041) Kitchen Motion

I'm happy to answer any questions.

Cheers,
Michael

ccoupe · February 8, 2021, 6:47am

That is truly weird. If every hub did that, no one would be working. Since we are working, it's unique to you. Your off by one unless it's off by 2 is really difficult to image how the stock software or network stack could do that. Perhaps a coincidence? What are the devices - brand and model number? By any chance have you been setting Network Device ID numbers manually, overriding what the hub and device choose?

mike.maxwell · February 8, 2021, 2:57pm

The Z-Wave stack in the C5 has been in production for 3 years now using at least 3 different Z-Wave radios, and hasn't had any fundamental changes in over a year.
So while one can never say never, the chances of this being an issue caused by Hubitat are as close to zero as one can get in my opinion.

Given this seems to be related to your moisture sensors, which sensors are they?

mledford · February 8, 2021, 3:07pm

So I've thought a good deal about this over the past 6 months about what has been happening before I posted a thing about it.

Coincidence? Based on what has happened to me here I have to say no. Let's just take this particular instance and not even think about my previous experiences. What are the chances that (looking at the log posted) within 1 second of the initial Garage Door (0x0C) opened notification that the Crawl Space Leak (0x0E) would report wet, not physically alarm, and not actually be wet? I'd have to guess the chances are fairly slim. I'll also mention that the sensor still reports as wet even though it is dry. Why? Likely it's because it never sent a message reporting itself as wet so it can't send a message reporting itself as dry. Therefore the hub has no reason to believe it is not wet. My guess is the hub is just a state machine stored in a database. It has stored off that this device is wet and until it hears that it is not wet it won’t change. Noticing that this morning I have pressed the refresh button on the leak sensor in hopes that the next time it wakes up the hub will ask it to send an update on its current state.

Let's address the off-by-one and off-by-two of the node id. So thinking about this more I feel this is either some possible software overwriting memory error or hardware memory bit-flip error. In the case of the recent incident it's easy to see how this might have happened. The Garage Door node id (0x0C) is 1100 in binary. The node id value just needs the second least significant bit to flip to become the Crawl Space Leak (0x0E) 1110 in binary. My guess is that the kind of messages these two send out are likely "compatible" with each other. This allows one to be mistaken for the other. The garage door reports open could be mistaken for the leak sensor reports wet. I would be interested to see if the software is keeping track of the notification device in the data structures as it moves through the system. Seems like it would be. And that brings to my mind as I'm typing that maybe this is a bug in the Dome Leak Sensor driver where it is not checking to see that the message is indeed a water alarm message it is being sent. So the software is responding incorrectly to bad inputs wherever or however it may be generated.

Let's be clear I'm not at this point saying that this is happening every day or every other day or even once a week. I'm simply saying that I have seen evidence that this has and is happening and it is disconcerting. It's possible this is happening more than we know. Maybe the bit-flips in your hub don't correspond to a device there. Or if the messages on which this happens are not compatible and the software does the right thing most of the time and ignores them. Or maybe my hub is just in the right place to get hit by a stray cosmic ray more frequently than everyone else causing the bit flips. I realize there are many reasons why this could be happening and I don't have all the answers here. But it seems pretty clear that this is happening at least to me and given some reports I'm pretty sure it's probably happening to others too.

Without seeing the code or the flow of information through the system I can only speculate as to what might be the problem.

The garage door sensor is an Ecolink Z-Wave Plus TILT-ZWAVE2.5-ECO. The leak sensor is a Dome Z-Wave Plus Leak Sensor.

No. I have not modified the node ids ever. That would be silly. I'm aware that during inclusion of a Z-Wave device the hub tells the device the value to set the home id and which value it should have for the node id. Changing those values would, as you are suggesting, mess everything up. Because the device node id wouldn't change. That only happens during inclusion or exclusion.

mike.maxwell · February 8, 2021, 3:29pm

The only way to conclusively see what's going on is via frame captures, by the time the frames have made it through the radio firmware it's too late.
This is one of those issues that just cannot be diagnosed with out that level if info.

mledford · February 8, 2021, 3:35pm

Sure enough this caused the device to go back to dry.

dev:97 2021-02-08 10:30:21.697 am warn debug logging disabled...
dev:97 2021-02-08 10:20:51.419 am info Crawl Space Leak battery is 68%
dev:97 2021-02-08 10:20:51.416 am debug BatteryReport(batteryLevel:68)
dev:97 2021-02-08 10:20:50.389 am info Crawl Space Leak  is dry
dev:97 2021-02-08 10:20:50.386 am debug SensorBinaryReport(sensorValue:0, sensorType:6)
dev:97 2021-02-08 10:20:50.092 am debug WakeUpNotification()

ccoupe · February 8, 2021, 4:00pm

If I assume (yes) that there is a CRC check on the message then the location of where a node-id can be corrupted by hardware error (cpu or radio) is pretty narrow. If the node-id is not part of the checksum/CRC then the location is wider but you'd expect many errors, not one ever few days.

mledford · February 8, 2021, 4:06pm

OK. I'm happy to attempt getting you whatever information you need. Give me directions and I see if I can make it happen.

I still wonder if the Dome Leak Sensor driver is not looking to make sure that that the message sensor is of type water (6). Or that the Dome Leak Sensor driver is not looking to make sure that the notification type is of water alarm (05). Since the device sends both types of messages.

I found this information from Dome's website that describe the command classes for the leak sensor.

mledford · February 8, 2021, 4:13pm

I believe this is correct. Unless the corruption comes inside the device sending the message itself then the corruption wouldn't be coming due to interference because of the CRC check. I think this is why @mike.maxwell was suggesting that for conclusive proof we'd need frame captures. However, I'm still looking at this as possibly solvable in software. While it would be great that the bit-flips shouldn't happen I think that there are parts of the z-wave standard in place that allow the device drivers to potentially handle these edge cases.

e.g. I see the binary report or notification report. It reports 0XFF (on/wet) but... wait the message says that it isn't a sensor type of water or a notification type of water alarm. Something went terribly wrong and I shouldn't do anything with this message.

ccoupe · February 8, 2021, 4:19pm

If you had some replacement sensors and hub you could swap them and test. That could be less expensive than a Zwave frame capture device (last I looked) and eventually you need to replace the hardware that the capture shows as error-prone.

mledford · February 8, 2021, 4:27pm

Just to be clear this has happened with two different devices of two different types as indicated in the initial post. This was happening with a motion sensor which caused leak sensor A to report wet. Now it has happened with the garage door sensor (contact) which caused leak sensor B to report wet. You might could appreciate then why I don't believe these are the source of the problem.

JasonJoel · February 8, 2021, 4:32pm

If you search for zwave sniffer or zniffer on these forums or Google there should be a bunch of resources on how to setup a sniffer.

Basically you need a sniffer compatible usb zwave stick (UZB being the most commonly used for sniffing) and a Windows PC to run the sniffer/logging software.

To get effective sniffer logs, though, make sure the sniffer PC is physically close to the hubitat hub. You want the usb sniffer stick to hear the same packets the hub does, after all, which means it needs it be in the same physical area as the hub.

Good luck.

Tony · February 8, 2021, 5:15pm

Maybe unlikely, but it's conceivable that an intermediate repeater could be the cause of frame corruption. According to the data rate in use, Z-Wave uses either an 8-bit checksum (for 9.6kbps and 40kbps rates) or 16-bit CRC (for 100kbps) for the frame check sequence; it does cover the node ID. Depending on the alignment of a couple of erroneous bits in the frame (failing hardware rather than random noise would seem more likely to produce this scenario), an 8-bit checksum would be ineffective at detecting this.

The Z-Wave portion of my network is mostly for lighting, except for a couple of contact sensors and two of the original Aeon doorbells (the old Gen 5 version that can store a hundred or so MP3 files that can be played as audio alerts). Over the past three years that I've used those devices I've experienced at least a dozen instances where the incorrect audio track was played when triggered by an automation. I've noticed that it's usually the same incorrect track number (hearing 'warning: carbon monoxide" instead of "mailbox door is open" gets your attention) and wondered if a corrupted message frame might be the culprit, rather than faulty doorbell hardware since both units have shown this behavior.

mike.maxwell · February 8, 2021, 6:28pm

even if that were true, that wouldn't set off your garage contact sensors, or visa versa...

mledford · February 8, 2021, 6:56pm

I don't think I fully understand your response here.

Let's say I have two sensors. A garage contact sensor with node id 4 and a water leak sensor with node id 5.
Let's say that the garage contact sensor was activated as open and sent messages to the hub.
Let's say that somewhere somehow between the sensor and some part of the Hubitat software the node id in the message from the garage contact sensor managed to change from node id 4 to node id 5.
Let's say that since the node id of this message is now 5 that the Hubitat software routes the message to the device driver for node id 5. In this case that would be the Dome leak sensor which is node id 5.

Are you saying that even if the leak sensor's device driver parsing wasn't looking to make sure the message is of the appropriate message type (water or water alarm depending on the type of message) the leak sensor driver, which is receiving the message, would NOT cause the Hubitat system to recognize the message and act on it as if the the water leak sensor had reported a leak message?

mike.maxwell · February 8, 2021, 7:01pm

that I'm saying is that if the message got delivered to the leak sensor driver, it's not also going to be delivered to the contact sensor driver, and visa versa, forget about how the driver parses the delivered data, parsing issues are not related to this aparent comingling issue you have...

mledford · February 8, 2021, 7:10pm

OK. I think I follow what you are saying.

In my specific example the logs show that both the garage contact sensor and the water sensor were activated. You are saying this wouldn't be possible.

I agree on the premise of a single message. However, my understanding is that z-wave devices can and will send multiple messages when it hasn't been ACK'ed. I believe this means it could be possible for multiple messages to be sent that essentially "say" the same thing. And one of those messages could be lucky enough to have a bit flip as mentioned.

Would you agree?

martyn · February 8, 2021, 7:39pm

It doesn't really work like that though.

Generally speaking nowadays, different sensor device types send different Command Class messages, so even if your contact sensor somehow changed its node ID in mid air to be the same as the node ID for the leak sensor, it would also have to magically change the Command Class message from reporting a "contact alarm" to reporting a "leak alarm" in order for it to trigger anything in a leak sensor device driver, even if the message from the contact alarm actually ended up there.

Historically this type of issue did used to occur though (not with Hubitat I should say, just in general).

Back in the early days of Z-Wave when everything used simpler Command Classes like Basic or Binary Sensor, messages flying around were nearly identical apart from node ID. So it didn't take much to corrupt them.

It was still rare that an in air message got corrupted though, what was more likely to happen was a dodgy Z-Wave stack in A N Other Z-Wave device would mix up node IDs when repeating messages. So for example you'd get two different lights turning on or off, even though you'd only sent a message for one of them.

As others have said, you really need to get a sniffer to catch what's happening as it happens. Ultimately though, Hubitat will only act on the data it receives - if that data is generated spuriously, but is otherwise correct, then there's not much they can do about that.

mledford · February 8, 2021, 8:10pm

That would only matter if the driver paid attention to the types which are in the commands.

Instead of just supposing I went and tested it.

I just changed my garage door sensor from the Generic Z-Wave Contact Sensor to the Dome Leak Sensor driver. I then turned on debugging for the device. I created a small rule to notify me and then I opened the garage door. I think the logs speaks volumes.

dev:12 2021-02-08 02:57:01.680 pm info  Garage Door  is dry
dev:12 2021-02-08 02:57:01.679 pm debug SensorBinaryReport(sensorValue:0, sensorType:255)
dev:12 2021-02-08 02:57:01.654 pm info  Garage Door  is dry
dev:12 2021-02-08 02:57:01.652 pm debug SensorBinaryReport(sensorValue:0, sensorType:255)
dev:12 2021-02-08 02:57:01.629 pm debug ignore: NotificationReport(v1AlarmType:0, v1AlarmLevel:0, reserved:0, notificationStatus:255, notificationType:6, event:23, sequence:false, eventParametersLength:0, eventParameter:[])
dev:12 2021-02-08 02:56:41.346 pm info  Garage Door  is wet
dev:12 2021-02-08 02:56:41.345 pm debug SensorBinaryReport(sensorValue:255, sensorType:255)
dev:12 2021-02-08 02:56:41.328 pm debug ignore: NotificationReport(v1AlarmType:0, v1AlarmLevel:0, reserved:0, notificationStatus:255, notificationType:6, event:22, sequence:false, eventParametersLength:0, eventParameter:[])
dev:12 2021-02-08 02:56:41.327 pm info  Garage Door  is wet
dev:12 2021-02-08 02:56:41.324 pm debug SensorBinaryReport(sensorValue:255, sensorType:255)
dev:12 2021-02-08 02:56:15.056 pm debug The configuration will be updated the next time the device wakes up.  You can force the device to wake up immediately by pressing the connect button once.
dev:12 2021-02-08 02:56:15.047 pm warn  description logging is: true
dev:12 2021-02-08 02:56:15.045 pm warn  debug logging is: true
dev:12 2021-02-08 02:56:15.016 pm info  updated...
dev:12 2021-02-08 01:10:14.738 pm info  Garage Door battery is 100%

So as you can see the driver totally doesn't look at the sensor type.

So while I and Hubitat likely can't control bit-flips. Hubitat sure as heck can fix the driver so that it ignores device types in messages that don't make sense.

mike.maxwell · February 8, 2021, 8:19pm

Trust me, that's not going to fix your issue...
We can banter about the driver being the problem here, but it isn't.