HE behaving strangely, not "working"

This is probably a long shot, but in the Settings tab, go into both Zigbee and Zwave sections and be sure the radio is enabled. It might not hurt to disable, wait, and re-enable it.

I also had an oddity after a power loss last week where it changed Zigbee channels to some random low numbered channel, and almost all my Zigbee stuff didn't work correctly. I changed channels, rebooted. and all was fine again.

Yeah, twice now it's been off for and unplugged for more than a minute (moving the first time the 6 feet to the other side of the room, and then moving it back.).

....

Baffling.

After the repair (again), I think it's OK now?
Google Photos
That should be a picture of me standing in the door to the room where the hub is. It started out, and is currently, in the back left hand corner. I moved it to the wall on the opposite side, in the big "hole" you can see just on this side of the ductwork above the water heater. The wall switch that "always worked fine" is literally at my left elbow in that picture. Makes sense if anything's gonna work, it's that. :slight_smile:

Here's a diagram.


I took the picture from that spot marked at the bottom, "Pic" pointing in the general direction tha tha arrow points in. Dark Purple "1" was the hubitat's first location, and you can see the HVAC ductwork, all the radiant floor control piping, and the water heater are largely between the hubitat and the rest of the house.

So I moved it from purple 1, to red 1. I figured I'd do that, run a repair in case that was moving it enough to actually make a difference, and see if ... well, as long as it worked I figured it was better.

It didn't work.

BUT I think something was going wrong BEFORE I moved it, that's why i thought to move it but wasn't the reason I moved it. In other words, the move was triggered by the problems, though it was more of a thing I figured I needed to do and that was an opportune time to do it, not that I really felt like the move was gonna fix anything, or much anyway.

LEGEND: Purple and red "1" are the hubitat. 12 and 13 are pink because they're battery devices. All the rest are switches, and green vs. orange has nothing to do with switch type - it's "green are downstairs" and the map applies directly, "orange are upstairs" and the map's wrong (and they're only in approximate positions)

1 Like

@mahlerrd From a troubleshooting standpoint, since this happened when you were updating firmware the next thing I would do to try to isolate the problem is to power cycle the Z-Wave devices themselves for a few minutes, starting with those nearest the hub. It's quite possible that some have become confused and are not properly acting as repeaters. You can then try to see if commands can be sent to them but don't be surprised if another Z-Wave repair or two along the way will be necessary if the routes that were being generated are not optimum.

Please forgive me if you've previously read this.
It had a great impact on me, and it comes from a reputable source.

The following is from a much quoted document on making up a good zwave mesh:

3. Place the hub in a central location

Putting the hub in a corner of the basement might be convenient, but its a terrible idea for Z-Wave. The hub is the most important node in the network and should have the best location possible. While Z-Wave is a mesh network and can route or hop thru other nodes in the mesh, each hop is a significant delay and chokes up the network with more traffic. Ideally the hub should reach 90% of the nodes in your Smart Home without relying on routing. If the hub has Wifi then putting it in a central location is easy, you just need a wall outlet to plug it in. I have my hub hung off the back of a TV cabinet in roughly the middle of the first floor of my home.
(bolding - mine)

1 Like

Funnily enough, I just tried that and it broke everything and appeared to be unfixable until I moved it back. :slight_smile:

What's the process to move it? I figured it was nothing more complicated than power off, move, power on, run repair?

Of course, I still think I was having other problems, possibly a confused device as suggested, so maybe confused device, THEN a change in topology and the device wasn't ... working right... or something?

Anyway, all seems well now, so maybe I'll move it later today to the inside corner of my wife's office.

(If HE did PoE I would have an even better option. But it doesn't, so I don't.)

I hope that you did the power off via the Hub Shutdown command, right?
Also, I've found for both zigbee and zwave you have to give some time for the network to settle down. How much time? I don't know exactly, but at least 30 to 60 minutes.

Maybe not, but you can use a POE to USB adapter easily enough if you really want to use POE... Many people do it that way.

1 Like

Yep. Of course I can't guarantee that electrical power will never be interrupted, so the product should gracefully handle this situation. Still, I didn't pull the plug without the appropriate ceremonies.

Ah, I should have thought of that. And sort of did, but hadn't followed up on it yet. Nice to see that's a possibility.

But another thought occurred to me, in that according to Silicon Lab's Z-Wave PC Controller, the Hub is talking directly to only 7 devices.

But, of those 7 devices 2 are as far away from the hub as you can get (one's a battery device and I moved it recently, so it probably doesn't count, but the other should, although it's also still a direct communication after 4 or 5 repairs). So maybe I'm misreading the topology map? Of the below, 11, 12, and the group in the middle 38-41 all make sense.

13 made sense 3 weeks ago before I moved it (12 and 13 are my two only battery powered devices), but it's stayed as a "direct communication" through at least 4 or 5 repairs. And 58 is as far from the hub as you can get, about 42 feet, through 3 walls and I think it might actually have the width of the stair treads in the way.

But generally, everything's connected to half of everything else, EXCEPT the hub isn't even connecting to things in the same gang as devices it will connect to (see especially 35, 36, and 37)

Here's a map of the "direct communication" nodes to the hub:


NOTE GREEN IS FIRST FLOOR, YELLOW/ORANGE IS SECOND FLOOR (so the floorplan aligns with green, doesn't align with yellow).

If 58 and 13 aren't a problem, I don't see why anything in the house is a problem.

But even disregarding that, if the 38-41 group works fine directly communicating (and we'll assume it's not super-marginal signal strength there, since we have two examples of "farther away through more walls" working), and given the Hubitat Elevation is currently 9 feet up and is just as close to the plane the upstairs devices are on as the downstairs devices, there's another 7 devices that are the same or closer than the 7 it currently talks to, if only it would try.

But it doesn't appear to try?

And arguably, half of the devices are within 20-25 feet of it with only a single wall occluding them. (and 903 MHz isn't bad at penetrating walls, it's why they used it with telephones forever).

I can easily move it to near the group in the lower right (that's my wife's office, I'd have to punch down her jacks and whatnot, but I can do that), that would get it within 10 feet of 20 devices, within 20-25 feet of at least half.

So I guess the question is, and I'm sure there's no answer for this, why doesn't my hub even just talk to tall the ones it should be able to talk to? Is there any way to make it more aggressive at direct communication instead of through the mesh?

I mean, that diagram above is really weird. #14 (laundry room, center bottom room) is talking to 58 as well, ... I think my whole network is at most one hop from the hub (which I think would be fine), and a ratio of about half direct connect and half 1 hop away, except that the hub seems to not want to directly communicate with anything, and uses 5 close devices as its first hop (and two far away), meaning now I'm possibly two hops.

(I'm in the process of building some trees and maps of potential communication to confirm this).

So, while I'd love to move it, I'm not sure what it'll gain, since the hub doesn't seem to want to talk to all the devices that are close to it already but seems to prefer to let the mesh handle it.

Still, it would be interesting for the sake of science!

I have also used the PC Controller program, and I also cannot rationalize why my hub (node 1) has some nodes as "direct reports" and others not. It just didn't make sense to me.
Perhaps it's some sort of "legacy" impact. I have read on other sites, that there seems to be some sort of "preference" for lower number node id's.
That is, perhaps the zwave repair doesn't really make up a new route for a node, if the one that's currently in it's tables works.
I'm hopefully moving my zwave devices to a c7 and I'm going to try again and see those nodes that are directly connected to node 1. Perhaps it will clear itself up.

Perhaps @bcopeland would like to comment on how the Hubitat zwave repair program works.

Interesting. I rejoined the PC Controller to the network, let it simmer for a while and checked and now the "Hub not talking to things" problem is even worse. But that's not real concerning; it's still working (and indeed the network seems fine!)

First thing I'm doing is I just air-gapped all the switches in the house.

I pulled the airgap switch, left it off for ~5 seconds, pushed it back in, waited the second or two for it to turn on and resume whatever state it was in and for the light to settle, then waited another 5 seconds or more before doing the next one. Probably too quickly, but ... all seemed OK with that and I really didn't want to spend an hour doing this, slowly. :slight_smile:

Mostly I went from inside out, hub side to far away. Obviously I wasn't super worried about precision there, but generally that's what I did.

Now running a repair, and the repair is going FAR faster than it had previously. 5 minutes after starting it's already done 15 of the 60, whereas before it was taking several hours (3+) to complete. Of course maybe it hung up on a few later, so I'm not sure it'll complete faster, just that it's at least moving far more quickly now than I've ever seen it move before. And it feels like it's completing faster.

Is that ... some sort of maintenance? Once per year, or a couple of times, or maybe just "after you've got everything mostly settled into place after having sporadically and in no particular order added 60 devices willy-nilly" that you should sort of air-gap everything one at a time and let them come back up?

I mean, that feels like it's not illegitimate, if you know what I mean.

Anyway, will report later today on how that worked. (As it stands, 20 minutes after starting the repair we're already halfway done. If it keeps that up, it'll be done 5x as quickly as it used to take. So I think that's a very good sign, and an indicator that there was a device or three that were confused.)

Unfortunately lack of a "Repair is done" entry means "Repair failed". I didn't understand this until very recently. :frowning:

A successful repair of a node looks like this:

sys:12020-07-23 21:22:11.399 traceZ-Wave Node 202:  Repair is done.
sys:12020-07-23 21:22:11.381 traceZ-Wave Node 202:  Repair is requesting node neighbor info
sys:12020-07-23 21:22:11.377 traceZ-Wave Node 202:  Repair is adding return route
sys:12020-07-23 21:22:11.373 traceZ-Wave Node 202:  Repair is deleting routes
sys:12020-07-23 21:22:09.661 traceZ-Wave Node 202: Repair is requesting device associations
sys:12020-07-23 21:22:09.641 traceZ-Wave Node 202: Repair is updating neighbors
sys:12020-07-23 21:22:09.622 traceZ-Wave Node 202: Repair is updating neighbors
sys:12020-07-23 21:22:09.602 traceZ-Wave Node 202: Repair is updating neighbors
sys:12020-07-23 21:22:04.552 traceZ-Wave Node 202: Repair is updating neighbors
sys:12020-07-23 21:22:02.469 traceZ-Wave Node 202: Repair setting SUC route
sys:12020-07-23 21:22:02.426 traceZ-Wave Node 202: Repair pinging

Yes, it would be nice to have an "failed" entry in the log, especially if you have a large number of devices.

This should be much better with the C-7 that offers a richer Z-Wave properties page. Really looking forward to that.

Did you update the firmware on any of your devices? I was thinking it was just your hub. There are some devices you might have to exclude then re-include after an update.

Again, I'm afraid I don't have a solution but I have another statement:

ZWave repair is initiated by the Hub by sending a message to all devices, asking they determine their neighbors... and then return that info to the hub.

Let's think about what can go wrong with that simple idea. :smiley:

First, the hub can send all it wants, it's receiving that counts. Any ZWave device that didn't hear the request will sit there and do nothing. Each will respond to the neighbors, so that Device X may not have heard the Hub, and therefore will NOT try and get a neighbor list itself, it will respond to the neighbors. What that means in practicality is that any device hidden behind Device X didn't hear the initial request and isn't in range of any neighbors.

The ZWave Repair completes and Device X is now reachable, but nothing hidden behind it is... which gets solved by yet another ZWave Repair.

I bumbled onto this years back when I grabbed a copy of the ZWave Repair logs and sorted it, looking only at the final message.. lo and behold, there were devices missing. I did another Repair and got busy scratching my head.. because the 2nd time more devices were in the list.

Firmware: @erktrek
I did update some firmware.

Every time I've had to exclude/include, it's broken all the automation associated with #thing. There isn't even a way I've found to automatically (or semi-manually) give #re-included-thing the same name as what it had before. Is there a way around that?

What would you folk's processes be if you wanted to update the firmware on 50 devices? Would you really exclude them all, update the firmware, reinclude them and rebuild the entire set of automations?

Down the road, I'll have enough set up I do not want to mess with having to recreate and reassign various things. Hence trying to sort out this process now. I hope to not need to update firmware often on devices, but I want to be able to update firmware without too much disruption if I need to. E.g. I don't mind if I have to, in some cases, have to exclude/include one, but I'd like to keep that to a minimum, or zero if possible.

to that end, I'm hoping the newly ... jiggered and fiddled with network which works apparently 5x faster (for no reasons I can actually figure out) will let updates work via the Firmware Updater app.

Repairs: @csteele and @dennypage
I did check - there was no skipped devices in this Repair, and none gave anything other than what appears to be normal repair information in the logs. I wish these logs were able to be copied automatically to another system (syslog, FTP on a schedule, or even if there were an easy way to pull them regularly via API or webhook?).

Anyway, I can totally run several more repairs, spaced out a few hours from each other. And will kick another off right now. Since they only take 25 minutes now this is fine (better than 3 hours!)

See this post. Same concept, just with one device.

That is says "Repair setting SUC route" multiple times for a given node, and does not say "Repair is done" for that node, is an indication that the node was not successfully repaired.

1 Like

Who thought up that bizarre state of affairs? :slight_smile:
In any case, only 5 didn't complete then, and none that I would consider critical (e.g. none that would be a choke point to farther places.) I'll check on those specifically.

Wow.

I mean, thanks. And sure, that's better than breaking all your automations.

But still, wow.

I'm in the process of doing that for 115 devices...

I feel for you.