HE behaving strangely, not "working"

I have a network of about 60 devices (nearly all are hard wired repeaters and 80% are Inovelli Switches and Dimmers).

My Hubitat Elevation was in a less than desireable location and though it had probably good connectivity to a dozen switches near it, it was probably mesh for more devices that it should have been (they built the house around it, ended up it was on one side of the wall of ductwork for the HVAC).

Nevertheless, and I feel this is important here - the network worked fine until last night.

Last night I was trying to update some firmware on some switches, and was having problems.

I had thought the problems were with the updaters, not with the network or Hubitat, but now that I think about this I'm not sure sure - maybe it was just coincidence but it seems suspicious that I started having these problems at that time. Do note I updated only a couple, waiting several minutes after each was finished, back on the network, updated and working properly before starting the next. Until that process just stopped working.

Anyway, last night I realized I should move my HE, so I powered it down via the UI, moved it to a different wall in the room it's in so it still has clear shot to what it HAD been talking to, but now it should have a far better signal to dozens or more devices (e.g. now the ductwork is between the HE and the outside, not between HE and the rest of the house).

I turned it back on, fired up a Repair and let it go overnight. Finished a couple hours later, and just sat there like the usual dingus it is for the remainder of the night.

And now ... many/most devices aren't working through the HE. I'm not sure any are, in fact. But I mean "aren't working" in a very specific way - If I click any button in HE on Devices pages or wherever, nothing happens. I can't turn them on, I can't turn them off, I can't get them to give me information. But I can go to logs and see me press buttons on switches, so there is some communication happening.

I am running another Repair, just to see what happens afterwards, it's about 2/3rds done (another hours or so, probably after lunch it'll be done) and I'll give it a couple of hours to settle after that.

Any other thoughts?

Device and other information:
Hubitat Elevation® Platform Version 2.2.2.126
Hardware Version Rev C-5
All apps are "built ins" and there's only 7 - Groups and Scenes (I have one or two groups, no scenes), Hubitat Dashboard (a couple of test dashboards that no one uses regularly yet), Maker API with access to all the things, Mode Manager, Motion Lighting Apps (nothing set up in it at all), Rule Machine (also nothing set up in it at all) and Simple Automation Rules (a handful of things in it only).
I only have two three custom drivers - the Inovelli LZW30-SN and LZW31-SN drivers, and the one for the Inovelli 4-in-1.
So, while obviously I may be missing something, I don't think I have a complex setup, nor any of the "problematic" apps most folks seem to have problems with. It's of moderately large size but not of high complexity.
There's nothing interesting in the logs. When I push buttons in the UI, the logs don't ever show anything, though as I mentioned if I press the devices buttons the logs show that press.

It sounds like one or both of these things may have seriously messed up your z-wave mesh?

I’m no z-wave expert so I’m not exactly sure how to troubleshoot further, but I would add that where humans think the hub should have a good signal doesn’t always match up to what actually works for the underlying radio networks.

Can you put the hub back where it was, to remove that variable from the equation (temporarily at least)?

1 Like

I'll let the almost-finished repair finish (on number 58 of 65 nodes), and see what that changes if anything.

But yeah, in fact now that I've repatched it with a longer network cable, I can just set it back where it was. Worth a try.

Now this is the curious part. So the Z-Wave repair is running? Are there any errors or time outs in the logs? If not then there IS device communication.

If there's device communication possible which the Z-Wave repair requires to actually do the repair and route updates then it's not the mesh.

Yes, well finished now.

There are no errors, a snippet of the tail end of the logs:

sys:12020-07-26 12:02:12.584 infoFinished Z-Wave Network Repair
sys:12020-07-26 12:00:32.673 traceZ-Wave Node 67: Repair setting SUC route
sys:12020-07-26 12:00:31.634 traceZ-Wave Node 67: Repair setting SUC route
sys:12020-07-26 12:00:30.589 traceZ-Wave Node 67: Repair setting SUC route
sys:12020-07-26 12:00:28.559 traceZ-Wave Node 67: Repair setting SUC route
sys:12020-07-26 12:00:22.508 traceZ-Wave Node 67: Repair pinging
sys:12020-07-26 12:00:22.503 traceZ-Wave Node 67: Repair starting
sys:12020-07-26 11:58:30.245 traceZ-Wave Node 66: Repair setting SUC route
sys:12020-07-26 11:58:12.601 traceZ-Wave Node 66: Repair setting SUC route
sys:12020-07-26 11:57:54.957 traceZ-Wave Node 66: Repair setting SUC route
sys:12020-07-26 11:57:38.754 traceZ-Wave Node 66: Repair setting SUC route
sys:12020-07-26 11:57:17.411 traceZ-Wave Node 66: Repair pinging
sys:12020-07-26 11:57:17.406 traceZ-Wave Node 66: Repair starting

(That was the last two. The rest were all like that as well. )

And after that Repair finished, there are no changes that I can see, things still aren't working, though the log shows button presses.

Interesting. Last Activity for a device I'm sitting under (Guest Bedroom Lights) shows nothing since last night, though I've done quite a few things with it since then.

Maker API debug logs show things happening, but not consistently.

So I turned off Maker API's access to everything (... removing it entirely would be harder), I'll see if that changes anything after a hub restart maybe.

If no change, I'll move the hub the few feet back, but I can't imagine that being important at this time.
A

There was an issue with 2.2.2.127 and Makerapi.
Did you load that update?
A new update has just been released.
It may be worthwhile doing a backup, downloading the new update and see if things improve.
Just a thought and something to try.
Good luck!

1 Like

I'm trying it now.

Speaking of which, I write too long of stuff already so some things I didn't include that now seem like I should have

  • I tend to stay mostly updated (check every month or so), but I think just Friday I updated to 2.2.2.126 (from 2.2.2.122). I checked even this morning for newer updates, and it didn't say there were any available. But it was fine all Friday and most of Saturday.

But I checked now and there 2.2.2.129 available, so I'm updating. because why not.

I have backups (HE has last 7, I have 6 of those downloaded - because I ... like to have backups before I fool with things too much, and I have an additional one made yesterday before I started fiddling).

OK, updated and I let it settle afterwards for a couple of minutes, no change.

It's like the various device pages just don't DO anything any more. It's hard to describe, except ... everything seems to work fine, but nothing works.

No errors. No warnings. No "active" indication that it can't talk to the devices, except by inferring it's not since a) nothing happens when you click stuff and b) lastActivity on most of these are stuck on yesterday.

I apparently have one device that's properly reporting to the HE out of the dozen or so that I can press while watching logs.

I'm think I'm going back to my last 2.2.2.122 backup from 7/24.

Also platform rollback to 2.2.2.122, which matches the DB version, so I should be back to square one now, or at least will be when that's done.

Preliminary results ...

I wonder if it's hardware?

I went back to a known good time, 7/24 - both platform version and DB. That made no difference.

I moved it back (using original network cable, too) to original location (which as I mentioned is literally eight feet from the new location, just on the other side of a small room). That also made no difference.

Of the 10-12 switches I have in the rooms near me, two of them update "Last Activity" properly when you press the physical switches and generate logs, the others don't. I'm too lazy to go around and click them all and get a more definitive list.

Neither of those that update Last Activity can be triggered on/off via zwave and the hubitat.

I did find one that works - both physically and remotely - and it's the first thing I installed in this house and the closest to the hubitat. I know that's first because it's the room the Hubitat and other network equipment is in.

But the second thing (also close to the HE) that I installed does not. This is the laundry room next to the room where the HE is - it's the only other room int he house that doesn't have natural light, so it was way early on the list. :slight_smile:

Which is implying NOT hardware, but still mesh/network or something.

Let me try another repair. I mean, ... you can't break broken any more than it is, so it won't get worse.

Indeed, if it gets worse it's almost better because it's pushing me to .. wipe and reload. That would suck, but if I had started yesterday when I began to have problems, I'd probably have the whole thing rebuilt by this point or at least nearly so, even including a couple of "let it sit for 3+ hours". But I would rather try to "fix" it since if this happened after I had more automation set up, it would be a terrible thing.

Also thanks everyone for the assistance, it is much appreciated!

I have no sure cure but like a lot of folks here, am curious how your hub got into this odd state.

You said you rebooted and shutdown, I believe... but did you do a shutdown followed by pulling the power? Wait 30 seconds, plug it back in? The Z-Radios don't lose power during a shutdown and the filtering capacitors can hold some power for a moment or two.. thus the 30 second wait.. to let the radios power down also.

If you have, forgive me. :slight_smile:

1 Like

This is probably a long shot, but in the Settings tab, go into both Zigbee and Zwave sections and be sure the radio is enabled. It might not hurt to disable, wait, and re-enable it.

I also had an oddity after a power loss last week where it changed Zigbee channels to some random low numbered channel, and almost all my Zigbee stuff didn't work correctly. I changed channels, rebooted. and all was fine again.

Yeah, twice now it's been off for and unplugged for more than a minute (moving the first time the 6 feet to the other side of the room, and then moving it back.).

....

Baffling.

After the repair (again), I think it's OK now?
Google Photos
That should be a picture of me standing in the door to the room where the hub is. It started out, and is currently, in the back left hand corner. I moved it to the wall on the opposite side, in the big "hole" you can see just on this side of the ductwork above the water heater. The wall switch that "always worked fine" is literally at my left elbow in that picture. Makes sense if anything's gonna work, it's that. :slight_smile:

Here's a diagram.


I took the picture from that spot marked at the bottom, "Pic" pointing in the general direction tha tha arrow points in. Dark Purple "1" was the hubitat's first location, and you can see the HVAC ductwork, all the radiant floor control piping, and the water heater are largely between the hubitat and the rest of the house.

So I moved it from purple 1, to red 1. I figured I'd do that, run a repair in case that was moving it enough to actually make a difference, and see if ... well, as long as it worked I figured it was better.

It didn't work.

BUT I think something was going wrong BEFORE I moved it, that's why i thought to move it but wasn't the reason I moved it. In other words, the move was triggered by the problems, though it was more of a thing I figured I needed to do and that was an opportune time to do it, not that I really felt like the move was gonna fix anything, or much anyway.

LEGEND: Purple and red "1" are the hubitat. 12 and 13 are pink because they're battery devices. All the rest are switches, and green vs. orange has nothing to do with switch type - it's "green are downstairs" and the map applies directly, "orange are upstairs" and the map's wrong (and they're only in approximate positions)

1 Like

@mahlerrd From a troubleshooting standpoint, since this happened when you were updating firmware the next thing I would do to try to isolate the problem is to power cycle the Z-Wave devices themselves for a few minutes, starting with those nearest the hub. It's quite possible that some have become confused and are not properly acting as repeaters. You can then try to see if commands can be sent to them but don't be surprised if another Z-Wave repair or two along the way will be necessary if the routes that were being generated are not optimum.

Please forgive me if you've previously read this.
It had a great impact on me, and it comes from a reputable source.

The following is from a much quoted document on making up a good zwave mesh:

3. Place the hub in a central location

Putting the hub in a corner of the basement might be convenient, but its a terrible idea for Z-Wave. The hub is the most important node in the network and should have the best location possible. While Z-Wave is a mesh network and can route or hop thru other nodes in the mesh, each hop is a significant delay and chokes up the network with more traffic. Ideally the hub should reach 90% of the nodes in your Smart Home without relying on routing. If the hub has Wifi then putting it in a central location is easy, you just need a wall outlet to plug it in. I have my hub hung off the back of a TV cabinet in roughly the middle of the first floor of my home.
(bolding - mine)

1 Like

Funnily enough, I just tried that and it broke everything and appeared to be unfixable until I moved it back. :slight_smile:

What's the process to move it? I figured it was nothing more complicated than power off, move, power on, run repair?

Of course, I still think I was having other problems, possibly a confused device as suggested, so maybe confused device, THEN a change in topology and the device wasn't ... working right... or something?

Anyway, all seems well now, so maybe I'll move it later today to the inside corner of my wife's office.

(If HE did PoE I would have an even better option. But it doesn't, so I don't.)

I hope that you did the power off via the Hub Shutdown command, right?
Also, I've found for both zigbee and zwave you have to give some time for the network to settle down. How much time? I don't know exactly, but at least 30 to 60 minutes.

Maybe not, but you can use a POE to USB adapter easily enough if you really want to use POE... Many people do it that way.

1 Like

Yep. Of course I can't guarantee that electrical power will never be interrupted, so the product should gracefully handle this situation. Still, I didn't pull the plug without the appropriate ceremonies.

Ah, I should have thought of that. And sort of did, but hadn't followed up on it yet. Nice to see that's a possibility.

But another thought occurred to me, in that according to Silicon Lab's Z-Wave PC Controller, the Hub is talking directly to only 7 devices.

But, of those 7 devices 2 are as far away from the hub as you can get (one's a battery device and I moved it recently, so it probably doesn't count, but the other should, although it's also still a direct communication after 4 or 5 repairs). So maybe I'm misreading the topology map? Of the below, 11, 12, and the group in the middle 38-41 all make sense.

13 made sense 3 weeks ago before I moved it (12 and 13 are my two only battery powered devices), but it's stayed as a "direct communication" through at least 4 or 5 repairs. And 58 is as far from the hub as you can get, about 42 feet, through 3 walls and I think it might actually have the width of the stair treads in the way.

But generally, everything's connected to half of everything else, EXCEPT the hub isn't even connecting to things in the same gang as devices it will connect to (see especially 35, 36, and 37)

Here's a map of the "direct communication" nodes to the hub:


NOTE GREEN IS FIRST FLOOR, YELLOW/ORANGE IS SECOND FLOOR (so the floorplan aligns with green, doesn't align with yellow).

If 58 and 13 aren't a problem, I don't see why anything in the house is a problem.

But even disregarding that, if the 38-41 group works fine directly communicating (and we'll assume it's not super-marginal signal strength there, since we have two examples of "farther away through more walls" working), and given the Hubitat Elevation is currently 9 feet up and is just as close to the plane the upstairs devices are on as the downstairs devices, there's another 7 devices that are the same or closer than the 7 it currently talks to, if only it would try.

But it doesn't appear to try?

And arguably, half of the devices are within 20-25 feet of it with only a single wall occluding them. (and 903 MHz isn't bad at penetrating walls, it's why they used it with telephones forever).

I can easily move it to near the group in the lower right (that's my wife's office, I'd have to punch down her jacks and whatnot, but I can do that), that would get it within 10 feet of 20 devices, within 20-25 feet of at least half.

So I guess the question is, and I'm sure there's no answer for this, why doesn't my hub even just talk to tall the ones it should be able to talk to? Is there any way to make it more aggressive at direct communication instead of through the mesh?

I mean, that diagram above is really weird. #14 (laundry room, center bottom room) is talking to 58 as well, ... I think my whole network is at most one hop from the hub (which I think would be fine), and a ratio of about half direct connect and half 1 hop away, except that the hub seems to not want to directly communicate with anything, and uses 5 close devices as its first hop (and two far away), meaning now I'm possibly two hops.

(I'm in the process of building some trees and maps of potential communication to confirm this).

So, while I'd love to move it, I'm not sure what it'll gain, since the hub doesn't seem to want to talk to all the devices that are close to it already but seems to prefer to let the mesh handle it.

Still, it would be interesting for the sake of science!

I have also used the PC Controller program, and I also cannot rationalize why my hub (node 1) has some nodes as "direct reports" and others not. It just didn't make sense to me.
Perhaps it's some sort of "legacy" impact. I have read on other sites, that there seems to be some sort of "preference" for lower number node id's.
That is, perhaps the zwave repair doesn't really make up a new route for a node, if the one that's currently in it's tables works.
I'm hopefully moving my zwave devices to a c7 and I'm going to try again and see those nodes that are directly connected to node 1. Perhaps it will clear itself up.

Perhaps @bcopeland would like to comment on how the Hubitat zwave repair program works.