Seeking help for C7 unresponsive

Seeking help or guidance to resolve a C7 that becomes unresponsive, after 4-6 hours.

Symptoms::

  • Light stays green.

  • Can access via 8081 to reboot, which allowed it to run for about 6 hours; this seems to be reducing. As β€œthe” hour approaches, the automations like turning lights on and off, or venting, become slower to respond or stop responding.

Background:

C7 has preformed perfect since before January 2024 without changes. On February 8th, new firmware was found and so the C7 was updated. Since then, things have not been good. Attempted several soft resets and restores from backups, also attempted restoring to previous firmware, this did not resolve things. Did network reset just in case.

With the help of a friend, who I am grateful, many things were tried: tracking memory and CPU usage every 5 minutes, attempting to reboot if CPU load went above 50% for 5 minutes, disabling the hub mesh to the C8, disabling larger apps (ecoBee suite, Alarm Panel for DSC), Airthings, webcore) The results were the same - reboot needed in 4-6 hours.

Summary of actions since Feb 8th:

  1. Hub Information Driver v3 is installed, to track memory, cpu usage

  2. Have soft reset and restored, a few times now.

  3. Have downgrade back to 2.3.6.146 without soft reset and restore

  4. Network reset.

  5. Yes - can connect and shut down from 8081

  6. Yes - can ping

  7. Platform Version was: 2.3.6.146, now restored to 2.3.8.119

  8. Hardware Version: C-7

  9. Connection is ethernet - static IP assigned by mac address.

  10. Power via supplied power block and cable

  11. Ethernet cable connects to the same switch as everything else in home.

  12. No firewalls, proxies or such between C7 and LAN and home devices.

  13. Jumbo frames are not installed

Devices:

  • 3 Zigbee devices shared from hub mesh with a C8

  • 4 Matter devices shared from hub mesh with a C8

  • 15 zwave devices

  • 15 Lutron (Zigbee)

  • 13 Meross (wifi) - started disabling as part of debugging

  • Had 4 Wemo (wifi) - now all removed as per of debugging (was always heavy and wanted to remove anyway)

  • 1 EcoBee thermostat - 3 sensors - via Ecobee Suite Manager

  • 12 DSC security via Hubitat Alarm App

  • 3 Air Things device - via the Air Things Cloud

Anything in the logs like a lot of chattiness from any particular device?

What 3rd party apps do you have installed?

Do you have @thebearmay 's Hub Information driver installed? (makes it useful to monitor memory levels)

1 Like

Hi,

  • nothing noteworthy re: chatty logs apps or such
  • No new third party apps. Noteworthy ones: ecoBee suite, Alarm Panel (for DSC), Airthings, webcore
  • Hub Information Driver v3 is installed, to track memory, cpu usage

Ev 5 min. timestamp | cpu percent | freemen | uptime

  • nb. 2:35 to 6:00 when hub was rebooted manually using 8081
  • nb. 8:05 to 9:05 when hub was again rebooted manually using 8081
  • reboot RM rules for cpupct > 51, and freemen < 250000 never triggered
2024-02-19T02:15:00.104-04 | 2 | 425816 | 0d, 2h, 21m, 16s
2024-02-19T02:20:00.134-04 | 2 | 425816 | 0d, 2h, 21m, 16s
2024-02-19T02:25:00.091-04 | 7 | 409872 | 0d, 2h, 31m, 16s
2024-02-19T02:30:00.091-04 | 7 | 409872 | 0d, 2h, 31m, 16s
2024-02-19T02:35:00.097-04 | 1 | 409972 | 0d, 2h, 41m, 16s
2024-02-19T06:00:00.185-04 | 1 | 534368 | 0d, 0h, 1m, 24s
2024-02-19T06:05:00.108-04 | 1 | 534368 | 0d, 0h, 1m, 24s
2024-02-19T06:10:00.107-04 | 38 | 552420 | 0d, 0h, 11m, 8s
2024-02-19T06:15:00.160-04 | 38 | 552420 | 0d, 0h, 11m, 8s
2024-02-19T06:20:00.105-04 | 5 | 494272 | 0d, 0h, 21m, 9s
2024-02-19T06:25:00.103-04 | 5 | 494272 | 0d, 0h, 21m, 9s
2024-02-19T06:30:00.140-04 | 4 | 475844 | 0d, 0h, 31m, 9s
2024-02-19T06:35:00.104-04 | 4 | 475844 | 0d, 0h, 31m, 9s
2024-02-19T06:40:00.099-04 | 4 | 457336 | 0d, 0h, 41m, 9s
2024-02-19T06:45:00.095-04 | 4 | 457336 | 0d, 0h, 41m, 9s
2024-02-19T06:50:00.120-04 | 6 | 446600 | 0d, 0h, 51m, 9s
2024-02-19T06:55:00.097-04 | 6 | 446600 | 0d, 0h, 51m, 9s
2024-02-19T07:00:00.111-04 | 1 | 434960 | 0d, 1h, 1m, 9s
2024-02-19T07:05:00.094-04 | 1 | 434960 | 0d, 1h, 1m, 9s
2024-02-19T07:10:00.150-04 | 5 | 463948 | 0d, 1h, 11m, 9s
2024-02-19T07:15:00.098-04 | 5 | 463948 | 0d, 1h, 11m, 9s
2024-02-19T07:20:00.141-04 | 1 | 459376 | 0d, 1h, 21m, 9s
2024-02-19T07:25:00.096-04 | 1 | 459376 | 0d, 1h, 21m, 9s
2024-02-19T07:30:00.100-04 | 4 | 456140 | 0d, 1h, 31m, 10s
2024-02-19T07:35:00.142-04 | 4 | 456140 | 0d, 1h, 31m, 10s
2024-02-19T07:40:00.096-04 | 4 | 444868 | 0d, 1h, 41m, 10s
2024-02-19T07:45:00.100-04 | 4 | 444868 | 0d, 1h, 41m, 10s
2024-02-19T07:50:00.151-04 | 6 | 442328 | 0d, 1h, 51m, 10s
2024-02-19T07:55:00.152-04 | 6 | 442328 | 0d, 1h, 56m, 8s
2024-02-19T08:00:00.100-04 | 2 | 429632 | 0d, 2h, 1m, 10s
2024-02-19T08:05:00.099-04 | 2 | 429632 | 0d, 2h, 1m, 10s
2024-02-19T09:05:00.633-04 | 3 | 430020 | 0d, 0h, 1m, 24s
2024-02-19T09:10:00.113-04 | 3 | 430020 | 0d, 0h, 1m, 24s

Hi,

Please know, as an extreme attempt to assess, all devices are disabled, and 95% of the apps are disabled. Apps left enabled are Hubitat Package Manager, Hub Information Driver v3, Mode Manager and RM.

Hopefully this will let things stay up for 1-2 days. Then I will start enabling things again.

Thank you

1 Like

@bobbyD

OP stated he'd done that several times now :frowning:

These 7 devices are all coming from the C8 to the C7 via Hub Mesh, correct?

What type(s) of Z-Wave devices are these? Do any of them have power/energy reporting enabled on them? Are they paired with S0 security by any chance?

What are these devices? :thinking: Are these the Lutron Aurora Smart Bulb Dimmer Switch devices? That is the only Zigbee device that Lutron still sells, IIRC.

Just trying to fully understand everything that is running on the hub...

1 Like

Unfortunately I have seen this kind of issues on some hubs running the DSC security integration. It's using telnet and if there is a communication problem with the panel, the integration closes your telnet connection, which is critical to your hub's proper functionality.

You can confirm issues with telnet by checking the logs for errors or warning like "Telnet connect failed" or "Connection failure".

2 Likes

Hi,

Checking the logs with mostly all apps disabled, e.g. RM is running to monitor cpu and freemem which are the only log entries... with one exception. There are two entries, both the same and about 5 hours apart that I have not seen before:

sys:1 2024-02-19 05:03:22.837 PM warn Z-Wave Network responded with Busy message.

Does this error suggest anything? All devices are disabled and only the base apps are running.

There were other earlier entries generating errors, as I mistakenly left some webcore pistons running. Once they were disabled the logs cleaned up as expected.

Otherwise...

These 7 devices are all coming from the C8 to the C7 via Hub Mesh, correct?

Correct, C8 shares them and the C7 was using them in some webcore pistons since September. Operating three blinds depending on time and temperature, and 4 Matter Outlets to operate some LED lights.

What type(s) of Z-Wave devices are these? Do any of them have power/energy reporting enabled on them? Are they paired with S0 security by any chance?

The zwave devices are Zooz Se11/se18 sensors (no security) and some door contact sensors.

There are also three Alfred locks that have security enabled.

There are also two RING repeaters which I will check on security and can report back.

What are these devices? :thinking: Are these the Lutron Aurora Smart Bulb Dimmer Switch devices? That is the only Zigbee device that Lutron still sells, IIRC.

The Lutron devices are Caseta DIVA dimmer switches. Sorry, I thought Lutron was zigbee.

You can confirm issues with telnet by checking the logs for errors or warning like "Telnet connect failed" or "Connection failure".

I will check on this and report back. I know there were some WSS errors, and telnet errors once I started trying to debug things, as I was working to isolate things. Before all this started I do not recall such.

Please know, with everything disable C7 has been up now. Hubitat Alarm (DSC) will be re-enabled, and I will report back.

Thank you :slight_smile:

1 Like

Please know, C7 has the following remaining active (mention to help clarify state as best as possible):

Devices left:

  1. Hub Information driver

  2. Hubitat Alarm Panel (just re-enabled to explore for telnet errors)

  • 12 devices - contacts, motion and smoke alarms
  1. Three Blinds C8 mesh shared (will not stay disabled, if checked)

Apps:

  1. File Manager Backup & Restore

  2. Hubitat Alarm

  3. Hubitat Package Manager

  4. Hubitat Safety Monitor

  5. Mode Manager

  6. Rebooter

  7. Rule Machine:

  • mem < 250K for 5 min
  • cpu > 51 for 5 min
  • log cpu and freem and uptime

Plan to report back in morning if things stay up :slight_smile:

Thats not normal.
"Disabling" Z-Wave devices does not stop it from talking to the hub, it just stops the hub from processing it any further. Seems like something is jamming up the radio. If you watch the z-wave logs you might be able to see if any of being very chatty. Or turn on debug logging for the devices.

First step when you see the z-wave busy would be to power down and unplug for 30 seconds, to restart the radio.

That usually.means you have zwave ghosts. Post your zwave details page in screenshots

Plan to report back in morning if things stay up

Traveling, so a little late. The guidance has helped with plans to investigate further, and I wanted to report back asap. Next update can be Friday due to travels.

I found reports of telnet issues with Lutron, so a cabling changes was made to assess a failed router port. Basically, wifi, C7, Lutron, Hue (no issues report), and DNSserver (no issues reported) are now all moved to same switch with DSC. This is to rule out a failed cable or port on the router.

Since, doing the above, and with DSC back on line, things stayed up yesterday. Then ecoBee, airthings, and all devices were brought back on line later in day. This resulted with two Meross devices (wifi) that produced new errors - these two devices are disabled now.

As things were up 6 hours later, Webcore was brought back, last night, and this morning two more symptoms:

  • one of the pistons using the meshed blinds did not fire. (can attempt to find the log entry later.) This piston is now disabled.
  • C7 responsiveness changed: all pages would load, albeit slow, except the log. Additionally, the log file for cpu/mem stopped logging during the night: 00:35, and resumed once rebooted 09:50. C7 was rebooted to get into the logs.
2024-02-20T23:35:00.122-04 | 6 | 414776 | 0d, 18h, 51m, 18s
2024-02-20T23:40:00.108-04 | 12 | 352552 | 0d, 19h, 1m, 18s
2024-02-20T23:45:00.060-04 | 12 | 352552 | 0d, 19h, 1m, 18s
2024-02-20T23:50:00.064-04 | 4 | 363056 | 0d, 19h, 11m, 18s
2024-02-20T23:55:00.059-04 | 4 | 363056 | 0d, 19h, 11m, 18s
2024-02-21T00:00:00.189-04 | 1 | 361916 | 0d, 19h, 21m, 18s
2024-02-21T00:05:00.106-04 | 1 | 361916 | 0d, 19h, 21m, 18s
2024-02-21T00:10:00.100-04 | 5 | 359328 | 0d, 19h, 31m, 18s
2024-02-21T00:15:00.097-04 | 5 | 359328 | 0d, 19h, 31m, 18s
2024-02-21T00:20:00.069-04 | 3 | 356936 | 0d, 19h, 41m, 18s
2024-02-21T00:25:00.062-04 | 3 | 356936 | 0d, 19h, 41m, 18s
2024-02-21T00:30:00.065-04 | 2 | 356540 | 0d, 19h, 51m, 18s
2024-02-21T00:35:00.107-04 | 2 | 356540 | 0d, 19h, 51m, 18s
2024-02-21T09:50:00.639-04 | 0 | 355724 | 0d, 0h, 1m, 30s
2024-02-21T09:55:00.122-04 | 0 | 355724 | 0d, 0h, 1m, 30s
2024-02-21T10:00:00.124-04 | 39 | 520444 | 0d, 0h, 11m, 17s
2024-02-21T10:05:00.063-04 | 39 | 520444 | 0d, 0h, 11m, 17s
2024-02-21T10:10:00.128-04 | 3 | 503456 | 0d, 0h, 21m, 17s
2024-02-21T10:15:00.104-04 | 3 | 503456 | 0d, 0h, 21m, 17s

With the webcore piston using the mesh blinds, the two meross wifi disabled, and colocation in a switch, it is hoped the C7 stays up. Can know more later today or tomorrow.

Thank you.

1 Like

This is an update in case it may give anyone an epiphany.

The current lead is a bit surprising for me as it concerns the ecoBee Suite helpers. After updating to the latest 3.2.8.122, the logs gave a hint at a change.

Last night the ecoBee suite contact and circulation helper apps were the only thing disabled. Reviewing the logs again, the time of the last log entry seemed timed with a circulation event issued to ecoBee. These are the only two helpers used, so it is easy to disable. If the C7 stays up 1d then, perhaps a test recreating a webCore piston will be done to test.

Background:

The past week has resulted in inconsistent uptime for C7 without rebooting. Various things are monitored: telnet with DSC and Lutron, error/warn log entries, wifi devices, spamming log entries, devices with any pending events. It is clear if a device was to misbehave, simulated by disabling it, it can wreak havoc in the logs and such. This makes the debugging longer as the device states are used in many ways.

Three RM rules were added to put into the event log and a log to file the reboot event for CPU> 25 or high mem < 250K for 10 minutes, as well as severe load events. This is in addition to the logging of CPU, freemen and uptime every 5 minutes. This helped, yet the C7 could still freeze before the reboot could happen. Fortunately, the 8081 service page is always available.

At the moment the C7 has been up for 15h. This is good considering the past 3-4 weeks it has ranged from 3-4hours, to 21 hours. There used to be a scheduled weekly reboot of the C7. During this debugging, this was changed to a daily reboot. This is now disabled and only the above RM rules are used instead. This may have been a good thing.

Yesterday the review of time entries suggested the webcore piston for Co2 monitoring and venting, and ecoBee circulation helper may be competing. This is odd since these two processes have running well for past year, e.g. June 2023-Jan 2024. So, the ecoBee helpers were disabled, as they were less important. Since this was done yesterday, the C7 has been up for 16h how.

If the C7 hangs after this change, I can need to redo the process of disabling things 1 by 1, since it was confirmed that if the C7 has nothing enabled it will stay up. This is not easy as, webcore pistons seems sensitive to devices disabled.

Additionally, there does seem to be some sensitivity to wifi devices and the reboot of the router (the router was a suspect, and is now replaced and ruled out). There was no log spamming, just an error that resolves in a minute or two. Wemo outlets (wifi) were also suspect, and have been removed (they were given away). There is also the occasional warning of configuring a Zooz device. This to seems to resolve within a minute or two (I assume it is normal since it is near a reboot event).

Otherwise, the log continue to seem clean: typical temp change updates from ecoBee, motion triggers from DSC and Zooz, lights turned on and off, as expected (unless C7 hangs :wink: ).

3 Likes

Well it seems it can be the ecoBee helpers - :frowning: I do so like the helpers, yet they are removed.

The C7 has been up twice now for over 1d. It even hit two days so I added the helpers back and well, it went down within 6 hours.

The pattern seems to centre around the ecoBee Suite and the adding of Helpers. Only two were used. The contact helper was used for over 1 year. The circulation help was actually one thing that was newly added in January (yep... been working on this for a bit before posting.)

I attempted to use the Hubitat ecoBee app, but could not find a way to check if everything was idle nor control the HRV for venting. These are used in a webcore CO2 venting piston. After uninstalling Hubitat ecoBee and reinstalling the ecoBee Suite app stayed up, but not quite as long. Oddly, this time I had to remove the ecoBee Suite package and reinstall it for things to stabilize. Once ecoBee Suite was removed and reinstalled things have stayed for 1d, 10h, 28m, 30s.

Now, I will leave things along and see if it will last some days. It is suspected that there is still something amiss, as the zwave motion sensor responsiveness to turn lights on seems to fluctuate: a fresh reboot and things are beautifully quick, and then it decreases, and then seems to come back.

This past week there were a few other errors, which for now I am not linking to the above and will monitor:

  • cloud backup failed with the error: Failed Steps: zwave. A complete shutdown and power off of the C7 resolved this. This prevent the setup from lasting more than 1d... but it has not been going ever since.
  • there were a few java.lang.NullPointerException exceptions for some SE11 and Meross WIFI switches. It seems these start to happen when the hub performance starts to go down.

Should all go well over next days I hope to report back, hub has been up for days :slight_smile:

I am hopefully, as the C7 and system survived a power outage last night. The LAN (C7, router, ISP Modem and such) are all on UPSs. It did not even seem to notice. Now, the wifi switches and outlets and such was a different story :slight_smile: All worked as designed though - yay!

If the above suggests anything please let me know. Until next time. Cheers!

2 Likes

Question is there a way to revert back to 2.3.6.146, e.g.

And or, have the java system reverted on the C7?


Uptime results are very mixed:

  • Reviewing log continues have java.lang.* errors (OutOfMemoryError/ Exception/ NullPointerException) mostly related to Meross switch devices.

  • Reviewing log continues to have zwave devices (SE18/ SE11) have daily configure warnings (no ghost devices)

  • a plot of memory consumption over the uptime intervals would produce a declining slope.

  • RM rule for severe load can be detected, but fails to reboot. Only RM remaining is to log cpu/mem/uptime every 10 minutes.

I would like to test reverting the C7 back to a firmware version in fall 2023 to rule out updates verses potential changes.

Otherwise, created a watchdog piston on C8 to confirm uptime has changed every 10 minutes and if not reboot C7 (with thanks @Sebastien for the idea)

Please let me know if reverting C7 is possible, and how.

Thank you

Try above....

3 Likes

Yes that will work. The rest of that topic is unrelated but the steps to revert back to the last 2.3.6 release are the same and will work.

2 Likes

@danabw and @jtp10181, thank you. I am just learning about endpoints now.

I saw the new release 2.3.8.128 this morning as well, which suggests some changes. Not sure which way I will do. Can report back results.

2 Likes

Well... results still mixed.

Only changes now are: remaining RM is the logging of cpu/mem/uptime, and C8 watchdog of uptime.

Things worked longer:

date time | cpupct | freemem | cpupct15 | uptime
2024-03-20T06:50:00.150-03 | 5 | 331780 | 2 | 1d, 6h, 57m, 23s

ecoBee checked for updates 30 secs after this, and then log entries start to become further apart. ecoBee did not even complete its updates until a reboot.

C8 watchdog caught the uptime not changing. The /hub/reboot/ end point did not reboot the C7 :frowning:

:8081 diag tool on C7 does reboot the C7.

Thoughts?