Seeking help for C7 unresponsive

Thats not normal.
"Disabling" Z-Wave devices does not stop it from talking to the hub, it just stops the hub from processing it any further. Seems like something is jamming up the radio. If you watch the z-wave logs you might be able to see if any of being very chatty. Or turn on debug logging for the devices.

First step when you see the z-wave busy would be to power down and unplug for 30 seconds, to restart the radio.

That usually.means you have zwave ghosts. Post your zwave details page in screenshots

Plan to report back in morning if things stay up

Traveling, so a little late. The guidance has helped with plans to investigate further, and I wanted to report back asap. Next update can be Friday due to travels.

I found reports of telnet issues with Lutron, so a cabling changes was made to assess a failed router port. Basically, wifi, C7, Lutron, Hue (no issues report), and DNSserver (no issues reported) are now all moved to same switch with DSC. This is to rule out a failed cable or port on the router.

Since, doing the above, and with DSC back on line, things stayed up yesterday. Then ecoBee, airthings, and all devices were brought back on line later in day. This resulted with two Meross devices (wifi) that produced new errors - these two devices are disabled now.

As things were up 6 hours later, Webcore was brought back, last night, and this morning two more symptoms:

  • one of the pistons using the meshed blinds did not fire. (can attempt to find the log entry later.) This piston is now disabled.
  • C7 responsiveness changed: all pages would load, albeit slow, except the log. Additionally, the log file for cpu/mem stopped logging during the night: 00:35, and resumed once rebooted 09:50. C7 was rebooted to get into the logs.
2024-02-20T23:35:00.122-04 | 6 | 414776 | 0d, 18h, 51m, 18s
2024-02-20T23:40:00.108-04 | 12 | 352552 | 0d, 19h, 1m, 18s
2024-02-20T23:45:00.060-04 | 12 | 352552 | 0d, 19h, 1m, 18s
2024-02-20T23:50:00.064-04 | 4 | 363056 | 0d, 19h, 11m, 18s
2024-02-20T23:55:00.059-04 | 4 | 363056 | 0d, 19h, 11m, 18s
2024-02-21T00:00:00.189-04 | 1 | 361916 | 0d, 19h, 21m, 18s
2024-02-21T00:05:00.106-04 | 1 | 361916 | 0d, 19h, 21m, 18s
2024-02-21T00:10:00.100-04 | 5 | 359328 | 0d, 19h, 31m, 18s
2024-02-21T00:15:00.097-04 | 5 | 359328 | 0d, 19h, 31m, 18s
2024-02-21T00:20:00.069-04 | 3 | 356936 | 0d, 19h, 41m, 18s
2024-02-21T00:25:00.062-04 | 3 | 356936 | 0d, 19h, 41m, 18s
2024-02-21T00:30:00.065-04 | 2 | 356540 | 0d, 19h, 51m, 18s
2024-02-21T00:35:00.107-04 | 2 | 356540 | 0d, 19h, 51m, 18s
2024-02-21T09:50:00.639-04 | 0 | 355724 | 0d, 0h, 1m, 30s
2024-02-21T09:55:00.122-04 | 0 | 355724 | 0d, 0h, 1m, 30s
2024-02-21T10:00:00.124-04 | 39 | 520444 | 0d, 0h, 11m, 17s
2024-02-21T10:05:00.063-04 | 39 | 520444 | 0d, 0h, 11m, 17s
2024-02-21T10:10:00.128-04 | 3 | 503456 | 0d, 0h, 21m, 17s
2024-02-21T10:15:00.104-04 | 3 | 503456 | 0d, 0h, 21m, 17s

With the webcore piston using the mesh blinds, the two meross wifi disabled, and colocation in a switch, it is hoped the C7 stays up. Can know more later today or tomorrow.

Thank you.

1 Like

This is an update in case it may give anyone an epiphany.

The current lead is a bit surprising for me as it concerns the ecoBee Suite helpers. After updating to the latest 3.2.8.122, the logs gave a hint at a change.

Last night the ecoBee suite contact and circulation helper apps were the only thing disabled. Reviewing the logs again, the time of the last log entry seemed timed with a circulation event issued to ecoBee. These are the only two helpers used, so it is easy to disable. If the C7 stays up 1d then, perhaps a test recreating a webCore piston will be done to test.

Background:

The past week has resulted in inconsistent uptime for C7 without rebooting. Various things are monitored: telnet with DSC and Lutron, error/warn log entries, wifi devices, spamming log entries, devices with any pending events. It is clear if a device was to misbehave, simulated by disabling it, it can wreak havoc in the logs and such. This makes the debugging longer as the device states are used in many ways.

Three RM rules were added to put into the event log and a log to file the reboot event for CPU> 25 or high mem < 250K for 10 minutes, as well as severe load events. This is in addition to the logging of CPU, freemen and uptime every 5 minutes. This helped, yet the C7 could still freeze before the reboot could happen. Fortunately, the 8081 service page is always available.

At the moment the C7 has been up for 15h. This is good considering the past 3-4 weeks it has ranged from 3-4hours, to 21 hours. There used to be a scheduled weekly reboot of the C7. During this debugging, this was changed to a daily reboot. This is now disabled and only the above RM rules are used instead. This may have been a good thing.

Yesterday the review of time entries suggested the webcore piston for Co2 monitoring and venting, and ecoBee circulation helper may be competing. This is odd since these two processes have running well for past year, e.g. June 2023-Jan 2024. So, the ecoBee helpers were disabled, as they were less important. Since this was done yesterday, the C7 has been up for 16h how.

If the C7 hangs after this change, I can need to redo the process of disabling things 1 by 1, since it was confirmed that if the C7 has nothing enabled it will stay up. This is not easy as, webcore pistons seems sensitive to devices disabled.

Additionally, there does seem to be some sensitivity to wifi devices and the reboot of the router (the router was a suspect, and is now replaced and ruled out). There was no log spamming, just an error that resolves in a minute or two. Wemo outlets (wifi) were also suspect, and have been removed (they were given away). There is also the occasional warning of configuring a Zooz device. This to seems to resolve within a minute or two (I assume it is normal since it is near a reboot event).

Otherwise, the log continue to seem clean: typical temp change updates from ecoBee, motion triggers from DSC and Zooz, lights turned on and off, as expected (unless C7 hangs :wink: ).

3 Likes

Well it seems it can be the ecoBee helpers - :frowning: I do so like the helpers, yet they are removed.

The C7 has been up twice now for over 1d. It even hit two days so I added the helpers back and well, it went down within 6 hours.

The pattern seems to centre around the ecoBee Suite and the adding of Helpers. Only two were used. The contact helper was used for over 1 year. The circulation help was actually one thing that was newly added in January (yep... been working on this for a bit before posting.)

I attempted to use the Hubitat ecoBee app, but could not find a way to check if everything was idle nor control the HRV for venting. These are used in a webcore CO2 venting piston. After uninstalling Hubitat ecoBee and reinstalling the ecoBee Suite app stayed up, but not quite as long. Oddly, this time I had to remove the ecoBee Suite package and reinstall it for things to stabilize. Once ecoBee Suite was removed and reinstalled things have stayed for 1d, 10h, 28m, 30s.

Now, I will leave things along and see if it will last some days. It is suspected that there is still something amiss, as the zwave motion sensor responsiveness to turn lights on seems to fluctuate: a fresh reboot and things are beautifully quick, and then it decreases, and then seems to come back.

This past week there were a few other errors, which for now I am not linking to the above and will monitor:

  • cloud backup failed with the error: Failed Steps: zwave. A complete shutdown and power off of the C7 resolved this. This prevent the setup from lasting more than 1d... but it has not been going ever since.
  • there were a few java.lang.NullPointerException exceptions for some SE11 and Meross WIFI switches. It seems these start to happen when the hub performance starts to go down.

Should all go well over next days I hope to report back, hub has been up for days :slight_smile:

I am hopefully, as the C7 and system survived a power outage last night. The LAN (C7, router, ISP Modem and such) are all on UPSs. It did not even seem to notice. Now, the wifi switches and outlets and such was a different story :slight_smile: All worked as designed though - yay!

If the above suggests anything please let me know. Until next time. Cheers!

2 Likes

Question is there a way to revert back to 2.3.6.146, e.g.

And or, have the java system reverted on the C7?


Uptime results are very mixed:

  • Reviewing log continues have java.lang.* errors (OutOfMemoryError/ Exception/ NullPointerException) mostly related to Meross switch devices.

  • Reviewing log continues to have zwave devices (SE18/ SE11) have daily configure warnings (no ghost devices)

  • a plot of memory consumption over the uptime intervals would produce a declining slope.

  • RM rule for severe load can be detected, but fails to reboot. Only RM remaining is to log cpu/mem/uptime every 10 minutes.

I would like to test reverting the C7 back to a firmware version in fall 2023 to rule out updates verses potential changes.

Otherwise, created a watchdog piston on C8 to confirm uptime has changed every 10 minutes and if not reboot C7 (with thanks @Sebastien for the idea)

Please let me know if reverting C7 is possible, and how.

Thank you

Try above....

3 Likes

Yes that will work. The rest of that topic is unrelated but the steps to revert back to the last 2.3.6 release are the same and will work.

2 Likes

@danabw and @jtp10181, thank you. I am just learning about endpoints now.

I saw the new release 2.3.8.128 this morning as well, which suggests some changes. Not sure which way I will do. Can report back results.

2 Likes

Well... results still mixed.

Only changes now are: remaining RM is the logging of cpu/mem/uptime, and C8 watchdog of uptime.

Things worked longer:

date time | cpupct | freemem | cpupct15 | uptime
2024-03-20T06:50:00.150-03 | 5 | 331780 | 2 | 1d, 6h, 57m, 23s

ecoBee checked for updates 30 secs after this, and then log entries start to become further apart. ecoBee did not even complete its updates until a reboot.

C8 watchdog caught the uptime not changing. The /hub/reboot/ end point did not reboot the C7 :frowning:

:8081 diag tool on C7 does reboot the C7.

Thoughts?

I would disable all the ecobee stuff and see if that helps. I feel like a lot of people who have lock up issues are using the ecobee integration.

1 Like

Thank you. Have done that before, disabling all: ecoBee, webcore, airthings and such. All was disabled and then turning back on in varying combinations. I can confirm ecoBee helpers definitely needed to go.

Curiously, there was a recent 3 day stint - yay, but not repeated in months - when Adguard Home (pihole alternative) was disabled. I am testing ecoBee left out of Adguard to see if it changes anything (reports stated nothing was blocked so this is a double check).

Please know reverting to 236 was promising, until webcore was failing. The webcore downgrade did not seem to work well. The latest 238 is now installed.

A quick update... The upgrade to the 2.3.128 seems to have stabilized things. Now testing adding a ecoBee contact helper back.

Should the contact helper work, it can mean everything is back to normal :slight_smile:

Here's hoping!

Thank you for support.

1 Like

Update...

  1. Just in... the C7 using 2.3.8 within 6 hours of enabling the ecoBee Suite Contact Helper, which was used without issue on the C7 with 2.3.6 for over 10 months reliably, went done :frowning: Now testing its removal and if the multi-day uptime will resume.

  2. A new webcore piston using ecoBee Suite devices was introduce last week and was running fine for during the +3day uptime. It is a webcore "smart circulation" piston using the ecobee Suite devices. Assuming things restore, a "contact helper" piston may need to be tested.

QUESTION: What can be done to get C7 back on 2.3.6 and its HE webcore working?
The 2.3.6 downgrade broke HE webcore on the C7: webcore would not load stating to logout and back in to fix, a repeating loop. Once the C7 was restored to 2.3.8 its webcore was accessible again, yet the paused pistons needed to have some of their devices corrected. Please, know, there is a HE webcore also on the C8 v2.3.8.

Not sure since I do not use webcore, but I think there is a way to backup all your pistons? You might need to do a backup, delete the app from the hub, reinstall the app then restore the backup? Hopefully someone else who uses webcore can chime in if this is safe to try.

1 Like

Well, with the contact helper disabled and using hopefully a new WC piston equivalent, the C7 has been up for double the time (12h) compared to when the contact helper was enabled (6h). The ecoBee Suite's devices are still being used. A sample of the piston expression e.g.

!contains([EcobeeTherm: ecobee:equipmentStatus], 'fan' ) && (abs([EcobeeSensor: Desk (QBZN) : temperature] - [EcobeeTherm: ecobee : temperature]) > {MAX_DELTA})

Hopefully, this can:

  • perhaps point to what the C7 change was between 236 and 238
  • perhaps help someone else that is encountering this issue

For now, Ecobee suite helpers are not being used, and things seem to be working. @jtp10181 perhaps I can test 236 with the piston backup this weekend. Thank you for the idea, I had not consider the remove and restore from WC backup.

Otherwise, I can report back over next while, hopefully, to share the C7 has been responsive and long uptime :slight_smile:

1 Like

ecoBee or Airthings may not be the problem on the C7.

ecoBee and Airthings were moved to the C8, and removed from the C7, and the C7 still went down after a day.

Summary of new setup:

  • moved ecoBee and Airthings off C7 to a C8
  • C7 now has webcore for light and motion pistons, Lutron, Hue, zwave motion/sensors, and the HubitAlarm
  • C7 hosts HubitatAlarm (DSC alarm part is on a separate PC/SBC) and shares some contacts sensors to C8. C8 uses these sensors in some WC pistons.
  • C7 Does share some other Zooz SE18 and SE11 to C8.
  • Confirmed no jumbo frames on a NAS, or used by anything else, i.e. MTUs are all default 1500.
  • C8 has been up for days now.

Is it still the case where you can access the diagnostic port to reboot?
That basically means the main platform / UI is crashing.

Try a different power supply and cable, just to be sure its not some sort of under power issue.
Do you check the temps at all using Hub Info? If temp gets too hot it can cause CPU issues, which might lead to the platform crash.

Otherwise I am thinking you should downgrade to 2.3.6 and see how that works.

1 Like

Yes, you understand correct: 8081 works, UI/main platform unresponsive.

Will do for the PS and cable. The C7 original PS and cable are still in use. These I can test soon to rule out.

As to the CPU temp, I can confirm the temp is in the "green". I can also confirm the CPU usage does start spiking, even significantly. It was observed above .9, to even 1,8+. All per hub-a-dashery.

I can also confirm your idea to restore pistons from a backup works. It was used to convert ecobee and airthings from C7 to C8; it is a lot of work. So, the remaining complete test to downgrade to 2.3.6 again is being put off for a bit due to the light/motion pistons to restore.

For what it is worth, I have consider attempting to shape the IP tx/rx traffic to the C7. My suspicion is there is some change after 2.3.6 in this area. This past weekend while installing a new switch there was a significant load observed on the C7 CPU (UI was not responsive, but the zwave devices were still sluggish) , which it recovered from, yet memory did not recover. Additionally, I noticed before moving airthings off the C7, that CPU load on the C7 increased when Airthing's IP connection was being converted.

Thank you for the thoughts!!

Interesting... Two hue C7 bulbs were mesh shared to the C8 for use in a piston. The piston uses an ecoBee temperature sensor's motion on the C8 to turn the bulbs on and off.

The C7 did not stay up 3 hours when this was implemented. ecoBee is on C8, and not shared to the C7. Crazy...

The C7 has other hue bulbs tied into its on webcore piston and stays up.

I might try the other was - ecoBee sensor shared to C7 and C7 runs the piston.