Multi Hub Task Splitting

Here is a little history.
Unfortunately I have well too many cases when RM rule fails to complete. Usually failing RM rule has a re-triggering protection by using PB set to FALSE at the very beginning and TRUE at the very end with Required Expression PB = TRUE. All these rules are randomly failing at a point when rule has a "Wait for the Expression". Nature of a failure is - Expression becomes TRUE but rule did not advance beyond this point. I reported this case multiple times but usually I was not able to provide all the required info such as all logs for rule itself, rule subscription(s) status, logs for the all involved devices, etc ... Always something was missing and because failure is very random it is/was impossible to collect all that info.

As of now I gave up to fight with this phenomena.
Instead I am thinking about separating tasks by using multiple hubs. Ideally it will be nice to have
a single hub with true task separation. I am talking about true dedicated complete processing units for a different tasks such as:

  • Dedicated processing for all Zwave Communications
    (this is already a case, there is a dedicated ZWave sub-proccesor);
  • Dedicated processing all Zigbee Communications
    (this is already a case, there is a dedicated Zigbee sub-proccesor);
  • Dedicated LAN/Matter processing
    (I am not using Matter but have a C7 hub dedicated for almost all LAN integrations);
  • Dedicated Event application processing unit
    (again now this is a part of this multi-core/multi-tasking/multi-threading SW implementation);

Because current HW does not have a dedicated Event Processor and Application Processor I am seriously thinking about using separate hubs for these tasks.
My BIG question is:
If I will use say, one hub for handling only Zwave (and/or Zigbee, and/or LAN) communication and one dedicated hub for applications only (all hubs are interconnected with hub mesh) where all Event Processing will be done?
If all Event Processing will be done say, for ZWave on the dedicated ZWave hub, etc. this should be a big help in terms of splitting tasks between multiple CPUs. However all this mess with a HW will not improve anything if Event Processing will be done on the Application hub.

There are guys on here way smarter than me but I have been on Hubitat for years now and I do have a 3 hub mesh in operation. Well one is just for handling aqara devices on a separate zigbee channel, but two of the hubs definitely hand the baton to each other during certain operations.

IMHO, from an engineering perspective, I don’t think making your setup more complicated is going to achieve the simplification you are looking for. I think your issues will persist and become even harder to debug.

Targeting your issue, using PB will prevent a rule from running a second instance BUT the rule still gets triggered. And when the rule gets triggered while the previous instance is sitting on a WAIT statement, then the first instance gets cancelled and you PB will be left in the wrong state. I think this is the root of your issue.

2 Likes

All depend on the solid answer (as of now I don't have a solid answer) where the Event Processing will actually happen. If it will be on the clear Application hub than it makes no sense. But if all the Event Processing will be on whatever Communication hub it should make a difference because there is already a dedicated RF Communication Processor (ZWave and/or Zigbee) and main CPU will become a dedicated Event Processor. Another words, all Tasks will have a true dedicated CPU.
I really don't want to make this story too long.
At this time I am looking for the real multi-proccening HW implementation. I.e. a true dedicated CPU with not a sheared memory for handling each relatively complex tasks such as
(mybe I am missing something):

  • RF/LAN Communications (individual CPU for each communication protocol);
  • Event Processing unit;
  • Application unit;

No, the intention for using a PB is to prevent multiple re-triggering.
Failing rules becomes some sort of self healing if PB is not used because of re-triggering "fixes" a staled rule.

I'm not sure you're going to get what you want by adding another hub.

I had a similar, if not the same issue with rules stopping after a wait for event even though the event then happened. Unlike yours, my rules did not use PB to disable the rule. In fact, they were intended to be retriggered, with only the last instance proceeding to the end of the rule. They did have required expressions that could prevent triggering, but no PB to get stuck in the wrong state. I also (seemingly) randomly had these rules fail to finish. Since I wasn't using a PB, they could be triggered again later and usually would run correctly the next time.

I recently went from a single C-7 to a C-8 Pro with all Zigbee and Z-wave devices on it and a C-7 running all of my cloud-based services and integrations. One of the rules in question was moved to the factory-reset, non-radio hub. Yet I had the same failure at least once after that.

I have given up and removed the wait for expression/wait for event wherever I could from my rules, as it doesn't seem to work reliably for me and so far no one has been able to diagnose why, despite my posting the logs and rule subscriptions from both successful and failed runs.

In short, without knowing the cause of the issue you're having, there is no guarantee that adding a separate hub will resolve the issue. You're assuming it's related to hub loading or multi-tasking, but that is just an assumption at this point. If you decide to move the rules and it starts to work 100%, please make sure to post back here as that would be a very interesting result.

2 Likes

My apology - i did not read correctly. You ARE using the PB as a required expression and yes, you are correct, that should stop the second trigger from occurring.

I love multiple hubs :smiley:

But I'm a big user of "co-processing" -- by using Node-Red for the most sensible things. I submit that you could fire up an instance of Node-Red on some processor you already have available, in an afternoon. If it helps with that particular problem, great, if not, it hints at whether another hub will help.

I offer:
https://www.hubitatcommunity.com/nodered/Hubitat_Node-Red/index.php
if you need a quick tour of Node-Red as it relates to Hubitat.

1 Like

All Hub Mesh ultimately does (besides some nice things like showing preference values and logs on the other hub too) is send events from the source to destination hub and commands in the other direction. I'm not sure this approach is likely to solve your problem if it's events happening in quickly succession unless Hub Mesh happens to de-duplicate (if it's the same device; no relevance if it's not), but that would be by chance and not any designed behavior.

1 Like

I did read very carefully all your post related to this issue. The problem does exist and because if very seldom and random across multiple rules it is very hard to debug.
My current idea is to create a pipelined processing chain with dedicated processor for each major task: Communication_with_Devices --> Event_Processing --> Applications
ZWave/Zigbee RF Communications already have dedicated processors.
LAN/Matter Communication could be handled by dedicated hub (in my case C7 does this).
A dedicated hub could be used as an Application Processor.
If I have say two hub-meshed hubs and first one does Zwave and a second one does all rules (apps) where exactly Event Processing is happening?
If Zwave hub does this than it could be beneficial but if app hub is doing Event there will not be
any improvements.

My observation tells me the problem is somewhere in the "Wait for Expression" implementation.
Naturally "Wait for Expression" is a Combinational Logic but because hub is "event driven" this
Combinational Logic heavily relies on Events. And if Event somehow is missing the Expression will not be evaluated (something like this).
I am sure, there is a System Timer Interrupt (usually around 10-20mS). I would use this system Interrupt to create a bit slower Time Event (say 100mS) and use this for evaluating all Combinational Expressions. This way the Expression always will be correctly evaluated regardless of Events. But this may created an excessive CPU load (I am not a SW engineer). That is why a dedicated Event Processor could be very beneficial. Just thinking loudly.

I need to think about this but frankly I have a hard time to understand how Event is sent across hub mesh. The Event is a Status Change of the Device. Say for the Switch the Event is when Switch Status changes from OFF to ON and vice versa. I can clearly understand how hub mesh will/can update a Status of the Switch but how a related Event is passed from one hub to another is not clear and intuitive.

Unfortunately this problem is not only my. Few more users reported similar problem quite a few times.

Thank you for the advice. I am very happy with RM (when it works) but I am really tired to fight with this RM random failures. So, maybe it is a time to switch to a different automation engine.

Have you given any more consideration to simply writing a custom app? You can do everything you could do with a Rule here, except you have complete control over exactly how it executes.

If you haven't looked at the developer docs in the last year or so, there are lots of new additions that should help you get started, or someone here may offer you assistance if you state a specific goal (probably a better approach than sharing a rule since the best approach, IMHO, is to think about the outcome you want, your requirements, etc. rather than specify how it might be done in one particular app).

Even if you don't know Groovy, it's a Java-esque scripting language that is designed to be fairly easy to learn. Given your backround, I really think you can do this. There are some Hubitat-specific tips in the docs, lots of Groovy resources online, and many example apps from both Hubitat officially as well as the community to help as well.

1 Like

Oh yes, I am/was thinking about writing a custom app. (This is exactly what I do when I need a custom hardware and thanks to FPGAs I don't need a custom PCBs.)
But when I am thinking about debugging my creations with only logs available as a debugging instrument (correct me if I am wrong) it immediately becomes a show stopper.
And when I am thinking more about root cause for the observed problem I am near 100% convinced the problem is outside the RM. I might be wrong but I suspect the Event Processing engine. In this case (if this is a case) custom app may not help.

yeah, i was corrected already. I missed that he was using the Required Expression

1 Like

This is mostly true, though you also have the App Status page that can give you a view into the current state of your app, plus tools like App Stats. But with logging being entirely under your control, it can be a quite flexible tool during development. This is how all the built-in apps, including Rule Machine, were made, too (well, aside from a couple that are just wrappers around some built-in features like HomeKit). And it's certainly more flexible than a Rule where you can only turn the logging it does offer on or off. :slight_smile:

2 Likes

OK, maybe the debugging is not that bad as I am thinking.
Now I have another dilemma. Of course for the learning curve I have to start with something very simple. But than what? I cannot replace all my RM rules (well too many) with custom apps. Replacing rules which are failing more often? But basically every rule with "Wait for Expression" statements are prone for failure. Since this failure is very random statistically rules which used more often will fail more often. Creating very flexible custom app will be nothing more than another RM. At this point i am lost on long term strategy.

More I am thinking about this failure I more convicted the problem is with "Wait for Expression" implementation. This "Wait for Expression" statement is a Combinational Logic but it is heavily relies on the Events (because hub is Event Driven). I am near 100% sure the frequency for the failures depend on logic complexity for the "Wait for Expression" statements. I cannot recall any rule with very simple Expression (just waiting for only one Event/State change) ever failed.
More logic in the Expression requires re-avaluation for every Event when Devices changes its State. And what if 2 (or more) Events happens near at the same time. Depend on implementation details (which is huge unknown) it could be a Race Condition resulting in missing at least one of the near simultaneous Events resulting in rule failure.
Anyway at this time I am not quite ready to dive into creating custom apps.
I like RM (when it works) and I have few ideas for workaround the observed failures.
I will try this first.

UPDATE

Here is a modified failing rule:

Original "Wait for Expression" with Duration statement replaced with a Repeat Loop.
The idea is to enforce the re-evaluation of the Expression.
Rule tested and it works as expected. But of course some time is needed to see if this modification actually fixed the problem.

1 Like

I'm fairly confident that multiple hubs aren't going to help your issue - Rather it's going to make debugging even more complex, given the distributed nature, and tossing in a network layer (with variable delays).

Having seperate hubs for different radio's and LAN rules makes some sense, depending on the size/scale of your various radio meshes, as it allow more actvities to happen concurrently, given more cores, etc.

But you seem focused on event processing, and trying to isolate it to a single location, which isn't going to work, with multiple hubs - My understanding when a device/variable changes state or value, an event is triggered, and if that variable/device is part of hub mesh, the event/change is sent over the network to hubs that have a registered interest in that source device/variable. - Once the network packet comes in with the change, then the event is re-trigger on the receiving hub - And interested/subscribed applications are then triggered, etc. So the event is happening on the source device (location of the "real device"), sent over the network, and then re-triggered on the recieving hub.

I've had similar problems as what your describing, and I also use PBs in RM and required expressions to avoid multiple/concurrent executions. - The thing I've learned is to avoid wait for expresssion/delays, in the internals of rules. - I just use various global Hub Variables to record state, and keep rules very short and flow thru - If I need to wait for somethiing to "finish" an operation/task, in my world that's a seperate rule, with the "wait" as part of the TRIGGER based on one of the hub variables. - I keep the related rule "snippets" together just based on naming conventions. That keeps all the rules from being "blocking" and they quickly clear PB/execute/set PB. - So a given automation task is typically made up of 3-4 small rule segements that coordinate via HVs. - This also allows "external things" (aka Alexa/Google Home") to execute subsets of these rule segements based on verbal commands (aka end a task that's midway in progress, etc.)

Obviously, YMMV, and you can just code apps in groovy as well - But if I have a rule with a delay/wait inside of it (I do have a few, and I don't use PBs in those cases) - I reconsider that approach, and always try to have the "waits" as part of a smaller rules trigger. - Moving to multiple hubs, for this specific reason around RM, is going to matters even more difficult to debug, IMHO (not saying their aren't valid cases for multiple hubs - Deal with LAN stuff, being a large item)

Good Luck.

1 Like

That is my second thoughts how to workaround the observed issue.
Having multiple rules to perform a single task is a bit messy and that is why this one is second in a line. My first one is to try a Repeat Loop (I already posted the modified rule above) for enforcing evaluation of Expression. The modified rule works as expected but who knows how stable it will be over time.

One other minor point, about your example with Bath Occupancy above - I'm sure your understand that would be two rules, in my approach - One to make it occupied, and a seperate rule to mark it un-occupied.

My only other point, is that I've found using HV booleans as MUCH faster to process (a few ms), versus messing with a Virtual Swich (your "BathRoom Occupied") - If you need the VS, for other reasons, that's fine - I would set a HV boolean for occupied, right after clearing the PB, then in my second rule (waiting for motion to stop as the trigger) - I would have a required expression based on the Bathroom Occupied HV = TRUE, to start the waiting for your assortment of lights, fans to clear.

Not saying one way or another SHOULD matter, but I do spend time trying to shave ms off rule executions - Some things like Zwave/Zigbee IO delays can't be helped, but I've had better luck with HV globals for states, versus virtual switches (and it some cases, I've used both, if I want the VSwitch in a dashboard, or visible to Alexa, but I always use the HV for the required expressions.

Just what I found that seems to work much more reliably for me. - short and sweet rules. - And delays and waits, as parts of triggers.

Yes, this is clear. But as I mentioned having multiple rules for single task is messy and much harder to maintain over time because many little details evaporating very quickly and you will have to open and observe all the related rules at the same time. For this specific reason I am trying to keep all logic in a single rule. Otherwise multiple Trigger-based rules seem to be far more solid solution. Let me see how my "fix" (using a Repeat Loop) will work over time.
So far so good but sure, it is too soon to claim a victory.
But I am sure, TRIGGER-based waiting for anything is far more stable.
And yes, my preference is Boolean Variable vs Virtual Switch.