Configurable Backup/Log Clean Time and/or Options

@chuck.schwer I feel bad when I tag you guys but this one might mitigate a hub crash so I figured why not. Every night at around 2:00 AM I notice a huge slowdown that lasts around 10 minutes. Up to this point I have assumed that it was related to the backup. Perhaps it is not though because my backup files are time-stamped an hour later typically (maybe the timestamp is incorrect). Maybe it's when log files are cleaned or zipped or something. I don't know.

ANYWAY... Today I noticed something fantastically fun. During the slowdown things got so slow that my HE device with the WebSockets client (that implements a socket.io client) failed to check-in in time. The remote server disconnected the WS client. The HE device's webSocketStatus method got word it disconnected and I had code on the disconnect event to immediately connect again. On connect, the remote server sends me a bunch of information about the session. On top of that I pass a request to get device info to refresh all of the devices (of the integration this WS supports). It ends up being a significant amount of information and it is used to refresh all of the virtual devices from the remote server.

You might be able to see where this is going already. The reconnect and refresh added to the slowdown that was already happening at 2:00 AM. After reconnecting and getting a bunch of data it missed another check-in. It reconnected again. Missed another check-in or two. Eventually the HE hub went down in a blaze of glory.

Now for the feature requests:

  • Can we have more control over whatever is happening in the slowdown that starts around 2:00 am? What is happening during that slowdown window anyway? Is it the backup? I'm sort of a night owl. I would prefer the backup to happen at 3:00 am or 4:00 am. I'm often still awake around 2:00 am. I would definitely choose to have it later and I might even choose to have it happen every other day or even every week once I stop tinkering.
  • Can we get a function to call from driver code to see whether or not the backup (or other heavy system use) is occurring maybe? We could check it before potentially very resource intensive actions and skip or schedule them. Or maybe there is a better approach to this. I dunno. Sort of just talking out loud. I can't think of a great way for me to have mitigated that in the driver though.
  • Maybe we could get file system access to where the logs and backups are stored via SMB or SSH or something? I wouldn't mind being able to pull a backup automatically to a robust backup solution (my DAS or NAS). I would also be able to rest easier knowing that a crash in the middle of a backup or zip didn't leave some file part laying around that would exist indefinitely. Maybe you already clean those up.
  • This last one you can probably skip because I think it is just me being picky. However, I'll mention it anyway to give others the opportunity to chime in. I've noticed that the time to log a message does not seem to scale linearly with the size of the message. Small log messages are very quick and up to a certain size of message seem to all take about the same amount of time aka near instant. However, if you are logging out a page or two of JSON or XML to the log it can sometimes freeze the log up and impact performance for 5 to 10 seconds. Sometimes you need to print a JSON response out and sometimes JSON responses are a few pages. I obviously wouldn't leave these in long term. They would be limited to development but at the same time I feel like there might be some room for improvement about the way that HE is handling larger log events. It doesn't seem like they should take so long.
1 Like

I think we should get some insight into what actually is happening during maintenance and have the ability to control the time, seems like a reasonable request.

Looking at my performance monitor I see a good 45-60 minutes when response times are very high. Almost 1 hour of maintenance seems excessive to me since it runs every night. I also do not know what is actually happening during this period, so maybe that time frame is normal.