When using local Protocol.TELNET
connections in a Driver, the connection that's established doesn't appear to have the TCP Keep-Alive (KA) setting enabled.
If the Driver is primarily reading data from the resulting Socket, it will wait forever for data to come, even though the target entity may never respond.
If the "other end" is abnormally terminated (eg. AWS kill, or manual Instance termination) then the dead-connection will only be picked up once the Driver/Client code attempts to write()
to the the underlying connection.
I suspect this problem would impact any long-running WebSocket connections also.
NB: By default, and once enabled, Linux TCP Keep-Alive defaults to 2+ hrs before it terminates a wayward connection. The defaults for this can be lowered (and should be) to make them detect/react well in practice.
Here are the typical Linux defaults (from sysctl
):
net.ipv4.tcp_keepalive_time=7200
net.ipv4.tcp_keepalive_intvl=75
net.ipv4.tcp_keepalive_probes=9
ie. Don't start Keep-alive processing for 2 hrs (7200s), and then do 9x KA Pings @ 75s apart for about 11.25m (675s) before marking the connection dead inside the kernel.
Version/Config information
- Hubitat Elevation, C-5 2.0.8.113
Request
Enable SO_KEEPALIVE
on all sockets created for Drivers, change the OS-Level defaults to more quickly detect issues in wayward/dead connections (eg. 10 minutes, tops)
Once enabled on each socket, the system-wide settings can be change in Linux (impacting everything) or they can be set on each connection discretely.
Work-around
Have each Driver author always use an Application-level ping, over the Protocol.TELNET
connection, in order to force the dead-connection processing to kick in.
Most should likely do this anyhow, but the TCP Keep-Alive will be a handy fallback for a bunch of bad connection-drop situations.
References
@chuck.schwer I think you wanted platform/OS-level issues brought to your attention.