When using local
Protocol.TELNET connections in a Driver, the connection that's established doesn't appear to have the TCP Keep-Alive (KA) setting enabled.
If the Driver is primarily reading data from the resulting Socket, it will wait forever for data to come, even though the target entity may never respond.
If the "other end" is abnormally terminated (eg. AWS kill, or manual Instance termination) then the dead-connection will only be picked up once the Driver/Client code attempts to
write() to the the underlying connection.
I suspect this problem would impact any long-running WebSocket connections also.
NB: By default, and once enabled, Linux TCP Keep-Alive defaults to 2+ hrs before it terminates a wayward connection. The defaults for this can be lowered (and should be) to make them detect/react well in practice.
Here are the typical Linux defaults (from
net.ipv4.tcp_keepalive_time=7200 net.ipv4.tcp_keepalive_intvl=75 net.ipv4.tcp_keepalive_probes=9
ie. Don't start Keep-alive processing for 2 hrs (7200s), and then do 9x KA Pings @ 75s apart for about 11.25m (675s) before marking the connection dead inside the kernel.
- Hubitat Elevation, C-5 184.108.40.206
SO_KEEPALIVE on all sockets created for Drivers, change the OS-Level defaults to more quickly detect issues in wayward/dead connections (eg. 10 minutes, tops)
Once enabled on each socket, the system-wide settings can be change in Linux (impacting everything) or they can be set on each connection discretely.
Have each Driver author always use an Application-level ping, over the
Protocol.TELNET connection, in order to force the dead-connection processing to kick in.
Most should likely do this anyhow, but the TCP Keep-Alive will be a handy fallback for a bunch of bad connection-drop situations.
@chuck.schwer I think you wanted platform/OS-level issues brought to your attention.