Enable TCP Keep-Alive on Protocol.TELNET Connection, lower OS defaults for quick detection

guessed · April 12, 2019, 12:18am

When using local Protocol.TELNET connections in a Driver, the connection that's established doesn't appear to have the TCP Keep-Alive (KA) setting enabled.

If the Driver is primarily reading data from the resulting Socket, it will wait forever for data to come, even though the target entity may never respond.

If the "other end" is abnormally terminated (eg. AWS kill, or manual Instance termination) then the dead-connection will only be picked up once the Driver/Client code attempts to write() to the the underlying connection.

I suspect this problem would impact any long-running WebSocket connections also.

NB: By default, and once enabled, Linux TCP Keep-Alive defaults to 2+ hrs before it terminates a wayward connection. The defaults for this can be lowered (and should be) to make them detect/react well in practice.

Here are the typical Linux defaults (from sysctl):

  net.ipv4.tcp_keepalive_time=7200
  net.ipv4.tcp_keepalive_intvl=75
  net.ipv4.tcp_keepalive_probes=9

ie. Don't start Keep-alive processing for 2 hrs (7200s), and then do 9x KA Pings @ 75s apart for about 11.25m (675s) before marking the connection dead inside the kernel.

Version/Config information

Hubitat Elevation, C-5 2.0.8.113

Request
Enable SO_KEEPALIVE on all sockets created for Drivers, change the OS-Level defaults to more quickly detect issues in wayward/dead connections (eg. 10 minutes, tops)

Once enabled on each socket, the system-wide settings can be change in Linux (impacting everything) or they can be set on each connection discretely.

Work-around
Have each Driver author always use an Application-level ping, over the Protocol.TELNET connection, in order to force the dead-connection processing to kick in.

Most should likely do this anyhow, but the TCP Keep-Alive will be a handy fallback for a bunch of bad connection-drop situations.

References

Envisalink Integration Application and Connection Driver

@chuck.schwer I think you wanted platform/OS-level issues brought to your attention.

Pantheon · August 26, 2021, 2:37pm

OK, I'm swimming in deep water here, so please be gentle. I have been trying to get Kodi integrated into Hubitat. I want an event to be generated in my Kodi device when the artist or title of the song that is playing changes. Thanks to @tomw, I have a very basic driver that serves my need. I do not know anything about writing drivers. I am using webcore to capture the artist/title and display that information on my Fire HD pads. But the TCP connection (port 9090 for Kodi) times out after 30 seconds. How can I keep that connection open until I want it closed? I only need it open for a maximum of 3-4 hours.

tomw · August 26, 2021, 3:08pm

Thanks @Pantheon for the shout-out.

For the Kodi driver that I've been putting together, the connection is actually through WebSocket. I could not get the Hubitat Telnet implementation to cooperate, but that could be debugged as a contingency if necessary.

I suspect the WebSocket issue is actually a Kodi issue, in that the ping/pong keep-alive is not being observed by Kodi. I have seen other mentions of this same issue in other non-Hubitat integrations with Kodi's WebSocket interface.

So, I need to find a way to tweak or remove that behavior on the Hubitat WebSocket interface or else figure out how to make Kodi respond correctly to that.

EDIT: This old post from Chuck shows how to disable the ping, which seems to have helped with the disconnections. I'd guess it makes it less possible to observe a connection that goes away, but it will probably work for now.