I've made a few posts about my two hubs reporting Cloud Connection Unavailable. First let me say, THIS IS NOT A HUBITAT ISSUE. I have been fighting this problem since November 2020. Before that, no disconnects. I am reaching out to anyone that has experienced connection issues like this;
Boot Hub. At startup hubs connect to the cloud no problem. Anywhere between a few minutes to hours later, both hubs lose WAN connection. Along with that, I have a Raspberry Pi running Node-RED that also loses connection at the same time as the hubs. Can't ping anything on the WAN. If I wait it out connections re-establish on their own, sometimes it takes a few minutes, other times hours. Also on my network are 3 WIN10 PCs that never lose connection. I have tried 2 different routers, a Synology and ASUS Merlin (which connects to my ISP's RG Pace 5268ac). Because my WIN10 PCs experience no connection problem and my 2 hubs and Raspberry Pi do, I don't think there is any hardware or software configuration problem. I think that because before November 2020 no disconnects for well over a year.
Now the workaround solution;
If I setup an automation endpoint on a hub that I call from outside the network (WAN) the disconnect issue stops. I only need to call a endpoint on one hub to prevent all 3 devices from losing WAN connection.
My ISP is AT&T running a Pace 5268ac in DMZ mode. Main network router is currently Asus Merlin.
My question to anyone out there, have you ever experienced a problem like this? I have my suspicions this problem may lie with the ISP and/or their equipment but trying to get anyone to listen is futile because the connection is technically never lost. I am at a loss to understand why this is happening.
Not Hubitat specific, but Some things that I have seen cause connectivity problems:
DoS or flood protections that will blocking origins that are sending too many messages
For long-living connections, idle timeouts or max connection timeouts
These are both usually resolved by finding and adjusting settings on routers or firewalls or IDS systems. I would start with looking at your router logs and see if there is anything there when the problem occurs.
I have enabled full logs on my Asus Merlin and set my firewall to log all traffic and there is nothing indicating a problem, at least for that router. I does show numerous packet drops from BOTS looking for open ports.
On the Pace router the system log shows; (not sure what I am looking at other than it looks to be DNS?)
err
Jan 18 17:07:37
daemon: ustatsd[1892]: Parsing failed for fields [device info] at [ 142 0 3 918 0 8006 src=1.1.1.1 dst=107.199.xxx.xxx sport=53 dport=51013 src=107.199.xxx.xxx dst=1.1.1.1 sport=51013 dport=53 ] on file: /proc/net/conntrack_dpi
err
Jan 18 17:07:07
daemon: ustatsd[1892]: Parsing failed for fields [device info] at [ 142 0 3 918 0 8006 src=1.1.1.1 dst=107.199.xxx.xxx sport=53 dport=51013 src=107.199.xxx.xxx dst=1.1.1.1 sport=51013 dport=53 ] on file: /proc/net/conntrack_dpi
err
Jan 18 17:06:37
daemon: ustatsd[1892]: Parsing failed for fields [device info] at [ 142 0 3 918 0 8006 src=1.1.1.1 dst=107.199.xxx.xxx sport=53 dport=51013 src=107.199.xxx.xxx dst=1.1.1.1 sport=51013 dport=53 ] on file: /proc/net/conntrack_dpi
The Pace has an admin UI that’ll show the NAT entries.
The older version of ATT’s (‘fiber) residential gateways used to have a tiny amount of RAM, and the NAT could easily overrun it.... at which point bad things would happen to all outbound connections.
These devices had no real bridge mode (still created NAT entries), so even a bigger rtr behind them couldn’t work-around this limit
If this is what’s going on, then chase down the entity creating all the entries in that table and see if you can lower the primary producer.
Interesting Information. I don't know if I have a newer or older one but it's been installed almost three years now. Here is a snapshot of the NAT. Right now all devices are working. I will check it when hubs fall offline again.
I had said that setting up an endpoint on a hub and then calling it from the cloud solved my problem? Well it reduces the frequency of the hubs disconnecting, but not entirely.
What I have discovered is even though the Hubitats cannot reach the cloud (cloud connection unavailable), the cloud can reach the hubs. So at the time the hubs can't get out, they can still receive my calls from the cloud.
Tried that idea awhile ago. It didn't make any difference. Right now the router is using OpenDNS. Thanks for the suggestion though.
I have a sneaking suspicion the problem is with AT&T or their Pace router. They did push a firmware update back in late October or early November, right around the time the problem started but I have no way to know if that has caused my problem.
It almost seems like maybe their network is seeing traffic from my devices as some kind of threat and is cutting them off. Again, pure speculation.
Someone I know said AT&T chose 1.1.1.1 as the management IP for their business fiber routers. He had to block it because the flood of DNS requests would crash the router.
That would be amusing, since it's part of Cloudflare's public DNS resolver (1.1.1.1 / 1.0.0.1 ) - like Google's 8.8.8.8 / 8.8.4.4, it's also geo-routed.
AT&T's pro-line tend to route you to dedicated [AT&T] DNS servers as part of the setup, so perhaps they never noticed they were overriding a public DNS
The Pace routers have an internal UI that'll show you the NAT table itself, and let you clear it (in emergencies). I don't recall where it is exactly, but it's there.
Clearing it will bang-out all the in-flight connections, and well behaving apps will re-establish.
Looking at my logs it is showing it using 1.1.1.1 so something is wonky in my Asus settings. I will take a closer look. Thanks for that tidbit of info.
Okay, Changed my Asus router's DNS server. Now looking at the system log on the Pace router/gateway I still get these messages, only this time it is using the DNS I assigned on the Asus. I don't know if this is related to the disconnect problem or not. My guess is, its not. But I don't know enough to be sure.
After I made the DNS change I stopped calling the endpoint on my hub and sure enough, within 5-10 minutes the hub was once again blocked from sending out. Re-enabled the remote endpoint calls from the cloud and the hub reconnects, and stays connected.
Just to echo what @guessed has called out, but I don’t think you’re running into a DNS issue. I’ve never used the ATT hardware that he calls out, but the sort of symptoms you’re describing does sound similar to what I would expect to happen if the NAT table filled up. So I’d suggest that @guessed suggestion is a good path to try next.
Thanks. I have looked for the NAT table described but can't find it... yet. The Pace 5268ac is not a very intuitive RG, in fact it looks like most features found on the cheapest routers made have more options. I have gone to the router's site map which is supposed to show every option and the only thing dealing with NAT is what I posted in a screen shot above. Needless to say it's incredibly frustrating. Not only that, getting help from AT&T is a useless endeavor. I'll keep poking around on Google to see if the NAT thing is an unlisted option. Thanks
Edit: To add clarity, I am on DSL but the Pace router I have looks to have an input for a fiber connection.
Update: Apparently they have hobbled the NAT under diagnostics. What I posted above is the only thing that shows. IOW, there is no table listing anymore.
Steps that I have taken;
Closed down my VPN running on 1 port
Lowered the UDP & TCP Session Timeout
I do have a bunch of devices including IP Cameras that I can temporarily block to see if that helps but for now I will try the settings above to see if things improve.
Thanks too all for your advise. From reading other forums it seems the NAT problem is the issue. I will post more if I find and can correct the issue.
After nearly 3 months of ripping my hair out trying to find out why my hubs and raspberry Pi were losing connection to the outside world, I have finally discovered the source of the problem. It's a vap11g-300 that was being used for a security camera that only had ethernet. The vap11g can operate in bridge mode, or as a WiFi repeater. I discovered something odd when I rebooted my main router. As the main router was booting, and before it's UI was ready, if I typed the IP of what was supposed to be my main router, I was getting the UI of the vap11g. I don't know how or why it would do this but removing it from my network resolved the problem.
Not that this has any relevance to Hubitat but more as a example, that even though you may think a problem is with a certain device, be it a Hubitat or any other networked device, you have to consider everything and not just what seems obvious. I replaced routers, switches, cables, many, many hours of configuring and testing just to find out it was this oddball device causing all the trouble. Oh well! Live and learn?
New one on me too. Plus it was doing it from the WiFi connection back to the main router as the ethernet cable was plugged into the camera. I never would have caught it if it wasn't for trying to bring up my router before it was fully up and running. I lost track on the amount of hours I spent chasing this one down.