Network Help Needed

Admittedly not a network guy here, but I noticed a problem on my home network last night and need help.

I have two computers in my office each plugged into a separate Netgear 48-port gigabit managed switch. Then both switches are plugged into my edgerouter. I started a 300GB file copy from computer 1 on switch 1 to a server (computer 3) down in my IT closet and I noticed it was basically maxing out the gigabit speeds (~113 MB/s). Without doing some tracing, I’m unsure which of the two switches this third computer is plugged into. After about 5 minutes, I noticed computer 2 lost internet access. I also noticed the family room TV Netflix also lost internet. I could ping computer 1 from computer 2, and could open the edge router page from computer 2, but had no access to the outside world from this machine. I also noticed that my HE could not see computer 3, which it polls for all sorts of status (my 4Runner, shower, etc).
As soon as the file transfer completed, everything went back to normal.

Any idea on how to solve this problem? I noticed the same issue about a week earlier during another large file transfer from computer 1 to computer 2 - initiating this copy from computer 2. My daughter was watching Netflix, and it lost connection.

Put the two computers on the same switch if possible otherwise you are flooding the network and making everything else work harder. You could create a VLAN but that is somewhat advanced.

Unfortunately not that easy… every data wall outlet in the house has two Cat6 drops - one back to each managed switch. Any other options here? Other than vlan?

Edit: I have cascaded switches for decades now and never seen this issue. What makes this configuration different is what has me confused. Btw, I’ve had this set up for about 6 years and never had any other issues.

But again, not a network guy.

Dumb question, would cascading my rack mounted switches make any difference? Meaning switch 2 -> switch 1 -> edgerouter instead of switch 1 -> edgerouter <- switch 2?

I mean overall your network is fine (separate switches). Now what's interesting is that your other devices went off line (well couldn't be reached due to network traffic) It's almost as the switch that the "server" is plugged into was acting as more of a hub than a switch. As a switch it should balance things much better. Do you have QOS or any odd things turned on with any of the switches? Another thing that could have happened was on the server side the network card was having issues and sending an abnormal amount or resend req's... (Would need to set up something like wireshark to see) Again this is all only speculation from my end (I'm a network engineer).

Not really...Especially if you're not running a backplane (which yours are likely not able to use anyway)

In your server are you using a mechanical drive or SSD's?

It depends on the traffic flow. Anything within a VLAN should not have to go to the edgerouter. So a VLAN10 client on switch1 connecting to a VLAN10 client on switch2 is better off with the cascade of switches, especially if the link is 10GE.
Traffic between VLANS has to go to the layer 3 (and possibly firewall) device to determine if that traffic is possible (or allowed).
This may be an option to evaluate what clients are on each switch and move some clients if they mostly communicate amongst themselves.

I am not aware of any QOS on either switch, but these switches are infamous for having their web UI go down within months of a reboot, so I can’t check without rebooting - which I have to schedule - it’s winter break here and the internet is being used heavily.

I actually started looking at QOS on the ERX to limit each switch’s bandwidth to 75% as a possible solution - or “band-aid”.

That particular virtual server is running on a mechanical drive… but the fact that some other devices on the network lose internet connection makes me think it’s something other than the drive… could be a coincidence, but I’ve never noticed any latency on that server. It’s mostly running simple bash scripts.

Well my thought process on the drive is if there is a problem with the drive slowing down (which happenes) the processing of the overall system is affected because it's issues with speed and writing creating a bottleneck, I've seen it happen so just throwing it out there. Like I said, hard to really see what's going on without monitoring overall network traffic to get a bead on what's what.

Ah, good thought - and that server’s share that I was copying to happens to be on a (hardware) raid array of mech drives.

BUT, the same issue happened the other day when I was copying between computers 1 & 2 - the server wasn’t involved. My 8-yr old daughter actually asked me about it yesterday am - “did you figure out why Netflix stopped working the other night?”

Hmm... Yeah I'd use wireshark to see what's going on....

Dude, I am soooo glad you suggested that I look at my disks. I had forgotten that old ESXi (5.5… ya, I know!) host was running most of my VM’s boot/root partition file systems off an SSD with the secondary storage disks being on the raid array of mechanical drives. Based on your suggestion I decided to log into ESXi at the time to check if I had any drive issues and noticed that my SSD was starting to degrade. So I moved my 2 most important VMs’ root partitions to the raid array.
This past Friday, ESXi crashed and the SSD was finally toast. Somehow I managed to (barely) boot into recovery mode on ESXi, and the only two VM’s that survived were the two that I had moved to the raid array. I managed to spin up a temp proxmox machine and import those two VMs from ESXi. Total downtime… (a very panicked) 2 hours.

Phew!!!

Ps. Phew!!!!!!!

1 Like

backup those vm’s to a raid or to the cloud…

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.