The Case Of The Apparently Firewalled Connections

(Note: This is a part of my Adventures In Geekdom.)

Earlier this year, we (Pulsar Aviation) began having some intermittent congestion issues. A web page would just cease loading partway through and wouldn't do anything until the user hit the refresh button. The issue was intermittent enough that I had a difficult time observing it. By the time I'd notice the issue and gotten iftop and friends up, the problem had gone away. A few times it lasted much longer and I was able to note a vague sort of correlation between high bandwidth usage and the occurrence of this issue. A reboot of the DSL Modem/Router would cause things to go back to normal.

At some point I noticed that this was also affecting an IRC connection. I maintain two connections 24/7 in a screen session. One of these was sometimes exhibiting a similar problem where the connection would suddenly cease. Not drop; packets would just stop making it across and it took several minutes for either side to timeout. Initially I dismissed this as the IRC server being flaky, but one day I realized that the connection that wasn't exhibiting the issue was an IPv6 connection. This was the key that enabled me to diagnose the problem.

I did some further testing, attempting to get the issue to manifest while I was watching via tcpdump. I set up a loop to repeatedly try to download a file. It took a while, but eventually I was able to watch what happened when the issue cropped up. The session would look like this:

Several minutes later, it was still sending ACKs and getting nothing back. Eventually it would send a FIN. Testing this with a friend, I was able to determine that the server saw the same thing in reverse. It would get to a point and repeatedly resend the data, never seeing any of my ACKs or FINs. During all this, other attempts would succeed without issue. It was as if that failed connection was suddenly firewalled off.

Over time, the problem was slowly getting worse. Rather than taking two weeks after reboot before these issues would appear, it started happening after a day or two, and finally happening within minutes of the reboot. I'd also been able to observe packet loss to the router itself on the LAN interface from devices plugged directly into it, and had tried the simple things like cable replacement.

So, I figure that this one must be dying or something. Rather than try to spend any more time diagnosing it, I just go out to find a replacement. (Warranty having already expired.) As it happens I had the same choice as before. Same exact model DSL Modem/Router. Ah, well. So I buy it, bring it back and install it. All is well... for a few days.

Shortly after installing the replacement, the same problems started occurring. So, either this one is faulty, too, or something else is wrong. I hunt around for anything else that could be happening to cause this, but I really can't locate any other source of issue.

I decide that, given some of the other problems that we've had with this model, it's probably just a horribly-made device and that I should find a replacement online, even if it takes a few days to arrive. I end up buying a Cisco SR520 and with some effort, get it up and running, at which point the problems go away.

Based on observations both before and after the installation of the Cisco router, my final determination was that it wasn't bandwidth saturation that was the issue, but rather an excess of NAT sessions. The stats from the new router as of mid-June 2009 indicate that we've usually got somewhere between 300 and 500 active NAT sessions, sometimes peaking around 5000. I figure that the cheaper equipment wasn't able to keep up with that, and was dropping sessions from the NAT table, causing an instant severance of the connection. This is why the issue was only evident for IPv4 connections and not IPv6 connections; since IPv6 connections go through a tunnel to SixXS via a static port forwarding assignment, they weren't subject to the NAT table.

This is version 11 of this page, which was last modified at 13:52 on 2009-07-01 by treed.