A quick story about node.js, socket.io, and the Linux TCP stack.

werner
3 min readDec 12, 2018

Over the last week, one of our legacy core servers started throwing errors at an increasingly alarming rate.

Being a stable legacy server, serving a few enterprise apps developed more than 2 years ago — under a code freeze — this seemed fairly odd.

I immediately hopped on the box and checked the shouting Nginx error logs and lo and behold, I was greeted with a rapidly rolling report of upstream timeouts:

Behold… A sea of errors…

Strange I thought. Doing a basic sanity check — including the git HEAD commit id’s of the services and process uptime, everything was as it should be: frozen in time and running smoothly — but it wasn’t.

Note that this service is part of Atajo — which exclusively uses websockets for realtime communication between devices and backends. This “legacy” core still used socket.io (with polling) as you can see in the image above.

My focus then turned to the main external variable — client connections.

I like the following command to give me pretty good idea of TCP connection states on the box:

netstat -tan | awk '{print $6}' | sort | uniq -c

Which in this case gave me :

   11948 ESTABLISHED
145 FIN_WAIT1
1425 FIN_WAIT2
9 LAST_ACK
118 LISTEN
69 SYN_RECV
60523 TIME_WAIT

Holy moly I thought… 60k connections in TIME_WAIT state. That’s almost the entire port allocation on the VM. Something was wrong.

Was this a (D)DoS attack?

Checking the firewall iptables and throughput with iftop — I could see that all the traffic was legit and the requesting IP’s from allowed subnets.

I jumped on the phone to talk to one of our support engineers — to ask him if there where any more users allocated to this app, and there was.

Normally our enterprise clients would notify us if they are expecting / allocating more users on / to an app, but in this case, because everything was running smoothly for more than 2 years — almost forgotten — they just added another division (~1000 users) to the app without second thought.

Now, to be honest, 1000 users are not that many — But something was causing upstream timeouts because of it.

So back to the Nginx error log, I could see these where all polling connection attempts.

Basically, socket.io on a client device starts with an HTTP handshake, and tries to upgrade to a WebSocket if supported (some of the devices in the fields where Android 4.0 or less) — If not, sockets are “simulated” by HTTP polling.

So I had two questions — why where the sockets not upgrading, and secondly, why legitimately non-websocket capable apps would suddenly start a connection flood by incessant polling after years of stability.

I suspected the two where related and went back into the logs to see when this started happening, when it got worse, and when it got better.

Long story short:

  • I realised that firstly, this VM was never configured with my standard TCP kernel flag tuning (it was still a stock standard Ubuntu image)
  • Secondly, because of this, when Nginx started saturating upstreams, the websocket upgrade did not happen, resulting in more and more devices falling back to polling (which is a feedback loop of sorts), until we see about 500 polling requests a second (on a 2 CPU / 16Gb RAM VM) — which obviously will not work.

So, I’ll share an example TCP kernel flag config with you (add this to /etc/sysctl.d/100-socket.io.conf and refresh with sysctl -p)

net.ipv4.ip_local_port_range='1024 65000'
net.ipv4.tcp_tw_reuse='1'
net.ipv4.tcp_fin_timeout='15'
net.core.netdev_max_backlog='4096'
net.core.rmem_max='16777216'
net.core.somaxconn='4096'
net.core.wmem_max='16777216'
net.ipv4.tcp_max_syn_backlog='20480'
net.ipv4.tcp_max_tw_buckets='400000'
net.ipv4.tcp_no_metrics_save='1'
net.ipv4.tcp_rmem='4096 87380 16777216'
net.ipv4.tcp_syn_retries='2'
net.ipv4.tcp_synack_retries='2'
net.ipv4.tcp_wmem='4096 65536 16777216'
vm.min_free_kbytes='65536'

And for some more sugar, add a high keepalive to your Nginx upstream config:

upstream atajo {
hash $arg_UUID consistent;
server 127.0.0.1:30000;
server 127.0.0.1:30001;
keepalive 512;
}

As an interesting aside, the hash $arg_UUID consistent part is for sticky balancing between upstreams (as socket.io requires sticky sessions in order for handshake packets to be directed to the same upstream). Instead of using the IP address (because one can have 100s of devices coming in over an APN or internal corporate network from the same IP address), I use the actual UUID as provided by Android and iOS via standard API’s. This allows for a smoother distribution of connections.

Anyway, the lesson is ALWAYS check your kernel TCP flags!

Have a good one.

--

--