The One 502 in 20,000

For as long as we can remember, our load balancer has thrown the occasional 502. Roughly one in every 20,000 requests would fail — a tiny blip in the grand scheme of things.

Given the bigger infrastructure challenges we were dealing with, this wasn’t worth chasing – until it was.

In DevOps, even the smallest cracks eventually demand attention.

That moment came yesterday: one of our critical APIs failed – the first time in three years (that we know of).

We started investigating.

A quick Google search – “ELB gunicorn 502” – led us straight to the fix.

The root cause?

  • ELB’s idle connection timeout is 60 seconds by default.
  • Gunicorn’s keep-alive timeout is just 2 seconds.

Gunicorn was closing the connection well before ELB was ready to let go.

All we had to do was set longer keep-alive value when starting gunicorn

gunicorn --keep-alive 65 <other-args>

It is even mentioned in the docs of gunicorn to use a longer value!

Five minutes later, the pull request was merged & deployed. Immediately the problem vanished too.

Sometimes, the fix really is on the first page of Google – if you just google for the right thing.