
For as long as we can remember, our load balancer has thrown the occasional 502. Roughly one in every 20,000 requests would fail — a tiny blip in the grand scheme of things.
Given the bigger infrastructure challenges we were dealing with, this wasn’t worth chasing – until it was.

In DevOps, even the smallest cracks eventually demand attention.
That moment came yesterday: one of our critical APIs failed – the first time in three years (that we know of).

We started investigating.
A quick Google search – “ELB gunicorn 502” – led us straight to the fix.
The root cause?
- ELB’s idle connection timeout is 60 seconds by default.
- Gunicorn’s keep-alive timeout is just 2 seconds.
Gunicorn was closing the connection well before ELB was ready to let go.
All we had to do was set longer keep-alive
value when starting gunicorn
gunicorn --keep-alive 65 <other-args>
It is even mentioned in the docs of gunicorn
to use a longer value!

Five minutes later, the pull request was merged & deployed. Immediately the problem vanished too.

Sometimes, the fix really is on the first page of Google – if you just google for the right thing.