Full downtime

Incident Report for Buttondown

Resolved

For around ten minutes (between 5.03am PST and 5.17am PST), Buttondown's servers were overloaded. The past week, Buttondown's settled into a very bursty traffic pattern where around ~two million emails are all being sent out at 8am PST (aka 5am PST). My work to autoscale workers means that these emails are going out as planned, but the burstiness of the traffic means that the entire application falls over as all of the event webhooks coming back from those emails monopolizes traffic.

In the short term, this is an easy fix: the burstiness is not that high, and I'm just going to keep the steady state at which I'm scaled high enough to handle it.

In the medium-to-long term, though, webhook traffic really shouldn't be impacting the core application. I'll be carving out a separate deployable endpoint for webhooks that can be a) scaled independently b) DDOS'd without customer impact.

Posted Mar 31, 2021 - 05:00 PDT