General availability outage

Incident Report for Buttondown

Resolved

For around an hour this morning, Buttondown had significantly degraded availability.

## What happened?
New hosts refused to spin up and were "correctly" throwing 500s for around 30% of requests (this was only impacting hosts that were automatically cycling in and out, which is why it wasn't all requests.)

## Why did this happen?
I'm using an undocumented Notion API to power documentation search, and the token that I was using to power that API expired in a way that I was not defensively programming against. This meant that each time the server tried to restart it would hit the API, fall over, and then pass that failure onto the client. As soon as this happened widespread enough, I got an alert for it... but I was out on a run. As soon as I got back, I hit the circuit breaker for that codepath and things got back to normal.

## Why won't this happen again?
That circuit breaker is gonna stay off for a little, but I plan on moving all of that compilation to a build-time step anyway, removing the Notion codepath from the critical path of the application!

## Any questions?
Email me: justin@buttondown.email

Posted Nov 29, 2020 - 09:00 PST