Learn how you can add fire drills to your software development lifecycle to increase your production resilience and better understand how your systems handle failure.
What is a fire drill?
There are many names for similar processes, “fire drills”, “game days”, “chaos testing”. In this post I’ll mainly be speaking about what has worked well for my teams, but this is a process that has a lot of room for interpretation and can easily be adjusted to fit your team’s needs. It’s the principles that matter.
At a high-level, a fire drill is a process for executing a set of failure scenarios against your application/system and confirming that what you expect to happen actually happens. It’s deceiving in it’s simplicity, but an incredibly powerful tool for building resilient systems.
Why? What problem do fire drills solve?
Fire drills force you to question your assumptions. What happens if your database connection is unavailable? What error codes will your application return? Do your PagerDuty alarms fire? Have you documented how to handle that alarm? Does a failure in component
A affect component
How to prepare for a fire drill
Walk through your application/system and identify any dependencies you have. Do you make calls to another API (even one you own)? Do you connect to a database? Do you consume or publish to any queues? This is a good time to document all of these things with some simple architectural diagrams if you haven’t already.
List out every alert/alarm that you have for this application/system (you do have alarms, right?). Gather links to all of your metrics/dashboards. Find (or create) a runbook to track all of your alerts and what should be done if they fire.
Prepare your scenarios
Now that you have listed out all of your dependencies and alerts, you can start thinking about the scenarios that you want to test during the fire drill. A good starting point is to have a scenario for each dependency failing, and one that exercises each alarm.
Some example scenarios:
- MySQL database is unreachable
- Redis cache is full
Bis taking >5s to respond
- AWS permissions are broken
- Perform a traffic failover to a second region/data center
Flesh each of these out as much as you can. You’re going to have to make a trade-off of completeness vs time investment. Ideally you would cover multiple cases for each dependencies, since your system might handle a total outage more gracefully than degraded performance or higher latencies. Make a judgement call here on how deep you want to go.
Prepare execution steps
For each scenario, you need to think through and document how you will inject that failure into your system (and un-inject it!). Write down the specific steps so that anyone on your team would be able to recreate that failure.
This can be hard, you might have to get creative to cover all of your scenarios. For some of the simpler ones you can do things to you application config to simulate the failure (swap out endpoints, use the wrong credentials, wrong ports, etc). Depending on your tech stack, there may be tools that you can use to inject some of the more tricky failures like network congestion or random packet loss (ex. tylertreat/comcast). We’ll walk through some specific example scenarios a little further down.
An aside: Dev/Staging vs Production
Ideally, you would run your fire drill scenarios in a production environment. This will give you the most accurate results and the best chance of finding issues that my only pop up in production. For example, staging environments often have less resources allocated to them, less traffic flowing through them, and less alerting or SLO tracking. You should evaluate your own situation, maybe your team has perfect parity between production and pre-production environments (kudos if so). But, more likely than not, there are differences that could bury issues that even a fire drill will not expose.
One common thing I’ve seen is that alerts in pre-production environments do not get routed to a real PagerDuty (or similar) instance, making it hard to verify that pages actually fire correctly.
If your application/system can be deployed in multiple regions or data centers, that can be a good way to run your scenarios against production without it affecting your customers.
Start with the assumption that you will run the scenarios in production, and if that isn’t possible, have a strong argument ready for why not.
Document assumptions & expectations
For each scenario you prepared above, document what you expect to happen when you execute it.
- What alerts do you expect to trigger?
- What do you expect your metrics/dashboards to show?
- How will the service respond? (Consumer lag? HTTP status codes? etc)
Be as specific as possible. If you don’t know, dig into the code to make an educated guess. It’s okay to leave this blank in cases where you’re not sure how the system will respond, but it might be a red flag if you aren’t sure how you expect your system to respond to failure.
Execute the fire drill
Now for the fun part 🎉.
Executing the fire drill is best done as a team (or at least with one partner), since there can be a lot to manage and monitor. Schedule some time with your team and block off at least an hour or more (I’ve had some larger systems with a dozen+ scenarios take upwards of half a day or more). Grab some donuts/burritos and get crackin'.
For each scenario, in order:
- Execute the failure injection steps (break permissions, take down a dependency, shut off the database, etc).
- Wait and monitor what happens, watch for alerts, monitor dashboards, try to run requests/data through your system.
- Document what actually happens in as much detail as you can.
- Revert the failure injection and wait for the system to come back to equilibrium.
You can split this work up; have one person documenting, one person injecting the failures, another making requests, etc. Whatever works best for your team.
Good thing you took such detailed notes while executing each scenario. Let’s examine.
- Did the alerts that you expected to fire, fire? Did ones you didn’t expect to fire, fire? Did they fire in a timely manner?
- Did you have all the right metrics and monitoring to cover each failure scenario?
- Would you have been able to identify the failure scenario based solely off the metrics and alerts?
- Were there any unexpected failures?
- Was the customer experience (data flow/status codes) what you expected?
- After reverting the failure, did the system recover successfully? Did it take longer than you expected? Did you alarms all resolve in a timely manner?
Now that you’ve examined the results of the fire drill, it’s time to come up with some action items, or remediations.
Your stated expectations almost certainly didn’t line up with reality for every scenario, I bet there were some surprises.
As a team, think about what you could do to fix your system to respond better to these failures. Maybe you just need to tune some alert timings/thresholds. Maybe you need to introduce or tune timeout settings. Maybe you need to take a big step back and think about larger architectural changes to get more resiliency.
Prioritize and commit
Now this exercise wouldn’t be very useful if we didn’t do anything with the results.
Stack rank your remediations, and commit to tackling some of the easier ones ASAP. Bigger items you may need to schedule into your road map later, but at least now you know the cases where your system doesn’t response as expected.
If you haven’t gone to production yet1, you should be able to clearly identify which remediations should be fixed before letting customers in.
When to (re)run a fire drill
Fire drills shouldn’t be a one time event. Yes they are super important before releasing a new system, but they also should be run periodically to identify any drift in expectations. This is more of an issue for projects which are being actively developed, since you are more likely to make changes that could affect the scenarios.
Come up with what works best for your team, maybe you fire drill your top 3 most important systems every 6 months.
A good rule of thumb is that you should execute a fire drill every time you add a new dependency to the system, since that is a very clear addition of logic where you need to handle failures. Note that you may not need to run the entire fire drill again, maybe just one or two new scenarios.
Let’s walk through an example fire drill for an imagined system,
Message Saver 9000.
This system consumes messages from an AWS SQS queue and writes to a DynamoDB table. To keep things simple for this example, there is no way to read the messages back out.
Here is our simple fire drill scenario template:
|#||Scenario||How to Simulate||Expectation||Actual|
Okay, let’s think about the failure cases for
Message Saver 9000.
- What if SQS is down?
- What if DynamoDB is down?
- What if the system is overloaded?
Let’s write these up, assuming we have some basic monitoring and alarms in place.
|#||Scenario||How to Simulate||Expectation||Actual|
|1||SQS is down||Remove the ||Can’t read messages. ||TBD|
|2||DynamoDB is down||Change the ||Can’t save items. ||TBD|
|3||Service is overloaded||Publish many messages to the SQS queue||SQS lag will be immediate. ||TBD|
Okay, let’s say we execute these scenarios, and we documented the following findings.
|#||Scenario||How to Simulate||Expectation||Actual||Result|
|1||SQS is down||Remove the ||Can’t read messages. ||✅|
|2||DynamoDB is down||Change the ||Can’t save items. ||❌|
|3||Service is overloaded||Publish many messages to the SQS queue||SQS lag will be immediate. ||❌|
Oops, looks 2 out of our 3 scenarios did not go as planned!
Let’s come up with some remediations for the issues we saw.
DynamoFailurealarm didn’t fire properly, and we didn’t see any DynamoDB failures in our metrics dashboard.
SQSFallingBehindalarm took too long to fire when we overloaded the queue, we expected it to fire more immediately.
- When DynamoDB was throttling us, not all messages made it to the DB, we lost data!
Now we document the remediations, get them into our issue tracker, and assign a priority.
|Scenario #||Issue||Priority||Ticket #|
|3||When DynamoDB was throttling us, not all messages made it to the DB, we lost data!||High||MSAVER-1236|
Looks like we found a pretty serious data durability issue we need to remediate before we release!
Right before you deploy a new system is a great time to fire drill! ↩︎