Week-Long Outage: Lifelong Lessons
Briefly

Week-Long Outage: Lifelong Lessons
"Outages are often like a murder mystery, trying to figure out who done it. Was it the feature code with the bug? Was it the configuration change? Was it everyone's favorite, DNS? It's always DNS. Except for, in the story I'm about to share, it actually wasn't DNS."
"My goal with this talk is to entertain you with a roller coaster ride through one of the biggest outages of my career. What comes after, the lessons learned, while not as exciting, are even more important."
"During the story, I'm going to talk about an Elasticsearch upgrade. During this upgrade, we went from Elasticsearch 2 to Elasticsearch 5. Despite the fact that the numbers jump by a value of 3, the actual major version bump was only 1."
One of the largest outages in a career can resemble a murder mystery, with various potential causes like feature code bugs or configuration changes. The speaker emphasizes that contributing to an outage is generally undesirable. The narrative focuses on a significant Elasticsearch upgrade, transitioning from version 2 to version 5, which involved important changes. The experience from this outage not only provided critical lessons but also influenced the speaker's career path towards becoming a Site Reliability Engineer (SRE).
Read at InfoQ
Unable to calculate read time
[
|
]