My First SRE Con
I feel re-energized after attending my first Site Reliability Engineering Con in Brooklyn NY this year. I have been really interested in SRE practices and this being my first formal SRE event was able to learn so much from such a great group of people and organizations.
What is Site Reliability Engineering?
Google literally wrote the book on Site Reliability Engineering. SREs are responsible for the resiliency and robustness of the complex and distributed systems. Even with their 1000’s of engineers, Google per their presentation has around 400 SREs in the SRE organization. This shows that SREs are typically a leverage resource providing expertise to multiple platforms and services. If you are curious about what SREs do and you have spent some cursory time doing research, the first question that came to mind for me was what is the difference between an SRE and a DevOps Engineer. I produced a little chart below that I am presenting internally to AppDynamics.
TL;DR for that is that SREs are software engineers focusing on operations problems vs DevOps folks are typically systems engineers focusing on the development pipeline.
How Does an SRE Really Feel?
Being on the front line of complex systems is not easy. Pretty much all of time, SREs are tasked with automating and providing expertise on platforms that are on the bleeding edge. Especially during an incident or if you are new your first few incidents, the feeling can be not the best. SREs are not the service owners and can remedy issues but usually do not provide a fix or long term solution. I equate being an SRE to playing Street Fighter and this amazing piece of internet gold is pretty encapsulating fighting against an incident.
A constant battle of fighting the unknown and the worst possible time e.g during an incident is taxing. Our brand reputation can be tarnished very quickly. An excellent piece of what SREs face and how multiple engineering teams have to come together is the recent MailChimp outage. MailChimp did an amazing job documenting their Mandrill outage and explaining steps along the way to research and remedy the incident. By any measure, these series of steps were not easy and being in a production outage situation making these moves, kudos to the MailChimp teams.
There is no Root Cause
One of the most influential talks for me at SRECon NA 2019 had to be Ryan Kitchen’s talk around focusing on what is working. Too often when disciplines that focus on production, a lot of focus is what has gone wrong vs right. My biggest take away from Ryan’s talk was there is no root cause. I strongly believe in this. I write about the Fog of Development a good bit; no one person has an entire end-to-end view of the entire system. Our systems are too complex and we are fallible aka human; long and short humans built the systems. There is just a series of events that manifest itself in a “perfect storm”.
A Look to SRECon 2020
I am very excited to return to another SRECon. I was happy to present a lighting talk this year at SRECon. Will be trading in the hipster digs of Park Slope Brooklyn for the high tech Santa Clara, CA scene next year as SRECon expands. Onwards to SRECon NA 2020!