Actionable Production Escalations
I've long considered the following items the basics of an actionable production escalation. These were taught to me by Googlers (mostly when I violated these understated values). The fundamentals of any production escalation require the documentation of the following from SREs:
1. An exception, call graph, logs or metrics showing the problem
2. A first pass characterization of the problem (what is it / how much impact)
3. Why me? (Do we need a PoC that you wouldn't know otherwise?)
4. What have you already tried.
5. Things that you have noted that are out of the ordinary.
6. How specifically can I help solve this problem? (Find a PoC? look at the code? Judge downstream impact? Validate severity?)
Following the above process keeps a check on the level of due diligence needed before a Dev escalation. It also helps formulate concrete action items as part of the escalation process. I've found that this helps resolve issues quicker and keeps the prod overhead low for devs. What do you think?
This sharing is really nice! I think the order could be 1, 2 (what
ReplyDelete's the issue); 4, 5 (what's the existing effort); and 3, 6 (why and how do the one you reach out help)
Thank you - that makes sense. Reposting your response below for a quick copy paste in the future:
Delete1. What's the problem? (exception, call graphs, logs, metrics, impact)
2. What's the existing effort? (what has been tried, things out of the ordinary)
3. Who's the best person to solve the issue & why? (PoC, Code issue, bug etc.)