Incident Management
A clear and repeatable incident management (sometimes called incident response) process is critical to resolving incidents quickly and restoring service for your customers. Below we outline some of the key considerations that you should make when implementing your incident response documentation and training program.
Incident Lifecycle
The incident lifecycle is fairly standard across all service providers. Regardless, it's a good idea to map out the incident lifecycle in a way that makes sense for your organization. You can start by using the example below as a basic starting point and then develop your lifecycle from there. It's great to have as a building block for developing your detailed incident response procedures.
Detection | Classification and Declaration | Diagnosis | Resolution | Closure |
---|---|---|---|---|
Postmortems, Post-Incident Reports, and Root Cause Analyses
Quality Assurance
People
Team Organization, Structure, and Hiring
Roles and Responsibilities
One of the most important aspects of incident management is having roles clearly defined before an incident strikes. Name the roles, define the responsibilities, ensure that there is always a designated person or on-call rotation to cover that role at any time of day, and drill your incident management process with those roles present.
We recommend having the following roles in place. However, depending on the size and structure of your organization, you may choose to split these responsibilities up further or consolidate them into a single role.
Incident Manager or Incident Commander
An Incident Manager (IM) (sometimes called an Incident Commander) is the person responsible for restoring service, serving your customers, and ultimately protecting the brand. When on duty, the Incident Manager has the 24x7 responsibility to do what it takes to restore service health. The IM responds to all potentially customer-impacting events and leads whatever team necessary to restore service as quickly as possible. The IM makes tough decision when required, and owns the incident from start to finish, including how it is handled, and the follow-up to determine root cause and actions to prevent similar issues in the future.
Who is it?
Typically, it's a mid- to upper-level manager with an ability to problem solve. Knowledge of the service that they're supporting is incredibly helpful in order to direct troubleshooting and resolution activities and have the confidence to act in a high-pressure situation. They're willing to do what it takes to defend service availability and health. In some companies, this is a dedicated role that is hired for specifically. In others, it's an on-call rotation of engineering leaders.
What do they do?
The Incident Manager owns the incident from start to finish. They ensure timely restoration of service as the primary focus and respond to all high-severity, potentially customer-impacting events. They're responsible for ensuring whatever team necessary to restore service is engaged as quickly as possible. They assemble a virtual team (a team outside of the organizational structure that is built for only one purpose) of on-call engineers (OCEs) who they drive to restore service. They take ownership of critical decisions and ensure that the general service delivery goals such as security, availability, and data integrity are maintained. They're responsible for tracking the progress of an incident in real-time and ensure the right data is captured for reporting updates for the customer, support teams, and other internal audiences. After the incident, they follow up to determine the root cause of an incident and help develop corrective actions that will prevent the incident from happening in the future. They also provide the details required by the problem management process, as well as the communication managers and other groups that might need follow-up.
Communications Manager
A Communications Manager (CM) is the person responsible for communicating to your customers and internal audiences during an incident. When on duty, the Communications Manager has the 24x7 responsibility to provide quality communications to customers, including external service health or status page messages (like those that Trustleaf provides), sending messages to internal audiences to keep them apprised of the customer impact and current status of an incident, and developing customer-facing root cause analysis (RCA) documentation (sometimes called a post-incident review or root cause messaging). CMs collaborate with the IMs and OCEs to ensure that timely, targeted, and accurate incident communications and notifications are delivered to customers and internal stakeholders.
Who is it?
This is usually someone with training in incident management, incident communications, public relations, corporate communications, marketing communications, or some combination of those experiences. They're usually good technical writers, taking the detail internal information that's discussed during an incident call and translating that is appropriate for customers and internal audiences that may be more business-oriented.
What do they do?
The Communications Manager drives the customer-facing and internal-facing incident communications end to end. They post, publish, and/or send the incident communications and notifications to customers. Sometimes the medium is a self-hosted, authenticated service health dashboard. Sometimes it's a public-facing status page like one hosted on Trustleaf (for ultimate transparency and trust). Sometimes it's an email or SMS message from a customer database or from a list of explicitly subscribed users. The CM serves as a court reporter during an incident to document the progress of the incident and provide unbiased information to internal audiences.
On-Call Engineer (OCE)
The On-Call Engineer (OCE) (sometimes called a Designated Responsible Individual or Site Reliability Engineer) is the technical resource tasked with investigating, troubleshooting, and fixing an incident from a technical standpoint. There may be multiple OCEs engaged during an incident (think network, data center, software, subcomponent, etc.) who all have some role in restoring service. They're a group of individuals who have a 24x7 responsibility to resolve live-site customer issues either by providing workarounds, or by providing configuration or software updates to address an issue. The issues that they investigate come from three sources: customer, monitoring, or by the individuals themselves. OCEs also participate in the RCA process to produce bug fixes or additional system monitoring and logging.
Who is it?
The OCE is typically an engineer of some kind who specializes in a particular area that support a critical function of the service, its features, or its supporting infrastructure.
What do they do?
OCEs debug the issue that's causing the incident. They troubleshoot and fix the issue to restore service by serving in an on-call rotation for the component area that they serve.
Other Roles
Companies may have other supprting roles during an incident. For example, support and account teams ensure that customers have direct support during an incident and they may also help relay important information to the incident conference call (sometimes called a bridge).
Roles and Responsibilities Matrix
A roles and responsibilities matrix is a great tool for establishing and communicating the basic expectations of each of the roles involved in the incident management or incident response process. Below is an example of a roles and responsibilities matrix using the roles that we suggested above.
Role | Responsibilities |
---|---|
Incident Manager (IM) |
|
Communications Manager (CM) |
|
On-Call Engineer (OCE) |
|
Expectations
Clearly communicate the expected behaviors during an incident to your team and all of your partners teams that will participate at any point during the incident management process. Below are some common behaviors to communicate internally, and these best practices can help with collaborating on an incident call to maintain focus on driving an incident towards service restoration:
- Effectively using technology
- Bridge courtesy and professionalism
- Incident investigation and coordination
Effectively Using Technology
Get familiar with incident tools and processes before an incident occurs
-
Your headset and audio settings should be configured correctly before an incident call
- Know how to quickly select the correct audio device if you have to get online from a remote location, or configure your computer's audio settings to automatically use specific devices in specific scenarios (if applicable).
- All required permissions, VPN access, and other required access should be configured and working correctly.
- Know how and where to access established incident resources, like conference bridges. When the incident engagement process begins, you are expected to join the established call as soon (and as safely) as possible.
Keep the call distraction- and noise-free
-
If you are not speaking, mute.
- Always double-check your mute status before and after speaking or when someone enters your office or other workspace.
- If you are in a noisy area, move to a quieter location or put on a headset.
-
If you are having any trouble joining the call or others report having trouble hearing you, just dial in from a cell phone or ask someone to add your mobile number to the call.
- Every second is critical on an incident bridge, especially at the start. Crucial time spent troubleshooting audio settings could be put towards the investigation instead.
Make use of the chat or IM window
- For things like server names and ticket numbers.
- For information that is not time-sensitive.
- "I'm sorry. I'm having a hard time hearing you. Can you place that in the chat?"
- For keeping track of key milestones, and keeping IM-only participants informed.
- Use at-mentions to ensure accountability and clarify who an ask is directed to.
Make use of any whiteboard features
- For important data that will be frequently referenced. This will reduce repeated requests for the same information that can often derail a bridge call.
Internal conference bridges are for company employees only
-
Under no circumstances should customers be added to internal conference bridges or calls.
- If needed, other customer-facing resources should facilitate organizing separate conference calls with customers if a direct discussion is required.
- Ensure that any participants who have dialed in directly via phone have identified who they are and what team they represent.
- Proactively identify yourself when dialing in to a call directly.
Bridge Courtesy & Professionalism
Be clear and concise
-
Only speak when necessary, and speak with intention.
- As more resources join a call, the potential for a chaotic bridge grows.
- Know what you're going to say before saying it.
-
Consider how Air Traffic Control coordinates radio communications to keep things organized. Always make a habit to briefly introduce yourself and the team you represent when speaking, and direct questions or assign tasks to specific people to avoid ambiguity and ensure accountability.
- "This is Micah Gregorio from the Incident Communications team. Can Joe Smith from the Network team provide the log details from..."
- Check your mute status frequently.
Practice Active Listening
-
Know your role, know what you're responsible for, and be ready to respond when called upon.
- Keep the bridge informed of your status and availability in case you are needed.
-
Follow along with bridge progress and know what data is available.
- When information needs to be continuously repeated, this creates interruptions and takes time away from remediating the issue.
- Check on the bridge to make sure you are no longer needed before disengaging.
We're all on the same team
-
Having a helpful attitude goes a long way in moving an incident towards resolution. If you were incorrectly engaged, don't spend more energy pushing back than you would have spent on pointing someone in the right direction instead.
- Consider the difference in response between "Why was I engaged for this," vs. "This isn't my role, but try this/this person might be able to help. Feel free to engage me again if it turns out that X is the problem."
-
"There's always time in the postmortem."
- Clarifying the correct escalation path, why paging resources didn't work, etc. can happen after the incident. Service restoration and doing the right thing for the customer is always the first priority.
-
Be available to give a warm handoff when raising or transferring an incident escalation.
- The start of the bridge is the most critical time to get the essential details. Make sure nobody needs any info from you before disengaging. Receiving an incident call with a vague description and then discovering that the person who raised the incident is offline completely derails the investigation.
- Rember the human and be appreciative of the contributions of others.
Incident Investigation and Coordination
Assign tasks and set clear expectations
- Keep track of assigned tasks and next steps using a whiteboard feature.
- Avoid vague expectations and set specific commitments. Ask for acknowledgement from specific individuals when making a request.
- Set a cadence for check-ins on current progress. People tend to be more comfortable to check in and provide an update at a particular time once expectations are set, rather than being put on the spot for an update.
- Set expectations regarding when to escalate or try another approach if progress is not being made.
Always approach incident bridges with a sense of urgency
- Service restoration is the top priority; avoid casual chat and joking, analyzing performance, or postmortem discussions.
- As things slow down, don't take a back seat; continue to maintain focus on the affected customers.
- Take initiative and set the tone for everyone on the bridge.
- Understand the pre-established criteria for severity levels and incident escalations.
Training
Drills / Practice
Process
A clear, concrete, and repeatable incident management process that results in the same outcomes no matter who is leading it, where it's being led, or when it's being led is a key component of a successful incident response program. Have a plan and execute it. An incident is not the time to change how incidents are run (whenever reasonably possible). You can think about how to do incidents better in the future as a part of the problem management process.
Process and Documentation
Incident Severity Matrix
Question Bank
Having a question bank that your Incident Managers can reference helps establish repeatable pattern for understanding, diagnosing, and developing solutions for major incidents. But not only do question banks help Incident Managers and on-call engineers from a technical perspective, it also helps ensure that everyone who's participating in the incident process can take the technical information and translate it to something that customers and other (potentially non-technical) stakeholders will understand. Use the following question bank as an example of something that you can implement on your own. This incident question bank is a table that describes that categories of questions that you can ask (e.g., status, impact, root cause, etc.) and when in the incident you should start asking those questions (e.g., during the investigating or service restoration phases).
Investigating | Service Degradation | Restoring Service | Service Restored | |
---|---|---|---|---|
Status |
|
|
|
|
Impact |
|
|
||
Workaround |
|
|||
Start/End Time |
|
|
||
Next Update / Estimated Time to Resolve |
|
|
||
Preliminary Root Cause / Next Steps |
|
|
||
Other / Tracking |
|
|