Incident Management

A clear and repeatable incident management (sometimes called incident response) process is critical to resolving incidents quickly and restoring service for your customers. Below we outline some of the key considerations that you should make when implementing your incident response documentation and training program.

Incident Lifecycle

The incident lifecycle is fairly standard across all service providers. Regardless, it's a good idea to map out the incident lifecycle in a way that makes sense for your organization. You can start by using the example below as a basic starting point and then develop your lifecycle from there. It's great to have as a building block for developing your detailed incident response procedures.

Detection	Classification and Declaration	Diagnosis	Resolution	Closure

Postmortems, Post-Incident Reports, and Root Cause Analyses

Quality Assurance

People

Team Organization, Structure, and Hiring

Roles and Responsibilities

One of the most important aspects of incident management is having roles clearly defined before an incident strikes. Name the roles, define the responsibilities, ensure that there is always a designated person or on-call rotation to cover that role at any time of day, and drill your incident management process with those roles present.

We recommend having the following roles in place. However, depending on the size and structure of your organization, you may choose to split these responsibilities up further or consolidate them into a single role.

Incident Manager or Incident Commander

An Incident Manager (IM) (sometimes called an Incident Commander) is the person responsible for restoring service, serving your customers, and ultimately protecting the brand. When on duty, the Incident Manager has the 24x7 responsibility to do what it takes to restore service health. The IM responds to all potentially customer-impacting events and leads whatever team necessary to restore service as quickly as possible. The IM makes tough decision when required, and owns the incident from start to finish, including how it is handled, and the follow-up to determine root cause and actions to prevent similar issues in the future.

Who is it?

Typically, it's a mid- to upper-level manager with an ability to problem solve. Knowledge of the service that they're supporting is incredibly helpful in order to direct troubleshooting and resolution activities and have the confidence to act in a high-pressure situation. They're willing to do what it takes to defend service availability and health. In some companies, this is a dedicated role that is hired for specifically. In others, it's an on-call rotation of engineering leaders.

What do they do?

The Incident Manager owns the incident from start to finish. They ensure timely restoration of service as the primary focus and respond to all high-severity, potentially customer-impacting events. They're responsible for ensuring whatever team necessary to restore service is engaged as quickly as possible. They assemble a virtual team (a team outside of the organizational structure that is built for only one purpose) of on-call engineers (OCEs) who they drive to restore service. They take ownership of critical decisions and ensure that the general service delivery goals such as security, availability, and data integrity are maintained. They're responsible for tracking the progress of an incident in real-time and ensure the right data is captured for reporting updates for the customer, support teams, and other internal audiences. After the incident, they follow up to determine the root cause of an incident and help develop corrective actions that will prevent the incident from happening in the future. They also provide the details required by the problem management process, as well as the communication managers and other groups that might need follow-up.

Communications Manager

A Communications Manager (CM) is the person responsible for communicating to your customers and internal audiences during an incident. When on duty, the Communications Manager has the 24x7 responsibility to provide quality communications to customers, including external service health or status page messages (like those that Trustleaf provides), sending messages to internal audiences to keep them apprised of the customer impact and current status of an incident, and developing customer-facing root cause analysis (RCA) documentation (sometimes called a post-incident review or root cause messaging). CMs collaborate with the IMs and OCEs to ensure that timely, targeted, and accurate incident communications and notifications are delivered to customers and internal stakeholders.

Who is it?

This is usually someone with training in incident management, incident communications, public relations, corporate communications, marketing communications, or some combination of those experiences. They're usually good technical writers, taking the detail internal information that's discussed during an incident call and translating that is appropriate for customers and internal audiences that may be more business-oriented.

What do they do?

The Communications Manager drives the customer-facing and internal-facing incident communications end to end. They post, publish, and/or send the incident communications and notifications to customers. Sometimes the medium is a self-hosted, authenticated service health dashboard. Sometimes it's a public-facing status page like one hosted on Trustleaf (for ultimate transparency and trust). Sometimes it's an email or SMS message from a customer database or from a list of explicitly subscribed users. The CM serves as a court reporter during an incident to document the progress of the incident and provide unbiased information to internal audiences.

On-Call Engineer (OCE)

The On-Call Engineer (OCE) (sometimes called a Designated Responsible Individual or Site Reliability Engineer) is the technical resource tasked with investigating, troubleshooting, and fixing an incident from a technical standpoint. There may be multiple OCEs engaged during an incident (think network, data center, software, subcomponent, etc.) who all have some role in restoring service. They're a group of individuals who have a 24x7 responsibility to resolve live-site customer issues either by providing workarounds, or by providing configuration or software updates to address an issue. The issues that they investigate come from three sources: customer, monitoring, or by the individuals themselves. OCEs also participate in the RCA process to produce bug fixes or additional system monitoring and logging.

Who is it?

The OCE is typically an engineer of some kind who specializes in a particular area that support a critical function of the service, its features, or its supporting infrastructure.

What do they do?

OCEs debug the issue that's causing the incident. They troubleshoot and fix the issue to restore service by serving in an on-call rotation for the component area that they serve.

Other Roles

Companies may have other supprting roles during an incident. For example, support and account teams ensure that customers have direct support during an incident and they may also help relay important information to the incident conference call (sometimes called a bridge).

Roles and Responsibilities Matrix

A roles and responsibilities matrix is a great tool for establishing and communicating the basic expectations of each of the roles involved in the incident management or incident response process. Below is an example of a roles and responsibilities matrix using the roles that we suggested above.

Role	Responsibilities
Incident Manager (IM)	Owns the incident from start to finish, including directing people and making final decisions Tracks action items and their owners Engages other required teams and resources to resolve the issue as quickly as possible Tracks and completes the postmortem process
Communications Manager (CM)	Develops and sends customer-facing communications Develops and sends internal-facing communications Keeps a timeline of the incident
On-Call Engineer (OCE)	Diagnoses the issue Determines what other resources are required Develops and implements a fix for the issue

Expectations

Clearly communicate the expected behaviors during an incident to your team and all of your partners teams that will participate at any point during the incident management process. Below are some common behaviors to communicate internally, and these best practices can help with collaborating on an incident call to maintain focus on driving an incident towards service restoration:

Effectively using technology
Bridge courtesy and professionalism
Incident investigation and coordination

Effectively Using Technology

Get familiar with incident tools and processes before an incident occurs

Your headset and audio settings should be configured correctly before an incident call
- Know how to quickly select the correct audio device if you have to get online from a remote location, or configure your computer's audio settings to automatically use specific devices in specific scenarios (if applicable).
All required permissions, VPN access, and other required access should be configured and working correctly.
Know how and where to access established incident resources, like conference bridges. When the incident engagement process begins, you are expected to join the established call as soon (and as safely) as possible.

Keep the call distraction- and noise-free

If you are not speaking, mute.
- Always double-check your mute status before and after speaking or when someone enters your office or other workspace.
If you are in a noisy area, move to a quieter location or put on a headset.
If you are having any trouble joining the call or others report having trouble hearing you, just dial in from a cell phone or ask someone to add your mobile number to the call.
- Every second is critical on an incident bridge, especially at the start. Crucial time spent troubleshooting audio settings could be put towards the investigation instead.

Make use of the chat or IM window

For things like server names and ticket numbers.
For information that is not time-sensitive.
"I'm sorry. I'm having a hard time hearing you. Can you place that in the chat?"
For keeping track of key milestones, and keeping IM-only participants informed.
Use at-mentions to ensure accountability and clarify who an ask is directed to.

Make use of any whiteboard features

For important data that will be frequently referenced. This will reduce repeated requests for the same information that can often derail a bridge call.

Internal conference bridges are for company employees only

Under no circumstances should customers be added to internal conference bridges or calls.
- If needed, other customer-facing resources should facilitate organizing separate conference calls with customers if a direct discussion is required.
Ensure that any participants who have dialed in directly via phone have identified who they are and what team they represent.
Proactively identify yourself when dialing in to a call directly.

Bridge Courtesy & Professionalism

Be clear and concise

Only speak when necessary, and speak with intention.
- As more resources join a call, the potential for a chaotic bridge grows.
- Know what you're going to say before saying it.
Consider how Air Traffic Control coordinates radio communications to keep things organized. Always make a habit to briefly introduce yourself and the team you represent when speaking, and direct questions or assign tasks to specific people to avoid ambiguity and ensure accountability.
- "This is Micah Gregorio from the Incident Communications team. Can Joe Smith from the Network team provide the log details from..."
Check your mute status frequently.

Practice Active Listening

Know your role, know what you're responsible for, and be ready to respond when called upon.
- Keep the bridge informed of your status and availability in case you are needed.
Follow along with bridge progress and know what data is available.
- When information needs to be continuously repeated, this creates interruptions and takes time away from remediating the issue.
Check on the bridge to make sure you are no longer needed before disengaging.

We're all on the same team

Having a helpful attitude goes a long way in moving an incident towards resolution. If you were incorrectly engaged, don't spend more energy pushing back than you would have spent on pointing someone in the right direction instead.
- Consider the difference in response between "Why was I engaged for this," vs. "This isn't my role, but try this/this person might be able to help. Feel free to engage me again if it turns out that X is the problem."
"There's always time in the postmortem."
- Clarifying the correct escalation path, why paging resources didn't work, etc. can happen after the incident. Service restoration and doing the right thing for the customer is always the first priority.
Be available to give a warm handoff when raising or transferring an incident escalation.
- The start of the bridge is the most critical time to get the essential details. Make sure nobody needs any info from you before disengaging. Receiving an incident call with a vague description and then discovering that the person who raised the incident is offline completely derails the investigation.
Rember the human and be appreciative of the contributions of others.

Incident Investigation and Coordination

Assign tasks and set clear expectations

Keep track of assigned tasks and next steps using a whiteboard feature.
Avoid vague expectations and set specific commitments. Ask for acknowledgement from specific individuals when making a request.
Set a cadence for check-ins on current progress. People tend to be more comfortable to check in and provide an update at a particular time once expectations are set, rather than being put on the spot for an update.
Set expectations regarding when to escalate or try another approach if progress is not being made.

Always approach incident bridges with a sense of urgency

Service restoration is the top priority; avoid casual chat and joking, analyzing performance, or postmortem discussions.
As things slow down, don't take a back seat; continue to maintain focus on the affected customers.
Take initiative and set the tone for everyone on the bridge.
Understand the pre-established criteria for severity levels and incident escalations.

Training

Drills / Practice

Process

A clear, concrete, and repeatable incident management process that results in the same outcomes no matter who is leading it, where it's being led, or when it's being led is a key component of a successful incident response program. Have a plan and execute it. An incident is not the time to change how incidents are run (whenever reasonably possible). You can think about how to do incidents better in the future as a part of the problem management process.

Process and Documentation

Incident Severity Matrix

Question Bank

Having a question bank that your Incident Managers can reference helps establish repeatable pattern for understanding, diagnosing, and developing solutions for major incidents. But not only do question banks help Incident Managers and on-call engineers from a technical perspective, it also helps ensure that everyone who's participating in the incident process can take the technical information and translate it to something that customers and other (potentially non-technical) stakeholders will understand. Use the following question bank as an example of something that you can implement on your own. This incident question bank is a table that describes that categories of questions that you can ask (e.g., status, impact, root cause, etc.) and when in the incident you should start asking those questions (e.g., during the investigating or service restoration phases).

	Investigating	Service Degradation	Restoring Service	Service Restored
Status		What service-side problem or alert do we see? How is it different from normal behavior? What actions have we taken so far, even if we ruled them out as a problem? What are we doing next (e.g. gathering or analyzing additional logs, developing and testing a fix, restarting system services, etc.)?	What action did we take to initiate recovery? Will customers need to take any action, or is there anything they can do to speed up recovery?	Did we take any other actions to complete the recovery process?
Impact	What service is affected? Which region / data center / infrastructure unit is affected? Or, can an exact list of customer IDs be quickly obtained?	Is there a specific error message customers will see? Is the entire service affected or only specific features? Is impact specific to certain types of customer configurations (e.g., release version)? Is impact intermittent in nature or does it occur 100% of the time?
Workaround		Are there any steps customers can take to self-resolve the issue, or are there alternative features customers could utilize to accomplish the same tasks?
Start/End Time		When was the first alert? When did customers first notice impact or when was the first support case raised?		When would the last impacted customer no longer experience impact?
Next Update / Estimated Time to Resolve		How long do we expect the current action to take? When do we expect to be able to share another update with customers? Will we have the current action completed within the next hour (or two hours or three hours)?	Is there an ETA? If not, what is a ballpark expectation for this scenario (e.g., 2 hours, 2 days, 2 weeks)? Take that time and provide a buffer for your next update to stakeholders.
Preliminary Root Cause / Next Steps			Is there any indication of the preliminary suspected cause, even at just a high level? If update-related, what was the purpose of the update? Are there safeguards in place to prevent this type of problem that didn't execute as expected, or is this a previously unknown failure scenario? If not, what problem did we identify that prompted us to take the recovery actions we did?	Do we know of any immediate next steps we'll be taking to prevent this from happening again? Was this caught by monitoring? Was this a testing miss?
Other / Tracking		How many customer support cases did we receive during this incident?		Are there any additional tickets or bugs associated with this case for tracking?

Incident Management

Incident Lifecycle

Postmortems, Post-Incident Reports, and Root Cause Analyses

Quality Assurance

People

Team Organization, Structure, and Hiring

Roles and Responsibilities

Incident Manager or Incident Commander

Who is it?

What do they do?

Communications Manager

Who is it?

What do they do?

On-Call Engineer (OCE)

Who is it?

What do they do?

Other Roles

Roles and Responsibilities Matrix

Expectations

Effectively Using Technology

Get familiar with incident tools and processes before an incident occurs

Keep the call distraction- and noise-free

Make use of the chat or IM window

Make use of any whiteboard features

Internal conference bridges are for company employees only

Bridge Courtesy & Professionalism

Be clear and concise

Practice Active Listening

We're all on the same team

Incident Investigation and Coordination

Assign tasks and set clear expectations

Always approach incident bridges with a sense of urgency

Training

Drills / Practice

Process

Process and Documentation

Incident Severity Matrix

Question Bank

Technology

Understanding Technology

Understanding Your Company's Technology

Understanding Your Company's Product and Customers

Selecting Your Incident Management and Incident Communications Toolset