Incident Management

A clear and repeatable incident management (sometimes called incident response) process is critical to resolving incidents quickly and restoring service for your customers. Below we outline some of the key considerations that you should make when implementing your incident response documentation and training program.

Incident Lifecycle

The incident lifecycle is fairly standard across all service providers. Regardless, it's a good idea to map out the incident lifecycle in a way that makes sense for your organization. You can start by using the example below as a basic starting point and then develop your lifecycle from there. It's great to have as a building block for developing your detailed incident response procedures.

Detection Classification and Declaration Diagnosis Resolution Closure

Postmortems, Post-Incident Reports, and Root Cause Analyses

Quality Assurance

People

Team Organization, Structure, and Hiring

Roles and Responsibilities

One of the most important aspects of incident management is having roles clearly defined before an incident strikes. Name the roles, define the responsibilities, ensure that there is always a designated person or on-call rotation to cover that role at any time of day, and drill your incident management process with those roles present.

We recommend having the following roles in place. However, depending on the size and structure of your organization, you may choose to split these responsibilities up further or consolidate them into a single role.

Incident Manager or Incident Commander

An Incident Manager (IM) (sometimes called an Incident Commander) is the person responsible for restoring service, serving your customers, and ultimately protecting the brand. When on duty, the Incident Manager has the 24x7 responsibility to do what it takes to restore service health. The IM responds to all potentially customer-impacting events and leads whatever team necessary to restore service as quickly as possible. The IM makes tough decision when required, and owns the incident from start to finish, including how it is handled, and the follow-up to determine root cause and actions to prevent similar issues in the future.

Who is it?

Typically, it's a mid- to upper-level manager with an ability to problem solve. Knowledge of the service that they're supporting is incredibly helpful in order to direct troubleshooting and resolution activities and have the confidence to act in a high-pressure situation. They're willing to do what it takes to defend service availability and health. In some companies, this is a dedicated role that is hired for specifically. In others, it's an on-call rotation of engineering leaders.

What do they do?

The Incident Manager owns the incident from start to finish. They ensure timely restoration of service as the primary focus and respond to all high-severity, potentially customer-impacting events. They're responsible for ensuring whatever team necessary to restore service is engaged as quickly as possible. They assemble a virtual team (a team outside of the organizational structure that is built for only one purpose) of on-call engineers (OCEs) who they drive to restore service. They take ownership of critical decisions and ensure that the general service delivery goals such as security, availability, and data integrity are maintained. They're responsible for tracking the progress of an incident in real-time and ensure the right data is captured for reporting updates for the customer, support teams, and other internal audiences. After the incident, they follow up to determine the root cause of an incident and help develop corrective actions that will prevent the incident from happening in the future. They also provide the details required by the problem management process, as well as the communication managers and other groups that might need follow-up.

Communications Manager

A Communications Manager (CM) is the person responsible for communicating to your customers and internal audiences during an incident. When on duty, the Communications Manager has the 24x7 responsibility to provide quality communications to customers, including external service health or status page messages (like those that Trustleaf provides), sending messages to internal audiences to keep them apprised of the customer impact and current status of an incident, and developing customer-facing root cause analysis (RCA) documentation (sometimes called a post-incident review or root cause messaging). CMs collaborate with the IMs and OCEs to ensure that timely, targeted, and accurate incident communications and notifications are delivered to customers and internal stakeholders.

Who is it?

This is usually someone with training in incident management, incident communications, public relations, corporate communications, marketing communications, or some combination of those experiences. They're usually good technical writers, taking the detail internal information that's discussed during an incident call and translating that is appropriate for customers and internal audiences that may be more business-oriented.

What do they do?

The Communications Manager drives the customer-facing and internal-facing incident communications end to end. They post, publish, and/or send the incident communications and notifications to customers. Sometimes the medium is a self-hosted, authenticated service health dashboard. Sometimes it's a public-facing status page like one hosted on Trustleaf (for ultimate transparency and trust). Sometimes it's an email or SMS message from a customer database or from a list of explicitly subscribed users. The CM serves as a court reporter during an incident to document the progress of the incident and provide unbiased information to internal audiences.

On-Call Engineer (OCE)

The On-Call Engineer (OCE) (sometimes called a Designated Responsible Individual or Site Reliability Engineer) is the technical resource tasked with investigating, troubleshooting, and fixing an incident from a technical standpoint. There may be multiple OCEs engaged during an incident (think network, data center, software, subcomponent, etc.) who all have some role in restoring service. They're a group of individuals who have a 24x7 responsibility to resolve live-site customer issues either by providing workarounds, or by providing configuration or software updates to address an issue. The issues that they investigate come from three sources: customer, monitoring, or by the individuals themselves. OCEs also participate in the RCA process to produce bug fixes or additional system monitoring and logging.

Who is it?

The OCE is typically an engineer of some kind who specializes in a particular area that support a critical function of the service, its features, or its supporting infrastructure.

What do they do?

OCEs debug the issue that's causing the incident. They troubleshoot and fix the issue to restore service by serving in an on-call rotation for the component area that they serve.

Other Roles

Companies may have other supprting roles during an incident. For example, support and account teams ensure that customers have direct support during an incident and they may also help relay important information to the incident conference call (sometimes called a bridge).

Roles and Responsibilities Matrix

A roles and responsibilities matrix is a great tool for establishing and communicating the basic expectations of each of the roles involved in the incident management or incident response process. Below is an example of a roles and responsibilities matrix using the roles that we suggested above.

Role Responsibilities
Incident Manager (IM)
  • Owns the incident from start to finish, including directing people and making final decisions
  • Tracks action items and their owners
  • Engages other required teams and resources to resolve the issue as quickly as possible
  • Tracks and completes the postmortem process
Communications Manager (CM)
  • Develops and sends customer-facing communications
  • Develops and sends internal-facing communications
  • Keeps a timeline of the incident
On-Call Engineer (OCE)
  • Diagnoses the issue
  • Determines what other resources are required
  • Develops and implements a fix for the issue

Expectations

Clearly communicate the expected behaviors during an incident to your team and all of your partners teams that will participate at any point during the incident management process. Below are some common behaviors to communicate internally, and these best practices can help with collaborating on an incident call to maintain focus on driving an incident towards service restoration:

Effectively Using Technology

Get familiar with incident tools and processes before an incident occurs

Keep the call distraction- and noise-free

Make use of the chat or IM window

Make use of any whiteboard features

Internal conference bridges are for company employees only

Bridge Courtesy & Professionalism

Be clear and concise

Practice Active Listening

We're all on the same team

Incident Investigation and Coordination

Assign tasks and set clear expectations

Always approach incident bridges with a sense of urgency

Training

Drills / Practice


Process

A clear, concrete, and repeatable incident management process that results in the same outcomes no matter who is leading it, where it's being led, or when it's being led is a key component of a successful incident response program. Have a plan and execute it. An incident is not the time to change how incidents are run (whenever reasonably possible). You can think about how to do incidents better in the future as a part of the problem management process.

Process and Documentation

Incident Severity Matrix

Question Bank

Having a question bank that your Incident Managers can reference helps establish repeatable pattern for understanding, diagnosing, and developing solutions for major incidents. But not only do question banks help Incident Managers and on-call engineers from a technical perspective, it also helps ensure that everyone who's participating in the incident process can take the technical information and translate it to something that customers and other (potentially non-technical) stakeholders will understand. Use the following question bank as an example of something that you can implement on your own. This incident question bank is a table that describes that categories of questions that you can ask (e.g., status, impact, root cause, etc.) and when in the incident you should start asking those questions (e.g., during the investigating or service restoration phases).

Investigating Service Degradation Restoring Service Service Restored
Status
  • What service-side problem or alert do we see? How is it different from normal behavior?
  • What actions have we taken so far, even if we ruled them out as a problem?
  • What are we doing next (e.g. gathering or analyzing additional logs, developing and testing a fix, restarting system services, etc.)?
  • What action did we take to initiate recovery?
  • Will customers need to take any action, or is there anything they can do to speed up recovery?
  • Did we take any other actions to complete the recovery process?
Impact
  • What service is affected?
  • Which region / data center / infrastructure unit is affected?
  • Or, can an exact list of customer IDs be quickly obtained?
  • Is there a specific error message customers will see?
  • Is the entire service affected or only specific features?
  • Is impact specific to certain types of customer configurations (e.g., release version)?
  • Is impact intermittent in nature or does it occur 100% of the time?
Workaround
  • Are there any steps customers can take to self-resolve the issue, or are there alternative features customers could utilize to accomplish the same tasks?
Start/End Time
  • When was the first alert?
  • When did customers first notice impact or when was the first support case raised?
  • When would the last impacted customer no longer experience impact?
Next Update / Estimated Time to Resolve
  • How long do we expect the current action to take?
  • When do we expect to be able to share another update with customers?
  • Will we have the current action completed within the next hour (or two hours or three hours)?
  • Is there an ETA?
  • If not, what is a ballpark expectation for this scenario (e.g., 2 hours, 2 days, 2 weeks)? Take that time and provide a buffer for your next update to stakeholders.
Preliminary Root Cause / Next Steps
  • Is there any indication of the preliminary suspected cause, even at just a high level?
  • If update-related, what was the purpose of the update?
  • Are there safeguards in place to prevent this type of problem that didn't execute as expected, or is this a previously unknown failure scenario?
  • If not, what problem did we identify that prompted us to take the recovery actions we did?
  • Do we know of any immediate next steps we'll be taking to prevent this from happening again?
  • Was this caught by monitoring?
  • Was this a testing miss?
Other / Tracking
  • How many customer support cases did we receive during this incident?
  • Are there any additional tickets or bugs associated with this case for tracking?

Technology

Understanding Technology

Understanding Your Company's Technology

Understanding Your Company's Product and Customers

Selecting Your Incident Management and Incident Communications Toolset