If your company is around long enough, it will encounter a major incident. This is unavoidable. However there are a number of important measures you can take to minimise the impact to the business and the customer, when you inevitably get that 2 am call you will be glad you did.
Defining a Major Incident
A major incident is defined as an event that has or will imminently occur and which has a significant impact on the business or organisation. This impact is often financial, such as a Tier one website going down for an online business, leading to an inability to trade; but also includes events that will have serious legal repercussions or cause reputational or brand damage, such as the inability to pay staff members on time or a major disruption to employees being able to do their job.
Incident Management & escalation
Incident management is the process of correctly and efficiently handling incidents, ensuring that the issues are escalated to the correct people in a timely fashion to minimise the impact to the business and that all stakeholders are kept informed. A post incident review ensures that thebusiness learns from it and can avoid similar incidents in the future.
There are 4 sections to the MIMP
- Pre-incident preparation
- Declaring the Incident
- Managing the Incident
- Closing the Incident
Pre-incident preparation is vital for efficient incident management, you do not want to be thinking who the relevant people are to call at 2am on a Saturday night or how to go about contacting them.
Every part of the business should have a list of the services and business areas they are reliant on, summarise this into a dependency matrix will help teams find and notify each other quickly. This list should include both the technical and business teams that will need to be notified. If you are just getting started and unsure of dependencies then it can be a good idea to draw up a dependency diagram, this is similar to a process flow diagram but also lists the teams involved for each step.
Each team also needs to have available contacts and an escalation path. For example first contacting the on-call engineer and escalating to the service owner or head of department if they are unreachable. Communication channels should also be agreed on, though generally a call is the most efficient communication method.
As you can see, there is a lot of information that needs to be kept up to date, this can be challenging if teams are constantly evolving and the engineers on call are constantly rotating. There are a number of applications that can help you keep this information organised, one of the more well known ones is xMatters (I have no affiliation). However you choose to do it, the important thing is to have a centralised place where teams can easily find information on who to contact in an emergency. This should also be stored somewhere that an incident is unlikely to effect its availability, it is no good if you have an up-to-date contact list but they are stored on your server and the major incident is that your server has gone down…
Declaring the Incident
A major incident is generally declared by someone senior, such as the Head of Department. However as they are usually not the first point of contact it is up to the first point of contact to use their judgement on when to escalate it.
This should be done by looking at the overall disruption to the functionality of the business. In general the event should be escalated if it is in breach of its SLA or has/will cause large:
- Inability to trade.
- Reputational damage.
- Legal ramifications.
- Inability of employees to do their job.
Managing the Incident
When you become alerted to a major incident the first thing you should do is figure out what teams you need to contact. Use your dependency matrix (see point 1) to see the technical expertise that you need as well as the business stakeholders that you need to inform.
Start off by getting the technical teams in the same room or on the same call.
Things you are looking for are:
- If this has happened before and what the workaround was.
- Any changes that have been made recently and if this could have caused it.
- The extent of the systems and services that are effected.
- To draw up a list of theories and possible solutions and work through them in order of the likelihood of solving it.
Once you have an understanding of the extent of the problem you need to notify the services that are effected by it as well as the business stakeholders. If you are doing this through conference call make sure that you keep the business stakeholders separate from the technical stakeholders, like wise if you are using an alerting platform then you should send two separate notifications to the technical and business stakeholders.
First comms should be as soon as possible and need to include a clear impact statement, even if you are unsure of the solution at this stage. This should be communicated in business language, eg. That customers are currently unable to checkout rather than that the payment API is refusing authentication.
Second comms should be 30 – 60 minutes later and give an update on the business impact, you should also list the cause if it is known and state the plan to resolve it.
Further comms should be sent every 30 – 60 minutes or at the agreed upon time. This will contain status updates as the list of solutions is worked through.
Final comms should be sent out once the impact has been mitigated or permanently resolved. This should include the total quantified business impact, eg. The total number of orders lost and should also include next steps to put a permanent fix in place.
Closing the Incident
Once the incident has been resolved you should hold a Post Incident Review (PIR) while the events are still fresh, usually within 2 days of the issue being mitigated. This meet should include all people that have been involved in the incident.
Things you should discuss:
- What preventative measures can we put in place to stop this, or similar incidents happening in the future?
- What measures can we put in place to detect similar incidents early?
- What measures can we put in place to mitigate the impact if this does happen again in the future?
The next steps should be drawn up and run by the team at the root cause of the incident, next steps must include time frames for being completed. The next steps and progress should be sent to all those involved in the incident to keep them informed.
Other important points
- Even if you are not the root cause, it is your responsibility to manage and communicate the incident until the relevant Head of Department can be found. It is also your responsibility to tell stakeholders that are relying on your product or service.
- You should make code changes on the fly unless its it absolutely necessary. Deploying code into production during an incident may not have been tested thoroughly and may cause more issues than it fixes.
- The incident can be declared as resolved once you have put a workaround in place and have some measures to stop it reoccurring. A permanent fix can be engineered later as part of the next steps.
- During an incident record all actions that you take and share this with all those on the call so that they can follow events as they happen.
- This is an excellent time to see if your automated alerting or fallover systems actually worked as you expected them too. Did your monitoring software detect and alert the relevant teams or was this discovered only when it started to affect the business?