Tech

Incident management at Aircall

Joel Vaz9 Minutes • Last updated on

Select chapter

Summary
Introduction
Aircall Incident Management Process
Architectural Decision
Inside the automation document
Integration With Slack
Cost Analysis
The Path Forward
Conclusion
Further Reading

Ready to build better conversations?

Simple to set up. Easy to use. Powerful integrations.

Get free access

Select chapter

Summary
Introduction
Aircall Incident Management Process
Architectural Decision
Inside the automation document
Integration With Slack
Cost Analysis
The Path Forward
Conclusion
Further Reading

Ready to build better conversations?

Simple to set up. Easy to use. Powerful integrations.

Get free access

Summary

At Aircall, our customers are at the heart of everything we do. We understand that in the world of tech, even with the best intentions and robust systems, things can occasionally go wrong. Our commitment is to ensure that when issues arise, they are resolved swiftly and with minimal impact on our customers.

For this purpose, we've developed a robust, custom solution for handling incidents. This solution leverages our deep knowledge of serverless architecture and our creativity, allowing us to act quickly and efficiently when problems arise. Everyone knows their role, communication is clear and effective, and customer impact is minimized. We gather essential metrics from each incident, maintain a culture of blameless accountability, and disseminate lessons learned transparently among all stakeholders.

We call this tool the Aircall Incident Management Tool. Admittedly, we spent most of our creative energy on its development and architecture, rather than on naming it.

Introduction

In the world of technology, ensuring seamless functionality for our customers is a constant challenge. As with any technology since the invention of the wheel, things don't always go as planned. At Aircall, we embrace this reality. What sets us apart is our proactive approach to handling these inevitable hiccups. We focus on improving our methods to manage uncertainty, minimize client impact, and learn from every incident.

Our journey began with a straightforward yet manual incident management process. This involved numerous custom actions and a pre-defined response plan. However, as our operations scaled, it became clear that this manual approach was insufficient. Steps were frequently skipped, and communication became fragmented across multiple channels. This not only extended incident response times but also delayed crucial communications between our engineers, support teams, and clients.

Realizing the need for a more efficient system, we explored various industry solutions. Our goal was to find an approach that adhered to our requirements, maintained our customer-centric focus, and was cost-effective. After careful consideration, we decided to build a custom in-house solution. This solution leverages AWS Systems Manager (SSM) for incident management, integrates our home-grown Slack Bot (Airbot), and taps into our extensive expertise in serverless computing within the cloud.

Today, we proudly use a tailored solution that prioritizes simplicity, rapid incident resolution, and continuous learning. Our incident management tool enables seamless incident reporting, ensuring visibility for all engineering, management, and customer support teams. It streamlines the incident management process, allowing us to concentrate on resolving issues and implementing follow-up actions.

This blog post will take you through the journey of developing our incident management tool, highlighting the challenges we faced, the solutions we implemented, and the key lessons we learned along the way.

Aircall Incident Management Process

Initially, Aircall started building an incident management solution based on many custom actions on a pre-defined response plan:

Report the incident on the communication platform engineering channel.
Create a new slack channel for the incident with the appropriate title.
Invite required support teams, engineers and managers.
Create a voice communication channel, i.e. a Zoom meeting.
Resolve the incident.
Create follow up items on the Agile board.
Create the post-mortem document.
Manually compute downtime and required lessons learned.

The response plan was entirely manual, and during incidents, steps were often skipped. Communication occurred across multiple channels, diverging from the centralized approach outlined in the response plan. This led to prolonged incident response times and delays in communication between engineers, support teams, and clients.

recognizing the need for change, we evaluated various industry options to find a solution that met our requirements and customer-centric approach while being cost-effective. After thorough discussions, we chose to develop a custom in-house solution. This solution leverages AWS Systems Manager (SSM) for incident management, our proprietary Slack Bot (Airbot), and our extensive knowledge of serverless computing within the cloud.

This brings us to today. We now have a custom-made solution emphasizing simplicity, fast incident resolution, and a strong foundation of learned lessons from each incident. Our tool allows anyone to report an incident with visibility for all engineering, management, and customer support teams. It shifts the effort from managing the incident to resolving it and addressing follow-up items.

Requirements	Implementation
Fully automated and customizable response plans	Incident response plans managed via code and deployed on AWS CI/CD deployment of response plans Metrics are automatically acquired and stored Fault proof design with multi region enabled
Simple and fast to use	Easy Slack integration that allows seamless usage from both technical and non technical colleagues Incident resources created fast and in parallel tasks (Slack channel, zoom room, Jira tickets, …) Responders added to the communication channel automatically
Company wide visibility of Incidents	The incident management tool is used in a public Slack channel and provides feedback across the company once an incident is started
Follow up actions management	The tool automates the creation of the post mortem document and follow up items are easy to be created and associated with the post-mortem Uptime computation is performed automatically

Below we will deep dive into the technical implementation, the challenges we faced, a cost analysis, and the lessons that we learned. Additionally we will provide some sneak peek at future tool improvements and plans.

Architectural Decision

As described in the requirements above the incident management tool was implemented using a Serverless architecture within AWS ecosystem, Confluence and Slack. The tool is composed of services independent of each other that are meant to work in parallel without a single point of failure. The diagrams below describes our current infrastructure setup for the incident management tool.

Start Incident Infrastructure

Resolve Incident Infrastructure

The start incident workflow uses multi-region to guarantee if for some reason our main AWS region services are down the incident can be used from a fallback different region.

The following sections describe the automation document steps that are used for our custom made incident management process and finally, how we integrate the tool for ease of usage within our own Slack bot.

Inside the automation document

We use an automation document to start the different steps of our incident resolution plan, each step creates a list of resources that are required for our teams to collaborate. We use a mix of single isolated Lambda functions and a Step function workflow to enable the different automation. For state storage we currently use the AWS SSM incident management tool, but we are in the process of replacing that tool with DynamoDb, most of the incident state and resources are managed currently using the DynamoDb already. The diagram below shows the infrastructure we have for the incident automation process.

Start Incident Workflow Infrastructure

Notice that all steps are non blocking, if one step fails during API Call to the service the remaining resources can still be created.

Above we shared an overview of the diagram for the start incident workflow, the resolve incident has a similar logic where a list of steps are performed to mark the incident as resolved with company wide visibility and ensuring that the required follow up actions are automated accordingly.

Overview of the incident resolution workflow

Finally we are going over our Slack integration, where our Slack bot listens to the events, assumes a role on AWS and is able to perform the input steps for the incident management.

Integration With Slack

This tool works in a symbiotic collaboration with our self made Slack Bot, this bot listens to events and is able to trigger responses on our cloud platform as well as making api calls to some of our services. We are not going through a deep dive on our Airbot configuration since it would take another technical blog post to describe that solution but it allow our tool to be seamless used within a Slack channel while keeping our infrastructure secrets away from Slack API.

Airbot - Aircall' custom slack bot icon

In the figure below is an example of the custom Slack bot response to the incident start action.

Airbot Slack response to an incident start action, notice the URL to join the newly created slack channel and the message for company wide visibility.

The tool is triggered via the usage of simple Slack Workflows where a user provides simple information:

Incident Title - meaningful title that describes in short words the incident.
Incident level of impact - based on the priority matrix below.

Severity	Description
1	Major service disruption. System outage affecting a significant number of users. Service continuously unavailable to log in or establish phone calls, with no workaround.
2	Key functionality impaired. Issues affect key functionality and/or causes substantial performance disruption in Customer’s use. No workaround is available.

The slack workflow can be be seen in the figure below, notice we currently have two workflows, one is meant to be used to start an incident and the other to resolve the incident. The resolve incident workflow automatically searches for currently active incidents, if only one is found that incident is marked as resolved. If multiple incidents are found, a list of incidents is printed on Slack with the associated command to close each one of the incidents, the user is required to paste the corresponding command and send it as a Slack message that will be captured by our Slack bot and the incident resolved automatically.

Figure with the list of possible workflows that trigger the incident automation

Additionally once a workflow is started a small input box appears to fill in the above described data.

Start incident Workflow prompt where the user is required to fill the data.

It was essential to guarantee that the tool would have some safe mechanisms in order for incidents to not be started by mistake, and also to ensure we have visibility on who performs the actions, for this, a confirm button is used after the workflow is submitted.

A message appears after the workflow is submitted asking for additional confirmation, everyone is already able to see at this point that something is going on.

After confirming the incident has started and resources are created, we can consider the incident resolved once the client-level impact is mitigated. A user then needs to initiate the "Resolve Incident" workflow, which informs us of the resolution and automatically gathers incident metrics. This triggers the start of follow-up actions.

A transcript of the slack channel history is attached to the post-mortem document;
The Jira task associated with the incident is automatically filled, with labels such as:
- Incident duration.
- Incident Origin (Internal / third party provider).
The incident Slack channel will be automatically archived once 30 days of the incident channel creation without any messages sent on that Slack channel.

In order to guarantee that incidents are resolved promptly we send a reminder message every hour if there are no new communications on the incident Slack channel. This message is only sent during work hours.

Cost Analysis

Cost was an essential requirement when choosing and designing the solution, there are many out of the box solutions in the market, but those would lock us to a monthly subscription plan something that it was essential for us to avoid. Although managed solution have their advantages, for this particular case a custom made solution with a small development effort proved beneficial in the long time, especially since we have the serverless know-how from our company DNA.

The Path Forward

As with all the other tools we provide we engage with our clients, the engineering and the support teams to listen to usage feedback and define the future improvements of the incident automation tool. The next improvement is to remove the AWS incident automation tool from our infra since it’s used only for visibility and to store the state of the incident, this is being easily replaced with the usage of DynamoDB.

On the tool development backlog we have additional ideas and possible increments such as:

Self Mitigation / Self Healing - Since we have a custom tool to our needs we can use it to automatically manage and resolve known infra issues. This is a huge topic for us and one of the main points key objectives in the future of the tool. We can use our robust monitoring platform to quickly detect patterns and apply a pre-defined solution. A few examples that we have in the plans:
- restart/flush an EC2 instance.
- Clear cache from a specific service.
- Switch network traffic.
- Restart a service.
Additional features on the custom response plan - Possibility to automatically trigger alerts for management during a P1 incident, additional post-mortem automation to fill more fields of the document.

This is just a glimpse of the many features we have envision for the incident management tool.

Conclusion

Ensuring the adoption of our incident management tool was crucial. A tool adds no value if our teams don’t know how to use it. Therefore, our priority was to keep it as simple as possible and to provide comprehensive training through live demonstrations, detailed documentation, and interactive learning modules. It was gratifying to see how quickly both technical and non-technical personnel adapted to using the tool.

This success motivates us to continue enhancing our internal tools and automating our processes. As Site Reliability Engineers (SREs), we are passionate about automation and constantly seek opportunities to streamline our workflows. Sustainability and cost-efficiency are also key factors in all our decisions. By leveraging serverless architecture and optimising our resources, we ensure that our solutions are not only effective but also sustainable and cost-effective.

Our customer obsession at Aircall drives us to continually add value for our clients. By improving our internal processes and tools, we ensure that we can deliver the best possible experience to our customers, even when challenges arise. Our commitment to sustainability means that we are not only focused on immediate solutions but also on long-term impacts and efficiencies.

The journey with the Aircall Incident Management Tool exemplifies our commitment to innovation, sustainability, and customer satisfaction. We look forward to continuing this journey, always striving to exceed our customers' expectations while maintaining a responsible and cost-effective approach to our technological advancements.

Ready to build better conversations?

Ready to build better conversations?

Summary

Introduction

Aircall Incident Management Process

Architectural Decision

Inside the automation document

Integration With Slack

Cost Analysis

The Path Forward

Conclusion

Further Reading

Ready to build better conversations?