How DataDome Uses Postmortems to Learn From Every Incident

Incidents are going to happen in any software. DataDome performs postmortems following incidents to find the main cause and resolve the issue for the future.

Sarah Belghiti

Operations Manager, Site Reliability Engineering

At DataDome, site reliability engineering (SRE) keeps our solution running well, utilizing several complex, distributed systems. As the operations manager on the SRE team, I have an overview of various changes and features that inevitably bring up incidents and outages as we scale and grow. We believe in learning from every incident through postmortems.

In the 24–48 hours following an incident, all teams have the opportunity to learn by leaning on the expertise of both client-facing and technical teams, embracing a diversity of thought to reach a resolution. In this article, we will explore the following topics:

  • How we figure out the source of the problem, together.
  • The power in cultivating a blameless culture.
  • How these learnings enhance technical expertise within DataDome.

The Foundation of Incident Resolution: Transparency

My priority, when an incident occurs, is mitigating customer impacts by resolving the issue. The postmortem is performed after resolution to document the incident and have a single source of truth identifying the root cause, action, and lessons learned. Postmortem reports are available to the company to aid in resolving any similar incidents.

DataDome ensures 24/7 support to all customers via on-call rotations. If an alert is received, the appropriate teams are notified, and action is immediately taken. Once the root cause is identified and more information is uncovered, our experts know exactly who to engage internally to reach a solution.

A postmortem report begins with understanding if the incident was caused by a new deployment, or something less straightforward. Our template document contains all relevant information, including tickets and logs to complete the postmortem process. Once the incident is fixed, all stakeholders involved know it’s critical to take a step back, review exactly what happened, and take action together to:

  • Improve the stability of infrastructure and software code.
  • Nurture trusting relationships with all customers.

Boris Tréhin, Head of Solutions & Services Enablement, explains that, without postmortems, the customer support and delivery process would be very difficult. Postmortems are often reviewed by client-facing teams, opening the door to proactive discussions with customers directly.

In these discussions, we communicate the exact scope of the incident, paired with tangible short-term and long-term action plans to alleviate any technical discrepancies. Technical stakeholders understand incidents happen, and our process ensures the same incident will not happen again. This is a true testament to the #TeamSpirit and #CustomerCentric mindset BotBusters embody each day, and foster the collaborative nature of DataDome’s working environment.

Both proactive and reactive collaboration between each team involved is key. Because of this, we are able to communicate all documentation to customers with confidence. This allows us to maintain a strong relationship, and for them to continue trusting DataDome’s product.

–Boris Tréhin, Head of Solutions & Services Enablement

Cultivating a Blameless Culture

A blameless postmortem is fundamental. Every system fails, and the postmortem presents an opportunity to learn from those failures; the key is to learn from the issue and not reproduce it. Our process concentrates on pinpointing the underlying factors leading to the event, without pointing fingers at any specific person or group.

Blame can lead people to avoid raising issues in fear of potential repercussions. Ultimately, we do not care who was responsible for an incident because we are a team—responsibility does not lie with one person.

People who do nothing, break nothing. We are human, we will make mistakes. It does not interest me to know who made the mistake, I’m focused on understanding why this mistake happened, and how we can avoid it in the future.

–Jean-Louis Bergamo, Head of Infrastructure

This mindset from leadership down naturally aids in fostering a safe environment, welcoming any and all shortcomings. A blameless culture ensures a seamless root cause analysis, and encourages BotBusters to speak up if something has gone wrong.

To have expertise in a technology, one must face certain challenges to learn more," adds Jean-Louis. “Not only do these conversations promote knowledge transfer between teams internally, they also help deepen technical aptitudes.

Enhancing Our Technical Expertise

As with any postmortem, we store all incidents in Notion including key metrics such as:

  • Timelines & impacts.
  • Observations & actions during an incident.
  • What was done well.
  • Which processes could be improved.
  • The technical strategy moving forward.

With the root cause identified, we are able to pinpoint improvements in our tech stack, locating areas we can fix or enhance in order to avoid or mitigate future incidents.

Using an Incident to Improve DataDome

In one incident, we made a change in the AWS security group for a region. Unexpectedly, we then stopped receiving traffic entirely for that region. While customers were routed automatically to the closest region, we still needed to fix the issue. Once it was quickly resolved, the postmortem allowed us to locate areas of improvement in our technical processes that could prevent future incidents.

To enhance the DataDome solution and processes, we:

  • Improved the ways we test configuration changes in a staging environment. While testing in staging is standard for code changes, it is sometimes neglected for configuration changes, even though the consequences can be just as important.
  • Strengthened the monitoring systems we have in place. This helps us identify the root cause of an incident faster.

Even when incidents affect parts of our solution or infrastructure that already follow best practices for the industry, postmortems help us learn what may not be working as intended—and find ways to keep the incident from happening again.

Conclusion

Incidents occur in any software. What happens after an incident is resolved is key to preventing future problems. DataDome’s postmortem process ensures we review the root cause and solution, as well as find ways to improve our technical procedures moving forward.

We are also transparent with our clients, nurturing confidence by communicating incidents and their resolutions, because we are confident in our ability to mitigate issues and come out stronger. Our blameless culture when it comes to postmortems helps every employee feel empowered to bring up issues without fear of repercussions.

Without assigning blame to any one person or group, postmortems help us find opportunities to grow as a company and improve our bot and fraud solution—making our product one that is based on real experiences and lessons learned.

If you are interested in joining DataDome in our mission to rid the web of fraudulent traffic, we encourage you to check out our open positions. You can also submit your resume through a spontaneous application!