Meet Our Head of Research: Q&A With Antoine Vastel, PhD

Antoine Vastel, PhD, Head of Research at DataDome, discusses threat research & what it takes to analyze, scale, & automate threat insights.

When someone attacks you, they don’t play within the rules, and they don’t care what university you went to. So it's good to have a diversity of skills to ensure that we always have what it takes to stay ahead of bots.

Antoine Vastel, PhD, Head of Research at DataDome

As Head of Research, Antoine analyzes customer inquiries and alerts from internal monitoring systems, ensuring the global DataDome team stays up to date on new findings and suspicious activity. He also works directly with customers to prepare for flash sales and limited edition releases. Fun facts: Antoine builds his own (good) bots—for research, of course—and in his spare time, he writes code that even competitors can’t wait to feature in their products!

Q: What does your typical workday look like?

A: On a typical day, I’ll spend a good deal of time reviewing inquiries from DataDome’s customers and alerts from our internal monitoring systems, with the goal of finding new ways to improve our bot detection capabilities. We monitor a wide range of signals linked to potentially suspicious traffic, and whenever it looks too suspicious, we will take a closer look.

For example, an American website may have legitimate traffic from Europe, but if the traffic volume coming from Europe is growing a lot for no apparent reason, we’ll look at it to understand the context and see if there are any recurring patterns, to ensure that we aren’t missing something.

Our aim is always to use these customer requests and internal alerts as input to improve our bot detection engine for the long term. The goal isn’t just to close a ticket or make an alert go away, but to ensure that all input is properly processed and taken into account.

As head of a distributed global team, I must also ensure that we communicate and share our knowledge with each other. For example, when we create machine learning models, the goal is to automate detection at scale and solve specific issues.

When our data analysts share the problems they encounter, it helps the data scientists understand what’s valuable to automate. In the same way, it’s useful for the threat researchers to know the limits of our current detection. If someone discovers a new way to properly forge a lot of attributes, we know that we should invest time in improving the detection of this specific problem.

Finally, I work directly with many of our customers to prepare for flash sales or releases of limited-edition products. For this type of event, we have a flash sales protection mode with more aggressive behavioral detection models to ensure that only humans participate, and we can also manually monitor the traffic for the duration of the sales.

Q: How do you and your team go about your threat research?

A: We have different ways to stay ahead of bots. All the requests and signals that we collect for our customers are stored in a big, in-house Elasticsearch cluster.

After a bot attack, we can access the Elasticsearch data either with Kibana or a Python script to analyze how our detection performed. Could we have improved it by fine-tuning our models, or by adjusting some signatures?

Sometimes, when we look at the data, we’ll realize that we need new signals to optimize the detection. If the attackers used an open-source tool, we can just download it and play around with it, analyze the code, monitor it when it interacts with a page protected by DataDome, and see what we can observe.

It’s the same for paid products, like scalper bots: We can buy them, analyze them, figure out their modus operandi and how they work, and try to come up with new ways to detect them based on either their signature or their behavior.

Then there are proxies. We already subscribe to plenty of proxy providers to ensure that we detect them properly. If an IP address has been used by a proxy provider, we use that information to train our machine learning models and monitor their performance.

To detect bots, it’s also important to understand the attackers’ mindset, so yes—I also create and use bots, in a respectful way of course!

An attacker always has a goal, usually to make money. So I try to think in the same way: If I have x dollars to invest, what can I do to make a profit as quickly as possible? When I take the cost parameter into account, what’s the best direction and how do I adapt my strategy?

The point is really to stay on top of the tools that attackers will use, ensure that we detect them, and if we don’t, our analysts and researchers will look for new signals or tune the current detection models.

Q: How did you get into the threat research field?

A: My PhD research initially focused on privacy topics. I was studying browser fingerprinting, and I made a lot of calls to the top 1 million Alexa websites to identify trackers, script execution traces, and so on. At the time, if you wanted to create a bot, you would either make HTTP requests with a Python script or use PhantomJS, which was good, but not very advanced.

Then, Google released Headless Chrome, and everything became simple. It was so easy and straightforward to make a realistic bot. So, I continued to scrape websites to collect information for my privacy research. But at one point, I was blocked because I wasn’t really hiding my fingerprint.  

When I tried to understand exactly what triggered the blocking, I realized that there were detection solutions available for programs like Selenium and PhantomJS, but there was nothing for Headless Chrome because it was brand new. So, I started to reverse engineer some code here and there, and I also started a GitHub repository, called Headless Cat & Mouse, together with someone from Google’s Headless Chrome team.

I really liked this work, and eventually my PhD research shifted toward bot detection. It also made sense because browser fingerprinting is less used for tracking now, since browser vendors have done a lot of work to reduce the entropy of attributes. However, fingerprinting is still really valuable for detecting bots, both server-side and client-side.

So, I kept all my knowledge of fingerprinting for tracking, but now I apply it to bot detection and protecting websites and apps against attackers.

Fun fact: A couple of years ago, Antoine wrote an open-source implementation of a canvas fingerprinting algorithm inspired by Elie Bursztein’s Picasso paper. His code is now used by other companies, including Discord, in their bot detection scripts.

Q: How have you seen automated threats change since you started?

A: As I just mentioned, at the time of PhantomJS, it was difficult to make headless browsers realistic. There were plenty of open-source tests to detect it, and detection was quite easy. That all changed with Headless Chrome and automation libraries like Puppeteer.

These bots have consistent HTTP headers (because they are browsers), consistent TLS fingerprints, and so on, and you really need expert knowledge to detect them.

Another evolution is the availability of open-source libraries like Puppeteer Extra Stealth. You can just install the package, and it will take care of forging your fingerprint very realistically. This has made it a lot easier to create realistic bots, and a lot more difficult to detect them.

Some companies even offer bots “as-a-service”, especially for scraping and scalping purposes. The users don’t need to be bot developers at all, because the service takes care of forging the fingerprint, rotating user agents and proxies, etc., and they only pay for successful requests.

There’s a new economy around scraper bots that didn’t exist when I started, with a lot of money to be made for bot developers.

Another emerging trend is that, ironically, there are more and more humans in the loop. Attackers use bots to automate part of the process, but humans will step in to help the bots bypass detection at some point.

CAPTCHA farms are a typical example. When I joined DataDome, it was still rather uncommon for bots to forge CAPTCHAs. Now, it’s frequent. Libraries like Puppeteer Extra Stealth even offer plugins that interact with CAPTCHA farms, so if an attacker doesn’t know how to bypass a CAPTCHA, they can just use a plugin provider to do it for them.

Finally, bot developers are investing a lot more resources than they used to. Data is the new gold, and many companies are making their living collecting, packaging, and reselling data. They need web scraping in order to exist, so they are ready to invest a lot of time and money. It’s a never-ending game, and it takes real expertise to understand the more and more advanced strategies of the bots.

Q: What’s the best part of your job?

A: DataDome protects some of the world’s biggest websites and applications, and blocking attackers has a real positive impact both for the companies and for their end users. It’s nice to walk down the street and see familiar brands all over the place, knowing that we protect their businesses and help them operate smoothly.What’s also really cool at DataDome is that our detection engine is built in such a way that it’s very easy to deploy new signals and new strategies. If we have an idea or an intuition, we can deploy it on a subset of traffic, start collecting signals to confirm whether our intuition was correct, and very quickly use these signals to block millions of malicious requests.

Of course, we carefully review and monitor all our changes to make sure we never disrupt our customers’ traffic. But in very little time, we can start to collect new signals and enforce new detection patterns, which is really useful when a customer comes under a new kind of attack.

Q: If you had to work in any other industry or role, what would it be?

A: Difficult question! I really enjoy security, so maybe something related to human fraud, fighting money laundering for example. It’s an interesting topic with potentially huge impact.

Q: What advice do you have for someone looking to break into online threat research?

A: There isn’t really a right or wrong background for this work, and it’s something you can start learning by yourself if you’re motivated.

It’s important to be familiar with how the web is working: web architecture, networking, JavaScript, proxies, etc. And of course, you should try to create a few bots and see what you can learn! Most importantly, you need to be curious. You won’t find a tutorial online, so you need to experiment and develop your knowledge yourself.

When we hire for my team, we look for people with an automation mindset. At our scale, data processing and data automation are very important.

Aside from that, our team members have different skills. Some are data scientists with deep machine learning expertise; some have a mix of data science and cybersecurity analyst skills. Others are true security experts, with specialist skills in topics like network and web security.

So, we have a wide range of profiles, and we want to continue that way. When someone attacks you, they don’t play within the rules, and they don’t care what university you went to. So it’s good to have a diversity of skills to ensure that we always have what it takes to stay ahead of bots.