Our battle with fraud
Due to our estimations, about half of our advertising product traffic was fraud in the spring of 2021. We used a third-party tool to filter fraud, but we had to pay for its use and could not influence the “magic” under the solutions’ hood.
Taking matters into our own hands, we figured out the details and worked out our own filter system, raising the conversion rate on advertisers’ websites. After disabling fraudulent partners (the bot farms), we reduced the fraud share to 10%.
In this article we talk about the basics of our approach. However, we will not disclose all secrets.
Briefly about the solution
Our ads project has the Campaign manager — a tool that helps advertisers set up advertising campaigns by setting budgets and targeting the traffic to their site. Advertisers want the target audience to perform specific actions on the site — follow the links or buy goods. Our goal is to attract these users through our advertising network.
The campaign manager allows push ads with classic push notifications displayed by the operating system tools and the in-page push that appears on the page in the browser in an iframe. The user sees the advertisement, clicks, and gets to our backend. That’s where all the fun happens.
Why is it necessary to fight fraud?
In theory, we should simply show the ad to the user requesting the ad (by registering the click via an intermediate redirect). But in practice, we are forced to evaluate the quality of traffic and filter out a massive amount of fraud so that advertisers’ money does not go to waste.
We filter out two types of traffic.
- Traffic from bots that will never buy anything from the advertiser. We divide bots into two types:
- Automated scripts created by means for testing in browsers — Phantomjs, Selenium, Puppeteer, Playwright, and others. With their help, the browser itself “presses” certain buttons, emulating ads views;
2. Traffic in the wrong price range. Push traffic is expensive. Advertisers pay for it. Fraudulent sources bring us other types of advertising to make more money by pretending to be the push. We filter out such publishers, protecting the advertiser’s interests, even though legitimate users can come to us through this channel. After all, our client should get what he pays for.
Initially, we used a third-party traffic filtering tool to go to market quickly. But for us, it turned out to be the wrong choice for several reasons:
- the solution works with its own servers,
- there is obscure magic under the hood
- that is the loading of another external dependency,
- the tool introduces quite a lot of traffic losses,
- it had false positives,
- we stumbled upon errors that turned off filtering for us.
The main reason is that the third-party tool did not consider the specifics of push advertising, not allowing specific fraud detection cases, including those with traffic of the wrong price range. It allows you to tell exactly whether the user is a bot or not, and that’s it. We tried more expensive solutions, but they also did not suit us.
As a result, we had to dive into the topic independently.
Redirects and Intermediate Pages
Fraud is assessed and filtered using an intermediate page accessed by the user who requested the ad.
We collect browser settings and other available data on this page to evaluate whether the traffic is good or bad. The intermediate page sends all the collected data to the backend, where we decide whether to show our advertiser’s ad in response to this request. For those who are rejected, we show the stub. On the other hand, the advertiser sees a high conversion and appreciates the tool for high-quality traffic.
We implemented several variants of the intermediate page:
- The minimal page does not collect anything and is not used regularly(0% of clicks are sent to it). This page was implemented to understand how the very fact of an “extra” redirect to an intermediate page affects the process of a user’s transition to the advertiser’s site. It turned out that each redirect eliminates about 5–7% of clicks.
- The normal page collects about 140 different parameters. We use it to monitor traffic from partners in real-time and highlight fraud. Also, we used to work on it with an external fraud detection system. With its help, we tested some of our hypotheses. The collection of so many parameters and the operation of a third-party system in terms of traffic losses are not free. We lose about 32% of traffic.
Depending on the task, we redirect traffic between intermediate pages. Generally, we use light. If it is necessary to experiment with a new trap — we connect Normal. In some exceptional cases, we return Minimal.
How filtering is carried out
We filter incoming traffic using the developed traps that work with one or more of the collected traffic parameters. For each type of fraud, we gradually build traps. We monitor the parameters, create hypotheses, and test traps on a small proportion of users. Then we apply traps to all users if the tests were successful (if the filter improves the quality of traffic using a specific trap). Each trap allows us to cut off a certain amount of fraud, but the work of the traps is “not free” in terms of performance. We give priority to those with higher efficiency.
Revealing all the traps means giving bot growers a tool to bypass our anti-fraud system. But I can provide a few examples that will make the approach clear.
Time trap (Low time to click)
We track the time between user actions with the same IP address and UserAgent. Human users cannot click on multiple ads or click on an ad immediately after icon loading. The time between loading an icon and clicking should not be short.
High CTR trap
CTR (click-through-rate) is a well-known parameter for analyzing the effectiveness of advertising campaigns, equal to the ratio of the number of clicks to the number of impressions, measured as a percentage. It reflects what percentage of ad impressions turned into clicks and a transition to the advertiser’s website. This value is usually small, no more than a few percent. High values are a sign of fraud.
It should be noted that the trap is not for a specific visitor but a traffic source.
Window size trap
Logic dictates that the human user’s document window should be of adequate size. We filter all traffic for which “window.innerWidth” is less than 200 and “window.innerHeight” is less than 100. These parameters are usually equal to 0 for bots.
Curiously, we also came across errors in this rule — when the parameters are 0, but the user comes to the advertiser’s site and performs targeted actions. Later, we found an explanation — a new generation of bots appeared, periodically going to the advertiser’s website so that ML tools do not consider fraud. We need to ban them, but there is a high probability of a false positive for legitimate users. Our team does not yet have an unequivocal opinion on the topic of this rule — we continue to experiment.
Object Presence Hook
Most users come from the Chrome browser, so we can afford to set browser-specific hooks.
A legal user must have a “window.chrome” object in the window (i.e., it must not be undefined). At the same time, many bot drivers hide the browser and remove this object for some reason. Thus, we filter traffic if this property is undefined. The hook is mentioned here as an example since it is now disabled (on Firefox, it started to give a lot of false positives).
Not all rules go into production. For example, we assumed that the request, click, and the icon should match in case the same User-Agent. We set such a trap but received too many false positives. It turned out that many partners can pass arbitrary modified User-Agent strings.
Filtration quality control
Before implementing the next trap, we check how well it works. Here’s how it works.
The most straightforward approach is to look at Clickhouse and evaluate the parameters of the filtered traffic with your eyes.
We can also look at the quality of traffic from the advertiser’s side — evaluate the conversion when a user clicks on an ad and performs certain actions on the site. When the target action is completed, the advertiser informs us about it.
Our campaign manager has collected enough data and can analyze which traffic does not bring conversions. Often this indicates bots-generated traffic. Conversion stats help us look for new traps or test how well old ones work.
In general, we have a relatively loyal approach to filtering — we filter out traffic only if we are 99% sure it is a fraud.
We can analyze the requests of a single user and the whole traffic from a particular partner. We have indeed come across the publishers, which are mostly the fraud. We include such sites in the black list and study what was happening — such cases help us build new traps.
To analyze the clusters of traffic allocated by specific parameters, we use CatBoost. This is a reasonably effective classifier from Yandex, which works well with categorical features.
We throw traffic data into it, and at the output, we get a list of significant parameters based on which we can create new rules. In addition, CatBoost can create a feature significance matrix, so we get a lot of interesting things out of it. For instance, properties of traffic revealing fraud in it).
However, there is no further automatic work — issues with partners are resolved at the management level. We had precedents of shutting down partners that supply a lot of junk traffic, especially at the beginning. After such a “cleaning” of the partners, the share of fraud in traffic was reduced from half to 10%.
We have recently taken a different path. We began to look at the automation tools themselves. While analyzing the product’s source code and its plugins, we found several markers of Puppeteer. However, the problem is that bot tools regularly undergo a complete refactoring and there are pull requests that fix the “holes” we found in the scripts of bot growers. Then we have to study the new version of the bot system to understand how to catch it. In the same way, we are studying new versions of browsers — they have new properties that can be used to conclude the quality of traffic.