Back
Tech

Advertisement detection: how machine learning changes the way we discover technologies

Colin de Vries
  • about 1 month ago
  • 6 min read

How do we know which advertisers are growing, which publishers are popular, and what ad networks have the most impact? It all comes down to Technology Detection, but this practice—while very efficient—is resource-intensive. We streamlined our fingerprint identification process with a bit of help from machine learning. Read on to learn all about how online advertising works, how we track it with algorithmic assistance, and why the human component remains central to digital work.

Online advertising is an integral, though sometimes annoying, part of our digital experience. Whether you want it or not, advertising shapes the way we browse, shop, and interact online. But what might seem like just a small nuisance—think banner ads, sponsored posts, or a pre-roll video—actually relies on a complex behind-the-scenes ecosystem. This system involves a network of advertisers, publishers, ad networks, and tech platforms, all collaborating to deliver seemingly omnipresent advertisements. Understanding how these ads are crafted and targeted, and the technology used to deliver them, can shed light on the machinery that powers the digital advertising industry.

How does online advertising work?

To advertise online, advertisers need to find the websites where their message can appear to specific targeted audience segments. Advertisers pay for their content to be published, with payment models based on metrics like cost-per-impression (CPM) or cost-per-click (CPC).

This is how advertisements get on websites most of the time:

  • Advertising Networks and Exchanges

These platforms collect available advertisement spaces from many sites, then match advertisers to those needs. The networks enable advertisers to manage their advertising inventory on many sites at once, meaning they don't have to negotiate directly with every single site themselves. A prime example of this strategy is working with Google AdSense.

  • Programmatic Advertising

Programmatic advertising enables advertisers to buy advertisement space in real-time with automated technology. It is an algorithm-centric approach to understanding how users interact with advertisements, and involves a system that ‘learns’ where best to place advertisements for maximum effectiveness, in real-time bidding (RTB). The Trade Desk is an example of this.

  • Direct Purchase

Some advertisers negotiate directly with website owners to buy advertisement spaces. This is often the case with high-traffic websites, where certain advertisement placements can be expensive but incredibly lucrative. Direct purchase advertising is common on websites like BuzzFeed, The New York Times, or CNN.

  • Affiliate Marketing

This kind of marketing and advertising involves websites that host advertisements from other businesses.  The host websites gain a commission based on the sales or leads they accrue for these ads. Examples of popular affiliate programs are Amazon Associates or, in the Netherlands, bol Affiliate Marketing.

  • Retargeting

When a user visits a site but doesn't make any purchases, advertisers collect cookies and 'schedule' advertisements for that site or product on different websites of their choice. The idea is to drive the visitor back to the original site later, after they left without making a purchase. We’ve all experienced this: being confronted with products or holidays we recently searched for, as a banner ad on a seemingly unrelated website. Facebook Ads is known to use this strategy commonly.

  • Native Advertising

These advertisements are intended to match the look and feel of a site just like regular content and offer a less intrusive advertising experience. They are typically controlled through native advertising platforms specifically. On Instagram, native ads appear in users' feeds and are designed to look like regular posts, and on Amazon, native ads appear as “sponsored” products within search results.

bar chart showing the top ten ad platforms, with various Google ad services taking up four of the top 6 spots.
Figure 1: Top 10 advertising platforms detected on websites, by market share.

Why do we detect online advertising technology?

Different advertising strategies require separate technology and algorithms to achieve their goals of matching the “right” ad to the intended audience. These differing demands have made the advertising technology landscape particularly complex due to its diversity. With numerous players, including many smaller, niche companies, staying up-to-date with advertising technology requires constant attention. 

Understanding fingerprints in advertising

To detect what kind of technology is being used on a website and identify the advertiser behind it, we look for what is commonly referred to as ‘fingerprints.’ A fingerprint, in the context of technologies, refers to unique identifiers embedded within the code of a website, indicating the presence of specific advertising platforms or technologies. These fingerprints can be fragments of code, URLs, cookies, or specific headers that are characteristic of certain technologies. They serve as digital signatures that help in the identification and classification of the technologies a website uses.

Let’s take a look at the fingerprints of Google AdSense. Within the below code, elements like "pagead2.googlesyndication.com" and "adsbygoogle" serve as fingerprints, indicating the presence of Google AdSense on the website.

Lines of HTML code highlighting Google
Figure 2: An example of HTML code for Google AdSense.

We can take nu.nl, one of the largest commercial news sites in The Netherlands, as another example. Advertisements are prominently displayed on the front page and often overshadow the content itself. By examining the source code of the website, we can detect links to various advertising platforms. These snippets are used as a fingerprint to identify providers like Google AdSense, DPGMedia Advertising, and DoubleClick.

Screenshot of a news website with wrap-around and in-text banner ads.
Figure 3: A screenshot showing a website with prominent banner ads.

The challenge of manually detecting technologies

Traditionally, identifying new fingerprints for undiscovered technologies was a manual process. Our technology detection team would inspect the source code of countless websites, searching for the unique identifiers. This method, while effective, was incredibly time-consuming and resource-intensive.

Enhancing fingerprint discovery with machine learning

Being tech people at a tech company, we built a new system to improve the fingerprint discovery process. It uses machine learning to analyze large quantities of web data, detecting potential fingerprints for both known and undiscovered advertising platforms. 

This is possible with a random forest classifier, a machine learning method that utilizes the predictive power of multiple decision trees. During training, the classifier generates multiple decision trees, each of which makes its own prediction about whether a specific pattern could represent a new—in this case, advertising—fingerprint. The most commonly predicted outcome is the final classification, allowing us to confidently detect new fingerprints even when they have subtle variations.

A simplified diagram of a decision tree
Figure 4: A schematic representation of how a random forest classifier works.

In Figure 4, each decision tree independently classifies the data: here, a shape salad. Some trees may classify the system as “Class-A” (hexagon), while others might classify it as “Class-B” (triangle). The random forest classifier assesses these classifications through majority voting, leading to a final, more accurate prediction: “Class-A.”

Our machine learning model does the time-intensive work of preliminary identification, but our human component is the key to the system. Our technology detection team reviews the model's predictions, verifies the found fingerprint, and ensures the reliability of each new discovery added to our database.

The initial deployment of our machine learning model has significantly improved the detection of advertising platforms that were previously undetected, broadening our understanding and offering better insights into the market. For recognized platforms like Google AdSense, our model has substantially increased detection rates.

The future of technology detection

Advertising is a large part of everyone’s contemporary online experience, and we can learn plenty from looking behind the scenes. The technology of advertising that we detect tells us more about the active companies, preferences, and trends online. Due to the ever-evolving nature of the web, there are always new technologies—and new digital fingerprints—popping up, and machine learning tools have helped us move with the changes. The human element to technology detection is what ensures high quality, and these new tools make the resource-intensive project more efficient. Significantly, our project also lays the groundwork for other innovations: we anticipate further enhancements in our detection capabilities, allowing us to rapidly identify and adapt to emerging advertising technologies. From here, we’ll be able to provide timely insights into the latest tools and trends in digital advertising and other industries, too. We’re in the process of exploring the potential of our machine learning techniques, applying them to other fields across the web, which promises even further impacts.

Subscribe to our newsletter to stay in the loop about the latest insights and developments around web data.

Subscribe

Related Recipes