Diffusion of User Tracking Data in the Online Advertising Ecosystem

Page created by Byron Collins

Science

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Proceedings on Privacy Enhancing Technologies ; 2018 (4):85–103

Muhammad Ahmad Bashir and Christo Wilson

Diffusion of User Tracking Data in the Online
Advertising Ecosystem
Abstract: Advertising and Analytics (A&A) companies              sold in Real Time Bidding (RTB) auctions. The rise of
have started collaborating more closely with one an-             RTB has forced Advertising and Analytics (A&A) com-
other due to the shift in the online advertising industry        panies to collaborate more closely with one another, in
towards Real Time Bidding (RTB). One natural way to              order to exchange data about users and facilitate bid-
understand how user tracking data moves through this             ding on impressions [10, 58]. The move towards RTB has
interconnected advertising ecosystem is by modeling it           also caused A&A companies to specialize into particular
as a graph. In this paper, we introduce a novel graph            roles. For example, Supply-Side Platforms (SSPs) work
representation, called an Inclusion graph, to model the          with publishers (e.g., CNN) to help manage their re-
impact of RTB on the diffusion of user tracking data             lationship with ad exchanges, while Demand-Side Plat-
in the advertising ecosystem. Through simulations on             forms (DSPs) try to optimize ad placement and bidding
the Inclusion graph, we provide upper and lower esti-            on behalf of advertisers. In short, due to RTB, the online
mates on the tracking information observed by A&A                advertising ecosystem has become enormously complex.
companies. We find that 52 A&A companies observe                      A natural way to model this complex ecosystem
at least 91% of an average user’s browsing history un-           is in the form of a graph. Graph models that accu-
der reasonable assumptions about information sharing             rately capture the relationships between publishers and
within RTB auctions. We also evaluate the effectiveness          A&A companies are extremely important for practi-
of blocking strategies (e.g., AdBlock Plus), and find that       cal applications, such as estimating revenue of A&A
major A&A companies still observe 40–90% of user im-             companies [26], predicting whether a given domain is
pressions, depending on the blocking strategy.                   a tracker [34], or evaluating the effectiveness of domain-
                                                                 blocking strategies on preserving users’ privacy.
Keywords: Online Tracking, RTB, Cookie Matching
                                                                      However, to date, technical limitations have pre-
DOI 10.1515/popets-2018-0033                                     vented researchers from developing accurate graph mod-
Received 2018-02-28; revised 2018-06-15; accepted 2018-06-16.    els of the online advertising ecosystem. For example,
                                                                 Gomer et al. [29] propose a Referer graph, where nodes
                                                                 represent publishers or A&A domains, and two nodes ai
1 Introduction                                                   and aj are connected if an HTTP message to aj is ob-
                                                                 served with ai as the HTTP Referer. Unfortunately, as
In the last decade, the online display advertising indus-        we will show, graphs built using Referer information
try has massively grown in size and scope. According             may contain erroneous edges in cases where a third-
to the Interactive Advertising Bureau (IAB), revenue             party script is embedded directly into a first-party con-
from the online display ad industry in the U.S. totaled          text (i.e., is not sandboxed in an iframe).
$88B in 2017, a growth of 21.4% from 2016 [63]. This                  In this paper, to model the diffusion of user track-
increased spending is fueled by advances that enable             ing data within RTB auctions, we propose a novel and
advertisers to target users with increasing levels of pre-       accurate representation of the advertising graph called
cision, even across different devices and platforms.             an Inclusion graph. The Inclusion graph corrects the
     Another recent change in the online display adver-          technical problem of the Referer graph by using the
tising ecosystem is the shift from ad networks to ad             actual inclusion relationships between domains to rep-
exchanges, where advertisers bid on impressions being            resent edges, rather than imprecise Referer relation-
                                                                 ships. We are able to construct Inclusion graphs, thanks
                                                                 to advances in browser instrumentation that allow re-
                                                                 searchers to conduct web crawls that record the exact
Muhammad Ahmad Bashir: Northeastern University, E-
                                                                 provenance of all HTTP(S) requests [6, 10, 41].
mail: ahmad@ccs.neu.edu
Christo Wilson: Northeastern University, E-mail:                      We use crawled data consisting of around 2M im-
cbw@ccs.neu.edu                                                  pressions from popular e-commerce websites collected

Diffusion of User Tracking Data in the Online Advertising Ecosystem 86

by a specially instrumented version of Chrome [10] to
construct the Inclusion graph. In § 4, we examine the
2 Background and Related Work
fundamental graph properties of the Inclusion graph
In this section, we review technical details of and current
and compare it to a Referer graph, created using the
computer science research on the online display adver-
same dataset to understand their salient differences. In
tising ecosystem. We start by discussing related work on
§ 5, we demonstrate a concrete use case for the In-
user privacy and tracking. Next, we present examples of
clusion graph by using simulations to model the flow
the current display ad serving process and define the
of tracking data to A&A companies. Furthermore, we
roles of different actors in the ecosystem, followed by
compare the efficacy of different real-world and graph
a brief overview of efforts to empirically measure these
theoretic “blocking” strategies (e.g., AdBlock Plus [2],
processes. Lastly, we examine prior work that modeled
Ghostery [25], and Disconnect [18]) at reducing the flow
the ad ecosystem as a graph.
of tracking information to A&A companies.
Overall, we make the following key contributions:
– We introduce the Inclusion graph as a model for
2.1 Tracking and Blocking
capturing the complexity of the online advertising
ecosystem. We use the Inclusion graph as a sub-
To show relevant ads to users, advertisers rely heavily
strate for modeling the flow of impressions to A&A
on collecting information about users as they browse
companies by taking into account the browsing be-
the web. This data collection is achieved by embedding
havior of users and the dynamics of RTB auctions.
trackers into webpages that gather browsing informa-
– We find that the Inclusion graph has substantive tion about each user.
differences in graph structure compared to the Ref- The area of tracking has been well studied. Kr-
erer graph because 48.4% of resource inclusions in ishnamurthy et al. and others have documented the
our crawled data have an inaccurate Referer. pervasiveness of trackers and the associated user pri-
– Through simulations, we find that 52 A&A com- vacy implications over time [15, 20, 26, 33, 37–39]. Fur-
panies are each able to observe 91% of an average thermore, tracking techniques have evolved over time.
user’s impressions as they browse, under modest as- Persistent cookies [35], local state in browser plug-
sumptions about data sharing in RTB auctions. 636 ins [7, 68, 69], and various browser fingerprinting meth-
A&A companies are able to observe at least 50% ods [1, 21, 36, 51, 55, 57, 65] are some of the tech-
of an average user’s impressions. Even under the niques that have been deployed to track users. Engle-
strictest simulation assumptions, the top 10 A&A hardt et al. [20] found evidence of tracking via the
companies observe 89-99% of all user impressions. Audio and Battery Status JavaScript APIs. In addi-
– We simulate the effect of five blocking strategies, tion to tracking users themselves, advertisers try to
and find that AdBlock Plus (the world’s most pop- maximize their knowledge of each user’s interest pro-
ular ad blocking browser extension [45, 62], is in- file by sharing information with each other via cookie
effective at protecting users’ privacy because major matching [1, 10, 23, 58]. Falahrastegar et al. examine
ad exchanges are whitelisted under the Acceptable how tracking differs across geographic regions [22].
Ads program [73]. In contrast, Disconnect blocks Users have become increasingly concerned with the
the most information flows to A&A companies, fol- amount and types of tracking information collected
lowed by removal of top 10% A&A nodes. However, about them [47, 70]. Several surveys have investigated
even with strong blocking, major A&A companies users’ concerns about targeted ads, their preferences to-
still observe 40–80% of user impressions. wards tracking, and usage of privacy tools [8, 42, 48, 66,
71]. Concerns about the privacy implications of track-
The raw data we use in this study is publicly avail- ing (as well as the insecurity of online ad networks [75])
able.1 We have also publicly released the source code has led to increased adoption of tools that block track-
and data from this study.2 ers and ads. Two studies have examined the usage of ad
blockers in-the-wild [45, 62], while Walls et al. looked at
efforts to whitelist “acceptable advertisers” [73].
Merzdovnik et al. critically examined the effec-
tiveness of tracker blocking tools [49]; in contrast,
1 http://personalization.ccs.neu.edu/Projects/Retargeting/
Nithyanand et al. studied advertisers’ efforts to counter
2 http://personalization.ccs.neu.edu/Projects/AdGraphs/

Diffusion of User Tracking Data in the Online Advertising Ecosystem 87

Publisher Exchange HTTP(S) Request/Response pression” is used when advertising or tracking content
SSP DSP/Advertiser RTB Bidding is rendered in a user’s browser after they visit a web-
 page [17]. To participate in RTB auctions, A&A com-
p1 p2 s1 a3
 panies must implement cookie matching, which is a pro-
e2
  cess by which different A&A companies exchange their
  a2
a1  unique tracking identifiers for specific users. Several

 studies have examined the emergence of cookie match-
e1 a1
 e1   ing [1, 10, 23, 58]. Ghosh et al. theoretically model the
incentives for A&A companies to collaborate with their
(a) Cookie Matching (b) RTB Example with Two Exchanges
Example and Two Auctions competitors in RTB auction systems [24].
Figure 1(a) illustrates the typical process used by
Fig. 1. Examples of (a) cookie matching and (b) showing an ad
A&A companies to match cookies. When a user visits
to a user via RTB auctions. (a) The user visits publisher p1 Ê
which includes JavaScript from advertiser a1 Ë. a1 ’s JavaScript a website Ê, JavaScript code from a third-party adver-
then cookie matches with exchange e1 by programmatically gen- tiser a1 is automatically downloaded and executed in
erating a request that contains both of their cookies Ì. (b) The the user’s browser Ë. This code may set a cookie in the
user visits publisher p2 , which then includes resources from SSP user’s browser, but this cookie will be unique to a1 , i.e.,
s1 and exchange e2 Ê–Ì. e2 solicits bids Í and sells the impres-
it will not contain the same unique identifiers as the
sion to e1 Î Ï, which then holds another auction Ð, ultimately
cookies set by any other A&A companies. Furthermore,
selling the impression to a1 Ñ Ò.
the Same Origin Policy (SOP) prevents a1 ’s code from
reading the cookies set by any other domain. To facili-
ad blockers [56]. Mughees et al. examined the prevalence tate bidding in future RTB auctions, a1 must match its
of anti-ad blockers in the wild [53]. In this work, we ex- cookie to the cookie set by an ad exchange like e1 . As
pand on the existing blocking literature by taking the shown in the figure, a1 ’s JavaScript accomplishes this
effects of ad auctions and cookie matching into account. by programmatically causing the browser to send a re-
The research community has proposed a variety quest to e1 Ì. The JavaScript includes a1 ’s cookie in the
of mechanisms to stop online tracking that go beyond request, and the browser automatically adds a copy of
blacklists of domains and URLs. Li et al. [43] and e1 ’s cookie, thus allowing e1 to create a match between
Ikram et al. [32] used machine learning to identify track- its cookie and a1 ’s.
ers, while Papaodyssefs et al. [60] examined the use of Figure 1(b) shows an example of how an ad may
private cookies to avoid being tracked. Nikiforakis et be shown on publisher p2 using RTB auctions. When a
al. propose the complementary idea of adding entropy user visits p2 Ê, JavaScript code is automatically down-
to the browser to evade fingerprinting [54]. However, de- loaded and executed either from a Supply Side Platform
spite these efforts, third-party trackers are still pervasive (SSP) Ë or an ad exchange. SSPs are A&A companies
and pose real privacy issues to users [49]. that specialize in maximizing publisher revenue by for-
warding impressions to the most lucrative ad exchange.
Eventually the impression arrives at the auction held by
2.2 The Online Advertising Ecosystem ad exchange e2 Ì, and e2 solicits bids from advertisers
and Demand Side Platforms (DSPs) Í. DSPs are A&A
Numerous studies have chronicled the online advertis- companies that specialize in executing ad campaigns on
ing ecosystem, which is composed of companies that: behalf of advertisers. Note that all participants in the
track users, serve ads, act as platforms between publish- auction observe the impression; however, because
ers (websites that rely on advertising revenue to pay for only e2 ’s cookie is available at this point, auction par-
content creation) and advertisers, or all of the above. ticipants that have not matched cookies with e2 will not
Mayer et al. present an accessible introduction to this be able to identify the user.
topic in [46]. In this work, we collectively refer to The process of filling an impression may continue
companies engaged in analytics and advertising even after an RTB auction is won, because the win-
as A&A companies. ner may be yet another ad exchange or ad network. As
Recently, the online ad ecosystem has begun to shift shown in Figure 1(b), the impression is purchased from
from ad networks to ad exchanges, which implement e2 by e1 Î Ï, who then holds another auction Ð and
Real Time Bidding (RTB) auctions to sell impressions ultimately sells to a1 (the advertiser from the cookie
to advertisers. In the advertising industry, the term “im- matching example) Ñ Ò. Ad exchanges and ad networks

Diffusion of User Tracking Data in the Online Advertising Ecosystem 88

routinely match cookies with each other to facilitate the
flow of impression inventory between markets.
3 Methodology
Measurement Studies. Barford et al. broadly Our goal is to capture the most accurate representation
characterized the web adscape and identified systemat- of the online advertising ecosystem, which will allow us
ically important ad networks [9]. Rodriguez et al. mea- to model the effect of RTB on diffusion of user tracking
sured the ad ecosystem that serves mobile devices [72], data. In this section, we introduce the dataset used in
while Zarras et al. specifically examined ad networks this study and describe how we use it to build a graph
that serve malicious ads [75]. Gill et al. modeled the representation of the ad ecosystem.
revenue earned by different A&A companies [26], while
other studies have used empirical measurements to de-
termine the value of individual users to online advertis- 3.1 Dataset
ers [58, 59]. Many studies have used a variety of meth-
ods to study the targeted ads that are displayed to users In this work, we use the dataset provided by Bashir et
under a variety of circumstances [9–11, 16, 30, 44]. al. [10]. The goal of [10] was to causally infer the infor-
mation sharing relationships between A&A companies
by (1) crawling products from popular e-commerce web-
2.3 Ad Ecosystem Graphs sites and then (2) observing corresponding retargeted
ads on publishers. Bashir et al. conducted web crawls
A natural structure for modeling the online ad ecosys- that covered 738 major e-commerce websites (e.g., Ama-
tem is a graph, where nodes represent publishers and zon) and 150 popular publishers (e.g., CNN).3 The au-
A&A companies, and edges capture relationships be- thors chose top e-commerce sites from Alexa’s hierarchi-
tween these entities. Gomer et al. [29] built and analyzed cal list of online shops [4], and manually chose publish-
graphs of the ad ecosystem by making use of the Ref- ers from the Alexa Top-1K. They crawled 10 manually
erer field from HTTP requests. In this representation, a selected products per e-commerce site to signal strong
relationship di → dj exists if there is an HTTP request intent to trackers and advertisers, followed by 15 ran-
to domain dj with a Referer header from domain di . domly chosen pages per publisher to elicit display ads.
While Gomer et al. provided interesting insights In total, Bashir et al. repeated the entire crawl nine
into the structure of the ad ecosystem, their referral- times, resulting in data for around 2M impressions.
based graph representation has a significant limitation.
As we describe in § 3.3, relying on the HTTP Referer
does not always capture the correct relationships be- 3.2 Inclusion Trees
tween A&A parties, thus leading to incorrect graphs of
the ad ecosystem. We re-create this graph representa- Bashir et al. [10] used a specially instrumented ver-
tion using our dataset (see § 3) and compare its prop- sion of Chromium for their web crawls. Their crawler
erties to a more accurate representation in § 4. recorded the inclusion tree for each webpage, which is
Kalavri et al. [34] created a bipartite graph of pub- a data structure that captures the semantic relation-
lishers and associated A&A domains, then transformed ships between elements in a webpage (as opposed to the
it to create an undirected graph consisting solely of DOM, which captures syntactic relationships) [6, 41].
A&A domains. In their representation, two A&A do- The crawler also recorded all HTTP request and re-
mains are connected if they were included by the same sponse headers associated with each visited URL.
publisher. This construction leads to a highly dense To illustrate the importance of inclusion trees, con-
graph with many complete cliques. Kalavri et al. lever- sider the example webpage shown in Figure 2(a). The
aged the tight community structure of A&A domains DOM shows that the page from publisher p ultimately
to predict whether new, unknown URLs were A&A or includes resources from four third-party domains (a1
not. However, this co-occurrence representation has a through a4 ). It is clear from the DOM that the request
conceptual shortcoming: it may include edges between to a3 is responsible for causing the request to a4 , since
A&A domains that do not directly communicate or have the script inclusion is within the iframe. However, it
any business relationship. Due to this shortcoming, we
do not explore this graph representation in this work.
3 For simplicity, we refer to these e-commerce websites as pub-
lishers, to distinguish them from A&A domains.

Diffusion of User Tracking Data in the Online Advertising Ecosystem     89

                                               p.com/index.html                Cookie Matching.         The Bashir et al. dataset also
                                       includes labels on edges of the inclusion trees indicat-
                                                          ing cases where cookie matching is occurring. These la-
    
                                                                 a2.com/pixel.jpg
                                                                                     bels are derived from heuristics (e.g., string matching
    
                       a3.com/banner.html       to identify the passing of cookie values in HTTP pa-
    
                                                                 a4.com/ads.js
                                                                                     rameters) and causal inferences based on the presence

                                                                                     of retargeted ads. We use this data in § 5 to constrain
 (a) DOM Tree for http://p.com/index.html              (b) Inclusion Tree
                                                                                     some of our simulations.
                                                                    a1
                                   a1          a2
 Publisher
                         p                              p           a2
      A&A                                                                            3.3 Graph Construction
                                  a3           a4                   a3       a4

                        (c) Inclusion Graph              (d) Referer Graph
                                                                                     A natural way to model the online ad ecosystem is using
                                                                                     a graph. In this model, nodes represent A&A compa-
Fig. 2. An example HTML document and the corresponding in-
clusion tree, Inclusion graph, and Referer graph. In the DOM                         nies, publishers, or other online services. Edges capture
representation, the a1 script and a2 img appear at the same                          relationships between these actors, such as resource in-
level of the tree; in the inclusion tree, the a2 img is a child of the               clusion or information flow (e.g., cookie matching).
a1 script because the latter element created the former. The
Inclusion graph has a 1:1 correspondence with the inclusion tree.                    Canonicalizing Domains.              We use the data
The Referer graph fails to capture the relationship between the                      described in § 3.1 to construct a graph for the
a1 script and a2 img because they are both embedded in the                           online advertising ecosystem. We use effective 2nd -
first-party context, while it correctly attributes the a4 script to                  level domain names to represent nodes. For example,
the a3 iframe because of the context switch.
                                                                                     x.doubleclick.net and y.doubleclick.net are repre-
                                                                                     sented by a single node labeled doubleclick. Through-
is not clear which domain generated the requests to a2                               out this paper, when we say “domain”, we are referring
and a3 : the img and iframe could have been embedded                                 to an effective 2nd -level domain name.5
in the original HTML from p, or these elements could                                      Simplifying domains to the effective 2nd -level is a
have been created dynamically by the script from a1 .                                natural encoding for advertising data. Consider two in-
In this case, the inclusion tree shown in Figure 2(b) re-                            clusion trees generated by visiting two publishers: pub-
veals that the image from a2 was dynamically created                                 lisher p1 forwards the impression to x.doubleclick.net
by the script from a1 , while the iframe from a3 was                                 and then to advertiser a1 . Publisher p2 forwards to
embedded directly in the HTML from p.                                                y.doubleclick.net and advertiser a2 . This does not
    The instrumented Chromium binary used by                                         imply that x.doubleclick and y.doubleclick only sell
Bashir et al. was able to correctly determine the prove-                             impressions to a1 and a2 , respectively. In reality, Dou-
nance of webpage elements, regardless of how they were                               bleClick is a single auction, regardless of the subdo-
created (e.g., directly in HTML, via inline or remotely                              main, and a1 and a2 have the opportunity to bid on
included script tags, dynamically via eval(), etc.), or                              all impressions. Individual inclusion trees are snapshots
where they were located (in the main context or within                               of how one particular impression was served; only in
iframes). This was accomplished by tagging all scripts                               aggregate can all participants in the auctions be enu-
with provenance information (i.e., first-party for inline                            merated. Further, 3rd -level domains may read 2nd -level
scripts), and then dynamically monitoring the execu-                                 cookies without violating the Same Origin Policy [52]:
tion of each script. New scripts created during the ex-                              x.doubleclick.com and y.doubleclick.com may both
ecution of a given script (e.g., via document.write())                               access cookies set by .doubleclick, and do in practice.
were linked to their parent.4 More details about how                                      The sole exception to our domain canonicalization
Chromium was instrumented and inclusion trees were                                   process is Amazon’s Cloudfront Content Delivery Net-
extracted are available in [6].                                                      work (CDN). We routinely observed Cloudfront hosting
                                                                                     ad-related scripts and images in our data. We manu-
                                                                                     ally examined the 50 fully-qualified Cloudfront domains
4 Note that JavaScript within a given page context executes se-
rially, so there is no ambiguity created by concurrency. Although
Web Workers may execute concurrently, they cannot include                            5 None of the publishers and A&A domains in our dataset have
third party scripts or modify the DOM.                                               two-part TLDs, like .co.uk, which simplifies our analysis.

Diffusion of User Tracking Data in the Online Advertising Ecosystem 90

% Overlap with A&A
(e.g., d31550gg7drwar.cloudfront.net) that were pre-

from Alexa Top-5K
100 900

# Unique External
800

A&A Domains
or proceeded by A&A domains in our data, and mapped 80 700
600
60 500
each one to the corresponding A&A company (e.g., 40 400
300
adroll in this case). 20 200
100
0 0
Inclusion graph. We propose a novel representa- 0 250 500 750 1000 0 3K 6K 9K 12K 15K
tion called an Inclusion graph that is the union of all Top x A&A Domains # Pages Crawled

inclusion trees in our dataset. Our representation is a di-
Fig. 3. Overlap between fre- Fig. 4. Unique A&A domains
rected graph of publishers and A&A domains. An edge quent A&A domains and A&A contacted by each A&A do-
di → dj exists if we have ever observed domain di includ- domains from Alexa Top-5K. main as we crawl more pages.
ing a resource from dj . Edges may exist from publishers
to A&A domains, or between A&A domains. Figure 2(c)
A&A domains from publishers and non-A&A third par-
shows an example Inclusion graph.
ties like CDNs. In the inclusion trees from the Bashir et
Referer graph. Gomer et al. [29] also proposed a di- al. dataset [10], each resource is labeled as A&A or non-
rected graph representation consisting of publishers and A&A using the EasyList and EasyPrivacy rule lists. For
A&A domains for the online advertising ecosystem. In all the A&A labeled resources, we extract the associated
this representation, each publisher and A&A domain is 2nd -level domain. To eliminate false positives, we only
a node, and edge di → dj exists if we have ever observed consider a 2nd -level domain to be A&A if it was labeled
an HTTP request to dj with Referer di . Figure 2(d) as A&A more than 10% of the time in the dataset.
shows an example Referer graph corresponding to the
given webpage. The Bashir et al. [10] dataset includes
all HTTP request and response headers from the crawl, 3.5 Coverage
and we use these to construct the Referer graph.
Although the Referer and Inclusion graphs seem There are two potential concerns with the raw data we
similar, they are fundamentally different for technical use in this study: does the data include a representative
reasons. Consider the examples shown in Figure 2: the set of A&A domains? and does the data contain all of
script from a1 is included directly into p’s context, the outgoing edges associated with each A&A domain?
thus p is the Referer in the request to a2 . This results To answer the former question, we plot Figure 3, which
in a Referer graph with two edges that does not cor- shows the overlap between the top x A&A domains in
rectly encode the relationships between the three par- our dataset (ranked by inclusion frequency by publish-
ties: p → a1 and p → a2 . In other words, HTTP Referer ers) with all of the A&A domains included by the Alexa
headers are an indirect method for measuring the se- Top-5K websites.6 We observe that 99% of the 150 most
mantic relationships between page elements, and the frequent A&A domains appear in both samples, while
headers may be incorrect depending on the syntactic 89% of the 500 most frequent appear in both. These
structure of a page. Our Inclusion graph representation findings confirm that our dataset includes the vast ma-
fixes the ambiguity in the Referer graph by explicitly jority of prominent A&A domains that users are likely
relying on the inclusion relationships between elements to encounter on the web.
in webpages. We analyze the salient differences between To answer the second question, we plot Figure 4,
the Referer and Inclusion graph in § 4. which shows the number of unique external A&A do-
Weights. Additionally, we also create a weighted mains contacted by A&A domains in our dataset as
version of these graphs. In the Inclusion graph, the the crawl progressed (i.e., starting from the first page
weight of di → dj encodes the number of times a re- crawled, and ending with the last). Recall that the
source from di sent an HTTP request to dj . In the Ref- dataset was collected over nine consecutive crawls span-
erer graph, the weight of di → dj encodes the number ning two weeks of time, each of which visited 9,630 in-
of HTTP requests with Referer di and destination dj . dividual pages spread over 888 domains.
We observe that the number of A&A →A&A edges
rises quickly initially, going from 0 to 800 in 3,600
3.4 Detection of A&A Domains

For us to understand the role of A&A companies in 6 Our dataset and the Alexa Top-5K data were both collected
the advertising graph, we must be able to distinguish in December 2015, so they are temporally comparable.

Diffusion of User Tracking Data in the Online Advertising Ecosystem 91

Avg. Deg. Avg. Path Cluster. Degree
Graph Type |V| |E| |VWCC | |EWCC | (In Out) Length Coef. S∆ [31] Assort.
Inclusion 1917 26099 1909 26099 13.612 13.612 2.748† 0.472‡ 31.254‡ -0.31‡
Referer 1923 41468 1911 41468 21.564 21.564 2.429† 0.235‡ 10.040‡ -0.29‡
Table 1. Basic statistics for Inclusion and Referer graph. We show sizes for the largest WCC in each graph. † denotes that the metric
is calculated on the largest SCC. ‡ denotes that the metric is calculated on the undirected transformation of the graph.

crawled pages. Then, the growth slows down, requiring that should be in the core of the network are incorrectly
an additional 12,000 page visits to increase from 800 to attached to publishers along the periphery.
900. In other words, almost all A&A edges were dis- Structure and Connectivity. As shown in Ta-
covered by half-way through the very first crawl; eight ble 1, the Inclusion graph has large, well-connected
subsequent iterations of the crawl only uncovered 12.5% components. The largest Weakly Connected Compo-
more edges. This demonstrates that the crawler reached nent (WCC) covers all but eight nodes in the Inclusion
the point of diminishing returns, indicating that the vast graph, meaning that very few nodes are completely dis-
majority of connections between A&A domains that ex- connected. This highlights the interconnectedness of the
isted at the time are contained in the dataset. ad ecosystem. The average node degree in the Inclusion
graph is 13.6, and

Diffusion of User Tracking Data in the Online Advertising Ecosystem 92

2000 Betweenness Centrality Weighted PageRank
1600 google-analytics doubleclick
doubleclick googlesyndication
|WCC|

1200
googleadservices 2mdn
800
facebook adnxs
400 googletagmanager google
0 googlesyndication adsafeprotected
0 10 20 30 40 50 60 70 adnxs google-analytics
k google scorecardresearch
addthis krxd
Fig. 5. k-core: size of the Inclusion graph WCC as nodes with
criteo rubiconproject
degree ≤ k are recursively removed.
Table 2. Top 10 nodes ranked by betweenness centrality and
weighted PageRank in the Inclusion graph.
network that have disassortative connectivity, which we
examine in the next section.
segmented by ad exchange (e.g., customers and part-
ners centered around DoubleClick). This is a known
4.2 Cores and Communities deficiency in modularity maximization based methods,
that they tend to produce communities with no real-
We now examine how nodes in the Inclusion graph con- world correspondence [5]. Girvan–Newman found 10
nect to each other using two metrics: k-cores and com- communities, with the largest community containing
munity detection. The k-core of a graph is the subset 1,097 nodes (57% of all nodes) and 16,424 edges (63%
of a graph (nodes and edges) that remain after recur- of all edges). Out of 1,097 nodes, 64% are A&A. How-
sively removing all nodes with degree ≤ k. By increas- ever, the modularity score was zero, which means that
ing k, the loosely connected periphery of a graph can be the Girvan–Newman communities contain a random as-
stripped away, leaving just the dense core. In our sce- sortment of internal and external (cross-cluster) edges.
nario, this corresponds to the high-degree ad exchanges, Overall, these results demonstrate that the web dis-
ad networks, and trackers that facilitate the connections play ad ecosystem is not balkanized into distinct groups
between publishers and advertisers. of companies and publishers that partner with each
Figure 5 plots k versus the size of the WCC for the other. Instead, the ecosystem is highly interdependent,
Inclusion graph. The plot shows that the core of the with no clear delineations between groups or types of
Inclusion graph rapidly declines in size as k increases, A&A companies. This result is not surprising consider-
which highlights the interdependence between A&A do- ing how dense the Inclusion graph is.
mains and the lack of a distinct core.
Next, to examine the community structure of the
Inclusion graph, we utilized three different community 4.3 Node Importance
detection algorithms: label propagation by Raghavan et
al. [64], Louvain modularity maximization [12], and the In this section, we focus on the importance of specific
centrality-based Girvan–Newman [27] algorithm. We nodes in the Inclusion graph using two metrics: be-
chose these algorithms because they attempt to find tweenness centrality and weighted PageRank. As be-
communities using fundamentally different approaches. fore, we focus on the largest WCC. The betweenness
Unfortunately, after running these algorithms on centrality for a node n is defined as the fraction of all
the largest WCC, the results of our community analy- shortest paths on the graph that traverse n. In our sce-
sis were negative. Label propagation clustered all nodes nario, nodes with high betweenness centrality represent
into a single community. Louvain found 14 communities the key pathways for tracking information and impres-
with an overall modularity score of 0.44 (on a scale of sions to flow from publishers to the rest of the ad ecosys-
-1 to 1 where 1 is entirely disjoint clusters). The largest tem. For weighted PageRank, we weight each edge in the
community contains 771 nodes (40% of all nodes) and Inclusion graph based on the number of times we ob-
3252 edges (12% of all edges). Out of 771 nodes, 37% serve it in our raw data. In essence, weighted PageRank
are A&A. However, none of the 14 communities cor- identifies the nodes that receive the largest amounts of
responded to meaningful groups of nodes, either seg- tracking data and impressions throughout each graph.
mented by type (e.g., publishers, SSPs, DSPs, etc.) or

Diffusion of User Tracking Data in the Online Advertising Ecosystem 93

Table 2 shows the top 10 nodes in the Inclusion These questions have direct implications for under-
graph based on betweenness centrality and weighted standing users’ online privacy. The first two questions
PageRank. Prominent online advertising companies are are about quantifying a user’s online footprint, i.e., how
well represented, including AppNexus (adnxs), Face- much of their browsing history can be recorded by dif-
book, and Integral Ad Science (adsafeprotected). Sim- ferent companies. In contrast, the third question inves-
ilar to prior work, we find that Google’s advertising do- tigates how well different blocking strategies perform at
mains (including DoubleClick and 2mdn) are the most protecting users’ privacy.
prominent overall [29]. Unsurprisingly, these companies
all provide platforms, i.e., SSPs, ad exchanges, and ad
networks. We also observe trackers like Google Analyt- 5.2 Simulation Setup
ics and Tag Manager. Interestingly, among 14 unique
domains across the two lists, ten only appear in a single To answer these questions, we simulate the browsing
list. This suggests that the most important domains in behavior of typical users using the methodology from
terms of connectivity are not necessarily the ones that Burklen et al. [14].9 In particular, we simulate a user
receive the highest volume of HTTP requests. browsing publishers over discreet time steps. At each
time step our simulated user decides whether to remain
on the current publisher according to a Pareto distri-
bution (exponent = 2), in which case they generate a
5 Information Diffusion new impression on that publisher. Otherwise, the user
browses to a new publisher, which is chosen based on a
In § 4, we examined the descriptive characteristics of
Zipf distribution over the Alexa ranks of the publishers.
the Inclusion graph, and discuss the implications of
Burklen et al. developed this browsing model based on
this graph structure on our understanding of the on-
large-scale observational traces, and derive the distri-
line advertising ecosystem. In this section, we take the
butions and their parameters empirically. This brows-
next step and present a concrete use case for the In-
ing model has been successfully used to drive simulated
clusion graph: modeling the diffusion of user tracking
experiments in other work [40].
data across the ad ecosystem under different types of ad
We generated browsing traces for 200 users. On av-
and tracker blocking (e.g., AdBlock Plus and Ghostery).
erage, each user generated 5,343 impressions on 190
We model the flow of information across the Inclusion
unique publishers. The publishers are selected from the
graph, taking into account different blocking strategies,
888 unique first-party websites in our dataset (see § 3.1).
as well as the design of RTB systems and empirically ob-
During each simulated time step the user generates
served transition probabilities from our crawled dataset.
an impression on a publisher, which is then forwarded
to all A&A domains that are directly connected to the
publisher. This emulates a webpage with multiple slots
5.1 Simulation Goals
for display ads, each of which is serviced by a differ-
ent SSP or ad exchange. However, it is insufficient to
Simulation is an important tool for helping to under-
simply forward the impression to the A&A domains di-
stand the dynamics of the (otherwise opaque) online
rectly connected to each publisher; we also must account
advertising industry. For example, Gill et al. used data-
for ad exchanges and RTB auctions [10, 58], which may
driven simulations to model the distribution of revenue
cause the impression to spread farther on the graph.
amongst online display advertisers [26].
We discuss this process next. The simulated time step
Here, we use simulations to examine the flow
ends when all impressions arrive at A&A domains that
of browsing history data to trackers and advertisers.
do not forward them. Once all outstanding impressions
Specifically, we ask:
have terminated, time increments and our simulated
1. How many user impressions (i.e., page visits) to
user generates a new impression, either from their cur-
publishers can each A&A domain observe?
rently selected publisher or from a new publisher.
2. What fraction of the unique publishers that a user
visits can each A&A domain observe?
3. How do different blocking strategies impact the
number of impressions and fraction of publishers ob- 9 To the best of our knowledge, there are no other empirically
served by each A&A domain? validated browsing models besides [14].

Diffusion of User Tracking Data in the Online Advertising Ecosystem 94

1 1 x-axis is in log scale). This demonstrates that heavy-
0.8 0.8
hitters like DoubleClick, GoogleSyndication, OpenX,
0.6 0.6
CDF

CDF
0.4 0.4 and Facebook are likely to purchase impressions that
0.2 0.2
go up for auction in our simulations.
0 0
0 0.2 0.4 0.6 0.8 1 1 10 100 1K 10K100K
Indirect Propagation. Unfortunately, precisely ac-
Termination Probability Mean Weight on Incoming
per Node Edges counting for indirect propagation is not currently possi-
ble, since it is not known exactly which A&A domains
Fig. 6. CDF of the termination Fig. 7. CDF of the weights on
probability for A&A nodes. incoming edges for A&A nodes. are ad exchanges, or which pairs of A&A domains share
information. To compensate, we evaluate three different
indirect impression propagation models:
5.2.1 Impression Propagation – Cookie Matching-Only: As we note in § 3.2, the
Bashir et al. [10] dataset includes 200 empirically
Our simulations must account for direct and indirect validated pairs of A&A domains that match cookies.
propagation of impressions. Direct flows occur when one In this model, we treat these 200 edges as ground-
A&A domain sells or redirects an impression to another truth and only indirectly disseminate impressions
A&A domain. We refer to these flows as “direct” be- along these edges. Specifically, if ai observes an im-
cause they are observable by the web browser, and are pression, it will indirectly share with aj iff ai → aj
thus recorded in our dataset. Indirect flows occur when exists and is in the set of 200 known cookie match-
an ad exchange solicits bids on an impression. The ad- ing edges. This is the most conservative model we
vertisers in the auction learn about the impression, but evaluate, and it provides a lower-bound on impres-
this is not directly observable to the browser; only the sions observed by A&A domains.
winner is ultimately known.
– RTB Relaxed: In this model, we assume that
Direct Propagation. To account for direct propa- each A&A domain that observes an impression, in-
gation, we assign a termination probability to each A&A directly shares it with all A&A domains that it is
node in the Inclusion graph that determines how often connected to. Although this is the correct behavior
it serves an ad itself, versus selling the impression to a for ad exchanges like Rubicon and DoubleClick, it
partner (and redirecting the user’s browser accordingly). is not correct for every A&A domain. This is the
We derive the termination probability for each A&A most liberal model we evaluate, and it provides an
node empirically from our dataset. When an impression upper-bound on impressions observed by A&A do-
is sold, we determine which neighboring node purchases mains.
the impression based on the weights of the outgoing – RTB Constrained: In this model, we select a sub-
edges. For a node ai , we define its set of outgoing neigh- set of A&A domains E to act as ad exchanges.
bors as No (ai ). The probability of selling to neighbor Whenever an A&A domain in E observes an impres-
P
aj ∈ No (ai ) is w(ai → aj )/ ∀ay ∈No (ai ) w(ai → ay ), sion, it shares it with all directly connected A&A
where w(ai → aj ) is the weight of the given edge. domains, i.e., to solicit bids. This model represents
Figure 6 shows the termination probability for A&A a more realistic view of information diffusion than
nodes in the Inclusion graph. We see that 25% of the Cookie Matching-Only and RTB Relaxed mod-
the A&A nodes have a termination probability of one, els because the graph contains few but extremely
meaning that they never sell impressions. The remaining well connected exchanges.
75% of A&A nodes exhibit a wide range of termination
probabilities, corresponding to different business mod- For RTB Constrained, we select all A&A nodes with
els and roles in the ad ecosystem. For example, Dou- out-degree ≥ 50 and in/out degree ratio r in the range
bleClick, the most prominent ad exchange, has a termi- 0.7 ≤ r ≤ 1.7 to be in E. These thresholds were cho-
nation probability of 0.35, whereas Criteo, a well-known sen after manually looking at the degrees and ratios
advertiser specializing in retargeting, has a termination for known ad exchanges and ad exchanges marked by
probability of 0.63. Bashir et al. [10]. This results in |E| = 36 A&A nodes
Figure 7 shows the mean incoming edge weights for being chosen as ad exchanges (out of 1,032 total A&A
A&A nodes in the Inclusion graph. We observe that domains in the Inclusion graph). We enforce restrictions
the distribution is highly skewed towards nodes with on r because A&A nodes with disproportionately large
extremely high average incoming weights (note that the amounts of incoming edges are likely to be trackers (in-

Diffusion of User Tracking Data in the Online Advertising Ecosystem         95

   Node Type                    Edge Type                Activation      a4 and a5 (i.e., it services their ad campaigns by bidding
     Publisher               Cookie Matched               Direct         on their behalf). Light grey edges capture cases where
     Exchange            Non-Cookie Matched
                                                         Indirect        the two endpoints have been observed cookie matching
DSP/Advertiser
                                                                         in the ground-truth data. Edge e2 → a3 is a false nega-
                                                 a1
                                  e1             0                  a4   tive because matching has not been observed along this
                  p1
                                  0                                 0
(a) Example                                      a2                      edge in the data, but a3 must match with e2 to mean-
    Graph                                        0
                  p2              e2                                a5   ingfully participate in the auction.
                                  0              a3                 0
                                                 0                            Figure 8(b)–(d) show the flow of impressions under
                   False negative edge
                                                                         our three models. In all three examples, a user visits
                                                 a1                      publishers p1 and p2 , generating two impressions. Fur-
                                  e1             1                  a4
                  p1                                                     ther, in all three examples a2 wins both auctions on
                                  1                                 0
(b) Cookie                                       a2                      behalf of a5 ; thus e1 , e2 , a2 , and a5 are guaranteed to
    Matching                                     2
                  p2              e2                                a5   observe impressions. As shown in the figure, a2 and a5
                                  1              a3                 2
                                                 0                       observe both impressions, but other nodes may observe
                   False negative impression
                                                                         zero or more impressions depending on their position
                                                 a1                      and the dissemination model. In Figure 8(b), a3 does not
                                  e1             1                  a4
                  p1                                                     observe any impressions because its incoming edge has
                                  1                                 0
(c) RTB                                          a2
    Constrained                                  2                       not been labeled as cookie matched; this is a false nega-
                  p2              e2                                a5
                                  1              a3                 2    tive because a3 participates in e2 ’s auction. Conversely,
                                                 1
                                                      False positive
                                                                         in Figure 8(d), all nodes always share all impressions,
                                                      impressions        thus a4 observes both impressions. However, these are
                                                 a1                      false positives, since DSPs like a2 do not routinely share
                                  e1             1                  a4
                  p1
                                  1                                 2    information amongst all their clients.
(d) RTB                                          a2
    Relaxed                                      2
                  p2              e2                                a5
                                  1              a3                 2
                                                 1
                                                                         5.2.2 Node Blocking
Fig. 8. Examples of our information diffusion simulations. The
observed impression count for each A&A node is shown below               To answer our third question, we must simulate the ef-
its name. (a) shows an example graph with two publishers and             fect of “blocking” A&A domains on the Inclusion graph.
two ad exchanges. Advertisers a1 and a3 participate in the RTB
                                                                         A simulated user that blocks A&A domain aj will not
auctions, as well as DSP a2 that bids on behalf of a4 and a5 .
(b)–(d) show the flow of data (dark grey arrows) when a user
                                                                         make direct connections to it (the solid outlines in Fig-
generates impressions on p1 and p2 under three diffusion models.         ure 8). However, blocking aj does not prevent aj from
In all three examples, a2 purchases both impressions on behalf of        tracking users indirectly: if the simulated user contacts
a5 , thus they both directly receive information. Other advertisers      ad exchange ai , the impression may be forwarded to
indirectly receive information by participating in the auctions.         aj during the bidding process (the dashed outlines in
                                                                         Figure 8). For example, an extension that blocks a2 in
formation enters but is not forwarded out), while those                  Figure 8 will prevent the user from seeing an ad, as
with disproportionately large amounts of outgoing edges                  well as prevent information flow to a4 and a5 . However,
are likely SSPs (they have too few incoming edges to be                  blocking a2 does not stop information from flowing to
an ad exchange). Table 6 in the appendix shows the                       e1 , e2 , a1 , a3 , and even a2 !
domains in E, including major, known ad exchanges                             We evaluate five different blocking strategies to
like App Nexus, Advertising.com, Casale Media, Dou-                      compare their relative impact on user privacy under our
bleClick, Google Syndication, OpenX, Rubicon, Turn,                      three impression propagation models:
and Yahoo. 150 of the 200 known cookie matching edges                    1. We randomly blocked 30% (310) of the A&A nodes
in our dataset are covered by this list of 36 nodes.                          from the Inclusion graph.10
     Figure 8 shows hypothetical examples of how im-                     2. We blocked the top 10% (103) of A&A nodes from
pressions disseminate under our indirect models. Fig-                       the Inclusion graph, sorted by weighted PageRank.
ure 8(a) presents the scenario: a graph with two publish-
ers connected to two ad exchanges and five advertisers.
a2 is a bidder in both exchanges, and serves as a DSP for                10 We also randomly blocked 10% and 20% of A&A nodes, but
                                                                         the simulation results were very similar to that of random 30%.

Diffusion of User Tracking Data in the Online Advertising Ecosystem        96
# Nodes Activated

                    300                             6                                          First, we look at the number of nodes that are ac-
                    250                             5

                                       Tree Depth
                    200                             4                                     tivated by direct propagation in trees rooted at each
                    150                             3                                     publisher. Figure 9a shows that our models are conser-
                    100                             2
                     50                             1                                     vative in that they generate smaller trees: the median
                      0                             0
                                                                                          original tree contains 48 nodes, versus 32, seven, and six
                          O

                              R

                          R

                          C

                                                        O

                                                                 R

                                                                      R

                                                                               C
                            TB l

                            TB

                            M

                                                                 TB

                                                                          TB

                                                                                M
                            rig

                                                        rig
                                in

                                                            in
                                                                                          from our models. One caveat to this is that publishers
                                 -R

                                 -C

                                                                     -R

                                                                           -C
                                   a

                                                            al
                                                                                          in our simulated trees have a wider range of fan-outs
(a) Number of nodes                    (b) Tree depth
                                                                                          than in the original trees. The median publishers in the
Fig. 9. Comparison of the original and simulated inclusion trees.                         original and simulated trees have 11 and 12 neighbors,
Each bar shows the 5th , 25th , 50th (in black), 75th , and 95th                          respectively, but the 75th percentile trees have 16 and
percentile value.                                                                         30 neighbors, respectively.
                                                                                               Second, we investigate the depth of the inclusion
                                                                                          trees. As shown in Figure 9b, the median tree depth in
3. We blocked all 594                           A&A     nodes             from      the
                                                                                          the original trees is three, versus two in all our models.
   Ghostery [25] blacklist.
                                                                                          The 75th percentile tree depth in the original data is
4. We blocked all 412 A&A nodes from the Discon-
                                                                                          four, versus three in the RTB Relaxed and RTB Con-
   nect [18] blacklist.
                                                                                          strained models, and two in the most restrictive Cookie
5. We emulated the behavior of AdBlock Plus [2],                                          Matching-Only model. These results show that overall,
   which is a combination of whitelisting A&A nodes                                       our models are conservative in that they tend to gener-
   from the Acceptable Ads program [73], and black-                                       ate slightly shorter inclusion trees than reality.
   listing A&A nodes from EasyList [19]. After                                                 Third, we look at the set of A&A domains that are
   whitelisting, 634 A&A nodes are blocked.                                               included in trees rooted at each publisher. For a pub-
                                                                                          lisher p that contacts a set Aop of A&A domains in our
We chose these methods to explore a range of graph                                        original data, we calculate fp = |Asp ∩Aop |/|Aop |, where Asp
theoretic and practical blocking strategies. Prior work                                   is the set of A&A domains contacted by p in simulation.
has shown that the global connectivity of small-world                                     Figure 10 plots the CDF of fp values for all publishers in
graphs is resilient against random node removal [13], but                                 our dataset, under our three models. We observe that for
we would like to empirically determine if this is true for                                almost 80% publishers, 90% A&A domains contacted in
ad network graphs as well. In contrast, prior work also                                   the original trees are also contacted in trees generated
shows that removing even a small fraction of top nodes                                    by the RTB Relaxed model. This falls to 60% and 16%
from small-world graphs causes the graph to fracture                                      as the models become more restrictive.
into many subgraphs [50, 74]. Ghostery and Disconnect                                          Fourth, we examine the number of ad exchanges
are two of the most widely-installed tracker blocking                                     that appear in the original and simulated trees. Exam-
browser extensions, so evaluating their blacklists allows                                 ining the ad exchanges is critical, since they are respon-
us to quantify how good they are at protecting users’                                     sible for all indirect dissemination of impressions. As
privacy. Finally, AdBlock Plus is the most popular ad                                     shown in Figure 11, inclusion trees from our simula-
blocking extension [45, 62], but contrary to its name,                                    tions contain an order of magnitude fewer ad exchanges
by default it whitelists A&A companies that pay to be                                     than the original inclusion trees, regardless of model.11
part of its Acceptable Ads program [3]. Thus, we seek to                                  This suggests that indirect dissemination of impressions
understand how effective AdBlock Plus is at protecting                                    in our models will be conservative relative to reality.
user privacy under its default behavior.
                                                                                          Number of Selected Exchanges.           Finally, we in-
                                                                                          vestigate the impact of exchanges in the RTB Con-
                                                                                          strained model. We select the top x A&A domains by
5.3 Validation
                                                                                          out-degree to act as exchanges (subject to their in/out
                                                                                          degree ratio r being in the range 0.7 ≤ r ≤ 1.7), then
To confirm that our simulations are representative of
                                                                                          execute a simulation. As shown in Figure 12, with 20
our ground-truth data, we perform some sanity checks.
We simulate a single user in each model (who generates
5K impressions) and compare the resulting simulated
                                                                                          11 Because each of our models assumes that a different set of
inclusion trees to the original, real inclusion trees.
                                                                                          A&A nodes are ad exchanges, we must perform three corre-
                                                                                          sponding counts of ad exchanges in our original trees.

Diffusion of User Tracking Data in the Online Advertising Ecosystem                  97

(Frac. of Publishers)
                          1                                                 1 Simulation                                        1
                              CM
                        0.8 RTB-C                                          0.8                                                0.8
                            RTB-R
                        0.6                                                0.6                                                0.6
        CDF

                                                                    CDF

                                                                                                                        CDF
                        0.4                                                0.4                                                0.4
                                                                                                       CM                                         5        30
                        0.2                                                0.2                       RTB-C                    0.2                10        50
                                                                                            Original RTB-R                                       20       100
                         0                                                  0                                                   0
                              0     0.2    0.4  0.6  0.8     1                   1        10     100   1000 10000                   0    0.2   0.4    0.6   0.8      1
                                   Frac. of A&A Contacted                            # of Ad Exchanges per Tree                         Fraction of Impressions

Fig. 10. CDF of the fractions of A&A                               Fig. 11. Number of ad exchanges in                   Fig. 12. Fraction of impressions observed
domains contacted by publishers in our                             our original (solids lines) and simulated            by A&A domains in RTB-C model when
original data that were also contacted in                          (dashed lines) inclusion trees.                      top x exchanges are selected.
our three simulated models.

  Blocking                        Cookie Matching-Only   RTB Constrained    RTB Relaxed          Cookie Matching-Only       RTB Constrained           RTB Relaxed
 Scenarios                        %E              %W     %E         %W      %E      %W         doubleclick       90.1   google-analytics  97.1   pinterest           99.1
No Blocking                       16.9            31.0   33.9       55.9    71.8     81.3      criteo            89.6   quantserve        92.0   doubleclick         99.1
AdBlock Plus                      12.3            28.0   25.6       50.3    48.4     68.6      quantserve        89.5   scorecardresearch 91.9   twitter             99.1
Random 30%                        12.1            21.8   22.1       34.2    48.7     54.8      googlesyndication 89.0   youtube           91.8   googlesyndication   99.0
  Ghostery                        3.52            9.87   6.82       18.2    13.5     21.9      flashtalking      88.8   skimresources     91.6   scorecardresearch   99.0
  Top 10%                         6.03            5.01   8.18       5.52    26.8     13.4      mediaforge        88.8   twitter           91.3   moatads             99.0
 Disconnect                       2.98            3.66   4.72       6.01    16.3     11.6      adsrvr            88.6   pinterest         91.2   quantserve          99.0
                                                                                               dotomi            88.6   criteo            91.2   doubleverify        99.0
Table 3. Percentage of Edges that are triggered in the Inclusion                               steelhousemedia   88.6   addthis           91.1   crwdcntrl           99.0
graph during our simulations under different propagation models                                adroll            88.6   bluekai           91.1   adsrvr              99.0
and blocking scenarios. We also show the percentage of edge                                    Table 4. Top 10 nodes that observed the most impressions under
Weights covered via triggered edges.                                                           our simulations with no blocking.

or more exchanges the distribution of impressions ob-                                          to have significant impact relative to the No Blocking
served by A&A domains stops growing, i.e., our RTB                                             baseline, in terms of removing edges or weight, under
Constrained model is relatively insensitive to the num-                                        the Cookie Matching-Only and RTB Constrained mod-
ber of exchanges. This is not surprising, given how dense                                      els. Further, the top 10% blocking strategy removes
the Inclusion graph is (see § 4). We observed similar re-                                      less edges than Disconnect or Ghostery, but it reduces
sults when we picked top nodes based on PageRank.                                              the remaining edge weight to roughly the same level as
                                                                                               Disconnect, whereas Ghostery leaves more high-weight
                                                                                               edges intact. These observations help to explain the out-
                                                                                               comes of our simulations, which we discuss next.
5.4 Results
                                                                                               No Blocking.         First, we discuss the case where no
We take our 200 simulated users and “play back” their                                          A&A nodes are blocked in the graph. Figure 13 shows
browsing traces over the unmodified Inclusion graph, as                                        the fraction of total impressions (out of ∼5,300) and
well as graphs where nodes have been blocked using the                                         fraction of unique publishers (out of ∼190) observed by
strategies outlined above. We record the total number                                          A&A domains under different propagation models. We
of impressions observed by each A&A domain, as well as                                         find that the distribution of observed impressions under
the fraction of unique publishers observed by each A&A                                         RTB Constrained is very similar to that of RTB Re-
domain under different impression propagation models.                                          laxed, whereas observed impressions drop dramatically
                                                                                               under Cookie Matching-Only model. Specifically, the
Triggered Edges.        Table 3 shows the percentage of
                                                                                               top 10% of A&A nodes in the Inclusion graph (sorted
edges between A&A nodes that are triggered in the In-
                                                                                               by impression count) observe more than 97% of the im-
clusion graph under different combinations of impres-
                                                                                               pressions in RTB Relaxed, 90% in RTB Constrained,
sion propagation models and blocking strategies. No
                                                                                               and 29% in Cookie Matching-Only. We observe simi-
blocking/RTB Relaxed is the most permissive case; all
                                                                                               lar patterns for fractions of publishers observed across
other cases have less edges and weight because (1) the
                                                                                               the three indirect propogating models. Recall that the
propagation model prevents specific A&A edges from
                                                                                               Cookie Matching-Only and RTB Relaxed models func-
being activated and/or (2) the blocking scenario ex-
                                                                                               tion as lower- and upper-bounds on observability; that
plicitly removes nodes. Interestingly, AdBlock Plus fails

Diffusion of User Tracking Data in the Online Advertising Ecosystem                                         98

       1                                                      1                                                             1
                                                                       RTB Constrained                                               RTB Constrained

      0.8                                                    0.8                                                           0.8
                               Publishers                                     RTB Relaxed
                Impressions                                                                                                                 RTB Relaxed
      0.6                                                    0.6                                                           0.6

                                                       CDF

                                                                                                                     CDF
CDF

      0.4                                                    0.4                                                           0.4

                                                                                                 Disconnect
      0.2                 Cookie Matching-Only               0.2                                   Ghostery                0.2                                  Top 10%
                              RTB Constrained                                                  AdBlock Plus                                                  Random 30%
                                  RTB Relaxed                                                   No Blocking                                                   No Blocking
       0                                                      0                                                             0
            0      0.2      0.4     0.6      0.8   1               0         0.2         0.4      0.6      0.8   1               0         0.2         0.4      0.6      0.8   1
                         Observed Fraction                                         Fraction of Impressions                                       Fraction of Impressions

Fig. 13. Fraction of impressions (solid                (a) Disconnect, Ghostery, AdBlock Plus                        (b) Top 10% and Random 30% of nodes
lines) and publishers (dashed lines) ob-
                                                       Fig. 14. Fraction of impressions observed by A&A domains under the RTB Constrained
served by A&A domains under our three
                                                       (dashed lines) and RTB Relaxed (solid lines) models, with various blocking strategies.
models, without any blocking.

the results from the RTB Constrained model are so sim-                                     with no blocking and in Table 5 with AdBlock Plus are
ilar to the RTB Relaxed model is striking, given that                                      almost identical, save for some reordering.
only 36 nodes in the former spread impressions indi-                                           Next, we examine Ghostery and Disconnect in Fig-
rectly, versus 1,032 in the latter.                                                        ure 14a. As expected, the amount of information seen by
     Although the overall fraction of observed impres-                                     A&A domains decreases when we block domains from
sions drops significantly in the Cookie Matching-Only                                      these blacklists. Disconnect’s blacklist does a much bet-
model, Table 4 shows that the top 10 A&A domains                                           ter job of protecting users’ privacy in our simulations:
observe 99%, 96%, and 89% of impressions on aver-                                          after blocking nodes using the Disconnect blacklist, 90%
age under RTB Relaxed, RTB Constrained, and Cookie                                         of the nodes see less than 40% of the impressions in the
Matching-Only respectively. Some of the top ranked                                         RTB Constrained model, and less than 53% in the RTB
nodes are expected, like DoubleClick, but other cases are                                  Relaxed model. In contrast, when using the Ghostery
more interesting. For example, Pinterest is connected                                      blacklist, 90% of the nodes see less than 75% of the im-
to 178 publishers and 99 other A&A domains. In the                                         pressions in both RTB models. Table 5 shows that top
Cookie Matching-Only model, it ranks 47 because it is                                      10 A&A domains are only able to observe at most 40–
directly embedded in relatively few publishers, but it                                     59% and 73–83% of impressions when the Disconnect
ascends up to rank seven and one, respectively, once in-                                   and Ghostery blacklists are used, respectively, depend-
direct sharing is accounted for. This drives home the                                      ing on the indirect propagation model.
point that although Google is the most pervasively em-                                         As shown in Figure 14b, blocking the top 10%
bedded advertiser around the web [15, 65], there are                                       of A&A nodes from the Inclusion graph (sorted by
a roughly 52 other A&A companies that also observe                                         weighted PageRank) causes almost as much reduction
greater than 91% of users’ browsing behaviors (in the                                      in observed impressions as Disconnect. Table 5 helps to
RTB Constrained model), due to their participation in                                      orient the top 10% blocking strategy versus Disconnect
major ad exchanges.                                                                        and Ghostery in terms of overall reduction in impression
With Blocking.         Next, we discuss the results when                                   observability and the impact on specific A&A domains.
AdBlock Plus (i.e., the Acceptable Ads whitelist and Ea-                                   In contrast, blocking 30% of the A&A nodes at ran-
syList blacklist) is used to block nodes. AdBlock Plus                                     dom has more impact than AdBlock Plus, but less than
has essentially zero impact on the fraction of impres-                                     Disconnect and Ghostery. Top 10 nodes under the “no
sions observed by A&A domains: the results in Fig-                                         blocking” and “random 30%” (not shown) strategies ob-
ure 14a under the RTB Constrained and RTB Relaxed                                          serve similar impression fractions. Both of these results
models are almost coincident with those for the models                                     agree with the theoretical expectations for small-world
when no blocking is applied at all. The problem is that                                    graphs, i.e., their connectivity is resilient against ran-
the major ad networks and exchanges are all present                                        dom blocking, but not necessarily targeted blocking.
in the Acceptable Ads whitelist, and thus all of their                                         We do not show results for our most restrictive
partners are also able to observe the impressions, even                                    model (i.e., Cookie Matching-Only) in Figure 14, since
if they are (sometimes) prevented from actually show-                                      the majority of A&A companies view almost zero im-
ing ads to the user. Indeed, the top 10 nodes in Table 4                                   pressions. Specifically, 90% of A&A companies view less

You can also read