Diffusion of User Tracking Data in the Online Advertising Ecosystem
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Proceedings on Privacy Enhancing Technologies ; 2018 (4):85–103 Muhammad Ahmad Bashir and Christo Wilson Diffusion of User Tracking Data in the Online Advertising Ecosystem Abstract: Advertising and Analytics (A&A) companies sold in Real Time Bidding (RTB) auctions. The rise of have started collaborating more closely with one an- RTB has forced Advertising and Analytics (A&A) com- other due to the shift in the online advertising industry panies to collaborate more closely with one another, in towards Real Time Bidding (RTB). One natural way to order to exchange data about users and facilitate bid- understand how user tracking data moves through this ding on impressions [10, 58]. The move towards RTB has interconnected advertising ecosystem is by modeling it also caused A&A companies to specialize into particular as a graph. In this paper, we introduce a novel graph roles. For example, Supply-Side Platforms (SSPs) work representation, called an Inclusion graph, to model the with publishers (e.g., CNN) to help manage their re- impact of RTB on the diffusion of user tracking data lationship with ad exchanges, while Demand-Side Plat- in the advertising ecosystem. Through simulations on forms (DSPs) try to optimize ad placement and bidding the Inclusion graph, we provide upper and lower esti- on behalf of advertisers. In short, due to RTB, the online mates on the tracking information observed by A&A advertising ecosystem has become enormously complex. companies. We find that 52 A&A companies observe A natural way to model this complex ecosystem at least 91% of an average user’s browsing history un- is in the form of a graph. Graph models that accu- der reasonable assumptions about information sharing rately capture the relationships between publishers and within RTB auctions. We also evaluate the effectiveness A&A companies are extremely important for practi- of blocking strategies (e.g., AdBlock Plus), and find that cal applications, such as estimating revenue of A&A major A&A companies still observe 40–90% of user im- companies [26], predicting whether a given domain is pressions, depending on the blocking strategy. a tracker [34], or evaluating the effectiveness of domain- blocking strategies on preserving users’ privacy. Keywords: Online Tracking, RTB, Cookie Matching However, to date, technical limitations have pre- DOI 10.1515/popets-2018-0033 vented researchers from developing accurate graph mod- Received 2018-02-28; revised 2018-06-15; accepted 2018-06-16. els of the online advertising ecosystem. For example, Gomer et al. [29] propose a Referer graph, where nodes represent publishers or A&A domains, and two nodes ai 1 Introduction and aj are connected if an HTTP message to aj is ob- served with ai as the HTTP Referer. Unfortunately, as In the last decade, the online display advertising indus- we will show, graphs built using Referer information try has massively grown in size and scope. According may contain erroneous edges in cases where a third- to the Interactive Advertising Bureau (IAB), revenue party script is embedded directly into a first-party con- from the online display ad industry in the U.S. totaled text (i.e., is not sandboxed in an iframe). $88B in 2017, a growth of 21.4% from 2016 [63]. This In this paper, to model the diffusion of user track- increased spending is fueled by advances that enable ing data within RTB auctions, we propose a novel and advertisers to target users with increasing levels of pre- accurate representation of the advertising graph called cision, even across different devices and platforms. an Inclusion graph. The Inclusion graph corrects the Another recent change in the online display adver- technical problem of the Referer graph by using the tising ecosystem is the shift from ad networks to ad actual inclusion relationships between domains to rep- exchanges, where advertisers bid on impressions being resent edges, rather than imprecise Referer relation- ships. We are able to construct Inclusion graphs, thanks to advances in browser instrumentation that allow re- searchers to conduct web crawls that record the exact Muhammad Ahmad Bashir: Northeastern University, E- provenance of all HTTP(S) requests [6, 10, 41]. mail: ahmad@ccs.neu.edu Christo Wilson: Northeastern University, E-mail: We use crawled data consisting of around 2M im- cbw@ccs.neu.edu pressions from popular e-commerce websites collected
Diffusion of User Tracking Data in the Online Advertising Ecosystem 86 by a specially instrumented version of Chrome [10] to construct the Inclusion graph. In § 4, we examine the 2 Background and Related Work fundamental graph properties of the Inclusion graph In this section, we review technical details of and current and compare it to a Referer graph, created using the computer science research on the online display adver- same dataset to understand their salient differences. In tising ecosystem. We start by discussing related work on § 5, we demonstrate a concrete use case for the In- user privacy and tracking. Next, we present examples of clusion graph by using simulations to model the flow the current display ad serving process and define the of tracking data to A&A companies. Furthermore, we roles of different actors in the ecosystem, followed by compare the efficacy of different real-world and graph a brief overview of efforts to empirically measure these theoretic “blocking” strategies (e.g., AdBlock Plus [2], processes. Lastly, we examine prior work that modeled Ghostery [25], and Disconnect [18]) at reducing the flow the ad ecosystem as a graph. of tracking information to A&A companies. Overall, we make the following key contributions: – We introduce the Inclusion graph as a model for 2.1 Tracking and Blocking capturing the complexity of the online advertising ecosystem. We use the Inclusion graph as a sub- To show relevant ads to users, advertisers rely heavily strate for modeling the flow of impressions to A&A on collecting information about users as they browse companies by taking into account the browsing be- the web. This data collection is achieved by embedding havior of users and the dynamics of RTB auctions. trackers into webpages that gather browsing informa- – We find that the Inclusion graph has substantive tion about each user. differences in graph structure compared to the Ref- The area of tracking has been well studied. Kr- erer graph because 48.4% of resource inclusions in ishnamurthy et al. and others have documented the our crawled data have an inaccurate Referer. pervasiveness of trackers and the associated user pri- – Through simulations, we find that 52 A&A com- vacy implications over time [15, 20, 26, 33, 37–39]. Fur- panies are each able to observe 91% of an average thermore, tracking techniques have evolved over time. user’s impressions as they browse, under modest as- Persistent cookies [35], local state in browser plug- sumptions about data sharing in RTB auctions. 636 ins [7, 68, 69], and various browser fingerprinting meth- A&A companies are able to observe at least 50% ods [1, 21, 36, 51, 55, 57, 65] are some of the tech- of an average user’s impressions. Even under the niques that have been deployed to track users. Engle- strictest simulation assumptions, the top 10 A&A hardt et al. [20] found evidence of tracking via the companies observe 89-99% of all user impressions. Audio and Battery Status JavaScript APIs. In addi- – We simulate the effect of five blocking strategies, tion to tracking users themselves, advertisers try to and find that AdBlock Plus (the world’s most pop- maximize their knowledge of each user’s interest pro- ular ad blocking browser extension [45, 62], is in- file by sharing information with each other via cookie effective at protecting users’ privacy because major matching [1, 10, 23, 58]. Falahrastegar et al. examine ad exchanges are whitelisted under the Acceptable how tracking differs across geographic regions [22]. Ads program [73]. In contrast, Disconnect blocks Users have become increasingly concerned with the the most information flows to A&A companies, fol- amount and types of tracking information collected lowed by removal of top 10% A&A nodes. However, about them [47, 70]. Several surveys have investigated even with strong blocking, major A&A companies users’ concerns about targeted ads, their preferences to- still observe 40–80% of user impressions. wards tracking, and usage of privacy tools [8, 42, 48, 66, 71]. Concerns about the privacy implications of track- The raw data we use in this study is publicly avail- ing (as well as the insecurity of online ad networks [75]) able.1 We have also publicly released the source code has led to increased adoption of tools that block track- and data from this study.2 ers and ads. Two studies have examined the usage of ad blockers in-the-wild [45, 62], while Walls et al. looked at efforts to whitelist “acceptable advertisers” [73]. Merzdovnik et al. critically examined the effec- tiveness of tracker blocking tools [49]; in contrast, 1 http://personalization.ccs.neu.edu/Projects/Retargeting/ Nithyanand et al. studied advertisers’ efforts to counter 2 http://personalization.ccs.neu.edu/Projects/AdGraphs/
Diffusion of User Tracking Data in the Online Advertising Ecosystem 87 Publisher Exchange HTTP(S) Request/Response pression” is used when advertising or tracking content SSP DSP/Advertiser RTB Bidding is rendered in a user’s browser after they visit a web- page [17]. To participate in RTB auctions, A&A com- p1 p2 s1 a3 panies must implement cookie matching, which is a pro- e2 cess by which different A&A companies exchange their a2 a1 unique tracking identifiers for specific users. Several studies have examined the emergence of cookie match- e1 a1 e1 ing [1, 10, 23, 58]. Ghosh et al. theoretically model the incentives for A&A companies to collaborate with their (a) Cookie Matching (b) RTB Example with Two Exchanges Example and Two Auctions competitors in RTB auction systems [24]. Figure 1(a) illustrates the typical process used by Fig. 1. Examples of (a) cookie matching and (b) showing an ad A&A companies to match cookies. When a user visits to a user via RTB auctions. (a) The user visits publisher p1 Ê which includes JavaScript from advertiser a1 Ë. a1 ’s JavaScript a website Ê, JavaScript code from a third-party adver- then cookie matches with exchange e1 by programmatically gen- tiser a1 is automatically downloaded and executed in erating a request that contains both of their cookies Ì. (b) The the user’s browser Ë. This code may set a cookie in the user visits publisher p2 , which then includes resources from SSP user’s browser, but this cookie will be unique to a1 , i.e., s1 and exchange e2 Ê–Ì. e2 solicits bids Í and sells the impres- it will not contain the same unique identifiers as the sion to e1 Î Ï, which then holds another auction Ð, ultimately cookies set by any other A&A companies. Furthermore, selling the impression to a1 Ñ Ò. the Same Origin Policy (SOP) prevents a1 ’s code from reading the cookies set by any other domain. To facili- ad blockers [56]. Mughees et al. examined the prevalence tate bidding in future RTB auctions, a1 must match its of anti-ad blockers in the wild [53]. In this work, we ex- cookie to the cookie set by an ad exchange like e1 . As pand on the existing blocking literature by taking the shown in the figure, a1 ’s JavaScript accomplishes this effects of ad auctions and cookie matching into account. by programmatically causing the browser to send a re- The research community has proposed a variety quest to e1 Ì. The JavaScript includes a1 ’s cookie in the of mechanisms to stop online tracking that go beyond request, and the browser automatically adds a copy of blacklists of domains and URLs. Li et al. [43] and e1 ’s cookie, thus allowing e1 to create a match between Ikram et al. [32] used machine learning to identify track- its cookie and a1 ’s. ers, while Papaodyssefs et al. [60] examined the use of Figure 1(b) shows an example of how an ad may private cookies to avoid being tracked. Nikiforakis et be shown on publisher p2 using RTB auctions. When a al. propose the complementary idea of adding entropy user visits p2 Ê, JavaScript code is automatically down- to the browser to evade fingerprinting [54]. However, de- loaded and executed either from a Supply Side Platform spite these efforts, third-party trackers are still pervasive (SSP) Ë or an ad exchange. SSPs are A&A companies and pose real privacy issues to users [49]. that specialize in maximizing publisher revenue by for- warding impressions to the most lucrative ad exchange. Eventually the impression arrives at the auction held by 2.2 The Online Advertising Ecosystem ad exchange e2 Ì, and e2 solicits bids from advertisers and Demand Side Platforms (DSPs) Í. DSPs are A&A Numerous studies have chronicled the online advertis- companies that specialize in executing ad campaigns on ing ecosystem, which is composed of companies that: behalf of advertisers. Note that all participants in the track users, serve ads, act as platforms between publish- auction observe the impression; however, because ers (websites that rely on advertising revenue to pay for only e2 ’s cookie is available at this point, auction par- content creation) and advertisers, or all of the above. ticipants that have not matched cookies with e2 will not Mayer et al. present an accessible introduction to this be able to identify the user. topic in [46]. In this work, we collectively refer to The process of filling an impression may continue companies engaged in analytics and advertising even after an RTB auction is won, because the win- as A&A companies. ner may be yet another ad exchange or ad network. As Recently, the online ad ecosystem has begun to shift shown in Figure 1(b), the impression is purchased from from ad networks to ad exchanges, which implement e2 by e1 Î Ï, who then holds another auction Ð and Real Time Bidding (RTB) auctions to sell impressions ultimately sells to a1 (the advertiser from the cookie to advertisers. In the advertising industry, the term “im- matching example) Ñ Ò. Ad exchanges and ad networks
Diffusion of User Tracking Data in the Online Advertising Ecosystem 88 routinely match cookies with each other to facilitate the flow of impression inventory between markets. 3 Methodology Measurement Studies. Barford et al. broadly Our goal is to capture the most accurate representation characterized the web adscape and identified systemat- of the online advertising ecosystem, which will allow us ically important ad networks [9]. Rodriguez et al. mea- to model the effect of RTB on diffusion of user tracking sured the ad ecosystem that serves mobile devices [72], data. In this section, we introduce the dataset used in while Zarras et al. specifically examined ad networks this study and describe how we use it to build a graph that serve malicious ads [75]. Gill et al. modeled the representation of the ad ecosystem. revenue earned by different A&A companies [26], while other studies have used empirical measurements to de- termine the value of individual users to online advertis- 3.1 Dataset ers [58, 59]. Many studies have used a variety of meth- ods to study the targeted ads that are displayed to users In this work, we use the dataset provided by Bashir et under a variety of circumstances [9–11, 16, 30, 44]. al. [10]. The goal of [10] was to causally infer the infor- mation sharing relationships between A&A companies by (1) crawling products from popular e-commerce web- 2.3 Ad Ecosystem Graphs sites and then (2) observing corresponding retargeted ads on publishers. Bashir et al. conducted web crawls A natural structure for modeling the online ad ecosys- that covered 738 major e-commerce websites (e.g., Ama- tem is a graph, where nodes represent publishers and zon) and 150 popular publishers (e.g., CNN).3 The au- A&A companies, and edges capture relationships be- thors chose top e-commerce sites from Alexa’s hierarchi- tween these entities. Gomer et al. [29] built and analyzed cal list of online shops [4], and manually chose publish- graphs of the ad ecosystem by making use of the Ref- ers from the Alexa Top-1K. They crawled 10 manually erer field from HTTP requests. In this representation, a selected products per e-commerce site to signal strong relationship di → dj exists if there is an HTTP request intent to trackers and advertisers, followed by 15 ran- to domain dj with a Referer header from domain di . domly chosen pages per publisher to elicit display ads. While Gomer et al. provided interesting insights In total, Bashir et al. repeated the entire crawl nine into the structure of the ad ecosystem, their referral- times, resulting in data for around 2M impressions. based graph representation has a significant limitation. As we describe in § 3.3, relying on the HTTP Referer does not always capture the correct relationships be- 3.2 Inclusion Trees tween A&A parties, thus leading to incorrect graphs of the ad ecosystem. We re-create this graph representa- Bashir et al. [10] used a specially instrumented ver- tion using our dataset (see § 3) and compare its prop- sion of Chromium for their web crawls. Their crawler erties to a more accurate representation in § 4. recorded the inclusion tree for each webpage, which is Kalavri et al. [34] created a bipartite graph of pub- a data structure that captures the semantic relation- lishers and associated A&A domains, then transformed ships between elements in a webpage (as opposed to the it to create an undirected graph consisting solely of DOM, which captures syntactic relationships) [6, 41]. A&A domains. In their representation, two A&A do- The crawler also recorded all HTTP request and re- mains are connected if they were included by the same sponse headers associated with each visited URL. publisher. This construction leads to a highly dense To illustrate the importance of inclusion trees, con- graph with many complete cliques. Kalavri et al. lever- sider the example webpage shown in Figure 2(a). The aged the tight community structure of A&A domains DOM shows that the page from publisher p ultimately to predict whether new, unknown URLs were A&A or includes resources from four third-party domains (a1 not. However, this co-occurrence representation has a through a4 ). It is clear from the DOM that the request conceptual shortcoming: it may include edges between to a3 is responsible for causing the request to a4 , since A&A domains that do not directly communicate or have the script inclusion is within the iframe. However, it any business relationship. Due to this shortcoming, we do not explore this graph representation in this work. 3 For simplicity, we refer to these e-commerce websites as pub- lishers, to distinguish them from A&A domains.
Diffusion of User Tracking Data in the Online Advertising Ecosystem 89 p.com/index.html Cookie Matching. The Bashir et al. dataset also includes labels on edges of the inclusion trees indicat- ing cases where cookie matching is occurring. These la- a2.com/pixel.jpg bels are derived from heuristics (e.g., string matching a3.com/banner.html to identify the passing of cookie values in HTTP pa- a4.com/ads.js rameters) and causal inferences based on the presence of retargeted ads. We use this data in § 5 to constrain (a) DOM Tree for http://p.com/index.html (b) Inclusion Tree some of our simulations. a1 a1 a2 Publisher p p a2 A&A 3.3 Graph Construction a3 a4 a3 a4 (c) Inclusion Graph (d) Referer Graph A natural way to model the online ad ecosystem is using a graph. In this model, nodes represent A&A compa- Fig. 2. An example HTML document and the corresponding in- clusion tree, Inclusion graph, and Referer graph. In the DOM nies, publishers, or other online services. Edges capture representation, the a1 script and a2 img appear at the same relationships between these actors, such as resource in- level of the tree; in the inclusion tree, the a2 img is a child of the clusion or information flow (e.g., cookie matching). a1 script because the latter element created the former. The Inclusion graph has a 1:1 correspondence with the inclusion tree. Canonicalizing Domains. We use the data The Referer graph fails to capture the relationship between the described in § 3.1 to construct a graph for the a1 script and a2 img because they are both embedded in the online advertising ecosystem. We use effective 2nd - first-party context, while it correctly attributes the a4 script to level domain names to represent nodes. For example, the a3 iframe because of the context switch. x.doubleclick.net and y.doubleclick.net are repre- sented by a single node labeled doubleclick. Through- is not clear which domain generated the requests to a2 out this paper, when we say “domain”, we are referring and a3 : the img and iframe could have been embedded to an effective 2nd -level domain name.5 in the original HTML from p, or these elements could Simplifying domains to the effective 2nd -level is a have been created dynamically by the script from a1 . natural encoding for advertising data. Consider two in- In this case, the inclusion tree shown in Figure 2(b) re- clusion trees generated by visiting two publishers: pub- veals that the image from a2 was dynamically created lisher p1 forwards the impression to x.doubleclick.net by the script from a1 , while the iframe from a3 was and then to advertiser a1 . Publisher p2 forwards to embedded directly in the HTML from p. y.doubleclick.net and advertiser a2 . This does not The instrumented Chromium binary used by imply that x.doubleclick and y.doubleclick only sell Bashir et al. was able to correctly determine the prove- impressions to a1 and a2 , respectively. In reality, Dou- nance of webpage elements, regardless of how they were bleClick is a single auction, regardless of the subdo- created (e.g., directly in HTML, via inline or remotely main, and a1 and a2 have the opportunity to bid on included script tags, dynamically via eval(), etc.), or all impressions. Individual inclusion trees are snapshots where they were located (in the main context or within of how one particular impression was served; only in iframes). This was accomplished by tagging all scripts aggregate can all participants in the auctions be enu- with provenance information (i.e., first-party for inline merated. Further, 3rd -level domains may read 2nd -level scripts), and then dynamically monitoring the execu- cookies without violating the Same Origin Policy [52]: tion of each script. New scripts created during the ex- x.doubleclick.com and y.doubleclick.com may both ecution of a given script (e.g., via document.write()) access cookies set by .doubleclick, and do in practice. were linked to their parent.4 More details about how The sole exception to our domain canonicalization Chromium was instrumented and inclusion trees were process is Amazon’s Cloudfront Content Delivery Net- extracted are available in [6]. work (CDN). We routinely observed Cloudfront hosting ad-related scripts and images in our data. We manu- ally examined the 50 fully-qualified Cloudfront domains 4 Note that JavaScript within a given page context executes se- rially, so there is no ambiguity created by concurrency. Although Web Workers may execute concurrently, they cannot include 5 None of the publishers and A&A domains in our dataset have third party scripts or modify the DOM. two-part TLDs, like .co.uk, which simplifies our analysis.
Diffusion of User Tracking Data in the Online Advertising Ecosystem 90 % Overlap with A&A (e.g., d31550gg7drwar.cloudfront.net) that were pre- from Alexa Top-5K 100 900 # Unique External 800 A&A Domains or proceeded by A&A domains in our data, and mapped 80 700 600 60 500 each one to the corresponding A&A company (e.g., 40 400 300 adroll in this case). 20 200 100 0 0 Inclusion graph. We propose a novel representa- 0 250 500 750 1000 0 3K 6K 9K 12K 15K tion called an Inclusion graph that is the union of all Top x A&A Domains # Pages Crawled inclusion trees in our dataset. Our representation is a di- Fig. 3. Overlap between fre- Fig. 4. Unique A&A domains rected graph of publishers and A&A domains. An edge quent A&A domains and A&A contacted by each A&A do- di → dj exists if we have ever observed domain di includ- domains from Alexa Top-5K. main as we crawl more pages. ing a resource from dj . Edges may exist from publishers to A&A domains, or between A&A domains. Figure 2(c) A&A domains from publishers and non-A&A third par- shows an example Inclusion graph. ties like CDNs. In the inclusion trees from the Bashir et Referer graph. Gomer et al. [29] also proposed a di- al. dataset [10], each resource is labeled as A&A or non- rected graph representation consisting of publishers and A&A using the EasyList and EasyPrivacy rule lists. For A&A domains for the online advertising ecosystem. In all the A&A labeled resources, we extract the associated this representation, each publisher and A&A domain is 2nd -level domain. To eliminate false positives, we only a node, and edge di → dj exists if we have ever observed consider a 2nd -level domain to be A&A if it was labeled an HTTP request to dj with Referer di . Figure 2(d) as A&A more than 10% of the time in the dataset. shows an example Referer graph corresponding to the given webpage. The Bashir et al. [10] dataset includes all HTTP request and response headers from the crawl, 3.5 Coverage and we use these to construct the Referer graph. Although the Referer and Inclusion graphs seem There are two potential concerns with the raw data we similar, they are fundamentally different for technical use in this study: does the data include a representative reasons. Consider the examples shown in Figure 2: the set of A&A domains? and does the data contain all of script from a1 is included directly into p’s context, the outgoing edges associated with each A&A domain? thus p is the Referer in the request to a2 . This results To answer the former question, we plot Figure 3, which in a Referer graph with two edges that does not cor- shows the overlap between the top x A&A domains in rectly encode the relationships between the three par- our dataset (ranked by inclusion frequency by publish- ties: p → a1 and p → a2 . In other words, HTTP Referer ers) with all of the A&A domains included by the Alexa headers are an indirect method for measuring the se- Top-5K websites.6 We observe that 99% of the 150 most mantic relationships between page elements, and the frequent A&A domains appear in both samples, while headers may be incorrect depending on the syntactic 89% of the 500 most frequent appear in both. These structure of a page. Our Inclusion graph representation findings confirm that our dataset includes the vast ma- fixes the ambiguity in the Referer graph by explicitly jority of prominent A&A domains that users are likely relying on the inclusion relationships between elements to encounter on the web. in webpages. We analyze the salient differences between To answer the second question, we plot Figure 4, the Referer and Inclusion graph in § 4. which shows the number of unique external A&A do- Weights. Additionally, we also create a weighted mains contacted by A&A domains in our dataset as version of these graphs. In the Inclusion graph, the the crawl progressed (i.e., starting from the first page weight of di → dj encodes the number of times a re- crawled, and ending with the last). Recall that the source from di sent an HTTP request to dj . In the Ref- dataset was collected over nine consecutive crawls span- erer graph, the weight of di → dj encodes the number ning two weeks of time, each of which visited 9,630 in- of HTTP requests with Referer di and destination dj . dividual pages spread over 888 domains. We observe that the number of A&A →A&A edges rises quickly initially, going from 0 to 800 in 3,600 3.4 Detection of A&A Domains For us to understand the role of A&A companies in 6 Our dataset and the Alexa Top-5K data were both collected the advertising graph, we must be able to distinguish in December 2015, so they are temporally comparable.
Diffusion of User Tracking Data in the Online Advertising Ecosystem 91 Avg. Deg. Avg. Path Cluster. Degree Graph Type |V| |E| |VWCC | |EWCC | (In Out) Length Coef. S∆ [31] Assort. Inclusion 1917 26099 1909 26099 13.612 13.612 2.748† 0.472‡ 31.254‡ -0.31‡ Referer 1923 41468 1911 41468 21.564 21.564 2.429† 0.235‡ 10.040‡ -0.29‡ Table 1. Basic statistics for Inclusion and Referer graph. We show sizes for the largest WCC in each graph. † denotes that the metric is calculated on the largest SCC. ‡ denotes that the metric is calculated on the undirected transformation of the graph. crawled pages. Then, the growth slows down, requiring that should be in the core of the network are incorrectly an additional 12,000 page visits to increase from 800 to attached to publishers along the periphery. 900. In other words, almost all A&A edges were dis- Structure and Connectivity. As shown in Ta- covered by half-way through the very first crawl; eight ble 1, the Inclusion graph has large, well-connected subsequent iterations of the crawl only uncovered 12.5% components. The largest Weakly Connected Compo- more edges. This demonstrates that the crawler reached nent (WCC) covers all but eight nodes in the Inclusion the point of diminishing returns, indicating that the vast graph, meaning that very few nodes are completely dis- majority of connections between A&A domains that ex- connected. This highlights the interconnectedness of the isted at the time are contained in the dataset. ad ecosystem. The average node degree in the Inclusion graph is 13.6, and
Diffusion of User Tracking Data in the Online Advertising Ecosystem 92 2000 Betweenness Centrality Weighted PageRank 1600 google-analytics doubleclick doubleclick googlesyndication |WCC| 1200 googleadservices 2mdn 800 facebook adnxs 400 googletagmanager google 0 googlesyndication adsafeprotected 0 10 20 30 40 50 60 70 adnxs google-analytics k google scorecardresearch addthis krxd Fig. 5. k-core: size of the Inclusion graph WCC as nodes with criteo rubiconproject degree ≤ k are recursively removed. Table 2. Top 10 nodes ranked by betweenness centrality and weighted PageRank in the Inclusion graph. network that have disassortative connectivity, which we examine in the next section. segmented by ad exchange (e.g., customers and part- ners centered around DoubleClick). This is a known 4.2 Cores and Communities deficiency in modularity maximization based methods, that they tend to produce communities with no real- We now examine how nodes in the Inclusion graph con- world correspondence [5]. Girvan–Newman found 10 nect to each other using two metrics: k-cores and com- communities, with the largest community containing munity detection. The k-core of a graph is the subset 1,097 nodes (57% of all nodes) and 16,424 edges (63% of a graph (nodes and edges) that remain after recur- of all edges). Out of 1,097 nodes, 64% are A&A. How- sively removing all nodes with degree ≤ k. By increas- ever, the modularity score was zero, which means that ing k, the loosely connected periphery of a graph can be the Girvan–Newman communities contain a random as- stripped away, leaving just the dense core. In our sce- sortment of internal and external (cross-cluster) edges. nario, this corresponds to the high-degree ad exchanges, Overall, these results demonstrate that the web dis- ad networks, and trackers that facilitate the connections play ad ecosystem is not balkanized into distinct groups between publishers and advertisers. of companies and publishers that partner with each Figure 5 plots k versus the size of the WCC for the other. Instead, the ecosystem is highly interdependent, Inclusion graph. The plot shows that the core of the with no clear delineations between groups or types of Inclusion graph rapidly declines in size as k increases, A&A companies. This result is not surprising consider- which highlights the interdependence between A&A do- ing how dense the Inclusion graph is. mains and the lack of a distinct core. Next, to examine the community structure of the Inclusion graph, we utilized three different community 4.3 Node Importance detection algorithms: label propagation by Raghavan et al. [64], Louvain modularity maximization [12], and the In this section, we focus on the importance of specific centrality-based Girvan–Newman [27] algorithm. We nodes in the Inclusion graph using two metrics: be- chose these algorithms because they attempt to find tweenness centrality and weighted PageRank. As be- communities using fundamentally different approaches. fore, we focus on the largest WCC. The betweenness Unfortunately, after running these algorithms on centrality for a node n is defined as the fraction of all the largest WCC, the results of our community analy- shortest paths on the graph that traverse n. In our sce- sis were negative. Label propagation clustered all nodes nario, nodes with high betweenness centrality represent into a single community. Louvain found 14 communities the key pathways for tracking information and impres- with an overall modularity score of 0.44 (on a scale of sions to flow from publishers to the rest of the ad ecosys- -1 to 1 where 1 is entirely disjoint clusters). The largest tem. For weighted PageRank, we weight each edge in the community contains 771 nodes (40% of all nodes) and Inclusion graph based on the number of times we ob- 3252 edges (12% of all edges). Out of 771 nodes, 37% serve it in our raw data. In essence, weighted PageRank are A&A. However, none of the 14 communities cor- identifies the nodes that receive the largest amounts of responded to meaningful groups of nodes, either seg- tracking data and impressions throughout each graph. mented by type (e.g., publishers, SSPs, DSPs, etc.) or
Diffusion of User Tracking Data in the Online Advertising Ecosystem 93 Table 2 shows the top 10 nodes in the Inclusion These questions have direct implications for under- graph based on betweenness centrality and weighted standing users’ online privacy. The first two questions PageRank. Prominent online advertising companies are are about quantifying a user’s online footprint, i.e., how well represented, including AppNexus (adnxs), Face- much of their browsing history can be recorded by dif- book, and Integral Ad Science (adsafeprotected). Sim- ferent companies. In contrast, the third question inves- ilar to prior work, we find that Google’s advertising do- tigates how well different blocking strategies perform at mains (including DoubleClick and 2mdn) are the most protecting users’ privacy. prominent overall [29]. Unsurprisingly, these companies all provide platforms, i.e., SSPs, ad exchanges, and ad networks. We also observe trackers like Google Analyt- 5.2 Simulation Setup ics and Tag Manager. Interestingly, among 14 unique domains across the two lists, ten only appear in a single To answer these questions, we simulate the browsing list. This suggests that the most important domains in behavior of typical users using the methodology from terms of connectivity are not necessarily the ones that Burklen et al. [14].9 In particular, we simulate a user receive the highest volume of HTTP requests. browsing publishers over discreet time steps. At each time step our simulated user decides whether to remain on the current publisher according to a Pareto distri- bution (exponent = 2), in which case they generate a 5 Information Diffusion new impression on that publisher. Otherwise, the user browses to a new publisher, which is chosen based on a In § 4, we examined the descriptive characteristics of Zipf distribution over the Alexa ranks of the publishers. the Inclusion graph, and discuss the implications of Burklen et al. developed this browsing model based on this graph structure on our understanding of the on- large-scale observational traces, and derive the distri- line advertising ecosystem. In this section, we take the butions and their parameters empirically. This brows- next step and present a concrete use case for the In- ing model has been successfully used to drive simulated clusion graph: modeling the diffusion of user tracking experiments in other work [40]. data across the ad ecosystem under different types of ad We generated browsing traces for 200 users. On av- and tracker blocking (e.g., AdBlock Plus and Ghostery). erage, each user generated 5,343 impressions on 190 We model the flow of information across the Inclusion unique publishers. The publishers are selected from the graph, taking into account different blocking strategies, 888 unique first-party websites in our dataset (see § 3.1). as well as the design of RTB systems and empirically ob- During each simulated time step the user generates served transition probabilities from our crawled dataset. an impression on a publisher, which is then forwarded to all A&A domains that are directly connected to the publisher. This emulates a webpage with multiple slots 5.1 Simulation Goals for display ads, each of which is serviced by a differ- ent SSP or ad exchange. However, it is insufficient to Simulation is an important tool for helping to under- simply forward the impression to the A&A domains di- stand the dynamics of the (otherwise opaque) online rectly connected to each publisher; we also must account advertising industry. For example, Gill et al. used data- for ad exchanges and RTB auctions [10, 58], which may driven simulations to model the distribution of revenue cause the impression to spread farther on the graph. amongst online display advertisers [26]. We discuss this process next. The simulated time step Here, we use simulations to examine the flow ends when all impressions arrive at A&A domains that of browsing history data to trackers and advertisers. do not forward them. Once all outstanding impressions Specifically, we ask: have terminated, time increments and our simulated 1. How many user impressions (i.e., page visits) to user generates a new impression, either from their cur- publishers can each A&A domain observe? rently selected publisher or from a new publisher. 2. What fraction of the unique publishers that a user visits can each A&A domain observe? 3. How do different blocking strategies impact the number of impressions and fraction of publishers ob- 9 To the best of our knowledge, there are no other empirically served by each A&A domain? validated browsing models besides [14].
Diffusion of User Tracking Data in the Online Advertising Ecosystem 94 1 1 x-axis is in log scale). This demonstrates that heavy- 0.8 0.8 hitters like DoubleClick, GoogleSyndication, OpenX, 0.6 0.6 CDF CDF 0.4 0.4 and Facebook are likely to purchase impressions that 0.2 0.2 go up for auction in our simulations. 0 0 0 0.2 0.4 0.6 0.8 1 1 10 100 1K 10K100K Indirect Propagation. Unfortunately, precisely ac- Termination Probability Mean Weight on Incoming per Node Edges counting for indirect propagation is not currently possi- ble, since it is not known exactly which A&A domains Fig. 6. CDF of the termination Fig. 7. CDF of the weights on probability for A&A nodes. incoming edges for A&A nodes. are ad exchanges, or which pairs of A&A domains share information. To compensate, we evaluate three different indirect impression propagation models: 5.2.1 Impression Propagation – Cookie Matching-Only: As we note in § 3.2, the Bashir et al. [10] dataset includes 200 empirically Our simulations must account for direct and indirect validated pairs of A&A domains that match cookies. propagation of impressions. Direct flows occur when one In this model, we treat these 200 edges as ground- A&A domain sells or redirects an impression to another truth and only indirectly disseminate impressions A&A domain. We refer to these flows as “direct” be- along these edges. Specifically, if ai observes an im- cause they are observable by the web browser, and are pression, it will indirectly share with aj iff ai → aj thus recorded in our dataset. Indirect flows occur when exists and is in the set of 200 known cookie match- an ad exchange solicits bids on an impression. The ad- ing edges. This is the most conservative model we vertisers in the auction learn about the impression, but evaluate, and it provides a lower-bound on impres- this is not directly observable to the browser; only the sions observed by A&A domains. winner is ultimately known. – RTB Relaxed: In this model, we assume that Direct Propagation. To account for direct propa- each A&A domain that observes an impression, in- gation, we assign a termination probability to each A&A directly shares it with all A&A domains that it is node in the Inclusion graph that determines how often connected to. Although this is the correct behavior it serves an ad itself, versus selling the impression to a for ad exchanges like Rubicon and DoubleClick, it partner (and redirecting the user’s browser accordingly). is not correct for every A&A domain. This is the We derive the termination probability for each A&A most liberal model we evaluate, and it provides an node empirically from our dataset. When an impression upper-bound on impressions observed by A&A do- is sold, we determine which neighboring node purchases mains. the impression based on the weights of the outgoing – RTB Constrained: In this model, we select a sub- edges. For a node ai , we define its set of outgoing neigh- set of A&A domains E to act as ad exchanges. bors as No (ai ). The probability of selling to neighbor Whenever an A&A domain in E observes an impres- P aj ∈ No (ai ) is w(ai → aj )/ ∀ay ∈No (ai ) w(ai → ay ), sion, it shares it with all directly connected A&A where w(ai → aj ) is the weight of the given edge. domains, i.e., to solicit bids. This model represents Figure 6 shows the termination probability for A&A a more realistic view of information diffusion than nodes in the Inclusion graph. We see that 25% of the Cookie Matching-Only and RTB Relaxed mod- the A&A nodes have a termination probability of one, els because the graph contains few but extremely meaning that they never sell impressions. The remaining well connected exchanges. 75% of A&A nodes exhibit a wide range of termination probabilities, corresponding to different business mod- For RTB Constrained, we select all A&A nodes with els and roles in the ad ecosystem. For example, Dou- out-degree ≥ 50 and in/out degree ratio r in the range bleClick, the most prominent ad exchange, has a termi- 0.7 ≤ r ≤ 1.7 to be in E. These thresholds were cho- nation probability of 0.35, whereas Criteo, a well-known sen after manually looking at the degrees and ratios advertiser specializing in retargeting, has a termination for known ad exchanges and ad exchanges marked by probability of 0.63. Bashir et al. [10]. This results in |E| = 36 A&A nodes Figure 7 shows the mean incoming edge weights for being chosen as ad exchanges (out of 1,032 total A&A A&A nodes in the Inclusion graph. We observe that domains in the Inclusion graph). We enforce restrictions the distribution is highly skewed towards nodes with on r because A&A nodes with disproportionately large extremely high average incoming weights (note that the amounts of incoming edges are likely to be trackers (in-
Diffusion of User Tracking Data in the Online Advertising Ecosystem 95 Node Type Edge Type Activation a4 and a5 (i.e., it services their ad campaigns by bidding Publisher Cookie Matched Direct on their behalf). Light grey edges capture cases where Exchange Non-Cookie Matched Indirect the two endpoints have been observed cookie matching DSP/Advertiser in the ground-truth data. Edge e2 → a3 is a false nega- a1 e1 0 a4 tive because matching has not been observed along this p1 0 0 (a) Example a2 edge in the data, but a3 must match with e2 to mean- Graph 0 p2 e2 a5 ingfully participate in the auction. 0 a3 0 0 Figure 8(b)–(d) show the flow of impressions under False negative edge our three models. In all three examples, a user visits a1 publishers p1 and p2 , generating two impressions. Fur- e1 1 a4 p1 ther, in all three examples a2 wins both auctions on 1 0 (b) Cookie a2 behalf of a5 ; thus e1 , e2 , a2 , and a5 are guaranteed to Matching 2 p2 e2 a5 observe impressions. As shown in the figure, a2 and a5 1 a3 2 0 observe both impressions, but other nodes may observe False negative impression zero or more impressions depending on their position a1 and the dissemination model. In Figure 8(b), a3 does not e1 1 a4 p1 observe any impressions because its incoming edge has 1 0 (c) RTB a2 Constrained 2 not been labeled as cookie matched; this is a false nega- p2 e2 a5 1 a3 2 tive because a3 participates in e2 ’s auction. Conversely, 1 False positive in Figure 8(d), all nodes always share all impressions, impressions thus a4 observes both impressions. However, these are a1 false positives, since DSPs like a2 do not routinely share e1 1 a4 p1 1 2 information amongst all their clients. (d) RTB a2 Relaxed 2 p2 e2 a5 1 a3 2 1 5.2.2 Node Blocking Fig. 8. Examples of our information diffusion simulations. The observed impression count for each A&A node is shown below To answer our third question, we must simulate the ef- its name. (a) shows an example graph with two publishers and fect of “blocking” A&A domains on the Inclusion graph. two ad exchanges. Advertisers a1 and a3 participate in the RTB A simulated user that blocks A&A domain aj will not auctions, as well as DSP a2 that bids on behalf of a4 and a5 . (b)–(d) show the flow of data (dark grey arrows) when a user make direct connections to it (the solid outlines in Fig- generates impressions on p1 and p2 under three diffusion models. ure 8). However, blocking aj does not prevent aj from In all three examples, a2 purchases both impressions on behalf of tracking users indirectly: if the simulated user contacts a5 , thus they both directly receive information. Other advertisers ad exchange ai , the impression may be forwarded to indirectly receive information by participating in the auctions. aj during the bidding process (the dashed outlines in Figure 8). For example, an extension that blocks a2 in formation enters but is not forwarded out), while those Figure 8 will prevent the user from seeing an ad, as with disproportionately large amounts of outgoing edges well as prevent information flow to a4 and a5 . However, are likely SSPs (they have too few incoming edges to be blocking a2 does not stop information from flowing to an ad exchange). Table 6 in the appendix shows the e1 , e2 , a1 , a3 , and even a2 ! domains in E, including major, known ad exchanges We evaluate five different blocking strategies to like App Nexus, Advertising.com, Casale Media, Dou- compare their relative impact on user privacy under our bleClick, Google Syndication, OpenX, Rubicon, Turn, three impression propagation models: and Yahoo. 150 of the 200 known cookie matching edges 1. We randomly blocked 30% (310) of the A&A nodes in our dataset are covered by this list of 36 nodes. from the Inclusion graph.10 Figure 8 shows hypothetical examples of how im- 2. We blocked the top 10% (103) of A&A nodes from pressions disseminate under our indirect models. Fig- the Inclusion graph, sorted by weighted PageRank. ure 8(a) presents the scenario: a graph with two publish- ers connected to two ad exchanges and five advertisers. a2 is a bidder in both exchanges, and serves as a DSP for 10 We also randomly blocked 10% and 20% of A&A nodes, but the simulation results were very similar to that of random 30%.
Diffusion of User Tracking Data in the Online Advertising Ecosystem 96 # Nodes Activated 300 6 First, we look at the number of nodes that are ac- 250 5 Tree Depth 200 4 tivated by direct propagation in trees rooted at each 150 3 publisher. Figure 9a shows that our models are conser- 100 2 50 1 vative in that they generate smaller trees: the median 0 0 original tree contains 48 nodes, versus 32, seven, and six O R R C O R R C TB l TB M TB TB M rig rig in in from our models. One caveat to this is that publishers -R -C -R -C a al in our simulated trees have a wider range of fan-outs (a) Number of nodes (b) Tree depth than in the original trees. The median publishers in the Fig. 9. Comparison of the original and simulated inclusion trees. original and simulated trees have 11 and 12 neighbors, Each bar shows the 5th , 25th , 50th (in black), 75th , and 95th respectively, but the 75th percentile trees have 16 and percentile value. 30 neighbors, respectively. Second, we investigate the depth of the inclusion trees. As shown in Figure 9b, the median tree depth in 3. We blocked all 594 A&A nodes from the the original trees is three, versus two in all our models. Ghostery [25] blacklist. The 75th percentile tree depth in the original data is 4. We blocked all 412 A&A nodes from the Discon- four, versus three in the RTB Relaxed and RTB Con- nect [18] blacklist. strained models, and two in the most restrictive Cookie 5. We emulated the behavior of AdBlock Plus [2], Matching-Only model. These results show that overall, which is a combination of whitelisting A&A nodes our models are conservative in that they tend to gener- from the Acceptable Ads program [73], and black- ate slightly shorter inclusion trees than reality. listing A&A nodes from EasyList [19]. After Third, we look at the set of A&A domains that are whitelisting, 634 A&A nodes are blocked. included in trees rooted at each publisher. For a pub- lisher p that contacts a set Aop of A&A domains in our We chose these methods to explore a range of graph original data, we calculate fp = |Asp ∩Aop |/|Aop |, where Asp theoretic and practical blocking strategies. Prior work is the set of A&A domains contacted by p in simulation. has shown that the global connectivity of small-world Figure 10 plots the CDF of fp values for all publishers in graphs is resilient against random node removal [13], but our dataset, under our three models. We observe that for we would like to empirically determine if this is true for almost 80% publishers, 90% A&A domains contacted in ad network graphs as well. In contrast, prior work also the original trees are also contacted in trees generated shows that removing even a small fraction of top nodes by the RTB Relaxed model. This falls to 60% and 16% from small-world graphs causes the graph to fracture as the models become more restrictive. into many subgraphs [50, 74]. Ghostery and Disconnect Fourth, we examine the number of ad exchanges are two of the most widely-installed tracker blocking that appear in the original and simulated trees. Exam- browser extensions, so evaluating their blacklists allows ining the ad exchanges is critical, since they are respon- us to quantify how good they are at protecting users’ sible for all indirect dissemination of impressions. As privacy. Finally, AdBlock Plus is the most popular ad shown in Figure 11, inclusion trees from our simula- blocking extension [45, 62], but contrary to its name, tions contain an order of magnitude fewer ad exchanges by default it whitelists A&A companies that pay to be than the original inclusion trees, regardless of model.11 part of its Acceptable Ads program [3]. Thus, we seek to This suggests that indirect dissemination of impressions understand how effective AdBlock Plus is at protecting in our models will be conservative relative to reality. user privacy under its default behavior. Number of Selected Exchanges. Finally, we in- vestigate the impact of exchanges in the RTB Con- strained model. We select the top x A&A domains by 5.3 Validation out-degree to act as exchanges (subject to their in/out degree ratio r being in the range 0.7 ≤ r ≤ 1.7), then To confirm that our simulations are representative of execute a simulation. As shown in Figure 12, with 20 our ground-truth data, we perform some sanity checks. We simulate a single user in each model (who generates 5K impressions) and compare the resulting simulated 11 Because each of our models assumes that a different set of inclusion trees to the original, real inclusion trees. A&A nodes are ad exchanges, we must perform three corre- sponding counts of ad exchanges in our original trees.
Diffusion of User Tracking Data in the Online Advertising Ecosystem 97 (Frac. of Publishers) 1 1 Simulation 1 CM 0.8 RTB-C 0.8 0.8 RTB-R 0.6 0.6 0.6 CDF CDF CDF 0.4 0.4 0.4 CM 5 30 0.2 0.2 RTB-C 0.2 10 50 Original RTB-R 20 100 0 0 0 0 0.2 0.4 0.6 0.8 1 1 10 100 1000 10000 0 0.2 0.4 0.6 0.8 1 Frac. of A&A Contacted # of Ad Exchanges per Tree Fraction of Impressions Fig. 10. CDF of the fractions of A&A Fig. 11. Number of ad exchanges in Fig. 12. Fraction of impressions observed domains contacted by publishers in our our original (solids lines) and simulated by A&A domains in RTB-C model when original data that were also contacted in (dashed lines) inclusion trees. top x exchanges are selected. our three simulated models. Blocking Cookie Matching-Only RTB Constrained RTB Relaxed Cookie Matching-Only RTB Constrained RTB Relaxed Scenarios %E %W %E %W %E %W doubleclick 90.1 google-analytics 97.1 pinterest 99.1 No Blocking 16.9 31.0 33.9 55.9 71.8 81.3 criteo 89.6 quantserve 92.0 doubleclick 99.1 AdBlock Plus 12.3 28.0 25.6 50.3 48.4 68.6 quantserve 89.5 scorecardresearch 91.9 twitter 99.1 Random 30% 12.1 21.8 22.1 34.2 48.7 54.8 googlesyndication 89.0 youtube 91.8 googlesyndication 99.0 Ghostery 3.52 9.87 6.82 18.2 13.5 21.9 flashtalking 88.8 skimresources 91.6 scorecardresearch 99.0 Top 10% 6.03 5.01 8.18 5.52 26.8 13.4 mediaforge 88.8 twitter 91.3 moatads 99.0 Disconnect 2.98 3.66 4.72 6.01 16.3 11.6 adsrvr 88.6 pinterest 91.2 quantserve 99.0 dotomi 88.6 criteo 91.2 doubleverify 99.0 Table 3. Percentage of Edges that are triggered in the Inclusion steelhousemedia 88.6 addthis 91.1 crwdcntrl 99.0 graph during our simulations under different propagation models adroll 88.6 bluekai 91.1 adsrvr 99.0 and blocking scenarios. We also show the percentage of edge Table 4. Top 10 nodes that observed the most impressions under Weights covered via triggered edges. our simulations with no blocking. or more exchanges the distribution of impressions ob- to have significant impact relative to the No Blocking served by A&A domains stops growing, i.e., our RTB baseline, in terms of removing edges or weight, under Constrained model is relatively insensitive to the num- the Cookie Matching-Only and RTB Constrained mod- ber of exchanges. This is not surprising, given how dense els. Further, the top 10% blocking strategy removes the Inclusion graph is (see § 4). We observed similar re- less edges than Disconnect or Ghostery, but it reduces sults when we picked top nodes based on PageRank. the remaining edge weight to roughly the same level as Disconnect, whereas Ghostery leaves more high-weight edges intact. These observations help to explain the out- comes of our simulations, which we discuss next. 5.4 Results No Blocking. First, we discuss the case where no We take our 200 simulated users and “play back” their A&A nodes are blocked in the graph. Figure 13 shows browsing traces over the unmodified Inclusion graph, as the fraction of total impressions (out of ∼5,300) and well as graphs where nodes have been blocked using the fraction of unique publishers (out of ∼190) observed by strategies outlined above. We record the total number A&A domains under different propagation models. We of impressions observed by each A&A domain, as well as find that the distribution of observed impressions under the fraction of unique publishers observed by each A&A RTB Constrained is very similar to that of RTB Re- domain under different impression propagation models. laxed, whereas observed impressions drop dramatically under Cookie Matching-Only model. Specifically, the Triggered Edges. Table 3 shows the percentage of top 10% of A&A nodes in the Inclusion graph (sorted edges between A&A nodes that are triggered in the In- by impression count) observe more than 97% of the im- clusion graph under different combinations of impres- pressions in RTB Relaxed, 90% in RTB Constrained, sion propagation models and blocking strategies. No and 29% in Cookie Matching-Only. We observe simi- blocking/RTB Relaxed is the most permissive case; all lar patterns for fractions of publishers observed across other cases have less edges and weight because (1) the the three indirect propogating models. Recall that the propagation model prevents specific A&A edges from Cookie Matching-Only and RTB Relaxed models func- being activated and/or (2) the blocking scenario ex- tion as lower- and upper-bounds on observability; that plicitly removes nodes. Interestingly, AdBlock Plus fails
Diffusion of User Tracking Data in the Online Advertising Ecosystem 98 1 1 1 RTB Constrained RTB Constrained 0.8 0.8 0.8 Publishers RTB Relaxed Impressions RTB Relaxed 0.6 0.6 0.6 CDF CDF CDF 0.4 0.4 0.4 Disconnect 0.2 Cookie Matching-Only 0.2 Ghostery 0.2 Top 10% RTB Constrained AdBlock Plus Random 30% RTB Relaxed No Blocking No Blocking 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Observed Fraction Fraction of Impressions Fraction of Impressions Fig. 13. Fraction of impressions (solid (a) Disconnect, Ghostery, AdBlock Plus (b) Top 10% and Random 30% of nodes lines) and publishers (dashed lines) ob- Fig. 14. Fraction of impressions observed by A&A domains under the RTB Constrained served by A&A domains under our three (dashed lines) and RTB Relaxed (solid lines) models, with various blocking strategies. models, without any blocking. the results from the RTB Constrained model are so sim- with no blocking and in Table 5 with AdBlock Plus are ilar to the RTB Relaxed model is striking, given that almost identical, save for some reordering. only 36 nodes in the former spread impressions indi- Next, we examine Ghostery and Disconnect in Fig- rectly, versus 1,032 in the latter. ure 14a. As expected, the amount of information seen by Although the overall fraction of observed impres- A&A domains decreases when we block domains from sions drops significantly in the Cookie Matching-Only these blacklists. Disconnect’s blacklist does a much bet- model, Table 4 shows that the top 10 A&A domains ter job of protecting users’ privacy in our simulations: observe 99%, 96%, and 89% of impressions on aver- after blocking nodes using the Disconnect blacklist, 90% age under RTB Relaxed, RTB Constrained, and Cookie of the nodes see less than 40% of the impressions in the Matching-Only respectively. Some of the top ranked RTB Constrained model, and less than 53% in the RTB nodes are expected, like DoubleClick, but other cases are Relaxed model. In contrast, when using the Ghostery more interesting. For example, Pinterest is connected blacklist, 90% of the nodes see less than 75% of the im- to 178 publishers and 99 other A&A domains. In the pressions in both RTB models. Table 5 shows that top Cookie Matching-Only model, it ranks 47 because it is 10 A&A domains are only able to observe at most 40– directly embedded in relatively few publishers, but it 59% and 73–83% of impressions when the Disconnect ascends up to rank seven and one, respectively, once in- and Ghostery blacklists are used, respectively, depend- direct sharing is accounted for. This drives home the ing on the indirect propagation model. point that although Google is the most pervasively em- As shown in Figure 14b, blocking the top 10% bedded advertiser around the web [15, 65], there are of A&A nodes from the Inclusion graph (sorted by a roughly 52 other A&A companies that also observe weighted PageRank) causes almost as much reduction greater than 91% of users’ browsing behaviors (in the in observed impressions as Disconnect. Table 5 helps to RTB Constrained model), due to their participation in orient the top 10% blocking strategy versus Disconnect major ad exchanges. and Ghostery in terms of overall reduction in impression With Blocking. Next, we discuss the results when observability and the impact on specific A&A domains. AdBlock Plus (i.e., the Acceptable Ads whitelist and Ea- In contrast, blocking 30% of the A&A nodes at ran- syList blacklist) is used to block nodes. AdBlock Plus dom has more impact than AdBlock Plus, but less than has essentially zero impact on the fraction of impres- Disconnect and Ghostery. Top 10 nodes under the “no sions observed by A&A domains: the results in Fig- blocking” and “random 30%” (not shown) strategies ob- ure 14a under the RTB Constrained and RTB Relaxed serve similar impression fractions. Both of these results models are almost coincident with those for the models agree with the theoretical expectations for small-world when no blocking is applied at all. The problem is that graphs, i.e., their connectivity is resilient against ran- the major ad networks and exchanges are all present dom blocking, but not necessarily targeted blocking. in the Acceptable Ads whitelist, and thus all of their We do not show results for our most restrictive partners are also able to observe the impressions, even model (i.e., Cookie Matching-Only) in Figure 14, since if they are (sometimes) prevented from actually show- the majority of A&A companies view almost zero im- ing ads to the user. Indeed, the top 10 nodes in Table 4 pressions. Specifically, 90% of A&A companies view less
You can also read