A Calculus of Tracking: Theory and Practice - Privacy ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Proceedings on Privacy Enhancing Technologies ; 2021 (2):259–281 Giorgio Di Tizio* and Fabio Massacci A Calculus of Tracking: Theory and Practice Abstract: Online tracking techniques, the interactions databases that provide internet snapshots1 and that can among trackers, and the economic and social impact of be used to experiment with tracking behavior, the effec- these procedures in the advertising ecosystem have re- tiveness of tracking mitigations as well as derive metrics ceived increasing attention in the last years. This work for a tracker market share or trackers concentrations proposes a novel formal model that describes the foun- [11], at least for what is measurable from the Internet.2 dations on which the visible process of data sharing be- The major privacy concern of the users is with haves in terms of the network configurations of the In- whom and for which purposes their personal informa- ternet (included CDNs, shared cookies, etc.). From our tion is shared and not by which technology [13, 14]. In model, we define relations that can be used to evaluate other words, users are not worried that an entity col- the impact of different privacy mitigations and deter- lects data when interacted with it but that these data mine if websites should comply with privacy regulations. are shared with other entities. An interesting question, We show that the calculus, based on a fragment of intu- so far unanswered, is how can we provide a third party itionistic logic, is tractable and constructive: any formal independent verification of empirical tracking claims. A derivation in the model corresponds to an actual track- study might claim that tool A is more effective than ing practice that can be implemented given the current tool B at mitigating trackers but there is hardly any configuration of the Internet. We apply our model on a way for a third-party to check why and how unless one dataset obtained from OpenWPM to evaluate the effec- re-runs the entire study and trace all results. Even the tiveness of tracking mitigations up to Alexa Top 100. claim that a tracker burstnet.com can potentially know whoever visited website amazon.com is hard to check Keywords: online tracking, ad-blocker, formal model unless one re-run the entire study. DOI 10.2478/popets-2021-0027 We answer such question by a formally grounded Received 2020-08-31; revised 2020-12-15; accepted 2020-12-16. mechanism: a calculus of tracking for the internet. We associate a formal relation between various way to ex- change data (access and inclusion of web pages, cookie 1 Introduction syncing, redirects, etc.) as they are measurable from the internet (and available in open datasets such as Web Online tracking of users for targeted advertising is the Census) and identify formal rules that capture how in- reality of today’s Internet, and the extent of such track- ternet visits can be tracked. ing has been the subject of an intense research activ- We can formally prove that a tracker can poten- ity [1]. Research studies span from facts finding studies tially know that a user visited a website. By using a (e.g. [2, 3]) to technical analysis both for classical tech- fragment of intuitionistic logic we can extract from the niques (e.g. based on cookies syncing [4]) and novel tech- proof the actual web pages configuration that makes niques (e.g. based on browser fingerprinting [5]). Eco- such tracking possible. The formal model allows one to nomic analysis are also not uncommon (e.g. [6]). A num- determine if a website should comply with privacy laws ber of mitigation tools also emerged to (partly) block (e.g. COPPA) or to compare different mitigations and trackers (e.g. Ghostery, Disconnect, Adblock Plus, etc.) conclude whether a mitigation is strictly better than and researchers have also investigated whether they are another or at least quantitatively better along a Pareto effective (e.g. [7, 8]) and their side-effects (e.g. [9, 10]). frontier. For this approach to be a useful link between These robust research activities, which we sample theory and practice, such calculus must be tractable and in Tab. 1 and Tab. 2, generated a number of open 1 E.g. Web Census: http://webtransparency.cs.princeton.edu/ *Corresponding Author: Giorgio Di Tizio: University of webcensus Trento, E-mail: giorgio.ditizio@unitn.it 2 Obviously, data exchange agreements between owners of Fabio Massacci: University of Trento, Vrije Universiteit Am- seemingly different trackers do affect conclusions based on the sterdam, Email: fabio.massacci@unitn.it internet. Detecting such agreements is hard given the present information asymmetry [12].
A Calculus of Tracking: Theory and Practice 260 the analysis can be performed on the fraction of the In- in the back-office data sharing. As such we could as- ternet visited by a user as available from open datasets sert that certain websites perform fingerprinting but, (e.g. OpenWPM) and we do so up to Alexa Top 100. in absence of a publicly known relation between these The overview of the key works in this area (§2) sum- websites on how fingerprints (or other personal infor- marized in Tab. 1 shows that the majority of the re- mation) are shared, this would be a meager knowledge. search focused on large scale analysis and technological Furthermore, the application of the framework is not analysis, while no attempt has been done on formally intended for large scale analysis of the whole Internet, describing the sharing procedures. We fill this gap with where formal reasoning hardly scale. the following contributions: – we present the first formal model (predicates and rules) that describes the passage of tracking infor- 2 Analysis of Tracking Practices mation across websites that can be externally mea- surable (§3 and §4); Over the last years, researchers have identified different – we prove that inferences are tractable and that one tracking techniques on the Internet. To make the paper can reconstruct the configuration responsible for a self-contained we present an overview of the techniques concrete tracking practice from the proof (§5); and refer the reader to Tab. 8 in the Appendix A.1 for – we formalize some of the interesting tracking rela- additional information. Tab. 2 reports some research tions that can be captured by our system (§6); questions addressed by the state-of-the-art. Most pa- – we extended the model to consider the uncertainty pers examined the effectiveness of tracker blocking tools, of the Internet interactions (§7); while others focused on trackers’ pervasiveness and the – we discussed scalability issues and challenges (§8) techniques used to track Internet users. We consid- and we instantiated our model in a state of the art ered works about Online Advertising Ecosystem, pri- theorem prover to determine tracking practices and vacy policies, and formal modeling of the Internet. websites that should comply with COPPA (§8.2); – we compare the effectiveness of different mitiga- tions (Ghostery, Disconnect, Adblock Plus, and A Summary of User Identification Privacy Badger) on the Top 5, Top 10, Top 50 and HTTP Cookies are IDs associated with a user and are Top 100 visited domains (§9). set by websites through JavaScript codes or HTTP responses. Cookies are automatically attached by the browser to all subsequent requests to the websites. The We conclude by discussing the added value and the lim- major difference compared to browser fingerprinting is itations of our approach (§10) and future work (§11). that the ID is stored locally on the user’s machine [7, 27]. Goals: we provide a framework that generates a Browser Fingerprinting is used by websites to col- third-party independent verification of tracking prac- lect information from the browser to build an unique fin- tices for individual cases, i.e. for single users that browse gerprint [28]. For example, to personalize the content, a limited number of websites. Large scale studies over a website can request device-specific information like millions of domains are unrealistic and inappropriate to user-agent, HTTP headers, plugins, fonts, screen help users in determining the best countermeasure to en- resolution, OS, canvas and AudioContext [5, 29, 30] hance their privacy or determine which websites should via HTTP headers or JavaScript codes [22]. These at- comply with COPPA. Furthermore, these studies lack tributes can be used to generate a unique fingerprint for transparency and do not provide concrete evidence of tracking purposes. Other approaches exploit O.S. and the effectiveness of the mitigations analyzed (as well as hardware properties to generate device fingerprints that provide in some cases contrasting results). Our frame- allow cross-browser tracking [31, 32]. work fills the gap by providing an explanation, in the Other Browser Storage, for example HTML5 form of a formal proof, that can help users to evaluate localStorage, Flash LSOs, and HTTP headers (e.g. the effectiveness of different adds-on or provide a proof ETag) [33, 34], are used by websites to store IDs and that shows if a website should comply with COPPA. track users even if HTTP cookies are deleted. Non-goals: we do not consider methods that rely Other tracking techniques exploit browsing his- on back-office data exchange for user identification. For tory [35] and caching process of DNS records [36]. example, collecting browser fingerprinting and using it
A Calculus of Tracking: Theory and Practice 261 Table 1. Research Topics Addressed by the State of the Art Research Research Topic [15][2][6][16][17][3][18][7][19][4][20][21][8][22][23][24][25][26] Our work Analysis of the tracking ecosystem X X X X X X Tracking coverage X X X X X X X X X Fact finding Search context exposure to tracking X X Detection of hidden flow among track- X X X X ers Detection of privacy regulation violation X Detect duty compliance to privacy regul. X Economic Cookie syncing incentives X X X analysis Revenue with and without cookies X X X X Effectiveness of blocking techniques X X X X X X Technical Development of a detection mechanism X X X X analysis Classification of Trackers X X X Analysis of Tracking techniques X X X Formal expression of privacy policy X X X Logic Formal analysis of tracking X Data Sharing Across Websites al. [18] represented traffic logs through 1-mode and 2- Cookie Syncing is an increasingly popular technique [4] mode graphs to highlight the connected communities of employed by trackers to share the IDs associated to a the trackers and proposed an automated tracker detec- user [17, 19, 30]. A common cookie syncing technique is tion mechanism based on graphs properties. Similarly, to pass the IDs as parameters in an HTTP request. This in our paper we (implicitly) employ graphs to repre- procedure allows the websites to map different IDs to a sent the interactions among websites that are exploited single user and link information from different trackers. in the tracking ecosystem. However, we also provide a formal description of the trackers’ interactions to prove tracking practices and privacy compliance. Analysis of the Online Advertising Ecosystem Ghosh et al. [15] analyzed the leakage of information in the Real-Time Bidding (RTB) protocol and modeled the Formal Models for IT-Security revenue of advertisers w/ and w/o syncing. Gill et al. [2] Speicher et al. [38] used a model based on AI planning employed HTTP traces to model the revenue earned by with grounded predicates in the context of the email in- different trackers, whereas Marotta et al. [6] empirically frastructure. Simeonovski et al. [39] proposed a model estimated the value of targeted advertisement depend- based on property graphs in the context of Internet ing on the presence or absence of cookies. We aim to core services. We instead focus on tracking practices and address an orthogonal problem, i.e. how can we formally present a stronger formal foundation compared to the prove that cookies are employed for tracking users, in- previous works by proposing a formal Gentzen calculus. dependently from how trackers utilize them. Iqbal et al. [37] developed a graph-based ML clas- sifier of ads and tracker. The tool builds a graph repre- Analysis of Children’s Online Privacy Protection Act sentation of the HTML structure, the network requests, Compliance and the JavaScript in the web page to determine track- Data collection and web tracking are regulated by data ing practices from specific features. Gomer et al. [16] protection laws. For example, the General Data Protec- analyzed how the search context exposes users to track- tion Regulation (GDPR) [40] is currently in force in the ing practices using directed network graphs based on re- EU. Still, it is not uncommon to observe violations [41]. ferral header information. Bashir et al. [3, 17] detected When it comes to collect personal information flows of information between advertisers based on re- from children additional laws are applied. In the targeted ads. They constructed an inclusion graph to U.S. the Children’s Online Privacy Protection Rule model the advertising ecosystem, analyzed the graph (COPPA) [42] imposes requirements for websites that properties, and simulated the impacts of tracker block- ing tools like Disconnect and Adblock Plus. Kalavri et
A Calculus of Tracking: Theory and Practice 262 Table 2. Research Questions From the State of the Art not yet a tool for parents to determine such violations due to the complex interactions among websites. Papers belong to some research classes: Fact finding (F), Eco- nomic analysis (E), Technical analysis (T), and Logic (L). Several papers tried to formally express privacy pol- icy (e.g. [25, 26]). In the context of COPPA, Barth Paper Class Research Questions et al. [24] proposed a framework to formally describe [15] E What induce publishers to perform cookie sync? this policy based on first-order temporal logic. Although [2] F,E What is the revenue given different information? [6] E How much cookies influence publishers’ revenues? similar to our work, [24] is a theoretical model that fo- [16] F How does search context impact users’ privacy? cuses on the description of privacy policies. In contrast, What are the tracking ecosystem characteristics? our model describes tracking behaviors for advertising [17] F Can retargeted ads detect exchange of info? and generate proof that websites should comply with [3] F,T Can we capture interactions among publishers and COPPA based on data from the Internet. For example, Ads company based on web inclusions? parents can determine which websites should comply How effective are tracking mitigations? based on their children’s visited websites. The effective [18] F,T Can we identify trackers via mode graph? [7] F,T How can we classify web tracking behaviors? compliance can be manually verified by the parents and How effective are tracking mitigations? the proof can prompt further investigations by the FTC. [19] F,T How trackers are distributed in the Top 1M? How effective are tracking mitigations? Can we automatically collect tracking behaviors? [4] F,T Which are the characteristics of cookie syncing? Which is the impact for the privacy of the users? 3 A Formal Model for Tracking [20] T Are invisible pixels used extensively by trackers? Can we classify trackers based on invisible pixels? We define a privacy threat when a website can know How effective are mitigations vs invisible pixels? [21] F,E What is the impact of cookie syncing and RTB? that a user visited other websites as a result of a pro- Which is the price value of user’s private data? cess of data sharing. The major concern for a user is not [8] F,T How effective are adds-on for desktop and mobile? that a website knows this is a recurring user, but that What are challenges of mobile tracking prot.? unrelated websites get knowledge of this activity thanks [22] F,T How to identify fingerprinting using font-probing? to the exchange of data. We did not model tracking tech- How effective are the tools against fingerprinting? niques that are not been observed in the wild (e.g. [47]). [23] F,T How to detect 3rd-party privacy violations? [24] L How can we formally reasoning about norms of transmission of PII? [25] L How can we formalize and check privacy rules? 3.1 Predicates [26] L How can we formalize privacy rules? Tab. 3 contains predicates that capture network in- teractions among websites and users (in our analy- collect personally identifiable information (PII)3 from sis we reduced the websites to their PS4 and PS+1 children under 13 y/o. COPPA requires to post a pri- of the URL5 ) as well as the type of mitigations con- vacy policy containing the personal information col- sidered. IncludeContent(w, w 0 ) indicates an inclusion lected, with who and for which goal this data is shared, of some content of web site w0 in website w. The to get a verifiable parental consent (for example call- predicate IncludeContentcookie (w, w 0 ) describes, in addi- ing a tool-free number), and to allow parents to review tion, a transmission of cookies collected by w to w0 . the PII collected and revoke the consent. Determine if a Redirectcookie (w, w 0 ) (Redirect(w, w 0 )) indicates a HTTP website should comply with COPPA is not easy. There redirection from the website w to the website w0 with have been several violations in the past, for example (without) a transfer of cookies collected by w. by Playdom [43] and Youtube [44], with fines of several Block_request(w) indicates that the connection to millions of dollars. Apart from U.S. FTC reports, some the website w is blocked, for example by an add-on. studies developed frameworks to analyze Android apps Block_tp_cookie(w) indicates that the website w is not to determine violations [23, 45, 46]. However, there is allowed to set cookies. These two mitigations protect users using different techniques. If Block_request(w) 3 E.g. First and last name, home address, SSN, persistent iden- 4 https://publicsuffix.org/ tifiers (e.g. cookies, fingerprinting), etc. 5 For example, https://s.ytimg.com is reduced to ytimg.com.
A Calculus of Tracking: Theory and Practice 263 Table 3. Ground Truth Network Interactions and Mitigations The predicates are obtained from ground truth data (G) and are not derived from rules of the model. IncludeContent(w, w 0 ) G website w includes 3rd-party content from the website w0 (e.g. within an i-frame tag). IncludeContentcookie (w, w 0 ) G website w includes 3rd-party content from the website w0 sending its cookies. Redirect(w, w 0 ) G website w redirects visitors to the website w0 . w does not append cookies in the redirection. Redirectcookie (w, w 0 ) G website w redirects visitors to the website w0 . w appends its cookies in the URL (or payload). Visit(w) G intentional access to website w by a user. Block_request(w) G extension blocks connections directed to the website w based on filter lists (e.g. Disconnect). Block_tp_cookie(w) G 3rd-party cookie blocking for website w. Table 4. Web Tracking Predicates The predicates are derived (D) from one or more rules of the model. Link(w, w 0 ) D websites w and w0 have a possible path to share information w/o exchange of cookies. Linkcookie (w, w 0 ) D websites w and w0 have a possible path to share information via cookies from w to w0 . Access(w, w 0 ) D website w forces to access a resource in w0 via a redirection or an inclusion. Accesscookie (w, w 0 ) D website w forces to access a resource in w0 via a redirection (inclusion) attaching w’s cookies. Cookie_sync(w, w 0 ) D website w synchronizes its cookies with the website w0 . The operation is unidirectional. Knows(w, w 0 ) D potential ability of website w to track users on a (possibly different) website w0 . req_COPPA(w) D website w should comply to COPPA. is evaluated as true (e.g. Disconnect blocks the Definition 1 (Internet Snapshot). The symbol N , website w), then all requests to w are blocked. If possibly with subscripts, denotes finite (possibly empty) Block_tp_cookie(w) is evaluated as true (e.g. 3rd-party set of instantiated predicates from Tab. 3 and 4 that cookie blocking protection is active), it means that the captures the interactions (inclusions, redirections, etc.) browser does not allow w to set HTTP cookies, however among websites observed on the Internet. it does not block HTTP requests directed to w. Tab. 4 summarizes the predicates that describe a We denote the pure "status" of the internet with N and possible exchange of users information between web- the set of predicates capturing a specific mitigation X sites. Linkcookie (w, w 0 ) and Link(w, w 0 ) identify a pos- on the internet snapshot N with NX∗ . For example, N = sible path, as a result of an inclusion or a redirection, {IncludeContent(w, w 0 ), Redirect(w 0 , w 00 ), . . .} and NX∗ = between w and w0 that can be exploited for tracking. {Block_request(w 0 ), Block_request(w 00 ), . . .}. Access(w, w 0 ) and Accesscookie (w, w 0 ) capture a success- The turnstyle ` separates the assumptions on the fully redirection or inclusion that forces a user to con- left from the propositions on the right. The sequence of tact website w0 from the website w. Cookie_sync(w, w 0 ) formulas on the left of ` are in conjunction. The hori- indicates cookie syncing between w and w0 to share IDs. zontal line separates preconditions from postconditions. The predicates in Tab. 3 are obtained from the col- The capital letters A, B and C, possibly with sub- lected data as we will see in §8. The predicates in Tab. 4 scripts, denote formulae of the quantifier-free frag- are inferred from the rules of the model. ment of IFOL, whose predicates are drawn from Tab. 3 and 4. The variable w ∈ W, possibly with apices, is a variable over websites. Constants (e.g 3.2 General Derivation Rules facebook.com) denote websites. WL and WR stand for Weakening Left/Right, CL for Contraction Left. The model includes the classical Gentzen rules for the We have the following rules for intuitionistic logic: quantifier-free fragment of intuitionistic first-order logic N `A A, N ` B N , A, A ` B (IFOL). We use IFOL since its proofs are constructive (Ax) (Cut) (CL) A`A N `B N,A ` B and thus there is a pairing between proofs and attacks. N `B N ` N `A N,A ` (WL) (WR) (¬L) (¬R) N,A ` B N `B N , ¬A ` N ` ¬A
A Calculus of Tracking: Theory and Practice 264 cess that brings to track users and no information on N , A, B ` C N `A N `B (∧L) (∧R) how fingerprints are shared is available. Furthermore, as N,A ∧ B ` C N `A∧B pointed out by several works ( [48, 49]), browser finger- N,A ` C N,B ` C (∨L) N `A (∨R1 ) N `B (∨R1 ) printing is not as accurate as cookies to identify users. N,A ∨ B ` C N `A∨B N `A∨B Thus, ad transactions carried out without the presence N `A N,B ` C N,A ` B of cookies are not enough to produce targeted advertise- (→L) (→R) ments [6]. We further discuss this extension in §11. N,A → B ` C N `A→B In our derivations, we do use neither ∨Ri nor ∨L rules as we are only interested in deriving knowledge predicates. Cookie Syncing We can also have domain-specific axioms of the form The rule Sync in Fig. 1c shows the preconditions re- A → B that can be added to a derivation with the rule: quired to implement cookie syncing between websites w N,A → B ` C A → B is Domain Axiom and w0 . Cookie syncing requires the exchange of cookies DomAx N `C to link the IDs used by the two trackers. This tech- nique is also called First to Third-party Cookie Sync- We represent a domain axiom A1∧...∧An→B as a rule ing [20]. Rule PropagateSync shows how to propagate A1 , . . . , A n and vice-versa as in IFOL, ` and → are in- cookie syncing through a sequence of websites. B terchangeable. Tracking specific rules in the next section are indeed domain axiom. Tracking via Cookie Syncing In Fig. 1c, the rule SyncTracking describes how cookie syncing between websites w0 and w00 allows to track 4 Tracking Specific Rules users even in websites where a tracker is not explic- itly present. This is known as Third to Third-party Information Flow Cookie Syncing [20]. We did not define a rule to de- In Fig. 1a the rules IncludeW and RedirectW show how scribe cookie forwarding because it is a special case of a link between two website w0 and w can be created. 3rdpartyTracking where the tracker passively receives Rule Redirect (Include) in Fig. 1a illustrates how the the user’s history collected by a 3rd-party. The cookies redirection to (inclusion of) another website can be em- forwarded could be used for back-office exchange that is ployed by trackers to pass information, for example outside of the scope of our model. All the rules assume cookies. The rule ImpRed (ImpInc) in Fig. 1b shows that the intention of the websites to track users. the predicates Redirectcookie (IncludeContentcookie ) im- plies the predicates Redirect (IncludeContent). COPPA Compliance A website w must comply with COPPA if at least one Network Interactions of these conditions hold [42] for w: Rules AccessToW and AccessTo in Fig. 1a describe ac- cess to resources with a possible propagation of infor- 1. is directed to children under 13 y/o and collects PII. mation between two websites w and w0 (w/ or w/o an 2. is directed to children under 13 y/o and allows an- exchange of cookies). The rule PropagateAccess shows other website w0 to collect PII. how the access can be propagated through websites. 3. has a general audience, but it has actual knowledge that it collects PII from children under 13 y/o. 4. collects PII from users of a website w0 directed to Third-Party Tracking children under 13 y/o. The rule 3rdpartyTracking in Fig. 1a shows that 3rd- parties present on a website w can track users. This rule can be applied recursively to describe complex interac- However, websites that fall in conditions (1) and (3) and tions among websites as shown in the Appendix A.2. collect only persistent identifiers (e.g. cookies) are not This rule does not consider the possibility of browser obliged to comply with COPPA if the persistent identi- fingerprinting (not blocked by the Block_tp_cookie mit- fier is used for internal operations only. It is important igation). Since we are interested in the data sharing pro- to underline that this exception does not allow behav-
A Calculus of Tracking: Theory and Practice 265 If a website w includes content from (redirects to) site w0 , then there is a link IncludeContent(w, w 0 ) Redirect(w, w 0 ) IncludeW RedirectW between w and w0 that allows an ex- Link(w, w 0 ) Link(w, w 0 ) change of information. Redirectcookie (w, w 0 ) IncludeContentcookie (w, w 0 ) During a redirection (inclusion) it is pos- Redirect Include sible to append a cookie of w for w0 . Linkcookie (w, w 0 ) Linkcookie (w, w 0 ) If a website w includes content from Link(w, w 0 ) ¬Block_request(w 0 ) AccessToW (redirects to) a website w0 (this case in- Access(w, w 0 ) cludes connections exploiting social but- tons) that is not blocked by any exten- Linkcookie (w, w 0 ) ¬Block_request(w 0 ) sion, then the user is forced to access the Accesscookie (w, w 0 ) AccessTo resources of w0 from w. If a website w forces to access the re- Access(w, w 0 ) Access(w 0 , w 00 ) sources of w0 and w0 forces to access the PropagateAccess resources of w00 , then the user that visits Access(w, w 00 ) w is forced to access website w00 . If a user visits a website w that forces to Visits(w) access a website w0 not blocked by any Access(w, w 0 ) ¬Block_tp_cookie(w 0 ) mitigation, then w0 knows that the user 3rdpartyTracking Knows(w 0 , w) visited w. (a) Tracking Flow Redirectcookie and IncludeContentcookie IncludeContentcookie (w, w 0 ) Redirectcookie (w, w 0 ) are a particular case of Redirect and ImpInc ImpRed IncludeContent(w, w 0 ) Redirect(w, w 0 ) Include respectively. (b) Tracking Implications A website w redirects the user to a website w0 inserting Accesscookie (w, w 0 ) ¬Block_tp_cookie(w 0 ) Sync cookies of w in the request. If the connection to w0 is Cookie_sync(w, w 0 ) not blocked by any mitigation and the browser allows w0 to set its cookies then w0 can receive w’s cookies and Cookie_sync(w, w 0 ) Cookie_sync(w 0 , w 00 ) synchronize them with its cookies. Cookie syncing can PropagateSync be propagated. Cookie_sync(w, w 00 ) The presence of cookie syncing with w0 allows a web- Knows(w 0 , w) Cookie_sync(w 0 , w 00 ) site w00 to track users on the website w even if it is not SyncTracking Knows(w 00 , w) explicitly present. (c) Tracking With Cookie Sharing If a website w tracks users on a children related Knows(w, w 0 ) Kids(w 0 ) w 6= w0 website w0 , then w0 should comply with COPPA. COPPAcomplRelease req_COPPA(w 0 ) This rule covers case (2). If a website w tracks users on a children related Knows(w, w 0 ) Kids(w 0 ) w 6= w0 website w0 , then w should comply with COPPA. COPPAcomplCollect req_COPPA(w) This rule covers case (4). If w is a children related website that collects PII on an external website w0 then it can perform be- Kids(w) Knows(w, w 0 ) BehavioralAds(w) COPPAcomplBehav havioral advertising. This rule covers the cases (1) req_COPPA(w) and (3). It is a special case of COPPAcomplBehav. If w is a children related website and performs cookie sync- Kids(w) Cookie_sync(w 0 , w) ing with w0 (i.e. it receives cookies from w0 ) then COPPAcomplCS req_COPPA(w) it can create profiles for its users for behavioral advertising. (d) COPPA Compliant Fig. 1. Tracking Derivation Rules
A Calculus of Tracking: Theory and Practice 266 ioral advertising. Fig. 1d shows the rules that describe As previously stated, the predicate Knows describes when a website should comply with COPPA. The pred- the potential ability of a website to track users. Thus, icate Kids(w) describes a website directed to children the obtainment of the predicate req_COPPA is not by under 13 y/o, req_COPPA(w) identifies a website that itself a definitive proof of the need for compliance. How- should comply to COPPA. ever, it provides an explanation that can trigger further COPPAcomplRelease and COPPAcomplCollect de- investigations by the FTC on which data are actually scribe conditions (2) and (4): if a children related web- sent (for example, due to a complaint of a parent). Web- site w0 allows a website w to track its users then both sites must then prove that the exchange either did not websites must comply with COPPA. We impose w 6= w0 occur or did not contain children’s information. to not fall in the conditions (1) and (3) where COPPA is not mandatory if used for internal activities. It is impor- tant to underline that in our model the Knows predicate implies the employment of a persistent identifiers (e.g. 5 Decidability and Theorem cookies). The scenario described in COPPAcomplCollect Proving is not always straightforward to be observed due to ex- change of cookies (e.g. cookie syncing) among websites. To show the decidability of our construction we rely on COPPAcomplBehav captures cases (1) and (3). Our the relation between logic programs and a fragment of model describes only personal identifiers, thus we need intuitionistic logic (in particular Harrop formulae [51]). to determine if a certain website uses this informa- tion for external operations (e.g. behavioral advertis- Theorem 1 (PTIME Knows Decidability). It is ing). BehavioralAds(w) could be instantiated using the possible to decide whether the internet snapshot N al- approach presented by Liu et al. [50] and we leave lows a website w∗ to know about the user’s visit to an- for future work the integration with our model. Rule other website w (N ` Knows(w ∗ , w)) in poly time in the COPPAcomplCS shows a special case of COPPAcomplBehav size of the snapshot N . where a children related website w receives cookies from Proof. We rely on embedding both snapshot and rules another website. Cookie syncing is a known technique as a Harrop formulae. utilized for behavioral advertising [4]. It is important to underline that the opposite case (Cookie_sync(w, w 0 )), G ::= A | G1 ∧ G2 | H → G | (1) in which the children related website w sends cookies | G1 ∨ G2 | ∀wG | ∃wG %Not used here to an external website, is already covered by the rule H ::= A | G → A | ∀wH | H1 ∧ H2 (2) COPPAcomplCollect since Cookie_sync(w, w 0 ) generates where A is a predicate, G is a goal formula and H is an a Knows(w 0 , w) that triggers the mentioned rule. Harrop formula. An internet snapshot N is encoded as Do we really need a formal approach? Consider a (large) conjunction which is a Harrop formula. Each Youtube and assume Kids(youtube.com) always holds rule from §4 is encoded as a goal formula. For example, (as sometimes it might be necessary to treat informa- 3rdpartyTracking can be coded as a Harrop formula: tion according to COPPA). We might be tempted to conclude that any website that includes cookies from H::=∀w∀w0 H youtube.com should be COPPA compliant. This infor- z }| H::=G→A { mal reasoning seems to imply that any website import- z }| G::=G1 ∧G2 ∧G3 { ing a social button or a video from Youtube should z }| { G::=A G::=A G::=H→A A be COPPA compliant. However, by applying the set of z }| { z }| { z 0 0 }| 0 { z }| { 0 ∀ww Visits(w) ∧ Access(w, w ) ∧ Block_tp_cookie(w ) → ⊥ → Knows(w , w) rules we previously presented we can instead prove that From Theorem A in [51] the pair of the query and this is actually incorrect. Suppose Kids(youtube.com), the rules LJ from §3.2 are a logic programming lan- we have an IncludeContent(abc.com, youtube.com) due guage. As we have no disjunction on the right of ` for to the social gadget, and thus by applying rules the query of interest, the rules (∨Ri ) responsible for the IncludeW, AccessToW, and 3rdpartyTracking we have PSPACE complexity of intuitionistic logic do not apply. Knows(youtube.com, abc.com). At this point, none of the Since N is finite, there are at most O(|N |2 ) differ- COPPA rules can produce req_COPPA(abc.com). In- ent constants as we have at most two arguments for each stead, it is possible to trigger rule COPPAcomplBehav by predicate. Hence, the instantiation of all quantified for- showing that youtube.com is performing behavioral ad- mulae embedding the rules generates at most O(|N |6 ) vertising and thus should comply with COPPA.
A Calculus of Tracking: Theory and Practice 267 ground propositional rules (we have at most three vari- predicates present in the proof and the interpolant. able per rule), even if no optimization can be done (e.g. We can then eliminate N1 from N and try to derive distinguishing between content delivery networks and N \ N1 ` Knows(w ∗ , w). If we succeed, it means there is actual websites). Thus, the ground instantiation of the another way to exchange data, so we extract a new sub- rules is poly in the size of the snapshot and also the set N2 and continue the process until for Ni , i = 1 . . . no query of interest can be decided in poly time. derivation is possible. As deciding a single query is de- cidable in polynomial time (See Theorem 1) the process We do not claim that the calculus of tracking for terminates after a polynomial number of interactions. arbitrary formulae including knowledge predicates is The union of all sets Ni is the desired set Nω . tractable. The presence of disjunction on the right would make decidability jump to PSPACE [52] as one could As immediate from the proof above, one could also stop encode QBF as a decision problem in the formula on the search as soon as the first subset of the internet the right of `. This decision problem could well use snapshot, N1 , responsible for the tracking is identified. Knows(w ∗ , w) as predicates but they could be replaced This is what we do with a theorem prover. There may with abstract ps and qs and would have no relation with be more than one proof because a prover can choose to the complexity of inferring visibility relations on the in- apply one rule before another one according to a suitable ternet. As of now, we do not see a practical need to heuristic that may lead to a faster proof search (see incorporate disjunction on the right. GAPT [57] for additional information). Different proofs From Theorem 1 follows that COPPA compliance may also come from the existence of different tracking rules can be also encoded as Hereditary Harrop formulas possibilities on the Internet. The important thing is that using the knowledge relations as basic atoms: one can be found in poly time (see Th.1). Corollary 1.1 (PTIME COPPA Compliance). It is possible to decide whether the internet snapshot N requires a website w to be COPPA Compliant (N ` 6 Using the Calculus for Tracking req_COPPA(w)) in poly time in the size of N . Relations Next, we show that from a derivation we can reconstruct We formally define tracking relations that are of prac- the connections responsible for the tracking. tical interest through our formal model. We illustrate some of these relations in the practical case of Alexa Theorem 2 (Map Proofs to Configurations). Top 5, 10, 50, and 100 websites later in §9. Given a derivation of Knows(w ∗ , w) from an inter- net snapshot N (N ` Knows(w ∗ , w)), one can extract an essential subset of the configuration Nω ⊆ N such that Flow Propagation N \Nω 6` Knows(w ∗ , w). Given the sharing of information through redirections, content inclusions, and cookie syncing and given a se- Proof. This result follows from the existence of uniform quence of visited web sites, we can study how the knowl- proofs6 for the fragment of interest [53] and the exis- edge about this sequence is distributed on the Internet. tence of a feasible interpolation for intuitionistic logic This is possible through a graph where we underline [54, 55]. Given a derivation of N ` Knows(w ∗ , w) one edges with predicate Knows(w 0 , w) to identify the web- can construct a uniform proof and the existence of the sites that know if a user visited another website. We interpolant guarantees that we have a set of formu- can map this representation to a Venn diagram where lae that only includes constants shared from the an- we identify which trackers are directly and indirectly tecedent (the internet configuration) and the succedent included in the websites visited. We define a mapping (the knowledge predicate). Hence we can use the proof between the predicate Knows and the set theory: to reconstruct the tracking steps and data exchanges responsible for Knows(w ∗ , w) in a subset N1 , as the KnowsUser(N , w) = {w∗ | N ` Knows(w ∗ , w)} (3) where KnowsUser(N , w) represent the set of websites w∗ that are able to track a user on the website w in an 6 A finite constructive process applies uniformly to every for- mula, either producing an intuitionistic proof of the formula or Internet snapshot N . demonstrating that no such proof can exist.
A Calculus of Tracking: Theory and Practice 268 Lowest Tracking Coverage accessible sites than Y. The ideal performance would be Our formal model generates relations between websites to drop one accessed site per blocked accessed tracker through Knows predicates for a given N . We compare (i.e we lost only the tracker itself). different mitigations to determine which produces the However, one of the major concerns in terms of pri- lowest tracking coverage. A mitigation X in an Internet vacy is not that a high number of trackers knows about snapshot N (NX∗ ), disables some Knows predicates. fragments of a user’s activity but that few trackers can reconstruct (almost entirely) the activity of a user. For Definition 2 (Mitigation Subsumption). Let N be example, Google is present in roughly 80% of the Top an Internet snapshot and NX∗ ,NY∗ two mitigations. We 1 million domains [19] and thus, has a high tracking say that the mitigation NX∗ is more effective than NY∗ iff coverage. We thus propose an additional definition: ∀ pairs (w, w0 ): N , NX∗ ` Knows(w 0 , w) implies N , NY∗ ` Knows(w 0 , w). Definition 4 (Mitigation Subsump. per Tracker). Mitigation NX∗ is quantitatively more effective than Intuitively, any Knows predicate obtained from N ap- mitigation NY∗ against the tracker w iff the access size plying the mitigation NX∗ is also obtained from N ap- of X is larger (or equal) than the access size of Y, plying NY∗ . We developed a script that automatically NX∗ A ≥ NY∗ A , and the knowledge size of X pro- generates TPTP input files for Slakje to obtain a proof jected to w is smaller than the knowledge size of Y for Mitigation subsumption. Unfortunately, this defini- projected to w. tion can be rarely applied. Indeed, if the mitigations modify different parts of the graph of Knows predicates, In other words, the mitigation NX∗ produces a higher the results cannot be compared. Thus, we propose a reduction of websites where the tracker w can control a quantitative analysis. user compared to the mitigation NY∗ while still keeping Given an Internet snapshot N , a mitigation NX∗ is a larger (or equal) number of accessible sites than Y. better than a mitigation NY∗ if both conditions hold: Unfortunately, it is hard to fulfill both conditions (as we will see in §9.1, Fig. 7): being less tracked means – C1: The mitigation NX∗ blocks access to a smaller losing more access. number of websites compared to the mitigation NY∗ – C2: The trackers obtained with NX∗ are smaller in number compared to the trackers obtained with NY∗ 7 Coping With Uncertainty Given an Internet snapshot N , we define its access size The model considers the possibility of tracking prac- and knowledge size respectively as follows tices given a static description of the Internet (e.g. from X OpenWPM). However, interactions among websites are ||N ||A = (w, w∗ ) ∈ W 2 | N ` Access(w, w ∗ ) often non-deterministic [56] and the same applies to the w∈W X tracking behaviors [7]. For example, 3rd-parties embed- ||N ||K = (w, w∗ ) ∈ W 2 | N ` Knows(w, w ∗ ) ded in a website can include different 3rd-parties de- w∈W pending on the results of RTB and thus changing the Site breakage, used to compare mitigations’ perfor- set of connections observed. Furthermore, trackers can mance [37], is directly influenced by the access size. We have different behaviors on the same site, for example, can now compare two mitigations NX∗ and NY∗ a tracker can behave both as analytics (and thus can- not track the users over different websites) and as a Definition 3 (Quantitative Mitig. Subsumption). 3rd-party tracker [7]. To handle this uncertainty, we ex- Mitigation NX∗ is quantitatively more effective than mit- tended the model by considering the likelihoods of de- igation NY∗ iff the access size of X is larger (or equal) riving predicates over different snapshots: than the access size of Y, NX∗ A ≥ NY∗ A , and the knowledge size of X is smaller than the knowledge size – A snapshot N is a picture of the Internet at some of Y, NX∗ K < NY∗ K . point in time and it is deterministic by construc- tion (either an inclusion is there or it is not, same Intuitively the mitigation X reduces the number of for mitigations). We intuitionistically derive either trackers that know the user’s visited websites more than the predicate A, ¬A, or neither (if we do not have Y does, while still keeping a larger (or equal) number of
A Calculus of Tracking: Theory and Practice 269 enough data, e.g. OpenWPM failed to detect an in- clusion) but never both. 8 Dataset and Scalability – Uncertainty stems from the fact that snapshots change over time. So yesterday A held, the day be- 8.1 Dataset for Internet Snapshot fore yesterday ¬A held, and three days ago neither was derivable. What can we conclude about today We evaluate our model with the 10k Site ID Detec- if we do not want to resample it? tion(1) 2016 dataset7 collected using a stateful instance of OpenWPM8 . We summarize the tables and columns employed to instantiate the predicates of our model in The overall semantics for N snapshots is: Fig. 2. The table site_visits contains the list of the Top 10k Alexa visited domains. The table urls contains < N1 , . . . , NN > ` A (a, b) ⇐⇒ the set of URLs loaded during the crawling. The table |Ni : Ni ` A| = a ∗ N ∧ |Ni : Ni ` ¬A| = (1 − b) ∗ N http_responses contains the HTTP responses. From the dataset, we extracted the sequence of To each derivation we associate a minimal and a maxi- redirections and inclusions necessary to instantiate the mal likelihood in the same way Ferson et al. [57] derived predicates. From table http_responses we employed a probability-box: visit_id (ids for the top 10k websites), url_id (ids for URLs of the HTTP responses), response_status (HTTP – a: likelihood that A is derivable from the predicates Status Codes), location_id (in case of a redirection, in the Internet snapshots. ids for a new URL to visit. NULL otherwise), and – 1 − b: likelihood that ¬A is derivable from the pred- time_stamp (timestamps of HTTP responses). icates in the Internet snapshots. We provide an example of the mapping into the model in the Appendix A.5. Tab. 5 shows the number In other words, a is the minimum likelihood that A is of HTTP responses received for the Top Alexa. The re- gonna be true, while b is the maximum. sponses are used to find the sequence of redirections. The p-box captures the evidence across all snap- In addition, Tab. 5 shows the number of predicates shots so it must potentially include evidence that A obtained applying our model on the Top 10, 20, 30, holds for sure (a ∗ N snapshots) and it does not hold 40, and 50 Alexa domains of the dataset without any for sure ((1 − b) ∗ N snapshots where ¬A was found to mitigation. After visiting the Top 50 domains the users hold e.g. when mitigations were detected). The gap be- contacted 190 different websites with more than 6k con- tween a and b measures the uncertainty i.e. what we nections. The number of redirections remains relatively cannot prove because the law of excluded middle does small compared to the number of inclusions observed. not hold and thus b 6= (1 − a). To compute the uncer- The number of HTTP responses in Tab. 5 for the tainty a brute-force solution is just to derive the proof Top 30, 40 and 50 domains are slightly different from the for each snapshot and then aggregate the results. values of Link(w, w 0 ) because the crawler failed to collect The likelihood provides information on the Internet some HTTP responses during a sequence of redirections as an evolving ecosystem. At any given time of course probably due to network problems. However, we can in the snapshot true at that moment, the probability correctly close the sequence even if a response is missing. collapses to 0 or 1, in the same way that a tossed coin To compute the sequence of redirections, we first ex- is always head or tail. tract the sequence of HTTP responses for each visit_id To illustrate the extension we considered 17 snap- considered, then order the responses using the column shots from January 2016 to October 2017 obtained from time_stamp (to avoid considering intermediate redirec- WebCensus (see Section 8) and we computed the like- tions as the beginning of a new connection) and extract lihood that the derivations, responsible for the proof the redirections from the set of HTTP responses. Cookie Knows(revsci.net, qq.com) in Fig. 15, are obtained from syncing can be detected by analyzing URLs [20] and the snapshots. Fig. 15 shows the likelihood derivation payloads in POST requests. For illustrative purposes, for a snapshot Ni . In Appendix A.3 we present the ex- haustive set of rules used in the proof. The maximum likelihood b for some derivations is 1 because, as we can- not observe a ¬IncludeContent, we cannot exclude that 7 https://webtransparency.cs.princeton.edu/webcensus/ the interaction happened but we failed to observe it. 8 https://github.com/mozilla/OpenWPM
A Calculus of Tracking: Theory and Practice 270 Tables from OpenWPM used to instanciate the predicates Visits, IncludeContent, and Redirect of our model (continuos lines). It is also possible to instanciate the predicates IncludeContentcookie and Redirectcookie from the tables http_responses and urls (dashed lines) but in Section 9.1 we employed empirically validated pairs from [17]. Fig. 2. Mapping of the Predicates to the Dataset. Table 5. # of Predicates and HTTP Responses for the Top Alexa Variables vs Top Domains 10 20 30 40 50 HTTP responses 925 1957 2864 3618 4530 IncludeContent(w, w 0 ) 824 1803 2681 3391 4272 Redirect(w, w 0 ) 101 154 184 229 261 Link(w, w 0 ) 925 1957 2865 3620 4533 Linkcookie (w, w 0 ) 3 3 3 5 6 Access(w, w 0 ) 925 2272 3636 5024 6382 Accesscookie (w, w 0 ) 3 3 3 5 6 Cookie_sync(w, w 0 ) 3 3 3 7 8 Fig. 3. Fragment of the TPTP Input for Slakje to Prove we use here 200 empirically validated domain pairs per- Knows(fbcdn.net, facebook.com) forming cookies syncing from [17]. Table 6. Timing for Successful and Failed Proof Attempts Run Slakje to prove Knows(fbcdn.net, facebook.com) and 8.2 Theorem Proving Implementation viceversa (not provable) with different visited domains We leverage on the GAPT tool [58] to generate proofs # visited TPTP input Successful proof Failed proof for the Knows and the req_COPPA predicates. We use domains [# axioms] Time [sec] Time [sec] 5 75 1.4 1.1 the intuitionistic prover Slakje [59] to produce formal 10 209 1.8 1.6 proofs based on the rules in our model. We encode the 50 867 10.5 19.3 model and the data using the TPTP syntax. The data 100 2,343 1,469.8 >3,600 used for the axioms is generated from actual data ob- tained using OpenWPM. We instantiated the Kids(w) predicate using the Top 50 Alexa In the Kids and Teens Tab. 6 shows the performance with an Intel i7-8750H @ category. This approach is fully-automated by a script 2.20GHz and 2 GB RAM for the Java VM. that generates a sequence of axioms from the database, the model, and the conjecture to prove. Fig. 3 shows a fragment of the TPTP input for 8.3 Scalability the prover, where the model is encoded and the rel- evant data are inserted as axioms. Fig. 15 and 14 in Our complexity analysis gives an upper-bound of the Appendix A.4 show an example of the proof gen- O(|N |6 ), which is inadequate for the application of the erated by Slakje for Knows(revsci.net, qq.com) and approach beyond very compact domains. Indeed our req_COPPA(flashtalking.com) respectively. We evalu- goal is not to provide Internet-scale analysis but third- ated the performance of Slakje by assuming the Top 5, party verifiable evidence for individual cases where num- 10, 50, and 100 as visited domains to generate a proof bers are manageable. For example, users rarely visit for Knows(f acebook.com, f bcdn.net) and the vice versa.
A Calculus of Tracking: Theory and Practice 271 9 Analysis of Mitigations 9.1 Evaluation of Tracking Relations We evaluated our approach on the dataset previously presented with the filter list of three widely deployed extensions (Ghostery, Disconnect, Adblock Plus). We neither consider the Firefox third-party cookie block- ing feature for unvisited websites10 nor other Fire- fox configurations that were either too restrictive (e.g. block all cookies) or they overlap (e.g. Firefox uses Disconnect blacklist). We used the blacklist of Determine which websites w0 knows about the visit of face- Ghostery, Disconnect, and Adblock Plus from Bashir book.com (Knows(w0 , f acebook.com)) by analyzing only on et al. [3, 17] (the data was collected in 2016 too). We the interactions generated by the facebook.com visit misses in- then compared the effectiveness of some of the miti- teractions generated by adobe.com (visited by the user) with gations (Disconnect and Adblock Plus) in 2016 with facebook.com and thus potential trackers. their 2019 version. We also extended the comparison Fig. 4. The Problem of Determine Knows(w0 , f acebook.com) with Privacy Badger and Adblock Plus (enforced with EasyList&EasyP rivacy) in the 2019 scenario. more than 100/120 websites [35, 60]9 , the cookie du- ration is typically short [61] and cliques, important for COPPA, are relatively small [19]. Thus, scalability in Flow Propagation this application is not a problem. Fig. 5 shows the graph of Access obtained applying our As shown in Tab. 6, the time required to generate a rules on the Top 5 Alexa domains without any mitiga- proof increases with the number of axioms. This number tion. While Fig. 6 shows the Venn diagrams obtained is dependent on the interactions observed by the user on computing the Knows predicates of the model without the visited websites. To improve performance we per- any mitigation and with the Disconnect mitigation. form DBMS slicing by extracting only the interactions that are obtained from the user’s visited websites (e.g. Top5, Top10, etc.) and not the entire Internet interac- Lowest Tracking Coverage tions and then perform proof reconstruction. Unsound Fig. 6a shows the Venn diagrams for the Top 5 domains search followed by proof reconstruction is a new trend without any mitigation (NB∗ = ∅) while, Fig. 6b shows in Automated Reasoning [62]. This is the minimum set the Venn diagram with Disconnect mitigation (NA ∗ = of interactions (and thus axioms) that must be consid- Disconnect). From Def. 2 we have that NA ∗ is more ered to avoid missing possible tracking practices. For ∗ effective than NB . example, if we extract only the interactions generated by visiting a website w and not all the other visited websites we can miss interactions generated from other Comparing Different Mitigations visits that reach w as shown in Fig. 4. We compared the effectiveness of the filter list of Ghostery, Disconnect, Adblock Plus (based on EasyList) in 2016 and Disconnect, Adblock Plus (based on EasyList), Adblock Plus (enforced 10 This feature is unable to block Google in certain situations. Firefox employs by default the Google search engine and, thus, establishes connections with Google domains if the website is not accessed directly through its domain name (we assume non- tech-savvy users behave in this way). As a result, Google do- 9 Skewed towards tech-savvy users, thus these values are likely mains (and all its subdomains) are whitelisted by Firefox and upper bounds. can bypass the third-party cookies block.
A Calculus of Tracking: Theory and Practice 272 Access predicates obtained without any mitigation in the Top (a) KnowsUser(N , w) without mitigations. Each circle is a 5 Alexa domains. Several connections are made to different visited Top 5 Alexa site and includes trackers which can po- third-party domains. Understanding how many trackers can po- tentially know about this visit tentially know about your youtube.com visits is far from trivial (even ignoring any back-office sharing agreement). Fig. 5. Access Graph Top 5 Alexa Domains with EasyList&EasyP rivacy), and Privacy Badger ("trained" on the Top 200 Alexa domains in December 2019) in 2019 based on the Def. 3 presented in §6. We extracted the requests from the dataset and we recur- sively apply the filter lists of the different mitigations to the connections established for the Top 5, 10, 50, 100 Alexa domains. Except for Privacy Badger, that provides the list of domains it "learned" either to block (b) KnowsUser(N , w) with Disconnect mitigation. completely (Block_request(w)) or to stop setting the Disconnect significantly limits potential trackers when visit- cookies (Block_tp_cookie(w)), for all the other miti- ing youtube.com (from 9 to 4) and yahoo.com (from 9 to 3) gations we rely on their blacklist of domains, i.e. only Block_request(w). The results are normalized with Fig. 6. Comparing Tracking Knowledge for Alexa Top 5. respect to N without any mitigation. ilarly. Adblock Plus is the most permissible mitiga- We employed the filter lists from [3, 17] and the tion in 2016. However, Adblock Plus shows a big in- database previously presented to analyze the effective- crement of efficacy in its 2019 version. For example, ness of the mitigations in 2016. We then computed the in the Top 100, a 26% reduction of the accessed con- current effectiveness of the filter list of Adblock Plus tent generates a 66% decrement of trackers. It is worth (with and without the addition of the EasyP rivacy list), mention that the filter list of Adblock Plus from [3] Disconnect, and Privacy Badger in 2019 with an up- is also roughly 46 times smaller than the list in 2019 to-date database11 from June 2019. Fig. 7 shows the and that currently there is overlap between EasyList comparison of the mitigations. The dashed line and the and EasyPrivacy [37]. In contrast, Disconnect does not dash-dotted line correspond to two different efficiency significantly improve in 2019 with a more restrictive levels. The first is a 1-for-1 drop: for each connection behavior. We found that the combination of EasyList that the mitigation blocks, it blocks one tracker, while and EasyPrivacy (EasyList&EasyPrivacy) achieves the the second represents a 1-for-2 drop: for each connec- highest protection at the cost of the most restric- tion that the mitigation blocks, it blocks two track- tive approach. Finally, Privacy Badger showed a sim- ers. Fig. 7a shows that, among the filter lists in 2016, ilar level of protection compared to Disconnect and Disconnect is the most aggressive mitigation up to EasyList&EasyPrivacy but with a significantly higher the Top 100 domains, where Ghostery behaves sim- number of connections allowed due to the balancing of blocking connections and cookies. 11 The database contains the same information of the 2016 database with small differences in the structure, for example, the 2019 version presents a table for the redirections.
You can also read