The Web Never Forgets: Persistent Tracking Mechanisms in the Wild

Page created by Albert Farmer
 
CONTINUE READING
The Web Never Forgets: Persistent Tracking Mechanisms in the Wild
The Web Never Forgets:
                         Persistent Tracking Mechanisms in the Wild

                            Gunes Acar1 , Christian Eubank2 , Steven Englehardt2 , Marc Juarez1
                                            Arvind Narayanan2 , Claudia Diaz1

                                                 1
                                                     KU Leuven, ESAT/COSIC and iMinds, Leuven, Belgium
                                                            {name.surname}@esat.kuleuven.be
                                                                               2
                                                                                   Princeton University
                                                            {cge,ste,arvindn}@cs.princeton.edu

ABSTRACT                                                                                              1.    INTRODUCTION
We present the first large-scale studies of three advanced web
tracking mechanisms — canvas fingerprinting, evercookies                                                 A 1999 New York Times article called cookies compre-
and use of “cookie syncing” in conjunction with evercookies.                                          hensive privacy invaders and described them as “surveillance
Canvas fingerprinting, a recently developed form of browser                                           files that many marketers implant in the personal computers
fingerprinting, has not previously been reported in the wild;                                         of people.” Ten years later, the stealth and sophistication of
our results show that over 5% of the top 100,000 websites                                             tracking techniques had advanced to the point that Edward
employ it. We then present the first automated study of                                               Felten wrote “If You’re Going to Track Me, Please Use Cook-
evercookies and respawning and the discovery of a new ev-                                             ies” [18]. Indeed, online tracking has often been described
ercookie vector, IndexedDB. Turning to cookie syncing, we                                             as an “arms race” [47], and in this work we study the latest
present novel techniques for detection and analysing ID flows                                         advances in that race.
and we quantify the amplification of privacy-intrusive track-                                            The tracking mechanisms we study are advanced in that
ing practices due to cookie syncing.                                                                  they are hard to control, hard to detect and resilient
   Our evaluation of the defensive techniques used by                                                 to blocking or removing. Canvas fingerprinting uses the
privacy-aware users finds that there exist subtle pitfalls —                                          browser’s Canvas API to draw invisible images and ex-
such as failing to clear state on multiple browsers at once                                           tract a persistent, long-term fingerprint without the user’s
— in which a single lapse in judgement can shatter privacy                                            knowledge. There doesn’t appear to be a way to automati-
defenses. This suggests that even sophisticated users face                                            cally block canvas fingerprinting without false positives that
great difficulties in evading tracking techniques.                                                    block legitimate functionality; even a partial fix requires a
                                                                                                      browser source-code patch [40]. Evercookies actively circum-
Categories and Subject Descriptors                                                                    vent users’ deliberate attempts to start with a fresh pro-
                                                                                                      file by abusing different browser storage mechanisms to re-
K.6.m [Management of Computing and Information                                                        store removed cookies. Cookie syncing, a workaround to
Systems]: Miscellaneous; H.3.5 [Information Storage                                                   the Same-Origin Policy, allows different trackers to share
and Retrieval]: Online Information Services — Web-based                                               user identifiers with each other. Besides being hard to de-
services; K.4.4 [Computers and Society]: Electronic                                                   tect, cookie syncing enables back-end server-to-server data
Commerce — Security                                                                                   merges hidden from public view.
                                                                                                         Our goal is to improve transparency of web tracking
Keywords                                                                                              in general and advanced tracking techniques in particular.
                                                                                                      We hope that our techniques and results will lead to bet-
Web security; privacy; tracking; canvas fingerprinting;
                                                                                                      ter defenses, increased accountability for companies deploy-
browser fingerprinting; cookie syncing; evercookie, Java-
                                                                                                      ing exotic tracking techniques and an invigorated and in-
Script; Flash
                                                                                                      formed public and regulatory debate on increasingly persis-
                                                                                                      tent tracking techniques.
                                                                                                         While conducting our measurements, we aimed to auto-
                                                                                                      mate all possible data collection and analysis steps. This
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
                                                                                                      improved the scalability of our crawlers and allowed us to
for profit or commercial advantage and that copies bear this notice and the full citation             analyze 100,000 sites for fingerprinting experiments, as well
on the first page. Copyrights for components of this work owned by others than the                    as significantly improve upon the scale and sophistication of
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or                the prior work on respawning, evercookies and cookie sync-
republish, to post on servers or to redistribute to lists, requires prior specific permission         ing.
and/or a fee. Request permissions from permissions@acm.org.
CCS’14, November 3–7, 2014, Scottsdale, Arizona, USA.                                                 1.1   Contributions
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-2957-6/14/11 ...$15.00.
                                                                                                        First study of real-world canvas fingerprinting
http://dx.doi.org/10.1145/2660267.2660347.                                                            practices. We present the results of previously unreported

                                                                                                674
The Web Never Forgets: Persistent Tracking Mechanisms in the Wild
canvas fingerprinting scripts as found on the top 100,000                  We show that cookie syncing can greatly amplify privacy
Alexa sites. We find canvas fingerprinting to be the most                breaches through server-to-server communication. While
common fingerprinting method ever studied, with more than                web privacy measurement has helped illuminate many pri-
5% prevalence. Analysis of the real-world scripts revealed               vacy breaches online, server-to-server communication is not
that they went beyond the techniques suggested by the aca-               directly observable. All of this argues that greater oversight
demic research community (Section 3).                                    over online tracking is becoming ever more necessary.
   Automated analysis of evercookies and respawn-                          Our results only apply to desktop browsing. Studying
ing. We describe an automated detection method for ever-                 similar tracking mechanisms on mobile platforms requires
cookies and cookie respawning. Applying this analysis, we                distinct methodologies and infrastructure and is left to fu-
detected respawning by Flash cookies on 10 of the 200 most               ture work.
popular sites and found 33 different Flash cookies were used
to respawn over 175 HTTP cookies on 107 of the top 10,000                2.   BACKGROUND AND RELATED WORK
sites. We also uncover a new evercookie vector, IndexedDB,
which was never found in the wild before (Section 4). Re-
                                                                            The tracking mechanisms studied in this paper can be
markably, respawning has already led to a lawsuit and a
                                                                         differentiated from their conventional counterparts by their
$500,000 settlement [14], and yet it is quite prevalent on the
                                                                         potential to circumvent users’ tracking preferences, being
web.
                                                                         hard to discover and resilient to removal. We selected three
   Cookie syncing privacy analysis. We find instances of
                                                                         of the most prominent persistent tracking techniques — can-
syncing of respawned IDs in the wild, i.e., an ID respawned
                                                                         vas fingerprinting, evercookies and cookie syncing — based
by one domain is passed to another domain. Respawning
                                                                         on the lack of adequate or comprehensive empirical measure-
enables trackers to link a user’s browsing logs before cookie
                                                                         ments of these mechanisms in the wild. We now give a brief
clearing to browsing logs after cookie clearing. In our mea-
                                                                         overview of these techniques.
surements, approximately 1.4% of a user’s browser history
can be linked this way in the wild. However, the figure                  Canvas fingerprinting: Canvas fingerprinting is a type of
jumps to at least 11% when these respawned cookies are                   browser or device fingerprinting technique that was first pre-
subsequently synced. Cookie syncing also allows trackers                 sented in a paper by Mowery and Shacham in 2012 [32].
to merge records on individual users, although this merging              The authors found that by using the Canvas API of modern
cannot be observed via the browser. Our measurements in                  browsers, an adversary can exploit subtle differences in the
Section 5 show that in the model of back-end merging we                  rendering of the same text or WebGL scenes to extract a
study, the number of trackers that can obtain a sizable frac-            consistent fingerprint that can easily be obtained in a frac-
tion (40%) of a user’s browsing history increases from 0.3%              tion of a second without user’s awareness.
to 22.1%.
   Novel techniques. In performing the above experi-                     The same text can be rendered in different ways on dif-
ments, we developed and utilized novel analysis and data                 ferent computers depending on the operating system, font
collection techniques that can be used in similar web pri-               library, graphics card, graphics driver and the browser. This
vacy studies.                                                            may be due to the differences in font rasterization such as
                                                                         anti-aliasing, hinting or sub-pixel smoothing, differences in
   • Using the strace debugging tool for low-level monitor-              system fonts, API implementations or even the physical dis-
     ing of the browser and the Flash plugin player (Section             play [32]. In order to maximize the diversity of outcomes,
     4.2).                                                               the adversary may draw as many different letters as possi-
   • A set of criteria for distinguishing and extracting                 ble to the canvas. Mowery and Shacham, for instance, used
     pseudonymous identifiers from traditional storage vec-              the pangram How quickly daft jumping zebras vex in their
     tors, such as cookies, as well as other vectors such                experiments.
     as Flash storage. By extracting known IDs, we can                   The entropy available in canvas fingerprints has never
     track them as they spread to multiple domains through               been measured in a large-scale published study like Panop-
     cookie syncing.                                                     ticlick [16]. Mowery and Shacham collected canvas finger-
   Making the code and the data public. We intend                        prints from 294 Mechanical Turk users and computed 5.73
to publicly release all the code we developed for our exper-             bits of entropy for their dataset. Since this experiment was
iments and all collected data, including (i) our crawling in-            significantly limited for measuring the canvas fingerprint-
frastructure, (ii) modules for analysing browser profile data            ing entropy, they had a further estimate of at least 10 bits,
and (iii) crawl databases collected in the course of this study.         meaning one in a thousand users share the same finger-
                                                                         print [32].
1.2    Implications                                                      Figure 1 shows the basic flow of operations to fingerprint
   The thrust of our results is that the three advanced track-           canvas. When a user visits a page, the fingerprinting script
ing mechanisms we studied are present in the wild and some               first draws text with the font and size of its choice and adds
of them are rather prevalent. As we elaborate on in Section              background colors (1). Next, the script calls Canvas API’s
6.1, they are hard to block, especially without loss of con-             ToDataURL method to get the canvas pixel data in dataURL
tent or functionality, and once some tracking has happened,              format (2), which is basically a Base64 encoded representa-
it is hard to start from a truly clean profile. A frequent ar-           tion of the binary pixel data. Finally, the script takes the
gument in online privacy debates is that individuals should              hash of the text-encoded pixel data (3), which serves as the
“take control” of their own privacy online. Our results sug-             fingerprint and may be combined with other high-entropy
gest that even sophisticated users may not be able to do so              browser properties such as the list of plugins, the list of
without significant trade-offs.                                          fonts, or the user agent string [16].

                                                                   675
from a Flash cookie that the user may fail to remove (Fig-
                                                                                             ure 2c).
                                                                                             Cookie syncing: Cookie synchronization or cookie sync-
                                                                                             ing is the practice of tracker domains passing pseudonymous
                     (1)                                                                     IDs associated with a given user, typically stored in cookies,
                                       (2) ToDataURL()
         FillText()                                                                          amongst each other. Domain A, for instance, could pass an
                                       
        FillStyle()                    w0KGgoAAAANSUhEUgAAA
         FillRect()                                                                          ID to domain B by making a request to a URL hosted by
                                       SwAAACWCAYAAABkW7XS
              ...                      AAAeq0leXgV1d0...                                     domain B which contains the ID as a parameter string. Ac-
                                                                                             cording to Google’s developer guide to cookie syncing (which
                                                                                             they call cookie matching), cookie syncing provides a means
                                       (3)                                                   for domains sharing cookie values, given the restriction that
                                  Hash()                                                     sites can’t read each other cookies, in order to better facili-
                                                                                             tate targeting and real-time bidding [4].
 Figure 1: Canvas fingerprinting basic flow of operations                                    In general, we consider the domains involved in cookie sync-
                                                                                             ing to be third parties — that is, they appear on the first-
                                                                                             party sites that a user explicitly chooses to visit. Although
                                                                                             some sites such as facebook.com appear both in a first and
                                                                     (2)
                                                                       (2)Write
                                                                           Write
                                                                           (2) Write
                                                                                             third-party context, this distinction is usually quite clear.
                                                 (1)
                                                   (1)Read
                                                       Read
                                                       (1) Read
                                                                                             The authors of [38] consider cookie synchronization both as
   id=123
    id=123
       id=123 id=123
               id=123
                  id=123 id=123
                           id=123
                              id=123                     id=123 id=123
                                                    id=123
                                                      id=123     id=123
                                                                    id=123                   a means of detecting business relationships between different
                 HTTP
                  HTTP
                    HTTP                  HTTP
                                           HTTP
                                             HTTP                  HTTP
                                                                    HTTP
                                                                      HTTP                   third-parties but also as a means of determining to what de-
    LSOs
     LSOs
       LSOs     Cookies
                 Cookies
                   Cookies   LSOs
                              LSOs
                                LSOs        Cookies LSOs
                                         Cookies
                                          Cookies    LSOs
                                                       LSOs       Cookies
                                                                   Cookies
                                                                     Cookies                 gree user data may flow between parties, primarily through
               (a)                (b)                       (c)                              real-time bidding. In the present work, we study the impli-
                                                                                             cations of the fact that trackers that share an ID through
Figure 2: Respawning HTTP cookies by Flash evercookies:                                      syncing are in position to merge their database entries cor-
(a) the webpage stores an HTTP and a Flash cookie (LSO),                                     responding to a particular user, thereby reconstructing a
(b) the user removes the HTTP cookie, (c) the webpage                                        larger fraction of the user’s browsing patterns.
respawns the HTTP cookie by copying the value from the
Flash cookie.                                                                                2.1    Related work
                                                                                                While HTTP cookies continue to be the most common
                                                                                             method of third-party online tracking [41], a variety of more
                                                                                             intrusive tracking mechanisms have been demonstrated, re-
Evercookies and respawning: A 2009 study by Soltani
                                                                                             fined and deployed over the last few years. In response, var-
et al. showed the abuse of Flash cookies for regenerating
                                                                                             ious defenses have been developed, and a number of studies
previously removed HTTP cookies, a technique referred to
                                                                                             have presented measurements of the state of tracking. While
as “respawning” [43]. They found that 54 of the 100 most
                                                                                             advertising companies have claimed that tracking is essen-
popular sites (rated by Quantcast) stored Flash cookies, of
                                                                                             tial for the web economy to function [42], a line of research
which 41 had matching content with regular cookies. Soltani
                                                                                             papers have proposed and prototyped solutions to carry out
et al. then analyzed respawning and found that several
                                                                                             behavioral advertising without tracking.
sites, including aol.com, about.com and hulu.com, regener-
                                                                                                Fingerprinting, novel mechanisms. Researchers have
ated previously removed HTTP cookies using Flash cookies.
                                                                                             presented novel browser fingerprinting mechanisms such as
A follow up study in 2011 found that sites use ETags and
                                                                                             those based on performance metrics [31], the JavaScript en-
HTML5 localStorage API to respawn cookies [7].
                                                                                             gine [33] , the rendering engine [50], clock skew [23], We-
In 2010, Samy Kamkar demonstrated the “Evercookie,” a                                        bGL and canvas fingerprinting [32]. Most of those stud-
resilient tracking mechanism that utilizes multiple storage                                  ies followed the path opened by the influential Panopticlick
vectors including Flash cookies, localStorage, sessionStor-                                  study [16], which demonstrated the potentials of browser
age and ETags [21]. Kamkar employed a variety of novel                                       fingerprinting for online tracking.
techniques, such as printing ID strings into a canvas image                                     Measurement studies. Web privacy measurement is a
which is then force-cached and read from the cached im-                                      burgeoning field; an influential early work is [25] and promi-
age on subsequent visits. Instead of just respawning HTTP                                    nent recent work includes [29, 41]. Mayer and Mitchell made
cookies by Flash cookies, his script would check the cleared                                 a comprehensive survey of tracking in combination with the
vectors in the background and respawn from any storage                                       policy that surrounds it, and developed a tool for similar
that persists.                                                                               web privacy measurement studies [29]. Roesner et al. ana-
                                                                                             lyzed different tracking methods and suggested a taxonomy
Figure 2 depicts the stages of respawning by Local Shared                                    for third-party tracking [41].
Objects (LSOs), also known as Flash cookies. Whenever                                           Other papers have looked at various aspects of web pri-
a user visits a site that uses evercookies, the site issues an                               vacy, including PII leakage [26], mobile web tracking [17],
ID and stores it in multiple storage mechanisms, including                                   JavaScript inclusions [35], targeted advertisements [27], and
cookies, LSOs and localStorage. In Figure 2a, the value 123                                  the effectiveness of blocking tools [28].
is stored in both HTTP and Flash cookies. When the user                                         Two studies measured the prevalence of different finger-
removes her HTTP cookie (Figure 2b), the website places                                      printing mechanisms and evaluated existing countermea-
a cookie with the same value (123) by reading the ID value                                   sures [37, 6]. Nikiforakis et al. studied three previ-

                                                                                       676
ously known fingerprinting companies and found 40 such                 advertising, without server-side profiles. In Adnostic, the
sites among the top 10K sites employing practices such                 browser continually updates a behavioral profile of the user
as font probing and the use of Flash to circumvent proxy               based on browsing activity, and targeting is done locally [14].
servers [37]. Acar et al. found that 404 sites in the top mil-         PrivAd has a similar model, but includes a trusted party that
lion deployed JavaScript-based fingerprinting and 145 sites            attempts to anonymize the client [20]. RePriv has the more
of the top 10,000 sites leveraged Flash-based fingerprint-             general goal of enabling personalization via interest profiling
ing [6].                                                               in the browser [19]. Bilenko et al. propose a model in which
   In comparison to these studies, we focus on canvas fin-             the user’s profile and recent browsing history is stored in a
gerprinting, which, to the best of our knowledge, has never            cookie [11]. Other work on similar lines includes [8, 49, 34].
been reported to be found in the wild and is much harder
to block.
   Several studies have looked at the use of Flash cook-
                                                                       3.    CANVAS FINGERPRINTING
ies (LSOs) and, in particular, the use of Flash cookies to                Canvas fingerprinting works by drawing text onto canvas
respawn HTTP cookies [43, 7, 30]. Soltani et al. uncovered             and reading the rendered image data back. In the following
the first use of respawning by Flash cookies [43], and in a            experiments we used an instrumented Firefox browser that
follow-up study, Ayenson et al. found the first use of cache           we built by modifying the source code and logged all the
ETags and localStorage for respawning [7]. McDonald and                function calls that might be used for canvas fingerprinting.
Cranor analyzed the landing pages of 100 popular websites,
plus 500 randomly-selected websites, and found two cases               3.1    Methodology and Data collection
of respawning in the top 100 websites and no respawning                   Our methodology can be divided into two main steps. In
in the randomly selected 500 sites [30]. In a recent study,            the first, we identified the ways we can detect canvas fin-
Sorensen analyzed the use of cache as a persistent storage             gerprinting, developed a crawler based on an instrumented
mechanism and found several instances of HTTP cookies                  browser and ran exploratory crawls. This stage allowed us
respawned from cached page content [44]. The main dif-                 to develop a formal and automated method based on the
ference between our study and the papers mentioned here                early findings. In the second step, we applied the analysis
is that we automated respawning detection as explained in              method we distilled from the early findings and nearly fully
Section 4, and this allowed us to analyze orders of magnitude          automated the detection of canvas fingerprinting.
more sites.                                                               Mowery and Shacham used fillText and ToDataURL
   Olejnik et al. studied cookie syncing (which they call              methods to draw text and read image data respectively [32].
cookie matching) [38]. They found that over 100 cookie                 We logged the return value of ToDataURL and, in order to
syncing events happen on the top 100 sites. In comparison              find out the strings drawn onto the canvas, we logged the
to their work, our study of cookie syncing (i) is large-scale,         arguments of fillText and strokeText methods1 .
covering 3,000 sites, (ii) is based on crawling rather than               We logged the URL of the caller script and the
crowd-sourcing, allowing easier comparative measurements               line number of the calling (initiator) code using Fire-
over time and (iii) presents a global view, in that we go be-          fox’s nsContentUtils::GetCurrentJSContext and nsJSU-
yond detecting individual sync events and are able to cap-             tils::GetCallingLocation methods. This allowed us to
ture and analyze the propagation of IDs through the tracking           precisely attribute the fingerprinting attempt to the respon-
ecosystem. Further, we study how cookie syncing interacts              sible script and the code segment. All function call logs were
with respawning, leading to more persistent tracking and               parsed and combined in a SQLite database that allowed us
widening the effects of these two vulnerabilities taken indi-          to efficiently analyze the crawl data. For each visit, we also
vidually.                                                              added cookies, localStorage items, cache metadata, HTTP
   Program analysis of JavaScript (i.e., static analysis and           request/response headers and request bodies to the SQLite
dynamic analysis) is a common technique in web security                database. We used mitmproxy 2 to capture HTTP data and
[46]. A few studies have used such techniques for blocking             parsed data accumulated in the profile folder for other data
or measuring web trackers. Orr et al. use static analysis              such as cookies, localStorage and cache data. The aggre-
to detect and block JavaScript-loaded ads [39]. Tran et al.            gated data were used in the early stage analysis for canvas
use dynamic taint analysis to detect various privacy-invasive          fingerprinting and evercookie detection, which is explained
behaviors [48]. Acar et al. use behavioral analysis to detect          in Section 4.2. Our browser modifications for Firefox con-
fingerprinting scripts that employ font probing [6].                   sist of mere 33 lines of code, spread across four files and the
   Defenses. Besson et al. [10] examined the theoretical               performance overhead of the modifications is minimal.
boundaries of fingerprinting defenses using Quantified In-                We crawled the home pages of the top 100,000 Alexa
formation Flow. Following a more practical approach, Niki-             sites with the instrumented Firefox browser between 1-5
forakis and others developed a defense called PriVaricator             May 2014. We used Selenium [5] to drive browsers to sites
to prevent linkability from fingerprinters by randomizing              and ran multiple Firefox instances in parallel to reduce the
browser features such as plugins [36]. Finally, Unger et al.           1
[50], studied the potentials of browser fingerprinting as a              In addition to these three methods we intercepted calls to
                                                                       MozFetchAsStream, getImageData and ExtractData meth-
defense mechanism against HTTP(S) session hijacking.                   ods which can be used to extract canvas image data. But we
   In Section 6.1 we discuss how existing privacy tools defend         did not put effort into recording the extracted image data
against the advanced tracking mechanisms we study.                     for three reasons: they were not used in the original can-
   Behavioral targeting without tracking. Several pa-                  vas fingerprinting paper [32], they are less convenient for
pers have addressed the question of whether all this tracking          fingerprinting (requires extra steps), and we did not find
is in fact necessary — they proposed ways to achieve the               any script that uses these methods and fingerprints other
                                                                       browser properties in the initial experiments.
purported goals of third-party tracking, primarily targeted            2
                                                                         http://mitmproxy.org/

                                                                 677
crawl time. Implementing some basic optimizations and                  3.2           Results
a naive load limiting check, we were able to run up to 30                 Table 1 shows the prevalence of the canvas fingerprinting
browsers in parallel on a 4-core 8GB desktop machine run-              scripts found during the home page crawl of the Top Alexa
ning GNU/Linux operating system. The modified browsers                 100,000 sites. We found that more than 5.5% of crawled
were run in a chroot jail to limit the effects of the host op-         sites actively ran canvas fingerprinting scripts on their home
erating system.                                                        pages. Although the overwhelming majority (95%) of the
  False positive removal The Canvas API is used by                     scripts belong to a single provider (addthis.com), we discov-
many benign scripts to draw images, create animations or               ered a total of 20 canvas fingerprinting provider domains, ac-
store content for games. During our crawls we found in-                tive on 5542 of the top 100,000 sites5 . Of these, 11 provider
teresting use cases, such as generating dynamic favicons,              domains, encompassing 5532 sites, are third parties. Based
creating tag clouds, and checking font smoothing support.              on these providers’ websites, they appear to be companies
By examining the distinctive features of false positives and           that deploy fingerprinting as part of some other service
the fingerprinting scripts found in the initial experiments,           rather than offering fingerprinting directly as a service to
we distilled the following conditions for filtering out false          first parties. We found that the other nine provider do-
positives:                                                             mains (active on 10 sites) are in-house fingerprinting scripts
                                                                       deployed by first parties. Note that our crawl in this paper
    • There should be both ToDataURL and fillText (or                  was limited to home pages. A deeper crawl covering internal
      strokeText) method calls and both calls should come              pages of the crawled sites could find a higher percentage of
      from the same URL.                                               fingerprinting.

    • The canvas image(s) read by the script should con-

                                                                         Frequency
      tain more than one color and its(their) aggregate size                         600

      should be greater than 16x16 pixels.                                           400

                                                                                     200
    • The image should not be requested in a lossy compres-
      sion format such as JPEG.                                                       0
                                                                                           10K   20K   30K   40K   50K   60K   70K   80K   90K   100K

   Checking the origin of the script for both read and write                                                  Top Alexa Rank
access helped us to remove scripts that use canvas for only            Figure 3: Frequency of canvas fingerprinting scripts on the
generating images but not reading them or vice versa. Al-              home pages of Top Alexa 100K sites.
though it is possible that two scripts from the same domain
can divide the work to circumvent our detection method, we
accepted that as a limitation.                                            The 5.5% prevalence is much higher than what other
   Enforcing a 16x16 pixel size limit allowed us to filter out         fingerprinting measurement studies had previously found
scripts that read too few pixels to efficiently extract the            (0.4% [37], 0.4%, 1.5% [6]), although these studies may not
canvas fingerprint. Although there are 28192 possible color            be directly comparable due to the differences in methodol-
combinations for a 16x16 pixel image3 , operating systems or           ogy and data collection. Also note that canvas fingerprinting
font libraries only apply anti-aliasing (which is an important         was first used by AddThis between January 15 to February
source of diversity for canvas fingerprinting) to text larger          1st, 2014, 6 which was after all the mentioned studies.
than a minimum font size.4
   The final check was to filter out cases where canvas image
                                                                                           Rank interval      % of sites with canvas
data is requested in a lossy compression format. Under a
lossy compression scheme, the returned image may lose the                                                     fingerprinting scripts
subtle differences that are essential for fingerprinting.                                  [1, 1K)            1.80
   Applying these checks, we reduced the false positive ratio                              [1K, 10K)          4.93
to zero for the 100,000 crawl, upon which we perform our                                   [10K, 100K]        5.73
primary analysis. We used static analysis to make sure the
scripts we flagged as canvas fingerprinting were also collect-         Table 2: Percentage of sites that include canvas fingerprint-
ing other high-entropy browser properties such as plugins,             ing scripts on the homepage, found in top 100K Alexa sites
navigator features and screen dimensions. It should be noted           divided in intervals of variable length. Websites in the 1 to
that in other pilot crawls (beyond 100K), we witnessed some            1K rank interval are 2.5 times less likely to embed a canvas
false positives that our conditions failed to remove. Also,            fingerprinting script than a site within 1K-10K interval.
we believe that a determined tracker may potentially cir-
cumvent our detection steps using more advanced but less
                                                                         Below rank 10,000, the prevalence of canvas fingerprint-
reliable attacks such as pixel stealing using SVG filters [45]
                                                                       ing is close to uniform. However, we found that the top
or CSS shaders [24].
                                                                       1,000 sites are 2.5 times less likely to have included canvas
             w×h
  2colordepth
3                    16×16
                , 232     = 28192 for the RGBA                         5
                                                                         We discarded some cases where the canvas fingerprinting
color space, which uses 24 bits for the colors                         script is served from a content delivery network (CDN) and
(RGB) and 8 bits for the alpha channel.        See,                    additional analysis was needed to distinguish between dif-
http://www.whatwg.org/specs/web-apps/current-                          ferent providers serving from the same (CDN) domain. In-
work/multipage/the-canvas-element.html#pixel-                          cluding these cases would only change the number of unique
manipulation                                                           sites with canvas fingerprinting to 5552 (from 5542).
4                                                                      6
  https://wiki.ubuntu.com/Fonts#Font_Smoothing                           The date was determined using http://httparchive.org/

                                                                 678
Number of
   Fingerprinting script                                                           Text drawn into the canvas
                                                            including sites                                 0        Tweet   0

   ct1.addthis.com/static/r07/core130.js                           5282            Cwm fjordbank glyphsYourvext  quiz, ὠ
                                                                                                            Browser
   i.ligatus.com/script/fingerprint.min.js                         115             http://valve.github.io Index        U+1F603 (128515)
   src.kitcode.net/fp2.js                                           68             http://valve.github.io Class        Other Symbol (So)
                                                                                                           Block       Emoticons
   admicro1.vcmedia.vn/fingerprint/figp.js                          31             http://admicro.vn/
                                                                                                        Java Escape "\ud83d\ude03"
   amazonaws.com/af-bdaz/bquery.js                                  26             Centillion        Javascript Escape "\ud83d\ude03"
   *.shorte.st/js/packed/smeadvert-intermediate-ad.js               14             http://valve.github.io
                                                                                                       Python Escape u'\U0001f603'
   stat.ringier.cz/js/fingerprint.min.js                             4             http://valve.github.io
                                                                                                      HTML Escapes 😃 😃
   cya2.net/js/STAT/89946.js                                         3                                 URL Encoded q=%F0%9F%98%83
                                                                                   ABCDEFGHIJKLMNOPQRSTUVWXYZ
                                                                                                           UTF8        f0 9f 98 83
                                                                                   abcdefghijklmnopqrstuvwxyz0123456789+/
   images.revtrax.com/RevTrax/js/fp/fp.min.jsp                         3           http://valve.github.io UTF16        d83d de03
   pof.com                                                             2           http://www.plentyoffish.com
                                                                                                     Contact Us
   *.rackcdn.com/mongoose.fp.js                                        2           http://api.gonorthleads.com
   9 others*                                                           9           (Various)
   TOTAL                                                         5559
                                                            (5542 unique1 )        -

                          Table 1: Canvas fingerprinting domains found on Top Alexa 100K sites.
              *: Some URLs are truncated or omitted for brevity. See Appendix for the complete list of URLs.
                      1: Some sites include canvas fingerprinting scripts from more than one domain.

fingerprinting scripts than the ones within the 1,000-10,000           and adds new tests to extract more entropy from the can-
range.                                                                 vas image. Specifically, we found that in addition to the
   Note that the URL http://valve.github.io, printed by                techniques outlined in Mowery and Shacham’s canvas fin-
many scripts onto the canvas, belongs to the developer of              gerprinting paper [32] AddThis scripts perform the following
an open source fingerprinting library7 . Furthermore, all              tests:
scripts except one use the same colors for the text and back-
ground shape. This similarity is possibly due to the use of                • Drawing the text twice with different colors and the
the publicly available open source fingerprinting library fin-               default fallback font by using a fake font name, starting
gerprintjs [51]. Figure 4 shows five different canvas images                 with “no-real-font-”.
used by different canvas fingerprinting scripts. The images
are generated by intercepting the canvas pixel data extracted              • Using the perfect pangram8 “Cwm fjordbank glyphs
by the scripts listed in Table 1.                                            vext quiz” as the text string
                                                                           • Checking support for drawing Unicode by printing the
                                                                             character U+1F603 a smiling face with an open mouth.
                                                                           • Checking for canvas globalCompositeOperation sup-
                                                                             port.
                                                                           • Drawing two rectangles and checking if a specific point
                                                                             is in the path by the isPointInPath method.

                                                                         By requesting a non-existent font, the first test tries to em-
                                                                       ploy the browser’s default fallback font. This may be used
                                                                       to distinguish between different browsers and operating sys-
                                                                       tems. Using a perfect pangram, which includes a single in-
Figure 4: Different images printed to canvas by fingerprint-           stance of each letter of the English alphabet, the script enu-
ing scripts. Note that the phrase “Cwm fjordbank glyphs                merates all the possible letter forms using the shortest string.
vext quiz” in the top image is a perfect pangram, that is, it          The last three tests may be trying to uncover the browser’s
contains all the letters of the English alphabet only once to          support for the canvas features that are not equally sup-
maximize diversity of the outcomes with the shortest possi-            ported. For instance, we found that the Opera browser can-
ble string.                                                            not draw the requested Unicode character, U+1F603.
                                                                         Another interesting canvas fingerprinting sample was the
                                                                       script served from the admicro.vcmedia.vn domain. By in-
  Manually analyzing AddThis’s script, we found that it                specting the source code, we found that the script checks
goes beyond the ideas previously discussed by researchers              the existence of 1126 fonts using JavaScript font probing.
7                                                                      8
  See, https://github.com/Valve/fingerprintjs/blob/                      http://en.wikipedia.org/wiki/List_of_pangrams#
v0.5.3/fingerprint.js#L250                                             Perfect_pangrams_in_English_.2826_letters.29

                                                                 679
Overall, it is interesting to see that commercial tracking            that are obfuscated or embedded in longer strings using non-
companies are advancing the fingerprinting technology be-               standard delimiters or ID strings that happen to have a high
yond the privacy/security literature. By collecting finger-             similarity. Similarly, an adversarial tracker could continually
prints from millions of users and correlating this with cookie          change an identifier or cookie sync short-lived identifiers, but
based identification, the popular third party trackers such             keep a mapping on the back end to enable long-term track-
as AddThis are in the best position to both measure how                 ing. Therefore, the results of this analysis provide a lower
identifying browser features are and develop methods for                bound on the presence of evercookie storage vectors and on
monitoring and matching changing fingerprints. Note that                the level of cookie syncing.
according to a recent ComScore report, AddThis “solutions”
reaches 97.2% of the total Internet population in the United            4.2    Flash cookies respawning HTTP cookies
States and get 103 billion monthly page views.9                            Although there are many “exotic” storage vectors that can
                                                                        be used to store tracking identifiers, Flash cookies have a
4.    EVERCOOKIES                                                       clear advantage of being shared between different browsers
    Evercookies are designed to overcome the “shortcomings”             that make use of the Adobe Flash plugin10 . We developed a
of the traditional tracking mechanisms. By utilizing multiple           procedure to automate the detection of respawning by Flash
storage vectors that are less transparent to users and may              cookies employing the method discussed in Section 4.1 to
be more difficult to clear, evercookies provide an extremely            detect IDs and using GNU/Linux’s strace [22] debugging
resilient tracking mechanism, and have been found to be                 tool to log access to Flash cookies.
used by many popular sites to circumvent deliberate user                   Compared to earlier respawning studies [43, 7, 30], the
actions [43, 7, 14]. In this section, we first provide a set            method employed in this paper is different in terms of au-
of criteria that we used to automatically detect identifier             tomation and scale. In prior studies, most of the work, in-
strings, present detailed results of an automated analysis of           cluding the matching of HTTP and Flash cookie identifiers
respawning by Flash evercookies, and show the existence of              was carried out manually. By automating the analysis and
respawning by both HTTP cookies and IndexedDB.                          parallelizing the crawls, we were able to analyze 10,000 web-
                                                                        sites, which is substantially more than the previous studies
4.1     Detecting User IDs                                              (100 sites, 700 sites). Note that, similar to [30], we only
  Given that not all instances of the various potential stor-           visited the home pages, whereas [43, 7] visited 10 internal
age vectors are used to track users, detecting evercookies              links on each website. Another methodological difference is
hinges on determining whether a given string can serve as a             that we maintained the Flash cookies when visiting different
user ID. In order to detect persistent IDs in a given storage           websites, whereas [43, 7] used a virtual machine to prevent
vector, we leveraged data from two simultaneous crawls on               contamination. Last, [30] also used the moving and contrast-
separate machines and applied the following set rule set for            ing Flash cookies from different computers to determine ID
determining which elements are identifying. We present the              and non-ID strings, which is one of the main ideas of the
rules with respect to HTTP cookies but note that they are               analysis described below.
applicable to other storage locations of a similar format.                 For this analysis we used data from four different crawls.
                                                                        First, we sequentially crawled the Alexa top 10,000 sites and
     • Eliminate cookies that expire within a month of being            saved the accumulated HTTP and Flash cookies (Crawl1 ).
       placed. These are too transient to track a user over             We then made three 10,000 site crawls, two of which were
       time.                                                            run with the Flash cookies loaded from the sequential crawl
                                                                        (Crawl2,3 ). The third crawler ran on a different machine,
     • Parse cookie value strings using common delimiters               without any data loaded from the previous crawl (Crawl4 ).
       (e.g. : and &). This extracts potentially identifying            Note that, except for the sequential crawl (Crawl1 ), we ran
       strings from non-essential data.                                 multiple browsers in parallel to extend the reach of the study
     • Eliminate parsed fields which don’t remain constant              at the cost of not keeping a profile state (cookies, localStor-
       throughout an individual crawl. Identifiers are likely           age) between visits. During each visit, we ran an strace
       to be unchanging.                                                instance that logs all open, read and write system calls of
                                                                        Firefox and all of its child processes. Trace logs were parsed
     • Compare instances of matching parsed cookie fields               to get a list of Flash cookies accessed during the visit, which
       (for cookies with the same domain and name) between              are then parsed and inserted into a crawl database.
       two unrelated crawls on different machines.                         For the analysis, we first split the Flash cookie contents
         – Eliminate fields which are not the same length.              from the three crawls (Crawl2,3,4 ) by using a common set of
                                                                        separators (e.g. ”=:&;). We then took the common strings
         – Eliminate fields which are more than 33% sim-                between crawls made with the same LSOs (Crawl2,3 ) and
           ilar according to the Ratcliff-Obershelp algo-               subtracted the strings found in LSO contents from the unre-
           rithm [12]. These are unlikely to contain sufficient         lated crawl (Crawl4 ). We then checked the cookie contents
           entropy.                                                     from the original profile (Crawl1 ) and cookies collected dur-
  The presented method provides a strict and conservative               ing the visits made with the same LSO set (Crawl2,3 ). Fi-
detection of identifiers that we believe (through manual in-            nally, we subtracted strings that are found in an unrelated
spection) to have a very low false positive rate. We antici-            visit’s cookies (Crawl4 ) to minimize the false positives. Note
pate several sources of false negatives, for example ID strings         that, in order to further eliminate false positives, one can use
9
                                                                        cookies and LSOs from other unrelated crawls since an ID-
 http://www.businesswire.com/news/home/
                                                                        10
20131113005901/en/comScore-Ranks-AddThis-1-                              iOS based devices and Chrome/Chromium bundled with
Distributed-Content-United                                              the Pepper API are exceptions

                                                                  680
string cannot be present in unrelated crawls. We used the              exactly matched the content of the Flash cookie named
100K crawl described in the canvas fingerprinting experi-              simg.sinajs.cn/stonecc_suppercookie.sol. This Flash
ments for this purpose.                                                cookie was used to respawn HTTP cookies on Chinese mi-
  For clarity, we express a simplified form of the operation           croblogging site weibo.com and its associated web portal
in set notation:                                                       sina.com.cn. To the best of our knowledge, this is the first
  M axRank
                                                                       report of IndexedDB as an evercookie vector. A more thor-
     [                                                                 ough study of respawning based on IndexedDB is left for
             ((((F2i ∩ F3i ) \ F4 ) ∩ C2i ∩ C3i ) \ C4 ),
                                                                       future study.
     i=1

where Fni denotes Flash cookies from Crawln for the site               4.3    HTTP cookies respawning Flash cookies
with the Alexa rank i, Cni denotes Cookies from Crawln                     We ran a sequential crawl of the Top 3,000 Alexa sites
for the site with the Alexa rank i and F4 , and C4 denotes all         and saved the accumulated HTTP and Flash cookies. We
Flash cookies and HTTP cookies collected during Crawl4 .               extracted IDs from this crawl’s HTTP cookies using the
   We applied the method described above to four crawls                method described in Section 4.1. We then made an addi-
run in May 2014 and found that 33 different Flash cook-                tional sequential crawl of the Top 3,000 Alexa sites on a
ies from 30 different domains respawned a total of 355                 separate machine loading only the HTTP cookies from the
cookies on 107 first party domains during the two crawls               initial crawl.
(Crawl2,3 ). Table 3 shows that on six of the top 100 sites,               Our method of detecting HTTP respawning from Flash
Flash cookies are used to respawn HTTP cookies. Nine                   cookies is as follows: (i) take the intersection of the initial
of top ten sites on which we observed respawning belong                crawl’s flash objects with the final crawl’s flash objects (ii)
to Chinese companies (one from Hong Kong) whereas the                  subtract common strings from the intersection using an un-
other site belongs to the top Russian search engine Yan-               related crawl’s flash objects and (iii) search the resulting
dex. The Flash cookie that respawned the most cook-                    strings for the first crawl’s extracted HTTP cookie IDs as
ies (69 cookies on 24 websites) was bbcookie.sol from the              described in Section 4.1. This enables us to ensure that the
bbcdn-bbnaut.ibillboard.com domain which belongs to                    IDs are indeed found in the Flash objects of both crawls,
a company that is found to use Flash based fingerprint-                aren’t common to unrelated crawls, and exist as IDs on the
ing [6]. Note that this Flash cookie respawned almost three            original machine. Using this method, we detected 11 differ-
HTTP cookies per site which belong to different third party            ent unique IDs common between the three storage locations.
domains (bbelements.com, .ibillboard.com and the first-                    These 11 IDs correspond to 14 first-party domains, a
party domain). The domain with the second highest number               summary of which is provided by Table 8 in the Ap-
of respawns was kiks.yandex.ru which restored 11 cookies               pendix. We primarily observe respawning from JavaScript
on 11 sites in each crawl (Crawl2,3 ).                                 originating from two third-parties: www.iovation.com, a
                                                                       fraud detection company that is specialized in device fin-
 Global                                 Respawning          1st/3rd    gerprinting, and www.postaffiliatepro.com, creators of af-
             Site              CC
  rank                                (Flash) domain         Party     filiate tracking software (that runs in the first-party con-
 16          sina.com.cn       CN     simg.sinajs.cn        3rd*       text). We also observe three instances of what appears to
 17          yandex.ru         RU     kiks.yandex.ru        1st        be in-house respawning scripts from three brands: Twitch
 27          weibo.com         CN     simg.sinajs.cn        3rd*       Interactive (twitch.tv and justin.tv), casino.com, and
 41          hao123.com        CN     ar.hao123.com         1st        xlovecam.com.
 52          sohu.com          CN     tv.sohu.com           1st
 64          ifeng.com         HK     y3.ifengimg.com       3rd*       5.    COOKIE SYNCING
 69          youku.com         CN     irs01.net             3rd
 178         56.com            CN     irs01.net             3rd           Cookie synchronization — the practice of third-party do-
 196         letv.com          CN     irs01.net             3rd        mains sharing pseudonymous user IDs typically stored in
 197         tudou.com         CN     irs01.net             3rd        cookies — provides the potential for more effective tracking,
                                                                       especially when coupled with technologies such as evercook-
                                                                       ies. First, pairs of domains who both know the same IDs
Table 3: Top-ranked websites found to include respawning               via synchronization can use these IDs to merge their track-
based on Flash cookies. CC: ISO 3166-1 code of the coun-               ing databases on the back end. Second, respawned cookies
try where the website is based. 3rd*: The domains that                 may contain IDs that are widely shared due to prior sync
are different from the first-party but registered for the same         events, enabling trackers to link a user’s browsing histories
company in the WHOIS database.                                         from before and after clearing browsing state.
                                                                          In this section, we present our method for detecting syncs,
  IndexedDB as Evercookie While running crawls for                     present an overview of the synchronization landscape and ex-
canvas fingerprinting experiments, we looked for sites that            amine the threats of back-end database merges and history-
store data in the IndexedDB storage vector. Specifically,              linking for users who clear state.
we checked the storage/persistent directory of the Fire-
fox profile. A very small number of sites, only 20 out of              5.1    Detecting cookie synchronization
100K, were found to use the IndexedDB storage vector.                     Using the techniques outlined in Section 4.1, we identified
Analyzing the IndexedDB file from the respawning crawl                 cookies containing values likely to be user IDs. In order to
(Crawl2) described above, we found that a script from the              learn which domains know a given ID through synchroniza-
weibo.com domain stored an item in the IndexedDB that                  tion, we examined cookie value strings and HTTP traffic.

                                                                 681
If a domain owns a cookie containing an ID, clearly the                                             Third party cookie policy
                                                                                    Statistic            Allow        Block
domain knows that ID. In fact, a telltale sign of cookie sync-
ing is multiple domains owning cookies containing the same               #    IDs                         1308          938
ID. Likewise, if an ID appears anywhere in a domain’s URL                #    ID cookies                  1482          953
                                                                         #    IDs in sync                  435          347
string (e.g. in the URL parameters), then that domain also
                                                                         #    ID cookies in sync           596          353
knows the ID. Note that a given tracker may simply ignore
                                                                         #    (First*) Parties in sync (407) 730     (321) 450
an ID received during a sync, but as we will demonstrate in              #    IDs known per party      1/2.0/1/33   1/1.8/1/36
Section 5.3, trackers opting to store IDs have the ability to            #    Parties knowing an ID 2/3.4/2/43      2/2.3/2/22
gain user data through history merging.
   The domains involved in HTTP traffic can be divided into             Table 4: Comparison of high-level cookie syncing statistics
(referrer, requested URL, location) tuples in which the loca-           when allowing and disallowing third-party cookies (top 3,000
tion domain is non-empty only for HTTP response redirects.              Alexa domains). The format of the bottom two rows is
The rules for ID passing are as follows:                                minimum/mean/median/maximum. *Here we define a first-
                                                                        party as a site which was visited in the first-party context
      • If an ID appears in a requested URL, the requested
                                                                        at any point in the crawl.
        domain learns the ID.
      • If an ID appears in the referrer URL, the requested
        domain and location domain (if it exists) learn the ID.            Appendix B shows a summary of the top 10 parties in-
                                                                        volved in cookie synchronization under both cookie policies.
      • If an ID appears in the location URL, the requested             Observe that although some parties are involved in less sync-
        domain learns the ID.                                           ing under the stricter cookie policy, many of the top parties
                                                                        receive the same number of IDs. Overall, disabling third-
   We cannot assume that the referrer learns a synced ID                party cookies reduces the number of synced IDs and parties
appearing in the requested URL or location URL string [38].             involved in syncing by nearly a factor of two. While this
In particular, third-party JavaScript executing a sync on a             reduction appears promising from a privacy standpoint, in
first-party site will cause the first-party to show up as the           the next section we will see that even with this much sparser
referrer, even though it may not even be aware of the ID                amount of data, database merges could enable domains to
sync. Although we can determine the directionality of ID                reconstruct a large portion of a user’s browsing history.
syncs in the cases of redirects, the fraction of flows in which            Included in Appendix C is a summary of the top 10 most
we could determine both the sender and receiver was small.              shared IDs under both cookie policies. For a specific exam-
Hence, when examining cookie synchronization, we focused                ple, consider the most shared ID which all third party cook-
on which domains knew a given ID, rather than attempting                ies are allowed, which was originally created by turn.com.
to reconstruct the paths of ID flows.                                   This ID is created and placed in a cookie on the first page
5.2      Basic results                                                  visit that includes Turn as a third-party. On the next page
                                                                        visit, Turn makes GET requests to 25 unique hostnames
   Before examining the privacy threats that can stem from              with a referrer of the form http://cdn.turn.com/server/
cookie synchronization, we first provide an overview of                 ddc.htm?uid=... that contains its ID. These
cookie syncing activities that occur when browsing under                25 parties gain knowledge of Turn’s ID, as well as their own
different privacy settings. We ran multiple crawls of the               tracking cookies, in the process. Similar sharing occurs as
top 3,000 Alexa domains on Amazon EC211 instances using                 the user continues to browse, eventually leading to 43 total
three different Firefox privacy settings: allowing all cookies          domains. With third-party cookies disabled, the top shared
(i.e. no privacy-protective measures), allowing all cookies             IDs come from a disjoint set of parties, largely composed
but enabling Do Not Track, and blocking third-party cook-               of syncs which share a first party cookie with several third-
ies. With all cookies allowed, the impact of Do Not Track on            party sites.
the aggregate statistics we measure was negligible. In par-
ticular, enabling Do Not Track only reduced the number of               5.3    Back-end database synchronization
domains involved in synchronization by 2.9% and the num-
                                                                           We now turn to quantifying how much trackers can learn
ber of IDs being synced by 2.6%. This finding is consistent
                                                                        about users’ browsing histories by merging databases on the
with studies such as Balebako et al. [9] — they find that, due
                                                                        back-end based on synced IDs. Cookie syncing allows track-
to lack of industry enforcement, Do Not Track provides lit-
                                                                        ers to associate a given user both with their own pseudony-
tle practical protection against trackers. We therefore omit
                                                                        mous ID and with IDs received through syncs, facilitating
further measurement and analysis of the effect of Do Not
                                                                        later back-end merges. We cannot observe these merges di-
Track in this section.
                                                                        rectly, so we do not know if such merges occur with any
   Table 4 shows high-level statistics for illustrative crawls
                                                                        frequency. That said, there is a natural incentive in the
under the two third-party cookie settings. We say that an
                                                                        tracking ecosystem to aggregate data in order to learn a
ID is involved in synchronization if it is known by at least
                                                                        much larger fraction of a user’s history.
two domains. Cookies and domains are involved in synchro-
                                                                           First, assuming no collaboration among third-party track-
nization if they contain or know such an ID, respectively.
                                                                        ers, only a handful of trackers are in position to track a
The statistics displayed aggregate both third-party and
                                                                        sizeable fraction of an individual’s browsing history. As per
first-party data, as many domains (e.g. doubleclick.com,
                                                                        Olejnik et al [38], if a visited first party appears as the re-
facebook.com) exist in both the Alexa Top 3000 and as
                                                                        ferrer in a request to another domain, we assume the second
third-parties on other sites.
                                                                        domain knows about this visit. For a crawl of 3,000 sites
11
     http://aws.amazon.com/ec2/                                         when allowing all cookies, only two of the 730 trackers could

                                                                  682
With third party cookies               5.4    Respawning and syncing

                    0.8
                                                              No Merge              At a given point in time, cookie synchronization pro-
                                                              With Merge
      Proportion of history known                                                vides a mechanism for trackers to link a user’s history to-
                              0.4
                                                                                 gether. Represented as a graph, sites in an individual’s his-
                                                                                 tory can be represented as nodes with edges between sites
                                                                                 if a user tagged with some pseudonymous ID visited both
                       0.0

                                                                                 sites. When a user clears his cookies and restarts browsing,
                                    0      200       400       600               the third parties will place and sync a new set of IDs and
                                                                                 eventually reconstruct a new history graph.
                                                                                    Since these history graphs correspond to browsing periods
                                         Without third party cookies
                                                                                 with completely different tracking IDs, they will be disjoint
           0.8

                                                              No Merge           — in other words, trackers can not associate the individual’s
                                                             With Merge
                                                                                 history before and after clearing cookies. However, if one of
                                                         With Merge
      0.4

                                                                                 the trackers respawns a particular cookie, parts of the two
                                                                                 history graphs can be connected by an edge, thereby linking
                                                                                 an individual’s history over time. This inference becomes
                   0.0

                                                                                 stronger if this respawned ID is synced to a party present
                                    0      200       400       600               on a large number of the sites that a user visits.
                                                                                    To test this possibility, we ran two 3,000 site crawls on two
                                    Rank of domain (decreasing order)            EC2 instances, A and B. We cleared the cookies, Flash stor-
Figure 5: Proportions of user history known when allow-                          age, cache, and local storage from machine B and loaded the
ing and blocking third party cookies under the two different                     Flash files from A to seed respawning from Flash. Finally,
merging schemes. Note that since the x-axis is sorted by the                     we ran another 3,000 site crawl on site B.
proportion of a user’s history that a domain can recover,                           We discovered a total of 26 domains that respawned IDs
the domains may appear in different orders for the different                     between the two crawls on machine B either through Flash
models.                                                                          or through other means12 . Three of these IDs were later
                                                                                 observed in sync flows. After conducting manual analysis,
                                                                                 we were unable to determine the exact mechanism through
recover more than 40% of a user’s history and only 11 could                      which 18 of these IDs were respawned since we cleared all
recover more than 10%. When disabling third-party cook-                          the storage vectors previously discussed, nor did we detect
ies, the corresponding numbers are two and six, respectively.                    JavaScript-based browser fingerprinting. We conjecture that
These results are consistent with earlier findings in Roesner                    these IDs were respawned through some form of passive,
et al [41].                                                                      server-side fingerprinting13 .
   We consider the following model of back-end database                             One of these IDs provides a useful case study. After
merges: a domain can merge its records with a single other                       respawning this ID, its owner, merchenta.com, passed it
domain that mutually knows some ID. We assume that when                          to adnxs.com through an HTTP redirect sync call. Now,
two domains merge their records for a particular user, they                      merchenta.com by itself is not in a position to observe a
will share their full records. Our model assumes some col-                       large fraction of a user’s history — it only appears on a sin-
laboration within the tracking ecosystem — among domains                         gle first party domain (casino.com). In fact, the largest ob-
already known to share IDs — but is much weaker than as-                         served percentage of a user’s history observable by a cookie-
suming full cooperation.                                                         respawning domain acting alone was 1.4%. However, by
   Figure 5 shows the proportion of a user’s 3,000-site brows-                   passing its ID to adnxs.com, merchenta.com enabled a much
ing history a domain can recover, in decreasing sorted order,                    larger proportion of a user’s history to be linked across state
if a user enables all cookies. The figure when blocking third-                   clears.
party cookies (also Figure 5) takes a identical shape but is                        In particular, we observed adnxs.com on approximately
steeper because it only includes roughly 60% as many par-                        11% of first party sites across the two crawls. Thus adnxs.
ties.                                                                            com now has the ability to merge its records for a particular
   Observe that after introducing the ability for a site to                      user before and after an attempt to clear cookies, although of
merge records directly with one other tracker, the known                         course we have no insight into whether or not they actually
proportion of a user’s 3,000-site history dramatically in-                       do so. This scenario enables at least 11% of a user’s history
creased for a large number of sites. When third-party cook-                      to be tracked over time.
ies are allowed, 101 domains can reconstruct over 50% of a                          Our measurements in this section illustrate the potential
user’s history and 161 could recover over 40%. Even when                         for cookie respawning and syncing event on a single site by a
these cookies are blocked, 44 domains could recover over
40% of a user’s history.                                                         12
                                                                                    The exact method here is not important, as we are con-
   Not much is known about how prevalent back-end                                 cerned with the fact that an ID which has been respawned
database merges are. In terms of incentives, a pair of track-                     is later involved in sync.
                                                                                 13
ers may enter into a mutually beneficial arrangement to in-                         Note that a document from one of these respawning do-
crease their respective coverage of users’ browsing histories,                    mains, merchenta.com mentions tracking by fingerprint-
or a large tracker may act as a data broker and sell user                         ing: “Merchenta’s unique fingerprint tracking enables con-
                                                                                  sumers to be engaged playfully, over an extended period of
histories for a fee.                                                              time, long after solely cookie-based tracking loses its effec-
                                                                                  tiveness”, http://www.merchenta.com/wp-content/files/
                                                                                  Merchenta%20Case%20Study%20-%20Virgin.pdf.

                                                                           683
You can also read