Domain name encryption is not enough: privacy leakage via IP-based website fingerprinting

Page created by Cindy Lang

IT & Technique

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Proceedings on Privacy Enhancing Technologies ; 2021 (4):420–440

Nguyen Phong Hoang, Arian Akhavan Niaki, Phillipa Gill, and Michalis Polychronakis

Domain name encryption is not enough: privacy
leakage via IP-based website fingerprinting
Abstract: Although the security benefits of domain
name encryption technologies such as DNS over TLS
                                                                  1 Introduction
(DoT), DNS over HTTPS (DoH), and Encrypted Client
                                                                  Due to the increase of Internet surveillance in recent
Hello (ECH) are clear, their positive impact on user
                                                                  years [13, 31], users have become more concerned about
privacy is weakened by—the still exposed—IP address
                                                                  their online activities being monitored, leading to the
information. However, content delivery networks, DNS-
                                                                  development of privacy-enhancing technologies. While
based load balancing, co-hosting of different websites on
                                                                  various mechanisms can be used depending on the de-
the same server, and IP address churn, all contribute to-
                                                                  sired level of privacy [48], encryption is often an in-
wards making domain–IP mappings unstable, and pre-
                                                                  dispensable component of most privacy-enhancing tech-
vent straightforward IP-based browsing tracking.
                                                                  nologies. This has led to increasing amounts of Internet
In this paper, we show that this instability is not a
                                                                  traffic being encrypted [3].
roadblock (assuming a universal DoT/DoH and ECH
                                                                       Having a dominant role on the Internet, the web
deployment), by introducing an IP-based website finger-
                                                                  ecosystem thus has witnessed a drastic growth in
printing technique that allows a network-level observer
                                                                  HTTP traffic being transferred over TLS [28]. Although
to identify at scale the website a user visits. Our tech-
                                                                  HTTPS significantly improves the confidentiality of
nique exploits the complex structure of most websites,
                                                                  web traffic, it cannot fully protect user privacy on its
which load resources from several domains besides their
                                                                  own when it comes to preventing a user’s visited web-
primary one. Using the generated fingerprints of more
                                                                  sites from being monitored by a network-level observer.
than 200K websites studied, we could successfully iden-
                                                                  Specifically, under current web browsing standards, the
tify 84% of them when observing solely destination IP
                                                                  domain name information of a visited website can still
addresses. The accuracy rate increases to 92% for pop-
                                                                  be observed through DNS queries/responses, as well as
ular websites, and 95% for popular and sensitive web-
                                                                  the Server Name Indication (SNI) field of the TLS hand-
sites. We also evaluated the robustness of the gener-
                                                                  shake packets. To address this problem, several domain
ated fingerprints over time, and demonstrate that they
                                                                  name encryption technologies have been proposed re-
are still effective at successfully identifying about 70%
                                                                  cently to prevent the exposure of domain names, in-
of the tested websites after two months. We conclude
                                                                  cluding DNS over TLS (DoT) [51], DNS over HTTPS
by discussing strategies for website owners and host-
                                                                  (DoH) [49], and Encrypted Client Hello (ECH) [88].
ing providers towards hindering IP-based website fin-
                                                                       Assuming an idealistic future in which all network
gerprinting and maximizing the privacy benefits offered
                                                                  traffic is encrypted and domain name information is
by DoT/DoH and ECH.
                                                                  never exposed on the wire as plaintext, packet metadata
Keywords: Domain Name Encryption, DoT, DoH, En-                   (e.g., time, size) and destination IP addresses are the
crypted Client Hello, Website Fingerprinting                      only remaining information related to a visited website
DOI 10.2478/popets-2021-0078
                                                                  that can be seen by a network-level observer. As a result,
Received 2021-02-28; revised 2021-06-15; accepted 2021-06-16.     tracking a user’s browsing history requires the observer
                                                                  to infer which website is hosted on a given destination IP
                                                                  address. This task is straightforward when an IP address
                                                                  hosts only one domain, but becomes more challenging
Nguyen Phong Hoang: Stony Brook University, E-mail:               when an IP address hosts multiple domains. Indeed, re-
nghoang@cs.stonybrook.edu
                                                                  cent studies have shown an increasing trend of websites
Arian Akhavan Niaki: University of Massachusetts -
Amherst, E-mail: arian@cs.umass.edu                               being co-located on the same hosting server(s) [47, 95].
Phillipa Gill: University of Massachusetts - Amherst, E-mail:     Domains are also often hosted on multiple IP addresses,
phillipa@cs.umass.edu                                             while the dynamics of domain–IP mappings may also
Michalis Polychronakis: Stony Brook University, E-mail:           change over time due to network configuration changes
mikepo@cs.stonybrook.edu
                                                                  or DNS-based load balancing.

Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting     421

     Given these uncertainties in reliably mapping do-        nections per fingerprint. Furthermore, our attack ex-
mains to their IPs, we investigate the extent to which        ploits the fact that websites often load many external re-
accurate browsing tracking can still be performed by          sources, including third-party analytics scripts, images,
network-level adversaries based solely on destination         and advertisements, making their fingerprints more dis-
IPs. In this work, we introduce a lightweight website fin-    tinguishable. We thus also investigated whether the re-
gerprinting (WF) technique that allows a network-level        moval of these resources due to browser caching or ad
observer to identify with high accuracy the websites a        blocking could help to make websites less prone to IP-
user visits based solely on IP address information, en-       based fingerprinting (§8).
abling network-level browsing tracking at scale [74]. For          By analyzing the HTTP response header of the web-
instance, an adversary can use—already collected by ex-       sites studied, we find that 86.1% of web resources are
isting routers—IPFIX (Internet Protocol Flow Informa-         cacheable, causing fewer network connections to be ob-
tion Export) [101] or NetFlow [2] records to easily ob-       servable by the adversary if these resources are loaded
tain the destination IP addresses contacted by certain        from the browser’s cache (§8.1). Moreover, using the
users, and track their browsing history.                      Brave browser to crawl the same set of websites, we
     For our attack, we first crawl a set of 220K web-        found that the removal of third-party analytics scripts
sites, comprising popular and sensitive websites selected     and advertisements can impact the order in which web
from two website ranking lists (§5). After visiting each      resources are loaded (§8.2), significantly reducing the ac-
website, we extract the queried domains to construct a        curacy of the enhanced fingerprints from 91% to 76%.
domain-based fingerprint. The corresponding IP-based          Nonetheless, employing the initially proposed WF tech-
fingerprint is then obtained by continuously resolving        nique in which the critical rendering path [38] is not
the domains into their IPs via active DNS measurement         taken into account, we could still fingerprint 80% of the
(§4.1). By matching these IPs from the generated fin-         websites even when browser caching and ad blocking are
gerprints to the IP sequence observed from the network        considered.
traffic when browsing the targeted websites, we could              Regardless of the high degree of website co-location
successfully fingerprint 84% of them (§6.3). The suc-         and the dynamics of domain–IP mappings, our findings
cessful identification rate increases to 92% for popular      show that domain name encryption alone is not enough
websites, and 95% for popular and sensitive websites.         to protect user privacy when it comes to IP-based WF.
     To further enhance the discriminatory capacity           As a step towards mitigating this situation, we discuss
of the fingerprints, we consider the critical rendering       potential strategies for both website owners and hosting
path [38] to capture the approximate ordering structure       providers towards hindering IP-based WF and maximiz-
of the domains that are contacted at different stages         ing the privacy benefits offered by domain name encryp-
while a website is being rendered in the browser (§4.2).      tion. To the extent possible, website owners who wish to
Our results show that the enhanced fingerprints could         make IP-based website fingerprinting harder should try
allow for 91% of the tested websites to be successfully       to (1) minimize the number of references to resources
identified based solely on their destination IPs(§6.4).       that are not served by the primary domain of a website,
     Given the high variability of website content and        and (2) refrain from hosting their websites on static IPs
domain–IP mappings across time, we expect that once           that do not serve any other websites. Hosting providers
generated, a fingerprint’s quality will deteriorate quickly   can also help by (1) increasing the number of co-located
over time. To assess the aging behavior of the finger-        websites per hosting IP, and (2) frequently changing the
prints, we conducted a longitudinal study over a period       mapping between domain names and their hosting IPs,
of two months. As expected, fingerprints become less          to further obscure domain–IP mappings, thus hindering
accurate over time, but surprisingly, after two months,       IP-based WF attacks.
they are still effective at successfully identifying about
70% of the tested websites (§7).
                                                              2 Background and Motivation
     As our WF technique is based on the observation
of the IPs of network connections that fetch HTTP re-         In this section, we review some background information
sources, it is necessary to evaluate the impact of HTTP       on domain name encryption technologies and discuss the
caching on the accuracy of the fingerprints. This is be-      motivation behind our study. In particular, we highlight
cause cached resources can be loaded directly from the        how our IP-based fingerprinting attack is different from
browser’s cache when visiting the same website for a          prior works, allowing network-level adversaries to effec-
second time, resulting in the observation of fewer con-       tively mount the attack at scale.

Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting     422

2.1 Domain Name Encryption                                    cannot provide any meaningful privacy benefit without
                                                              the use of DoT/DoH, and vice versa. Mozilla has sup-
In today’s web browsing environment, there are two
                                                              ported ECH since Firefox 85 [52].
channels through which domain name information is
exposed on the wire: plaintext DNS requests/responses,
and the Server Name Indication (SNI) extension of TLS.
                                                              2.2 Website Fingerprinting
     The plaintext nature of DNS not only jeopardizes
user privacy, but also allows network-level entities to in-   Website fingerprinting (WF) is a type of traffic analy-
terfere with user connections. For example, an on-path        sis attack, employed to construct fingerprints for a set
attacker can inject forged DNS responses to redirect a        of websites based on the traffic pattern observed while
targeted user to malicious hosts [27]. The domain name        browsing them. Depending on which metadata is visible
information exposed via DNS packets and the SNI field         from the encrypted traffic, different WF techniques can
has also been intensively exploited by state-level net-       be used to determine whether a user under surveillance
work operators for censorship purposes [14, 46, 75, 78,       visited any of the monitored websites.
82]. To cope with these security and privacy problems,             Numerous WF attacks targeting anonymized or
several solutions have been recently proposed to safe-        obfuscated communication channels have been pro-
guard domain name information on the wire, including          posed [40, 58, 73, 79, 85, 98, 107, 108], in which
DoT [51], DoH [49], and ECH [88].                             the actual destination IP address is hidden by means
     By encrypting DNS traffic, DoT/DoH preserves the         of privacy-enhancing network relays [32, 48], such as
integrity and confidentiality of DNS resolutions. Sev-        Tor [26] or the Invisible Internet Project (I2P) [43, 113].
eral companies (e.g., Google [35], Cloudflare [4]) of-        However, WF attacks on standard encrypted web traf-
fer free DoT/DoH services to the public, while popu-          fic (i.e., HTTPS), in which no privacy-enhancing net-
lar web browsers including Chrome and Firefox already         work relays are employed, have not been comprehen-
support DoH [35, 67], with the latter enabling DoH            sively investigated, especially at the IP-address level.
(through Cloudflare) by default for users in the US since     This is because the domain name information previ-
2019. However, the design choice of these vendors to          ously available in several plaintext protocols (e.g., DNS,
centralize all DNS resolutions to one trusted resolver        the SNI extension of TLS, and OCSP queries [90]) can
has raised several privacy concerns. As a result, more        be easily obtained from the network traffic (§2.1). This
privacy-centric DNS resolution schemes have also been         information alone can already be used for a straight-
proposed, including DoH over Tor [71, 91], Oblivious          forward inference of applications or web services being
DNS [92, 97], and distributed DoH resolution [44], to         visited [42, 102]. However, in an idealistic future where
not only conceal DNS packets from on-path observers,          domain name encryption (i.e., DoT/DoH and ECH) is
but also to deal with “nosy” recursors.                       fully deployed, visibility to any information above the IP
     SNI has been incorporated into the TLS protocol          layer will be lost. Under these conditions, and given the
since 2003 [17] as a workaround for name-based virtual        high degree of web co-location [47], our ultimate goal
hosting servers to co-host many websites that support         is to investigate the extent to which websites can still
HTTPS. During the TLS handshake, the client includes          be fingerprinted at scale, based solely on the IP address
the domain name of the intended website in the SNI            information of the servers being contacted.
field in order for the server to respond with the cor-             Given the numerous WF methods introduced in the
responding TLS certificate of that domain name. Until         past, one may wonder why do we even need another
TLS 1.2, this step takes place before the actual encryp-      website fingerprinting method? In addition to the afore-
tion begins, leaving the SNI field transmitted in plain-      mentioned reasons and pitfalls of previous WF tech-
text, and exposing users to similar security and privacy      niques [53], the rationale behind our fingerprinting tech-
risks as discussed above. TLS 1.3 provides an option          nique based solely on IP-level information stems from
to encrypt the SNI field, concealing the domain name          the increasing deployment of domain name encryption
information [87]. Since March 2020, ESNI has been re-         technologies [25, 61]. Currently, most web traffic does
worked into the ECH extension [88]. In order for ECH          expose domain information, as domain name encryption
to function, a symmetric encryption key derived from          has not been fully deployed [103], and thus network-level
the server’s public key has to be obtained in advance.        browsing tracking at scale through DNS or SNI is much
This public key can be obtained via an HTTPS resource         easier. However, once this massive-scale monitoring ca-
record lookup. Thus, it is important to note that ECH         pability is gone due to DoT/DoH and ECH, the next

Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting   423

best option for ISPs to continue tracking at a similar        sual browsing behavior (except, perhaps, the rare event
scale will be to rely on IP addresses, which are already      of a browser restart with many previously open tabs).
collected as part of IPFIX (Internet Protocol Flow In-        Although some users may visit more than one website
formation Export) [101] or NetFlow [2] records by ex-         at a time, they mostly interact with one tab at a time.
isting routers. Although more elaborate fingerprinting        There is also a time gap when changing from one tab to
schemes can certainly be conceived for HTTPS traffic,         another to open or reload a different website, especially
these will require a significant deployment effort and        for users with a single screen. All these together allow
cost [74], both for constructing and maintaining the fin-     an observer to distinguish between individual website
gerprints, as well as for matching them.                      visits, as also evident by existing techniques that can
                                                              be employed to split a network trace of such multi-tab
                                                              activity into individual traces [23, 24, 111]. Moreover,
3 Threat Model
                                                              although many individual users may be located behind
Internet service providers (ISPs) have been increasingly      the same NAT network, Verde et al. [105] have devel-
harvesting user traffic for monetization purposes, such       oped a framework that can identify different individ-
as targeted advertising [18, 34, 55]. Our threat model        uals behind a large metropolitan WiFi network based
considers the real-world scenario in which a local adver-     on NetFlow records. To keep our study simple, we thus
sary (e.g., an ISP) passively monitors users’ traffic and     assume that the adversary already employs the afore-
attempts to determine whether a particular website was        mentioned techniques to obtain the network trace of
visited. The adversary carries out the following steps to     different individuals before conducting our WF attack.
create website fingerprints.
     First, the adversary visits a set of websites and        4 Fingerprint Construction
records all domain names that are contacted to fetch
their resources. A domain-based fingerprint for each          Next, we explain how we construct IP-based fingerprints
website is then built from this set of contacted domains.     in more detail, from creating the initial domain-based
After that, these domains are periodically resolved to        fingerprints to deriving the final IP-based fingerprints.
their hosting IPs, which are used to construct IP-based       At a conceptual level, we first explore the straightfor-
fingerprints. Depending on the relationship between a         ward approach of resolving all domains loaded while vis-
domain name and its hosting IP(s), a connection to a          iting a website to their hosting IPs, from which we cre-
unique IP can be used to easily reveal which website is       ate a set of unique IPs that can potentially be used as
being visited if the IP only hosts that particular domain     the IP-based fingerprint for that website. We then take
name. Finally, to conduct the WF attack, the adversary        the critical rendering path [38] into account to improve
tries to match the sequence of IP addresses found in          the fidelity of the fingerprints, by considering the ap-
the network trace of the monitored user with the IP-          proximate order in which domains are loaded while the
based fingerprints constructed in the previous step to        website is being rendered on the screen.
infer which website was visited.
     The effectiveness of our attack depends on two pri-      4.1 Basic IP-based Fingerprint
mary factors, namely, the uniqueness (§6) and stability
                                                              Assuming an idealistic web browsing scenario in which
(§7) of the fingerprints. It is worth emphasizing that
                                                              domain name information can no longer be extracted
our model does not assume fingerprinting of obfuscated
                                                              from network traffic due to the full deployment of do-
network traffic, in which the IP address information is
                                                              main name encryption, the only remaining information
already hidden by means of privacy-enhancing technolo-
                                                              visible to the adversary is packet metadata (e.g., time,
gies. This class of attacks, whose goal is to use sophisti-
                                                              size) and sequences of connections to remote IP ad-
cated traffic analysis methods to fingerprint anonymized
                                                              dresses of contacted web servers. Under these condi-
network traffic, has been extensively investigated by
                                                              tions, the adversary would need to fingerprint targeted
prior studies [20, 40, 58, 79, 98]. Our threat model re-
                                                              websites based primarily on this information. As intro-
quires only minimal information collected from the net-
                                                              duced in our threat model, the adversary first visits the
work traffic, i.e., destination IPs.
                                                              targeted websites and records all domains that are con-
     We consider a browsing scenario in which one web-
                                                              tacted while browsing each website. Each domain can
site is visited at a time, which is particularly valid
                                                              then be resolved to its hosting IP address(es). As a re-
when it comes to ordinary web users on devices with
                                                              sult, the mapping between domains and hosting IP ad-
smaller screens, and is also most often the case of ca-

Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting            424

dress(es) is the basic unit on which the adversary relies                   4.2 Enhanced IP-based Fingerprint with
to construct IP-based fingerprints.                                             Connection Bucketing
     When browsing a website, the browser first contacts
                                                                            When a webpage is visited, besides the initial connec-
the web server to fetch the initial resource—usually
                                                                            tion to the primary domain, multiple requests may then
an HTML document. It then parses the HTML docu-
                                                                            be issued in parallel to fetch other resources referenced
ment and subsequently fetches other web resources ref-
                                                                            in the initial HTML document. Once fetched, these re-
erenced. Based on this underlying mechanism of fetch-
                                                                            sources may sometimes trigger even more requests for
ing a webpage, the adversary constructs domain-based
                                                                            other sub-resources (e.g., JavaScript). The absolute or-
fingerprints as follows. For a given website, its domain-
                                                                            der of these requests on the wire can change from time to
based fingerprint consists of two parts: the primary do-
                                                                            time, depending on many uncertain factors, such as the
main, denoted as dp , and a set of secondary domains,
                                                                            performance of the upstream network provider and the
denoted as ds . The fingerprint then can be represented
                                                                            underlying operating system. For that reason, we did
as: dp + {ds1 , ds2 , ..., dsn }. The primary domain is the
                                                                            not consider the order in which domains are contacted
domain of the URL shown in the browser’s address bar,
                                                                            when constructing the domain-based fingerprints, and
and typically corresponds to the server to which the
                                                                            thus we gather all secondary domains into one bucket,
first connection is made for fetching the initial HTML
                                                                            as described in §4.1. However, when viewing all these
document of the visited webpage. Secondary domains
                                                                            requests as a whole, there still exists a high-level order-
may be different from the primary one and are used for
                                                                            ing relationship that we can capture when considering
hosting other resources needed to load the webpage.
                                                                            the critical rendering path [38]. Specifically, there are
     From the domain-based fingerprints constructed
                                                                            certain render-blocking and critical objects that always
above, the adversary can then obtain their correspond-
                                                                            need to be loaded prior to some other objects.
ing IP-based fingerprints by repeatedly resolving the do-
                                                                                 The chain of events from fetching an initial HTML
main names into their hosting IP(s). Given that a do-
                                                                            file to rendering the website on screen is referred to as
main name may be resolved to more than one IP, each
                                                                            the critical rendering path [38]. When visiting a web-
domain in a domain-based fingerprint is converted to a
                                                                            site, the browser first contacts the primary domain to
set of IP(s) with at least one IP in it. As domain-based
                                                                            fetch the initial HTML file (e.g., index.html). Next,
fingerprints are comprised of two parts, the inherent
                                                                            this file is parsed to construct the DOM (Document
structure of IP-based fingerprints also consists of two
                                                                            Object Model) tree. The browser then fetches several
parts. The first part contains the IP(s) of the primary
                                                                            web resources from remote destinations to render the
domain name, while the second part is a set of sets of
                                                                            webpage. Depending on the complexity of the webpage,
IPs obtained by resolving the secondary domains. The
                                                                            these resources may include HTML, JavaScript, CSS,
fingerprint then can be represented as:
                                                                            image files, and third-party resources, which may in turn
 {dp ip1 , dp ip2 , ..., dp ipn } + {{ds1 ip1 , ds1 ip2 , ..., ds1 ipn },   load more sub-resources hosted on other third-party do-
   {ds2 ip1 , ds2 ip2 , ..., ds2 ipn }, ..., {dsn ip1 , ..., dsn ipn }}     mains [76]. According to the Internet Archive, a typi-
                                                                            cal website loads an average of 70 web resources as of
     To simplify the construction and matching of finger-                   this writing [7]. When considering this critical rendering
prints, we reduce the above fingerprint by considering                      path, there are three important events that we can use
the union of the sets of IP addresses of all secondary do-                  to cluster connections into three “buckets:” domLoad-
mains (second part of the above fingerprint). The sim-                      ing, domContentLoaded, and domComplete.
plified version of the IP-based fingerprint can thus be                     domLoading is triggered when the browser has re-
represented by just two sets of IP addresses as:                            ceived the initial HTML file and parses it to construct
  {dp ip1 , dp ip2 , ..., dp ipn } + {ds1 ip1 , ds1 ip2 , ..., ds1 ipn ,    the DOM tree. As a result, multiple parallel connections
                                                                            to fetch critical resources referenced by the DOM tree
      ds2 ip1 , ds2 ip2 , ..., ds2 ipn , ..., dsn ip1 , ..., dsn ipn }
                                                                            are initiated right after this event is fired.
    Although it might seem that this simplification dis-                    domContentLoaded is triggered when both the
cards some part of the structural information of the                        DOM and CSSOM (CSS Object Model) are ready [38],
page, we found no significant difference in accuracy                        signaling the browser to create the render tree. The
when evaluating both fingerprint formats. Therefore, we                     event is typically fired without waiting for style sheets,
opt to use the latter fingerprint structure, as it is simpler               images, and subframes to load [69]. After this event, sub-
and allows for faster matching.                                             sequent connections can often be observed for fetching

Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting            425

elements such as non-blocking style sheets, JavaScript               on those websites that are legitimately visited in real-
files, images, and subframes.                                        world scenarios. Therefore, we choose to use domains
domComplete is triggered when the website and                        from the Tranco top-site ranking list [104], since it has
its sub-resources have been loaded. After this event                 been shown to have a good overlap with the web traffic
is fired, non-essential objects can still be downloaded              observed by the Chrome User Experience Report [84].
in the background, leaving the critical rendering path                    The research community often relies on one of
unaffected. For example, external JavaScript files are               the four top-site lists (Alexa [9], Majestic [11], Um-
known to be render-blocking and are recommended to                   brella [12], and Quantcast [5]). However, studies have
be moved to the end of the webpage or to be included                 discovered several issues with these lists that can neg-
with a defer attribute of the  tag [36]. There-              atively impact research outcomes if not handled prop-
fore, a small cluster of connections can often be observed           erly [89, 104]. To remedy the shortcomings of these lists,
after this event is triggered.                                       the Tranco list is curated to aggregate the four afore-
     Based on these observations about the critical ren-             mentioned lists, resulting in a list of more than seven
dering path, we enhance the structure of our finger-                 million domains. Even then, due to the dynamic na-
prints (both domain-based and the corresponding IP-                  ture of the web [15, 57], there are still domains that are
based ones) to comprise the primary domain and three                 unstable, not responsive, or do not serve any web con-
sets of domains corresponding to the three events afore-             tent in the Tranco list, especially in its long tail [84, 89].
mentioned. The enhanced domain-based fingerprint can                 Therefore, we select the top 100K popular domains from
then be represented as:                                              the Tranco list for our study, because any ranking un-
                                                                     der 100K is not statistically significant, as suggested by
  dp + {ds1 , ds2 , ...} + {ds2 , ds3 , ...} + {..., dsn−1 , dsn }
                                                                     both top-list providers and previous studies [9, 89].
    Note that a domain can appear in more than one                        While some websites are so common that visiting
bucket if multiple objects are fetched from that domain              them may be considered to be a very low privacy risk
at different times. Accordingly, the representation of the           (e.g., facebook.com, twitter.com, or youtube.com),
enhanced IP-based fingerprint follows the same struc-                the leakage of visits to more “sensitive” websites is def-
ture, comprising four sets of IP addresses: i) the set of            initely an important concern. This is even more so in
IP addresses of the primary domain, and ii) three sets of            oppressive regions, where browsing certain online con-
IP addresses corresponding to the three buckets above.               tent could be considered as a violation of local regu-
                                                                     lations [39, 72, 112]. To that end, we complement our
                                                                     dataset by manually choosing websites from the Alexa
5 Experiment Setup                                                   list that belong to categories deemed “sensitive,”1 such
                                                                     as LGBT, sexuality, gambling, medical, and religion. In
In this section, we provide the details of how we set
                                                                     total, our dataset consists of 220,743 domains, includ-
up and conducted our experiments for assessing the ef-
                                                                     ing the top 100K popular domains of the Tranco list2
fectiveness of IP-based website fingerprinting. In par-
                                                                     and 126,597 domains from Alexa’s sensitive categories.
ticular, we discuss the rationale behind our test list of
                                                                     Among these, there are 5,854 common domains between
websites, and the duration and location of our measure-
                                                                     the two data sources.
ment.
                                                                          Although one may consider our test list as a closed-
                                                                     world dataset, it is infeasible to repeatedly crawl the
5.1 Selection of Test Domains                                        entire Internet, which has more than 362.3M domains
From an adversarial point of view, it is desirable for an            registered at the time of our experiment [8]. It is also
attacker to be able to reveal as many websites as possi-             unlikely that a network-level adversary is interested in
ble. It is, however, impractical to crawl the entire Inter-
net, given that there are more than 362.3 million domain
names registered across all top-level domains (TLDs) as              1 It is worth noting that sensitivity can be different from site
of 2020 [8]. In addition, many of them are dormant or                to site, depending on who, when, and from where is visiting
even unwanted domains [100] that the majority of In-                 the site [42]. We intuitively choose these complementary do-
                                                                     mains based on our common sense of what is sensitive based on
ternet users will never visit. As our goal is to assess the
                                                                     those categories that are often blocked by many Internet censors
extent to which domain name encryption would prevent                 around the world [75].
the leakage of the majority of users’ browsing activi-               2 The list was created on March 3rd 2020, and is available at
ties via IP-based website fingerprinting, we opt to focus            https://tranco-list.eu/list/J2KY.

Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 426

fingerprinting all websites on the Internet. However, the we have collected a total of 24 data batches during a
ability of continuously conducting our WF attack on two-month period. To stimulate future studies in this re-
more than 220K domains highlights the scalability of search domain, we make our dataset available to the re-
our method. In fact, we cover an order of magnitude search community at https://homepage.np-tokumei.
more domains compared to previous WF attacks against net/publication/publication_2021_popets.
DoT/DoH traffic [19, 50, 96], in which the largest open- Our measurement is conducted from a cluster of ma-
world setting comprised fewer than 10K domains [19]. chines located in a gigabit academic network in the US.
Due to the rapid increase in the use of content deliv-
ery networks (CDN), web content can be served from
5.2 Measurement Duration and Location
multiple servers distributed across different locations,
Prior work often overlooks the temporal aspect when depending on the origin of the request [45]. Although
constructing fingerprints. More specifically, many efforts our dataset can be considered as representative for web
are conducted in a one-off manner or over a short pe- users within our geographical area, it would have missed
riod of time, neglecting the temporal characteristics of some IP addresses of CDN-hosted websites which can
fingerprints whose websites’ content may evolve over only be observed at other locations. Nonetheless, as
time [57]. For our work, in which we focus on IP-based mentioned in §3, the adversary in our threat model is
fingerprints, the churn in domain–IP mappings is also local and also has access to the Internet from the same
another major concern [45]. Due to the variability of network location as the monitored users (e.g., the ISP
website content and hosting IPs across time, a previ- of a home or corporate user), and will not observe any
ously constructed fingerprint may not be valid after a other IPs of CDN-hosted websites either. Adversaries
certain time period. Therefore, a longitudinal measure- at different locations can always set up machines within
ment study is essential to examine the robustness of their network of interest and conduct the same exper-
fingerprints, which in turn impacts the efficacy of their iment with ours to construct a dataset of fingerprints
use in WF attacks. that matches those websites browsed by users within
Over a period of 60 days (from March 5th to May the network of their control.
3rd, 2020), we repeatedly crawled the 220K websites
from our test list curated in §5.1, using the Chrome
browser (desktop version 80.0), running on Ubuntu
6 Fingerprinting Accuracy
20.04 LTS. When visiting each website, we extract all Next, we evaluate the accuracy of our WF techniques
domains contacted to construct the fingerprint for that using the data collected in §5. We begin with an analy-
website using the steps discussed in §4. At the network sis of the information entropy that we can expect from
level, we capture the sequence of destination IPs con- domain-based fingerprints, and then evaluate the accu-
tacted, to evaluate the accuracy of our IP-based WF racy of IP-based fingerprints.
method (§6). Note that the collection of this sequence
of IPs is oblivious to the domains extracted indepen- 6.1 Fingerprint Entropy
dently when loading each website.
Due to DNS-based load balancing, many domains, As discussed in §3, the creation of IP-based fingerprints
especially of popular websites, may map to different IP is based on the domains contacted while visiting the
addresses at different times [45]. Therefore, once the set targeted websites. Therefore, it is important to first ex-
of domains that were contacted to render each web- amine the uniqueness of these domains, as it impacts
site is extracted, we continuously resolve them to ob- the effectiveness of IP-based WF. This will aid us in
tain their IP addresses until the next crawl. This best- deciding whether a domain should be included or not
effort approach allows us to obtain as many IPs as pos- as part of a fingerprint. For example, if a certain do-
sible for those domains that employ DNS-based load main is contacted when visiting every single website,
balancing. However, to make sure our experiment does then there is no point in including it. In contrast, if a
not saturate DNS servers (thus affecting other legiti- unique domain is only contacted when visiting a partic-
mate users), we enforce a rate limit of at least three ular website, it will make the fingerprint more distin-
hours. In other words, contacted domains of a web- guishable. The more unique a domain is, the higher the
site are only resolved again if they were not resolved information entropy that can be gained [94], resulting
within the last three hours. The entire process for each in a better fingerprint.
crawl batch takes approximately 2.5 days. As a result,

Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting               427

                                             Domain            IP                Table 1. Top-ten domains that yield the lowest entropy.
                       100

                        80                                                          Domain name                        # Websites     Entropy
        Cumulative %

                        60                                                          www.google-analytics.com          114K   (55%)         0.87
                        40                                                          fonts.gstatic.com                 102K   (49%)         1.03
                                                                                    fonts.googleapis.com              102K   (49%)         1.04
                        20
                                                                                    www.google.com                     76K   (37%)         1.44
                         0                                                          stats.g.doubleclick.net            72K   (35%)         1.53
                             0   2   4       6    8    10 12 14      16   18
                                                                                    www.googletagmanager.com           64K   (31%)         1.71
                                         Information entropy (bit)
                                                                                    www.facebook.com                   53K   (25%)         1.97
Fig. 1. CDF of the information entropy gained per domain/IP as                      connect.facebook.net               53K   (25%)         1.98
a percentage of all the unique domains/IP addresses observed.                       googleads.g.doubleclick.net        49K   (24%)         2.09
                                                                                    ajax.googleapis.com                34K   (16%)         2.62
     Let P (d) be the probability that a particular domain
will be contacted when visiting a given website among                            (that have been observed to be) hosted on it. Note that
the targeted websites. The information entropy in bits                           calculating the entropy using both average and median
gained if that particular domain is contacted can be                             gives us similar results because most co-hosted domains
calculated using the following formula:                                          on the same IP addresses often provide a similar amount
            Information entropy = − log2 P (d)                                   of information entropy. We thus choose the former one.
                                                                                      Considering these two possibilities, we then calcu-
     The blue (solid) line in Figure 1 shows the entropy                         late the information entropy of the 340K IPs observed
gained from each domain as a percentage of the approx-                           from our continuous DNS measurement in each crawl
imately 475K unique domains observed in each crawl                               batch. As indicated by the red (dashed) line of Figure 1,
batch. Almost 90% of these domains are unique to the                             50% of the IPs provide at least 9 bits of information en-
website from which they are referenced, yielding high                            tropy, while there is a group of more than 30% of the
information entropy (17.7 bits). In contrast, there are                          IPs that provide a high amount of information entropy
a few domains from which we can only gain a small                                (17.7 bits), which correspond to IPs hosting only a sin-
amount of entropy. This result aligns well with the study                        gle domain.
of Greschbach et al. [37] in which the Alexa top-site list
was crawled and for 96.8% of the websites there exists at                        6.2 Primary Domain to IP Matching
least one domain that is unique only to these websites.
     Table 1 lists the top-ten domains that provide the                          Before evaluating our WF techniques, we first investi-
least information entropy. Eight of them belong to                               gate whether WF is needed at all. Prior studies have
Google and two belong to Facebook. These domains                                 shown that a significant fraction of websites have a
are commonly included in many websites, thus only                                one-to-one mapping between primary domains and their
contributing a small amount of entropy. For instance,                            hosting IP(s) [45]. This first connection can be distin-
www.google-analytics.com is included in more than                                guished from the subsequent requests since there is usu-
half of the websites. However, we opt to keep them                               ally a noticeable time gap during which the browser
as part of our fingerprints, as most of them still pro-                          needs to contact the primary domain to download the
vide more than one bit of information, helping to differ-                        initial HTML file, parses it, and constructs the DOM
entiate between websites that reference these external                           tree before multiple subsequent connections are initi-
Google/Facebook services and those that do not.                                  ated to fetch referenced resources. As a result, it is
     Based on the entropy for each domain computed                               straightforward for an adversary to target this very first
above, we then calculate the entropy gained per IP ad-                           connection to infer which website is being visited.
dress, since our ultimate goal is to perform WF at the                                More specifically, when a domain is hosted on one
IP level. Given an observed IP, there are two possibili-                         IP or multiple IPs without sharing its hosting server(s)
ties regarding the domain(s) it may correspond to. First,                        with any other domains, it is easy to infer the domain
the IP may be associated with only a single domain, in                           from the IP(s) of its hosting server(s). We analyzed the
which case its entropy can be deduced directly from the                          DNS records of all primary domains in our dataset to
domain’s entropy. Second, the IP may co-host multiple                            quantify the fraction of websites that can be finger-
domains. In this case, the IP’s entropy is calculated by                         printed by just targeting their primary IP(s). We find
taking the average of the entropy values of all domains                          that 52% of the websites studied have their primary

Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 428

Table 2. Percentage of successfully identified websites using i) naive primary domain to IP matching (§6.2), ii) basic fingerprinting
(§4.1, §6.3), and iii) enhanced fingerprinting with connection bucketing (§4.2, §6.4).

Website type Total Primary Domain IP-based Fingerprinting Connection Bucketing

All websites crawled 208,191 107,455 (52%) 174,662 (84%) 189,527 (91%)
Popular websites 93,661 58,989 (63%) 86,147 (92%) 90,231 (96%)
Sensitive websites 120,293 51,538 (43%) 93,988 (78%) 104,983 (87%)
Sensitive and popular 5,763 3,072 (53%) 5,473 (95%) 5,687 (99%)

domain hosted on their own IP(s), while the remaining the popular websites and 78% of the sensitive websites.
48% are co-hosted on a server with at least another web- More worrisome is the fact that 95% of sensitive and
site. This result means that an adversary can already popular websites can be fingerprinted.
infer 52% of the targeted websites based solely on the
IP address of the very first connection to the primary 6.4 Enhanced Website Fingerprinting with
domain, without having to consider secondary connec- Connection Bucketing
tions. The third column of Table 2 shows the breakdown
of these websites in terms of their popularity and sen- We next evaluate the effectiveness of the enhanced
sitivity. Note that the total number of websites shown WF (§4.2), in which we take the critical rendering path
here is lower than the total number of test websites se- into consideration to cluster IPs into three buckets. Sim-
lected in §5 due to some unresponsive websites when ilarly to the basic fingerprints (§6.3), given a sequence
conducting our experiment. of IPs [ip0 , ip1 , ip2 , ..., ipn ], we first scan ip0 against all
fingerprints to create a pool of candidate fingerprints.
For the subsequence [ip1 , ip2 , ..., ipn ], our goal is to split
6.3 Basic IP-based Website Fingerprinting
it into three buckets of connections that can potentially
To fingerprint the remaining 48% of websites whose pri- be matched with the three buckets of IPs in the IP-
mary domains are co-hosted, an adversary would need based fingerprints. Based on the time of each connection
to analyze the second part of their IP-based fingerprint initiation captured at the network level (§5.2), we use
that captures the IP addresses of secondary domains. k-means clustering [30, 60, 63] to split them into three
Going back to the way we build our IP-based fin- sets of IPs.
gerprints in §4.1, the basic IP-based fingerprint has two For every candidate fingerprint, we intersect each
parts. The first part consists of the primary domain’s bucket of IPs in the fingerprint to the corresponding
IP(s), and the second part comprises a set of IPs ob- bucket of IPs captured from the network trace. Then,
tained by resolving all secondary domains. Given a se- for each matching IP, we add its entropy to the total
quence of IPs [ip0 , ip1 , ip2 , ..., ipn ] observed from a net- amount of entropy gained for that particular fingerprint.
work trace, we first scan ip0 against the primary part of Finally, we choose the fingerprint with the highest en-
all IP-based fingerprints, which are created by repeated tropy to predict the visited website.
active DNS measurements (§5.2). If ip0 is found among Using this approach, the accuracy rate can be im-
the primary IPs of a given fingerprint, the fingerprint proved to 91%. The breakdown of fingerprinted websites
is added to a pool of candidates. We then compare the is shown in the last column of Table 2. For the popular
subset {ip1 , ip2 , ..., ipn } with the secondary part of each and the sensitive websites, we obtain an accuracy rate
candidate fingerprint. For each matching IP, we add the of 96% and 87%, respectively. However, a more alarm-
entropy provided by that IP to the total amount of en- ing result is that 99% of sensitive and popular websites
tropy gained for that particular candidate fingerprint. can be precisely fingerprinted, posing a severe privacy
Finally, we choose the fingerprint with the highest total risk to their visitors.
entropy to predict the website visited. We now look into the remaining 9% (18,664) of web-
Using this IP-based WF method we obtained an sites for which we could not find an exact match. As
increased matching rate of 84%—that is, 84% of the shown in Figure 2, 20% of these websites have only
websites in our data set were identified with 100% ac- two matching candidates, while about 50% of them
curacy. The breakdown of the fingerprinted websites is have only up to ten matching candidates. By manually
shown in the fourth column of Table 2. Among these examining some of these fingerprints, we found many
fingerprinted websites, we could precisely match 92% of cases in which the matching candidate domains actu-

Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting                         429

                       100                                                                                   5 days       15 days          2 months
                                                                                                             10 days      1 month
        Cumulative %    80
                                                                                                       100
                        60
                                                                                                        80

                                                                                        Cumulative %
                        40
                                                                                                        60
                        20
                                                                                                        40
                         0
                             10   0
                                        10 1
                                                      102
                                                                   10   3                               20
                                      Number of candidate websites                                      0
                                                                                                          0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Fig. 2. CDF of candidate websites per fingerprint for the remain-                                       Stable                                  Unstable
                                                                                                                         Difference degree
ing 9% (18,664) of websites that could not be matched.
                                                                                Fig. 3. CDF of the stability of domain names loaded in each web-
ally point to the same website (but without redirect-                           site as a percentage of all websites studied.
ing to the same domain). These are mostly owned by
organizations who have registered the same name un-                             previously observed at time t0 , yielding a difference de-
der different TLDs (e.g., bayer.de vs. bayer.com), or                           gree of 0. In contrast, a website is considered as unstable
variations of the domain (e.g., christianrock.net vs.                           when its difference degree is 1, meaning that all domains
christianhardrock.net) to protect their brand against                           observed at time t1 are different from those previously
domain squatting [100]. A more determined adversary                             seen at t0 .
could invest the effort to implement more advanced                                   Figure 3 shows the stability of the websites studied
techniques for identifying such duplicate websites. For                         in terms of the domains contacted while visiting them.
instance, string similarity can be used to cluster similar                      About 30% of the websites contact the exact same set of
domains, while image similarity can be used to group                            domains to download web resources for the whole two-
websites with similar screenshots of the start page.                            month period of our study. Within a five-day period,
                                                                                80% of the websites are still almost completely stable,
                                                                                with a difference degree lower than 0.1, while this per-
7 Fingerprint Stability
                                                                                centage decreases to 50% over the two-month period.
As mentioned in our threat model, the efficacy of a WF                          Understandably, almost half of the websites we study
attack also depends on the stability of fingerprints over                       are the most popular on the Internet. Hence, it is ex-
time. There are two primary reasons why a website fin-                          pected that their content will be changed or updated on
gerprint may go stale. First, the website may change                            a regular basis. However, even after two months, almost
over time with existing elements removed and new ele-                           80% of the websites are still stable, with a difference de-
ments added [57]. Second, the mapping between a do-                             gree lower than 0.3, meaning that 70% of observed do-
main and its hosting IP(s) may also change [45]. Con-                           mains are still being used to host web resources needed
sequently, a previously constructed fingerprint may no                          to render these websites.
longer be valid after a certain time period.                                         This is a favorable result for the adversary, as it
    Since our IP-based fingerprints are constructed                             shows that domains are an effective and consistent fea-
based on domains contacted while browsing the targeted                          ture. The result particularly implies that the adversary
websites, we first examine the extent to which these                            does not need to keep crawling all websites repeatedly
websites are stable in terms of the domains that they                           to construct domain-based fingerprints. Based on the
reference. We introduce a difference metric to quantify                         results of Figure 3, the adversary perhaps can divide
the change in this set of domains for a given website                           websites into two groups comprising stable and less sta-
over time as follows. Let Dt0 and Dt1 be the sets of                            ble websites. For instance, the stable group consists of
contacted domains observed when browsing a website                              80% of websites with a difference degree lower than 0.2
at time t0 and t1 , respectively, the difference degree for                     after ten days, while the less stable group consists of the
this website is calculated as:                                                  remaining 20% of websites. Then, the adversary would
                                               (Dt0 ∪ Dt1 ) − (Dt0 ∩ Dt1 )      only need to re-crawl the less stable ones every ten days
     Difference degree =                                                        to keep their domain-based fingerprints fresh, instead of
                                                       Dt 0 ∪ Dt 1
                                                                                all websites.
    Based on this definition, we consider a website as                               However, in our threat model, what can be actu-
stable during a period t0 → t1 when the set of domains                          ally observed by the adversary is only IP addresses. We
observed at time t1 have not changed compared to those

Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting                    430

                             5 days       15 days          2 months                                100
                             10 days      1 month
                                                                                                    80

                                                                                      Accuracy %
                       100
                                                                                                    60
                        80
        Cumulative %

                                                                                                    40
                        60
                                                                                                    20
                        40
                                                                                                     0
                        20                                                                         ≤ 2.5 5   10 15 20 25 30 35 40 45 50 55 60
                                                                                                              Days since fingerprints constructed
                        0
                          0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
                        Stable                                  Unstable      Fig. 5. Fingerprint stability over time. Even after two months,
                                         Difference degree                    70% of the tested websites can still be fingerprinted.

Fig. 4. CDF of the stability of IP addresses in each IP-based fin-
                                                                              son why our IP-based fingerprints are still effective at
gerprint as a percentage of all websites studied.
                                                                              revealing 70% of the targeted websites even though a
thus apply the same difference formula to quantify the                        large number of IP-based fingerprints have changed sig-
stability of IP-based fingerprints. Unlike domain-based                       nificantly after two months, as indicated in Figure 4.
fingerprints, IP-based fingerprints become stale faster,                           The above finding means that the adversary can in-
as shown in Figure 4. We find that only 10% of the IP-                        telligently split domains into two groups, based on pre-
based fingerprints contain the same set of IPs over the                       viously observed data. The first group consists of do-
course of two months. After ten days, 60% of the finger-                      mains whose IPs are dynamic, while the second group
prints have more than 30% of their IPs changed. After                         contains domains whose IPs remain static over a config-
two months, half of the IPs have changed in more than                         urable amount of time. The adversary then only needs
50% of the fingerprints.                                                      to periodically perform DNS lookups for the first group
     Given these results, we investigate how the instabil-                    after a desired amount of time has passed, depending on
ity of the IP-based fingerprints impacts the accuracy of                      the network overhead and resources the adversary can
our WF attack. We consider an attack scenario in which                        sustain for conducting the attack.
the adversary uses fingerprints constructed in the past
to track the users’ browsing activities at a future time.                     8 Fingerprint Robustness
Figure 5 shows the accuracy (i.e., the percentage of suc-
cessfully identified websites) of our enhanced WF ap-                         We next examine the impact of HTTP caching on the
proach over the course of two months. Within 2.5 days3                        effectiveness of our WF since resources are often cached
after their generation, our fingerprints consistently yield                   by web browsers to improve websites’ performance. In
a high accuracy of 91%. Over the course of two months,                        addition, our WF also exploits the fact that websites
we can see a gradual decrease in the accuracy. However,                       often load external resources, including images, style
this decrease is quite modest, as after five to ten days                      sheets, fonts, and even “unwanted” third-party analytics
since their construction the fingerprints can still be used                   scripts, advertisements, and trackers [76], which result
to accurately identify about 80% of the websites. This                        in a sequence of connections to several servers with dif-
number only decreases to about 70% after two months.                          ferent IPs, making the fingerprints more unique. Thus,
     Although IP-based fingerprints go stale faster com-                      we also investigate whether blocking these unnecessary
pared to their domain-based fingerprints, those IP ad-                        resources would help make websites less distinguishable,
dresses that change frequently mostly correspond to sec-                      thus reducing their fingerprintability.
ondary domains, and only a small fraction corresponds
to primary domains (see Appendix A for details). The
vast majority of primary domains are hosted on mostly                         8.1 Impact of HTTP Caching on Website
static IP addresses for the whole period of our study.                            Fingerprinting Accuracy
As a result, the persistently stable IP addresses of these                    When a website is revisited, cached resources can be
primary domains in the IP-based fingerprints is the rea-                      served from the local cache without the browser fetching
                                                                              them again from their origins. Since our attack is based
                                                                              solely on the observation of the IP of connections to
3 The lowest time granularity is 2.5 days because each crawl
                                                                              remote destinations, we are interested in examining the
batch in our dataset requires this amount of time to be collected,            fraction of cacheable resources and the extent to which
as discussed in §5.                                                           HTTP caching impacts the effectiveness of our WF.

Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting                431

                       100                                                                     84

        Cumulative %    80
                                                                                               83

                                                                                  Accuracy %
                        60
                                                                                               82
                        40

                        20                                                                     81

                         0                                                                     80
                          -1   0   5m    1h 5h 1d 5d 1M      1Y                                 First 5m 10m 1h 5h 10h 1d 5d 10d 1M 2M
                                   Freshness timeline                                           visit
                                                                                                         Revisit after (x) amount of time
Fig. 6. CDF of the freshness timeline of HTTP resources.
(m: minute, h: hour, d: day, M: month, Y: year)                           Fig. 7. The accuracy of IP-based fingerprinting attack when ex-
                                                                          cluding destinations where cacheable resources are hosted. (m:
                                                                          minute, h: hour, d: day, M: month)
     Analyzing the response header of 21.3M objects ob-
served while crawling the tested websites, we find that
                                                                               There are two primary reasons why our attack is
86.1% of them are cacheable. In other words, these
                                                                          still highly effective although the majority of resources
HTTP resources can be stored and served from the local
                                                                          are cacheable. First, excluded IPs often host long-term
cache without being downloaded again from the remote
                                                                          cacheable shared resources, such as fonts and JavaScript
servers when being revisited.
                                                                          code, which contribute only a small amount of entropy
     Utilizing the cache-control information in the
                                                                          to the fingerprint if included. Second, for cacheable re-
HTTP response header, we compute the freshness time-
                                                                          sources hosted on IPs with high entropy, not all re-
line for each resource. The freshness timeline is the
                                                                          sources have the same freshness timeline. In fact, we find
amount of time during which the browser can store
                                                                          that half of the origins that host resources cacheable for
and serve resources from its cache without download-
                                                                          more than one hour also serve another resource with
ing them again from their original servers. Figure 6
                                                                          a freshness timeline shorter than five minutes, causing
shows the distribution of the freshness timeline of 21.3M
                                                                          at least one network connection to the original server if
objects. The value “-1” denotes uncachable resources
                                                                          revisited.
(13.9%) that must be downloaded again from their ori-
gin if revisited, while “0” indicates cacheable resources
(21.9%) that always need to be revalidated with their
origin. In other words, these two types of resources
                                                                          8.2 Fingerprinting Under Ad Blocking
will always cause a network connection to their original                  Due to the prevalence of ads and analytics scripts that
servers if revisited. On the other hand, the remaining                    harvest users’ information [86], many advertisement and
64.2% of resources can be loaded directly from the local                  tracker blocking tools have been developed to protect
cache without making any network connections.                             user privacy. Of these tools, the Brave browser has stood
     Next, we evaluate the impact of cacheable resources                  out to be one of the best browsers for user privacy on
on our attack accuracy by excluding IPs on which                          the Clearnet to date [56].
cacheable resources are hosted. We use the basic finger-                      Therefore, we opt to use the Brave browser for in-
printing method here for our evaluation (§4.1) instead                    vestigating the impact of ad blocking on IP-based WF.
of the enhanced one (§4.2), because revisited resources                   During the last four batches of our data collection pro-
may not be freshly loaded in the order of the critical                    cess (i.e., ten days), at the same time with crawling the
rendering path as in the first visit.                                     test websites (without blocking ads and trackers), we in-
     Although many web resources are cached, we could                     strumented the Brave browser (desktop version 1.6.30)
still obtain a high accuracy. As shown in Figure 7,                       to crawl these websites a second time. While the Brave
even when websites are revisited after only five minutes,                 browser is loading each website, we also capture (1) the
meaning that the majority of resources can be served                      set of domains contacted to fetch web resources for ren-
from the local cache, an accuracy of 80% can still be                     dering the website, and (2) the sequence of IPs con-
obtained—a decrease of just 4% (from 84%) compared                        tacted to fetch web resources.
to when websites are visited for the first time. If web-                      Using the same fingerprinting techniques as in §6,
sites are revisited after one hour, one day, or one month,                we then match the sequences of IP addresses observed
our basic WF attack can obtain a gradually increased                      while browsing with the Brave browser to our IP-based
accuracy of 80.8%, 81.4%, and 82.3%, respectively.                        fingerprints. As expected, the fingerprinting accuracy

You can also read