Domain name encryption is not enough: privacy leakage via IP-based website fingerprinting
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Proceedings on Privacy Enhancing Technologies ; 2021 (4):420–440 Nguyen Phong Hoang, Arian Akhavan Niaki, Phillipa Gill, and Michalis Polychronakis Domain name encryption is not enough: privacy leakage via IP-based website fingerprinting Abstract: Although the security benefits of domain name encryption technologies such as DNS over TLS 1 Introduction (DoT), DNS over HTTPS (DoH), and Encrypted Client Due to the increase of Internet surveillance in recent Hello (ECH) are clear, their positive impact on user years [13, 31], users have become more concerned about privacy is weakened by—the still exposed—IP address their online activities being monitored, leading to the information. However, content delivery networks, DNS- development of privacy-enhancing technologies. While based load balancing, co-hosting of different websites on various mechanisms can be used depending on the de- the same server, and IP address churn, all contribute to- sired level of privacy [48], encryption is often an in- wards making domain–IP mappings unstable, and pre- dispensable component of most privacy-enhancing tech- vent straightforward IP-based browsing tracking. nologies. This has led to increasing amounts of Internet In this paper, we show that this instability is not a traffic being encrypted [3]. roadblock (assuming a universal DoT/DoH and ECH Having a dominant role on the Internet, the web deployment), by introducing an IP-based website finger- ecosystem thus has witnessed a drastic growth in printing technique that allows a network-level observer HTTP traffic being transferred over TLS [28]. Although to identify at scale the website a user visits. Our tech- HTTPS significantly improves the confidentiality of nique exploits the complex structure of most websites, web traffic, it cannot fully protect user privacy on its which load resources from several domains besides their own when it comes to preventing a user’s visited web- primary one. Using the generated fingerprints of more sites from being monitored by a network-level observer. than 200K websites studied, we could successfully iden- Specifically, under current web browsing standards, the tify 84% of them when observing solely destination IP domain name information of a visited website can still addresses. The accuracy rate increases to 92% for pop- be observed through DNS queries/responses, as well as ular websites, and 95% for popular and sensitive web- the Server Name Indication (SNI) field of the TLS hand- sites. We also evaluated the robustness of the gener- shake packets. To address this problem, several domain ated fingerprints over time, and demonstrate that they name encryption technologies have been proposed re- are still effective at successfully identifying about 70% cently to prevent the exposure of domain names, in- of the tested websites after two months. We conclude cluding DNS over TLS (DoT) [51], DNS over HTTPS by discussing strategies for website owners and host- (DoH) [49], and Encrypted Client Hello (ECH) [88]. ing providers towards hindering IP-based website fin- Assuming an idealistic future in which all network gerprinting and maximizing the privacy benefits offered traffic is encrypted and domain name information is by DoT/DoH and ECH. never exposed on the wire as plaintext, packet metadata Keywords: Domain Name Encryption, DoT, DoH, En- (e.g., time, size) and destination IP addresses are the crypted Client Hello, Website Fingerprinting only remaining information related to a visited website DOI 10.2478/popets-2021-0078 that can be seen by a network-level observer. As a result, Received 2021-02-28; revised 2021-06-15; accepted 2021-06-16. tracking a user’s browsing history requires the observer to infer which website is hosted on a given destination IP address. This task is straightforward when an IP address hosts only one domain, but becomes more challenging Nguyen Phong Hoang: Stony Brook University, E-mail: when an IP address hosts multiple domains. Indeed, re- nghoang@cs.stonybrook.edu cent studies have shown an increasing trend of websites Arian Akhavan Niaki: University of Massachusetts - Amherst, E-mail: arian@cs.umass.edu being co-located on the same hosting server(s) [47, 95]. Phillipa Gill: University of Massachusetts - Amherst, E-mail: Domains are also often hosted on multiple IP addresses, phillipa@cs.umass.edu while the dynamics of domain–IP mappings may also Michalis Polychronakis: Stony Brook University, E-mail: change over time due to network configuration changes mikepo@cs.stonybrook.edu or DNS-based load balancing.
Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 421 Given these uncertainties in reliably mapping do- nections per fingerprint. Furthermore, our attack ex- mains to their IPs, we investigate the extent to which ploits the fact that websites often load many external re- accurate browsing tracking can still be performed by sources, including third-party analytics scripts, images, network-level adversaries based solely on destination and advertisements, making their fingerprints more dis- IPs. In this work, we introduce a lightweight website fin- tinguishable. We thus also investigated whether the re- gerprinting (WF) technique that allows a network-level moval of these resources due to browser caching or ad observer to identify with high accuracy the websites a blocking could help to make websites less prone to IP- user visits based solely on IP address information, en- based fingerprinting (§8). abling network-level browsing tracking at scale [74]. For By analyzing the HTTP response header of the web- instance, an adversary can use—already collected by ex- sites studied, we find that 86.1% of web resources are isting routers—IPFIX (Internet Protocol Flow Informa- cacheable, causing fewer network connections to be ob- tion Export) [101] or NetFlow [2] records to easily ob- servable by the adversary if these resources are loaded tain the destination IP addresses contacted by certain from the browser’s cache (§8.1). Moreover, using the users, and track their browsing history. Brave browser to crawl the same set of websites, we For our attack, we first crawl a set of 220K web- found that the removal of third-party analytics scripts sites, comprising popular and sensitive websites selected and advertisements can impact the order in which web from two website ranking lists (§5). After visiting each resources are loaded (§8.2), significantly reducing the ac- website, we extract the queried domains to construct a curacy of the enhanced fingerprints from 91% to 76%. domain-based fingerprint. The corresponding IP-based Nonetheless, employing the initially proposed WF tech- fingerprint is then obtained by continuously resolving nique in which the critical rendering path [38] is not the domains into their IPs via active DNS measurement taken into account, we could still fingerprint 80% of the (§4.1). By matching these IPs from the generated fin- websites even when browser caching and ad blocking are gerprints to the IP sequence observed from the network considered. traffic when browsing the targeted websites, we could Regardless of the high degree of website co-location successfully fingerprint 84% of them (§6.3). The suc- and the dynamics of domain–IP mappings, our findings cessful identification rate increases to 92% for popular show that domain name encryption alone is not enough websites, and 95% for popular and sensitive websites. to protect user privacy when it comes to IP-based WF. To further enhance the discriminatory capacity As a step towards mitigating this situation, we discuss of the fingerprints, we consider the critical rendering potential strategies for both website owners and hosting path [38] to capture the approximate ordering structure providers towards hindering IP-based WF and maximiz- of the domains that are contacted at different stages ing the privacy benefits offered by domain name encryp- while a website is being rendered in the browser (§4.2). tion. To the extent possible, website owners who wish to Our results show that the enhanced fingerprints could make IP-based website fingerprinting harder should try allow for 91% of the tested websites to be successfully to (1) minimize the number of references to resources identified based solely on their destination IPs(§6.4). that are not served by the primary domain of a website, Given the high variability of website content and and (2) refrain from hosting their websites on static IPs domain–IP mappings across time, we expect that once that do not serve any other websites. Hosting providers generated, a fingerprint’s quality will deteriorate quickly can also help by (1) increasing the number of co-located over time. To assess the aging behavior of the finger- websites per hosting IP, and (2) frequently changing the prints, we conducted a longitudinal study over a period mapping between domain names and their hosting IPs, of two months. As expected, fingerprints become less to further obscure domain–IP mappings, thus hindering accurate over time, but surprisingly, after two months, IP-based WF attacks. they are still effective at successfully identifying about 70% of the tested websites (§7). 2 Background and Motivation As our WF technique is based on the observation of the IPs of network connections that fetch HTTP re- In this section, we review some background information sources, it is necessary to evaluate the impact of HTTP on domain name encryption technologies and discuss the caching on the accuracy of the fingerprints. This is be- motivation behind our study. In particular, we highlight cause cached resources can be loaded directly from the how our IP-based fingerprinting attack is different from browser’s cache when visiting the same website for a prior works, allowing network-level adversaries to effec- second time, resulting in the observation of fewer con- tively mount the attack at scale.
Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 422 2.1 Domain Name Encryption cannot provide any meaningful privacy benefit without the use of DoT/DoH, and vice versa. Mozilla has sup- In today’s web browsing environment, there are two ported ECH since Firefox 85 [52]. channels through which domain name information is exposed on the wire: plaintext DNS requests/responses, and the Server Name Indication (SNI) extension of TLS. 2.2 Website Fingerprinting The plaintext nature of DNS not only jeopardizes user privacy, but also allows network-level entities to in- Website fingerprinting (WF) is a type of traffic analy- terfere with user connections. For example, an on-path sis attack, employed to construct fingerprints for a set attacker can inject forged DNS responses to redirect a of websites based on the traffic pattern observed while targeted user to malicious hosts [27]. The domain name browsing them. Depending on which metadata is visible information exposed via DNS packets and the SNI field from the encrypted traffic, different WF techniques can has also been intensively exploited by state-level net- be used to determine whether a user under surveillance work operators for censorship purposes [14, 46, 75, 78, visited any of the monitored websites. 82]. To cope with these security and privacy problems, Numerous WF attacks targeting anonymized or several solutions have been recently proposed to safe- obfuscated communication channels have been pro- guard domain name information on the wire, including posed [40, 58, 73, 79, 85, 98, 107, 108], in which DoT [51], DoH [49], and ECH [88]. the actual destination IP address is hidden by means By encrypting DNS traffic, DoT/DoH preserves the of privacy-enhancing network relays [32, 48], such as integrity and confidentiality of DNS resolutions. Sev- Tor [26] or the Invisible Internet Project (I2P) [43, 113]. eral companies (e.g., Google [35], Cloudflare [4]) of- However, WF attacks on standard encrypted web traf- fer free DoT/DoH services to the public, while popu- fic (i.e., HTTPS), in which no privacy-enhancing net- lar web browsers including Chrome and Firefox already work relays are employed, have not been comprehen- support DoH [35, 67], with the latter enabling DoH sively investigated, especially at the IP-address level. (through Cloudflare) by default for users in the US since This is because the domain name information previ- 2019. However, the design choice of these vendors to ously available in several plaintext protocols (e.g., DNS, centralize all DNS resolutions to one trusted resolver the SNI extension of TLS, and OCSP queries [90]) can has raised several privacy concerns. As a result, more be easily obtained from the network traffic (§2.1). This privacy-centric DNS resolution schemes have also been information alone can already be used for a straight- proposed, including DoH over Tor [71, 91], Oblivious forward inference of applications or web services being DNS [92, 97], and distributed DoH resolution [44], to visited [42, 102]. However, in an idealistic future where not only conceal DNS packets from on-path observers, domain name encryption (i.e., DoT/DoH and ECH) is but also to deal with “nosy” recursors. fully deployed, visibility to any information above the IP SNI has been incorporated into the TLS protocol layer will be lost. Under these conditions, and given the since 2003 [17] as a workaround for name-based virtual high degree of web co-location [47], our ultimate goal hosting servers to co-host many websites that support is to investigate the extent to which websites can still HTTPS. During the TLS handshake, the client includes be fingerprinted at scale, based solely on the IP address the domain name of the intended website in the SNI information of the servers being contacted. field in order for the server to respond with the cor- Given the numerous WF methods introduced in the responding TLS certificate of that domain name. Until past, one may wonder why do we even need another TLS 1.2, this step takes place before the actual encryp- website fingerprinting method? In addition to the afore- tion begins, leaving the SNI field transmitted in plain- mentioned reasons and pitfalls of previous WF tech- text, and exposing users to similar security and privacy niques [53], the rationale behind our fingerprinting tech- risks as discussed above. TLS 1.3 provides an option nique based solely on IP-level information stems from to encrypt the SNI field, concealing the domain name the increasing deployment of domain name encryption information [87]. Since March 2020, ESNI has been re- technologies [25, 61]. Currently, most web traffic does worked into the ECH extension [88]. In order for ECH expose domain information, as domain name encryption to function, a symmetric encryption key derived from has not been fully deployed [103], and thus network-level the server’s public key has to be obtained in advance. browsing tracking at scale through DNS or SNI is much This public key can be obtained via an HTTPS resource easier. However, once this massive-scale monitoring ca- record lookup. Thus, it is important to note that ECH pability is gone due to DoT/DoH and ECH, the next
Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 423 best option for ISPs to continue tracking at a similar sual browsing behavior (except, perhaps, the rare event scale will be to rely on IP addresses, which are already of a browser restart with many previously open tabs). collected as part of IPFIX (Internet Protocol Flow In- Although some users may visit more than one website formation Export) [101] or NetFlow [2] records by ex- at a time, they mostly interact with one tab at a time. isting routers. Although more elaborate fingerprinting There is also a time gap when changing from one tab to schemes can certainly be conceived for HTTPS traffic, another to open or reload a different website, especially these will require a significant deployment effort and for users with a single screen. All these together allow cost [74], both for constructing and maintaining the fin- an observer to distinguish between individual website gerprints, as well as for matching them. visits, as also evident by existing techniques that can be employed to split a network trace of such multi-tab activity into individual traces [23, 24, 111]. Moreover, 3 Threat Model although many individual users may be located behind Internet service providers (ISPs) have been increasingly the same NAT network, Verde et al. [105] have devel- harvesting user traffic for monetization purposes, such oped a framework that can identify different individ- as targeted advertising [18, 34, 55]. Our threat model uals behind a large metropolitan WiFi network based considers the real-world scenario in which a local adver- on NetFlow records. To keep our study simple, we thus sary (e.g., an ISP) passively monitors users’ traffic and assume that the adversary already employs the afore- attempts to determine whether a particular website was mentioned techniques to obtain the network trace of visited. The adversary carries out the following steps to different individuals before conducting our WF attack. create website fingerprints. First, the adversary visits a set of websites and 4 Fingerprint Construction records all domain names that are contacted to fetch their resources. A domain-based fingerprint for each Next, we explain how we construct IP-based fingerprints website is then built from this set of contacted domains. in more detail, from creating the initial domain-based After that, these domains are periodically resolved to fingerprints to deriving the final IP-based fingerprints. their hosting IPs, which are used to construct IP-based At a conceptual level, we first explore the straightfor- fingerprints. Depending on the relationship between a ward approach of resolving all domains loaded while vis- domain name and its hosting IP(s), a connection to a iting a website to their hosting IPs, from which we cre- unique IP can be used to easily reveal which website is ate a set of unique IPs that can potentially be used as being visited if the IP only hosts that particular domain the IP-based fingerprint for that website. We then take name. Finally, to conduct the WF attack, the adversary the critical rendering path [38] into account to improve tries to match the sequence of IP addresses found in the fidelity of the fingerprints, by considering the ap- the network trace of the monitored user with the IP- proximate order in which domains are loaded while the based fingerprints constructed in the previous step to website is being rendered on the screen. infer which website was visited. The effectiveness of our attack depends on two pri- 4.1 Basic IP-based Fingerprint mary factors, namely, the uniqueness (§6) and stability Assuming an idealistic web browsing scenario in which (§7) of the fingerprints. It is worth emphasizing that domain name information can no longer be extracted our model does not assume fingerprinting of obfuscated from network traffic due to the full deployment of do- network traffic, in which the IP address information is main name encryption, the only remaining information already hidden by means of privacy-enhancing technolo- visible to the adversary is packet metadata (e.g., time, gies. This class of attacks, whose goal is to use sophisti- size) and sequences of connections to remote IP ad- cated traffic analysis methods to fingerprint anonymized dresses of contacted web servers. Under these condi- network traffic, has been extensively investigated by tions, the adversary would need to fingerprint targeted prior studies [20, 40, 58, 79, 98]. Our threat model re- websites based primarily on this information. As intro- quires only minimal information collected from the net- duced in our threat model, the adversary first visits the work traffic, i.e., destination IPs. targeted websites and records all domains that are con- We consider a browsing scenario in which one web- tacted while browsing each website. Each domain can site is visited at a time, which is particularly valid then be resolved to its hosting IP address(es). As a re- when it comes to ordinary web users on devices with sult, the mapping between domains and hosting IP ad- smaller screens, and is also most often the case of ca-
Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 424 dress(es) is the basic unit on which the adversary relies 4.2 Enhanced IP-based Fingerprint with to construct IP-based fingerprints. Connection Bucketing When browsing a website, the browser first contacts When a webpage is visited, besides the initial connec- the web server to fetch the initial resource—usually tion to the primary domain, multiple requests may then an HTML document. It then parses the HTML docu- be issued in parallel to fetch other resources referenced ment and subsequently fetches other web resources ref- in the initial HTML document. Once fetched, these re- erenced. Based on this underlying mechanism of fetch- sources may sometimes trigger even more requests for ing a webpage, the adversary constructs domain-based other sub-resources (e.g., JavaScript). The absolute or- fingerprints as follows. For a given website, its domain- der of these requests on the wire can change from time to based fingerprint consists of two parts: the primary do- time, depending on many uncertain factors, such as the main, denoted as dp , and a set of secondary domains, performance of the upstream network provider and the denoted as ds . The fingerprint then can be represented underlying operating system. For that reason, we did as: dp + {ds1 , ds2 , ..., dsn }. The primary domain is the not consider the order in which domains are contacted domain of the URL shown in the browser’s address bar, when constructing the domain-based fingerprints, and and typically corresponds to the server to which the thus we gather all secondary domains into one bucket, first connection is made for fetching the initial HTML as described in §4.1. However, when viewing all these document of the visited webpage. Secondary domains requests as a whole, there still exists a high-level order- may be different from the primary one and are used for ing relationship that we can capture when considering hosting other resources needed to load the webpage. the critical rendering path [38]. Specifically, there are From the domain-based fingerprints constructed certain render-blocking and critical objects that always above, the adversary can then obtain their correspond- need to be loaded prior to some other objects. ing IP-based fingerprints by repeatedly resolving the do- The chain of events from fetching an initial HTML main names into their hosting IP(s). Given that a do- file to rendering the website on screen is referred to as main name may be resolved to more than one IP, each the critical rendering path [38]. When visiting a web- domain in a domain-based fingerprint is converted to a site, the browser first contacts the primary domain to set of IP(s) with at least one IP in it. As domain-based fetch the initial HTML file (e.g., index.html). Next, fingerprints are comprised of two parts, the inherent this file is parsed to construct the DOM (Document structure of IP-based fingerprints also consists of two Object Model) tree. The browser then fetches several parts. The first part contains the IP(s) of the primary web resources from remote destinations to render the domain name, while the second part is a set of sets of webpage. Depending on the complexity of the webpage, IPs obtained by resolving the secondary domains. The these resources may include HTML, JavaScript, CSS, fingerprint then can be represented as: image files, and third-party resources, which may in turn {dp ip1 , dp ip2 , ..., dp ipn } + {{ds1 ip1 , ds1 ip2 , ..., ds1 ipn }, load more sub-resources hosted on other third-party do- {ds2 ip1 , ds2 ip2 , ..., ds2 ipn }, ..., {dsn ip1 , ..., dsn ipn }} mains [76]. According to the Internet Archive, a typi- cal website loads an average of 70 web resources as of To simplify the construction and matching of finger- this writing [7]. When considering this critical rendering prints, we reduce the above fingerprint by considering path, there are three important events that we can use the union of the sets of IP addresses of all secondary do- to cluster connections into three “buckets:” domLoad- mains (second part of the above fingerprint). The sim- ing, domContentLoaded, and domComplete. plified version of the IP-based fingerprint can thus be domLoading is triggered when the browser has re- represented by just two sets of IP addresses as: ceived the initial HTML file and parses it to construct {dp ip1 , dp ip2 , ..., dp ipn } + {ds1 ip1 , ds1 ip2 , ..., ds1 ipn , the DOM tree. As a result, multiple parallel connections to fetch critical resources referenced by the DOM tree ds2 ip1 , ds2 ip2 , ..., ds2 ipn , ..., dsn ip1 , ..., dsn ipn } are initiated right after this event is fired. Although it might seem that this simplification dis- domContentLoaded is triggered when both the cards some part of the structural information of the DOM and CSSOM (CSS Object Model) are ready [38], page, we found no significant difference in accuracy signaling the browser to create the render tree. The when evaluating both fingerprint formats. Therefore, we event is typically fired without waiting for style sheets, opt to use the latter fingerprint structure, as it is simpler images, and subframes to load [69]. After this event, sub- and allows for faster matching. sequent connections can often be observed for fetching
Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 425 elements such as non-blocking style sheets, JavaScript on those websites that are legitimately visited in real- files, images, and subframes. world scenarios. Therefore, we choose to use domains domComplete is triggered when the website and from the Tranco top-site ranking list [104], since it has its sub-resources have been loaded. After this event been shown to have a good overlap with the web traffic is fired, non-essential objects can still be downloaded observed by the Chrome User Experience Report [84]. in the background, leaving the critical rendering path The research community often relies on one of unaffected. For example, external JavaScript files are the four top-site lists (Alexa [9], Majestic [11], Um- known to be render-blocking and are recommended to brella [12], and Quantcast [5]). However, studies have be moved to the end of the webpage or to be included discovered several issues with these lists that can neg- with a defer attribute of the tag [36]. There- atively impact research outcomes if not handled prop- fore, a small cluster of connections can often be observed erly [89, 104]. To remedy the shortcomings of these lists, after this event is triggered. the Tranco list is curated to aggregate the four afore- Based on these observations about the critical ren- mentioned lists, resulting in a list of more than seven dering path, we enhance the structure of our finger- million domains. Even then, due to the dynamic na- prints (both domain-based and the corresponding IP- ture of the web [15, 57], there are still domains that are based ones) to comprise the primary domain and three unstable, not responsive, or do not serve any web con- sets of domains corresponding to the three events afore- tent in the Tranco list, especially in its long tail [84, 89]. mentioned. The enhanced domain-based fingerprint can Therefore, we select the top 100K popular domains from then be represented as: the Tranco list for our study, because any ranking un- der 100K is not statistically significant, as suggested by dp + {ds1 , ds2 , ...} + {ds2 , ds3 , ...} + {..., dsn−1 , dsn } both top-list providers and previous studies [9, 89]. Note that a domain can appear in more than one While some websites are so common that visiting bucket if multiple objects are fetched from that domain them may be considered to be a very low privacy risk at different times. Accordingly, the representation of the (e.g., facebook.com, twitter.com, or youtube.com), enhanced IP-based fingerprint follows the same struc- the leakage of visits to more “sensitive” websites is def- ture, comprising four sets of IP addresses: i) the set of initely an important concern. This is even more so in IP addresses of the primary domain, and ii) three sets of oppressive regions, where browsing certain online con- IP addresses corresponding to the three buckets above. tent could be considered as a violation of local regu- lations [39, 72, 112]. To that end, we complement our dataset by manually choosing websites from the Alexa 5 Experiment Setup list that belong to categories deemed “sensitive,”1 such as LGBT, sexuality, gambling, medical, and religion. In In this section, we provide the details of how we set total, our dataset consists of 220,743 domains, includ- up and conducted our experiments for assessing the ef- ing the top 100K popular domains of the Tranco list2 fectiveness of IP-based website fingerprinting. In par- and 126,597 domains from Alexa’s sensitive categories. ticular, we discuss the rationale behind our test list of Among these, there are 5,854 common domains between websites, and the duration and location of our measure- the two data sources. ment. Although one may consider our test list as a closed- world dataset, it is infeasible to repeatedly crawl the 5.1 Selection of Test Domains entire Internet, which has more than 362.3M domains From an adversarial point of view, it is desirable for an registered at the time of our experiment [8]. It is also attacker to be able to reveal as many websites as possi- unlikely that a network-level adversary is interested in ble. It is, however, impractical to crawl the entire Inter- net, given that there are more than 362.3 million domain names registered across all top-level domains (TLDs) as 1 It is worth noting that sensitivity can be different from site of 2020 [8]. In addition, many of them are dormant or to site, depending on who, when, and from where is visiting even unwanted domains [100] that the majority of In- the site [42]. We intuitively choose these complementary do- mains based on our common sense of what is sensitive based on ternet users will never visit. As our goal is to assess the those categories that are often blocked by many Internet censors extent to which domain name encryption would prevent around the world [75]. the leakage of the majority of users’ browsing activi- 2 The list was created on March 3rd 2020, and is available at ties via IP-based website fingerprinting, we opt to focus https://tranco-list.eu/list/J2KY.
Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 426 fingerprinting all websites on the Internet. However, the we have collected a total of 24 data batches during a ability of continuously conducting our WF attack on two-month period. To stimulate future studies in this re- more than 220K domains highlights the scalability of search domain, we make our dataset available to the re- our method. In fact, we cover an order of magnitude search community at https://homepage.np-tokumei. more domains compared to previous WF attacks against net/publication/publication_2021_popets. DoT/DoH traffic [19, 50, 96], in which the largest open- Our measurement is conducted from a cluster of ma- world setting comprised fewer than 10K domains [19]. chines located in a gigabit academic network in the US. Due to the rapid increase in the use of content deliv- ery networks (CDN), web content can be served from 5.2 Measurement Duration and Location multiple servers distributed across different locations, Prior work often overlooks the temporal aspect when depending on the origin of the request [45]. Although constructing fingerprints. More specifically, many efforts our dataset can be considered as representative for web are conducted in a one-off manner or over a short pe- users within our geographical area, it would have missed riod of time, neglecting the temporal characteristics of some IP addresses of CDN-hosted websites which can fingerprints whose websites’ content may evolve over only be observed at other locations. Nonetheless, as time [57]. For our work, in which we focus on IP-based mentioned in §3, the adversary in our threat model is fingerprints, the churn in domain–IP mappings is also local and also has access to the Internet from the same another major concern [45]. Due to the variability of network location as the monitored users (e.g., the ISP website content and hosting IPs across time, a previ- of a home or corporate user), and will not observe any ously constructed fingerprint may not be valid after a other IPs of CDN-hosted websites either. Adversaries certain time period. Therefore, a longitudinal measure- at different locations can always set up machines within ment study is essential to examine the robustness of their network of interest and conduct the same exper- fingerprints, which in turn impacts the efficacy of their iment with ours to construct a dataset of fingerprints use in WF attacks. that matches those websites browsed by users within Over a period of 60 days (from March 5th to May the network of their control. 3rd, 2020), we repeatedly crawled the 220K websites from our test list curated in §5.1, using the Chrome browser (desktop version 80.0), running on Ubuntu 6 Fingerprinting Accuracy 20.04 LTS. When visiting each website, we extract all Next, we evaluate the accuracy of our WF techniques domains contacted to construct the fingerprint for that using the data collected in §5. We begin with an analy- website using the steps discussed in §4. At the network sis of the information entropy that we can expect from level, we capture the sequence of destination IPs con- domain-based fingerprints, and then evaluate the accu- tacted, to evaluate the accuracy of our IP-based WF racy of IP-based fingerprints. method (§6). Note that the collection of this sequence of IPs is oblivious to the domains extracted indepen- 6.1 Fingerprint Entropy dently when loading each website. Due to DNS-based load balancing, many domains, As discussed in §3, the creation of IP-based fingerprints especially of popular websites, may map to different IP is based on the domains contacted while visiting the addresses at different times [45]. Therefore, once the set targeted websites. Therefore, it is important to first ex- of domains that were contacted to render each web- amine the uniqueness of these domains, as it impacts site is extracted, we continuously resolve them to ob- the effectiveness of IP-based WF. This will aid us in tain their IP addresses until the next crawl. This best- deciding whether a domain should be included or not effort approach allows us to obtain as many IPs as pos- as part of a fingerprint. For example, if a certain do- sible for those domains that employ DNS-based load main is contacted when visiting every single website, balancing. However, to make sure our experiment does then there is no point in including it. In contrast, if a not saturate DNS servers (thus affecting other legiti- unique domain is only contacted when visiting a partic- mate users), we enforce a rate limit of at least three ular website, it will make the fingerprint more distin- hours. In other words, contacted domains of a web- guishable. The more unique a domain is, the higher the site are only resolved again if they were not resolved information entropy that can be gained [94], resulting within the last three hours. The entire process for each in a better fingerprint. crawl batch takes approximately 2.5 days. As a result,
Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 427 Domain IP Table 1. Top-ten domains that yield the lowest entropy. 100 80 Domain name # Websites Entropy Cumulative % 60 www.google-analytics.com 114K (55%) 0.87 40 fonts.gstatic.com 102K (49%) 1.03 fonts.googleapis.com 102K (49%) 1.04 20 www.google.com 76K (37%) 1.44 0 stats.g.doubleclick.net 72K (35%) 1.53 0 2 4 6 8 10 12 14 16 18 www.googletagmanager.com 64K (31%) 1.71 Information entropy (bit) www.facebook.com 53K (25%) 1.97 Fig. 1. CDF of the information entropy gained per domain/IP as connect.facebook.net 53K (25%) 1.98 a percentage of all the unique domains/IP addresses observed. googleads.g.doubleclick.net 49K (24%) 2.09 ajax.googleapis.com 34K (16%) 2.62 Let P (d) be the probability that a particular domain will be contacted when visiting a given website among (that have been observed to be) hosted on it. Note that the targeted websites. The information entropy in bits calculating the entropy using both average and median gained if that particular domain is contacted can be gives us similar results because most co-hosted domains calculated using the following formula: on the same IP addresses often provide a similar amount Information entropy = − log2 P (d) of information entropy. We thus choose the former one. Considering these two possibilities, we then calcu- The blue (solid) line in Figure 1 shows the entropy late the information entropy of the 340K IPs observed gained from each domain as a percentage of the approx- from our continuous DNS measurement in each crawl imately 475K unique domains observed in each crawl batch. As indicated by the red (dashed) line of Figure 1, batch. Almost 90% of these domains are unique to the 50% of the IPs provide at least 9 bits of information en- website from which they are referenced, yielding high tropy, while there is a group of more than 30% of the information entropy (17.7 bits). In contrast, there are IPs that provide a high amount of information entropy a few domains from which we can only gain a small (17.7 bits), which correspond to IPs hosting only a sin- amount of entropy. This result aligns well with the study gle domain. of Greschbach et al. [37] in which the Alexa top-site list was crawled and for 96.8% of the websites there exists at 6.2 Primary Domain to IP Matching least one domain that is unique only to these websites. Table 1 lists the top-ten domains that provide the Before evaluating our WF techniques, we first investi- least information entropy. Eight of them belong to gate whether WF is needed at all. Prior studies have Google and two belong to Facebook. These domains shown that a significant fraction of websites have a are commonly included in many websites, thus only one-to-one mapping between primary domains and their contributing a small amount of entropy. For instance, hosting IP(s) [45]. This first connection can be distin- www.google-analytics.com is included in more than guished from the subsequent requests since there is usu- half of the websites. However, we opt to keep them ally a noticeable time gap during which the browser as part of our fingerprints, as most of them still pro- needs to contact the primary domain to download the vide more than one bit of information, helping to differ- initial HTML file, parses it, and constructs the DOM entiate between websites that reference these external tree before multiple subsequent connections are initi- Google/Facebook services and those that do not. ated to fetch referenced resources. As a result, it is Based on the entropy for each domain computed straightforward for an adversary to target this very first above, we then calculate the entropy gained per IP ad- connection to infer which website is being visited. dress, since our ultimate goal is to perform WF at the More specifically, when a domain is hosted on one IP level. Given an observed IP, there are two possibili- IP or multiple IPs without sharing its hosting server(s) ties regarding the domain(s) it may correspond to. First, with any other domains, it is easy to infer the domain the IP may be associated with only a single domain, in from the IP(s) of its hosting server(s). We analyzed the which case its entropy can be deduced directly from the DNS records of all primary domains in our dataset to domain’s entropy. Second, the IP may co-host multiple quantify the fraction of websites that can be finger- domains. In this case, the IP’s entropy is calculated by printed by just targeting their primary IP(s). We find taking the average of the entropy values of all domains that 52% of the websites studied have their primary
Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 428 Table 2. Percentage of successfully identified websites using i) naive primary domain to IP matching (§6.2), ii) basic fingerprinting (§4.1, §6.3), and iii) enhanced fingerprinting with connection bucketing (§4.2, §6.4). Website type Total Primary Domain IP-based Fingerprinting Connection Bucketing All websites crawled 208,191 107,455 (52%) 174,662 (84%) 189,527 (91%) Popular websites 93,661 58,989 (63%) 86,147 (92%) 90,231 (96%) Sensitive websites 120,293 51,538 (43%) 93,988 (78%) 104,983 (87%) Sensitive and popular 5,763 3,072 (53%) 5,473 (95%) 5,687 (99%) domain hosted on their own IP(s), while the remaining the popular websites and 78% of the sensitive websites. 48% are co-hosted on a server with at least another web- More worrisome is the fact that 95% of sensitive and site. This result means that an adversary can already popular websites can be fingerprinted. infer 52% of the targeted websites based solely on the IP address of the very first connection to the primary 6.4 Enhanced Website Fingerprinting with domain, without having to consider secondary connec- Connection Bucketing tions. The third column of Table 2 shows the breakdown of these websites in terms of their popularity and sen- We next evaluate the effectiveness of the enhanced sitivity. Note that the total number of websites shown WF (§4.2), in which we take the critical rendering path here is lower than the total number of test websites se- into consideration to cluster IPs into three buckets. Sim- lected in §5 due to some unresponsive websites when ilarly to the basic fingerprints (§6.3), given a sequence conducting our experiment. of IPs [ip0 , ip1 , ip2 , ..., ipn ], we first scan ip0 against all fingerprints to create a pool of candidate fingerprints. For the subsequence [ip1 , ip2 , ..., ipn ], our goal is to split 6.3 Basic IP-based Website Fingerprinting it into three buckets of connections that can potentially To fingerprint the remaining 48% of websites whose pri- be matched with the three buckets of IPs in the IP- mary domains are co-hosted, an adversary would need based fingerprints. Based on the time of each connection to analyze the second part of their IP-based fingerprint initiation captured at the network level (§5.2), we use that captures the IP addresses of secondary domains. k-means clustering [30, 60, 63] to split them into three Going back to the way we build our IP-based fin- sets of IPs. gerprints in §4.1, the basic IP-based fingerprint has two For every candidate fingerprint, we intersect each parts. The first part consists of the primary domain’s bucket of IPs in the fingerprint to the corresponding IP(s), and the second part comprises a set of IPs ob- bucket of IPs captured from the network trace. Then, tained by resolving all secondary domains. Given a se- for each matching IP, we add its entropy to the total quence of IPs [ip0 , ip1 , ip2 , ..., ipn ] observed from a net- amount of entropy gained for that particular fingerprint. work trace, we first scan ip0 against the primary part of Finally, we choose the fingerprint with the highest en- all IP-based fingerprints, which are created by repeated tropy to predict the visited website. active DNS measurements (§5.2). If ip0 is found among Using this approach, the accuracy rate can be im- the primary IPs of a given fingerprint, the fingerprint proved to 91%. The breakdown of fingerprinted websites is added to a pool of candidates. We then compare the is shown in the last column of Table 2. For the popular subset {ip1 , ip2 , ..., ipn } with the secondary part of each and the sensitive websites, we obtain an accuracy rate candidate fingerprint. For each matching IP, we add the of 96% and 87%, respectively. However, a more alarm- entropy provided by that IP to the total amount of en- ing result is that 99% of sensitive and popular websites tropy gained for that particular candidate fingerprint. can be precisely fingerprinted, posing a severe privacy Finally, we choose the fingerprint with the highest total risk to their visitors. entropy to predict the website visited. We now look into the remaining 9% (18,664) of web- Using this IP-based WF method we obtained an sites for which we could not find an exact match. As increased matching rate of 84%—that is, 84% of the shown in Figure 2, 20% of these websites have only websites in our data set were identified with 100% ac- two matching candidates, while about 50% of them curacy. The breakdown of the fingerprinted websites is have only up to ten matching candidates. By manually shown in the fourth column of Table 2. Among these examining some of these fingerprints, we found many fingerprinted websites, we could precisely match 92% of cases in which the matching candidate domains actu-
Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 429 100 5 days 15 days 2 months 10 days 1 month Cumulative % 80 100 60 80 Cumulative % 40 60 20 40 0 10 0 10 1 102 10 3 20 Number of candidate websites 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Fig. 2. CDF of candidate websites per fingerprint for the remain- Stable Unstable Difference degree ing 9% (18,664) of websites that could not be matched. Fig. 3. CDF of the stability of domain names loaded in each web- ally point to the same website (but without redirect- site as a percentage of all websites studied. ing to the same domain). These are mostly owned by organizations who have registered the same name un- previously observed at time t0 , yielding a difference de- der different TLDs (e.g., bayer.de vs. bayer.com), or gree of 0. In contrast, a website is considered as unstable variations of the domain (e.g., christianrock.net vs. when its difference degree is 1, meaning that all domains christianhardrock.net) to protect their brand against observed at time t1 are different from those previously domain squatting [100]. A more determined adversary seen at t0 . could invest the effort to implement more advanced Figure 3 shows the stability of the websites studied techniques for identifying such duplicate websites. For in terms of the domains contacted while visiting them. instance, string similarity can be used to cluster similar About 30% of the websites contact the exact same set of domains, while image similarity can be used to group domains to download web resources for the whole two- websites with similar screenshots of the start page. month period of our study. Within a five-day period, 80% of the websites are still almost completely stable, with a difference degree lower than 0.1, while this per- 7 Fingerprint Stability centage decreases to 50% over the two-month period. As mentioned in our threat model, the efficacy of a WF Understandably, almost half of the websites we study attack also depends on the stability of fingerprints over are the most popular on the Internet. Hence, it is ex- time. There are two primary reasons why a website fin- pected that their content will be changed or updated on gerprint may go stale. First, the website may change a regular basis. However, even after two months, almost over time with existing elements removed and new ele- 80% of the websites are still stable, with a difference de- ments added [57]. Second, the mapping between a do- gree lower than 0.3, meaning that 70% of observed do- main and its hosting IP(s) may also change [45]. Con- mains are still being used to host web resources needed sequently, a previously constructed fingerprint may no to render these websites. longer be valid after a certain time period. This is a favorable result for the adversary, as it Since our IP-based fingerprints are constructed shows that domains are an effective and consistent fea- based on domains contacted while browsing the targeted ture. The result particularly implies that the adversary websites, we first examine the extent to which these does not need to keep crawling all websites repeatedly websites are stable in terms of the domains that they to construct domain-based fingerprints. Based on the reference. We introduce a difference metric to quantify results of Figure 3, the adversary perhaps can divide the change in this set of domains for a given website websites into two groups comprising stable and less sta- over time as follows. Let Dt0 and Dt1 be the sets of ble websites. For instance, the stable group consists of contacted domains observed when browsing a website 80% of websites with a difference degree lower than 0.2 at time t0 and t1 , respectively, the difference degree for after ten days, while the less stable group consists of the this website is calculated as: remaining 20% of websites. Then, the adversary would (Dt0 ∪ Dt1 ) − (Dt0 ∩ Dt1 ) only need to re-crawl the less stable ones every ten days Difference degree = to keep their domain-based fingerprints fresh, instead of Dt 0 ∪ Dt 1 all websites. Based on this definition, we consider a website as However, in our threat model, what can be actu- stable during a period t0 → t1 when the set of domains ally observed by the adversary is only IP addresses. We observed at time t1 have not changed compared to those
Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 430 5 days 15 days 2 months 100 10 days 1 month 80 Accuracy % 100 60 80 Cumulative % 40 60 20 40 0 20 ≤ 2.5 5 10 15 20 25 30 35 40 45 50 55 60 Days since fingerprints constructed 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Stable Unstable Fig. 5. Fingerprint stability over time. Even after two months, Difference degree 70% of the tested websites can still be fingerprinted. Fig. 4. CDF of the stability of IP addresses in each IP-based fin- son why our IP-based fingerprints are still effective at gerprint as a percentage of all websites studied. revealing 70% of the targeted websites even though a thus apply the same difference formula to quantify the large number of IP-based fingerprints have changed sig- stability of IP-based fingerprints. Unlike domain-based nificantly after two months, as indicated in Figure 4. fingerprints, IP-based fingerprints become stale faster, The above finding means that the adversary can in- as shown in Figure 4. We find that only 10% of the IP- telligently split domains into two groups, based on pre- based fingerprints contain the same set of IPs over the viously observed data. The first group consists of do- course of two months. After ten days, 60% of the finger- mains whose IPs are dynamic, while the second group prints have more than 30% of their IPs changed. After contains domains whose IPs remain static over a config- two months, half of the IPs have changed in more than urable amount of time. The adversary then only needs 50% of the fingerprints. to periodically perform DNS lookups for the first group Given these results, we investigate how the instabil- after a desired amount of time has passed, depending on ity of the IP-based fingerprints impacts the accuracy of the network overhead and resources the adversary can our WF attack. We consider an attack scenario in which sustain for conducting the attack. the adversary uses fingerprints constructed in the past to track the users’ browsing activities at a future time. 8 Fingerprint Robustness Figure 5 shows the accuracy (i.e., the percentage of suc- cessfully identified websites) of our enhanced WF ap- We next examine the impact of HTTP caching on the proach over the course of two months. Within 2.5 days3 effectiveness of our WF since resources are often cached after their generation, our fingerprints consistently yield by web browsers to improve websites’ performance. In a high accuracy of 91%. Over the course of two months, addition, our WF also exploits the fact that websites we can see a gradual decrease in the accuracy. However, often load external resources, including images, style this decrease is quite modest, as after five to ten days sheets, fonts, and even “unwanted” third-party analytics since their construction the fingerprints can still be used scripts, advertisements, and trackers [76], which result to accurately identify about 80% of the websites. This in a sequence of connections to several servers with dif- number only decreases to about 70% after two months. ferent IPs, making the fingerprints more unique. Thus, Although IP-based fingerprints go stale faster com- we also investigate whether blocking these unnecessary pared to their domain-based fingerprints, those IP ad- resources would help make websites less distinguishable, dresses that change frequently mostly correspond to sec- thus reducing their fingerprintability. ondary domains, and only a small fraction corresponds to primary domains (see Appendix A for details). The vast majority of primary domains are hosted on mostly 8.1 Impact of HTTP Caching on Website static IP addresses for the whole period of our study. Fingerprinting Accuracy As a result, the persistently stable IP addresses of these When a website is revisited, cached resources can be primary domains in the IP-based fingerprints is the rea- served from the local cache without the browser fetching them again from their origins. Since our attack is based solely on the observation of the IP of connections to 3 The lowest time granularity is 2.5 days because each crawl remote destinations, we are interested in examining the batch in our dataset requires this amount of time to be collected, fraction of cacheable resources and the extent to which as discussed in §5. HTTP caching impacts the effectiveness of our WF.
Domain Name Encryption Is Not Enough: Privacy Leakage via IP-based Website Fingerprinting 431 100 84 Cumulative % 80 83 Accuracy % 60 82 40 20 81 0 80 -1 0 5m 1h 5h 1d 5d 1M 1Y First 5m 10m 1h 5h 10h 1d 5d 10d 1M 2M Freshness timeline visit Revisit after (x) amount of time Fig. 6. CDF of the freshness timeline of HTTP resources. (m: minute, h: hour, d: day, M: month, Y: year) Fig. 7. The accuracy of IP-based fingerprinting attack when ex- cluding destinations where cacheable resources are hosted. (m: minute, h: hour, d: day, M: month) Analyzing the response header of 21.3M objects ob- served while crawling the tested websites, we find that There are two primary reasons why our attack is 86.1% of them are cacheable. In other words, these still highly effective although the majority of resources HTTP resources can be stored and served from the local are cacheable. First, excluded IPs often host long-term cache without being downloaded again from the remote cacheable shared resources, such as fonts and JavaScript servers when being revisited. code, which contribute only a small amount of entropy Utilizing the cache-control information in the to the fingerprint if included. Second, for cacheable re- HTTP response header, we compute the freshness time- sources hosted on IPs with high entropy, not all re- line for each resource. The freshness timeline is the sources have the same freshness timeline. In fact, we find amount of time during which the browser can store that half of the origins that host resources cacheable for and serve resources from its cache without download- more than one hour also serve another resource with ing them again from their original servers. Figure 6 a freshness timeline shorter than five minutes, causing shows the distribution of the freshness timeline of 21.3M at least one network connection to the original server if objects. The value “-1” denotes uncachable resources revisited. (13.9%) that must be downloaded again from their ori- gin if revisited, while “0” indicates cacheable resources (21.9%) that always need to be revalidated with their origin. In other words, these two types of resources 8.2 Fingerprinting Under Ad Blocking will always cause a network connection to their original Due to the prevalence of ads and analytics scripts that servers if revisited. On the other hand, the remaining harvest users’ information [86], many advertisement and 64.2% of resources can be loaded directly from the local tracker blocking tools have been developed to protect cache without making any network connections. user privacy. Of these tools, the Brave browser has stood Next, we evaluate the impact of cacheable resources out to be one of the best browsers for user privacy on on our attack accuracy by excluding IPs on which the Clearnet to date [56]. cacheable resources are hosted. We use the basic finger- Therefore, we opt to use the Brave browser for in- printing method here for our evaluation (§4.1) instead vestigating the impact of ad blocking on IP-based WF. of the enhanced one (§4.2), because revisited resources During the last four batches of our data collection pro- may not be freshly loaded in the order of the critical cess (i.e., ten days), at the same time with crawling the rendering path as in the first visit. test websites (without blocking ads and trackers), we in- Although many web resources are cached, we could strumented the Brave browser (desktop version 1.6.30) still obtain a high accuracy. As shown in Figure 7, to crawl these websites a second time. While the Brave even when websites are revisited after only five minutes, browser is loading each website, we also capture (1) the meaning that the majority of resources can be served set of domains contacted to fetch web resources for ren- from the local cache, an accuracy of 80% can still be dering the website, and (2) the sequence of IPs con- obtained—a decrease of just 4% (from 84%) compared tacted to fetch web resources. to when websites are visited for the first time. If web- Using the same fingerprinting techniques as in §6, sites are revisited after one hour, one day, or one month, we then match the sequences of IP addresses observed our basic WF attack can obtain a gradually increased while browsing with the Brave browser to our IP-based accuracy of 80.8%, 81.4%, and 82.3%, respectively. fingerprints. As expected, the fingerprinting accuracy
You can also read