A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Proceedings on Privacy Enhancing Technologies ; 2020 (2):24–44 Zhiju Yang and Chuan Yue* A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments Abstract: Web measurement is a powerful approach to studying various tracking practices that may compro- 1 Introduction mise the privacy of millions of users. Researchers have Web tracking is frequently performed over the Internet built several measurement frameworks and performed by various trackers to collect the information of users’ a few studies to measure web tracking on the desk- browsing activities for various purposes including per- top environment. However, little is known about web sonalized advertisement, targeted attacks, and surveil- tracking on the mobile environment, and no tool is lance [12, 21, 32]. Traditionally, stateful HTTP cook- readily available for performing a comparative measure- ies are used as the dominant technique to track online ment study on mobile and desktop environments. In this users [27]. In recent years, advanced stateful tracking work, we built a framework called WTPatrol that allows techniques such as Flash cookies and advanced stateless us and other researchers to perform web tracking mea- tracking techniques such as browser or device finger- surement on both mobile and desktop environments. printing have also become very popular over the Inter- Using WTPatrol, we performed the first comparative net [1, 3, 24, 30]. User privacy has been keeping com- measurement study of web tracking on 23,310 websites promised by trackers that utilize these techniques. that have both mobile version and desktop version web- Web measurement is a powerful approach to study- pages. We conducted an in-depth comparison of the web ing online tracking practices and techniques. Re- tracking practices of those websites between mobile and searchers have built several measurement frameworks desktop environments from two perspectives: web track- and performed a few studies to measure web tracking on ing based on JavaScript APIs and web tracking based the desktop environment [2, 11, 19, 21]. However, almost on HTTP cookies. Overall, we found that mobile web all the existing measurement studies were performed on tracking has its unique characteristics especially due to the desktop environment. Little is known about web mobile-specific trackers, and it has become increasingly tracking on the mobile environment, and no tool is read- as prevalent as desktop web tracking. However, the po- ily available for performing a comparative measurement tential impact of mobile web tracking is more severe study on mobile and desktop environments. To the best than that of desktop web tracking because a user may of our knowledge, only Eubank et al. [13] measured the use a mobile device frequently in different places and role played by HTTP cookies and JavaScript in mobile be continuously tracked. We further gave some sugges- web tracking on 500 websites. tions to web users, developers, and researchers to defend Measuring web tracking specifically on the mobile against web tracking. environment is necessary and important because of the Keywords: Web tracking, measurement, user privacy, following reasons. First, mobile web browsing has been mobile, JavaScript API booming and overtaking the desktop web browsing over DOI 10.2478/popets-2020-0016 the years [35]. Second, web tracking measurement on the Received 2019-08-31; revised 2019-12-15; accepted 2019-12-16. desktop environment cannot represent the complete web tracking practices because many websites have mobile- specific versions which could have different tracking practices. Third, certain web techniques such as plu- gin support are implemented differently between mo- bile and desktop browsers, which could lead to web tracking practice differences. Fourth, many web tech- niques and events such as those device sensor related Zhiju Yang: Colorado School of Mines, E-mail: zhi- ones are mobile-specific, and can also be leveraged for juyang@mines.edu tracking mobile web users. Therefore, measuring mo- *Corresponding Author: Chuan Yue: Colorado School of bile web tracking by simply using a desktop browser to Mines, E-mail: chuanyue@mines.edu
A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments 25 mimic a mobile browser may lead to inaccurate results sites that have both mobile version and desktop version as we verified in an experiment (Section 3). webpages; (3) we conducted an in-depth comparison of In this work, we built a framework called WTPa- web tracking practices of those websites between mo- trol that allows us and other researchers to perform bile and desktop environments from the perspectives of web tracking measurement on both mobile and desk- JavaScript APIs and HTTP cookies; (4) we gave some top environments. WTPatrol consists of three modules: suggestions to web users, developers, and researchers to automation driver which serves to automatically control defend against web tracking. the crawling behavior of a browser, WTPatrol browser Section 2 reviews the related work. Section 3 de- extension which instruments a browser for detecting scribes the design of our WTPatrol measurement frame- a variety of web tracking practices, and data analyzer work. Section 4 presents our data collection, dataset, which analyzes and compares web tracking related data and tracker definition. Section 5 presents and analyzes collected from mobile and desktop websites. our measurement results. Section 6 gives our sugges- Using WTPatrol, we performed the first compara- tions. Section 7 concludes this paper. tive measurement study of web tracking on 23,310 web- sites that have both mobile version and desktop version webpages. We conducted an in-depth comparison of web tracking practices of those websites between mobile and 2 Related Work desktop environments from two perspectives: tracking Web tracking compromises the privacy of users by asso- based on JavaScript APIs and based on HTTP cookies. ciating their identities with their browsing activities on From the perspective of JavaScript APIs, we in to- different websites. In this section, we review the studies tal identified 5,835 different trackers 1 and found that on investigating the effectiveness of web tracking tech- (1) 762 (13.1%) trackers are mobile-specific because niques, and the studies on measuring the adoption of they only appeared on mobile websites, (2)1,783 (30.6%) different web tracking techniques in the wild. trackers are desktop-specific because they only appeared on desktop websites, and (3) 3,290 (56.3%) trackers ap- peared on both mobile and desktop websites. From the 2.1 Effectiveness of Web Tracking perspective of HTTP cookies, we in total identified 5,574 different trackers and found that (1) 695 (12.5%) track- Techniques ers are mobile-specific because they only placed cook- ies on mobile websites, (2) 1,536 (27.6%) trackers are In general, there are two types of web tracking tech- desktop-specific because they only placed cookies on niques: stateful and stateless. In term of stateful tech- desktop websites, and (3) 3,343 (59.9%) trackers placed niques, a tracking website usually generates some unique cookies on both mobile and desktop websites. Overall, identifiers, saves them to users’ machines, and tracks we found that mobile web tracking has its unique char- users when the identifiers are carried back. In con- acteristics especially due to mobile-specific trackers, and trast, stateless techniques are usually fingerprinting- it has become increasingly as prevalent as desktop web based, and a tracking website derives unique user iden- tracking in terms of the overall volume of trackers. How- tifiers or fingerprints from the attributes of browsers, ever, the potential impact of mobile web tracking is operating systems, and/or devices. more severe than that of desktop web tracking because HTTP cookies are the oldest stateful tracking tech- a user may use a mobile device frequently in different niques [17], but recent studies such as the one per- places and be continuously tracked. formed by Roesner et al. [27] showed that they are still Our paper makes the following major contribu- widely used on many websites. In addition to HTTP tions: (1) we built a web tracking measurement frame- cookies, many advanced stateful web tracking tech- work, WTPatrol, that allows researchers to perform web niques have also widely appeared over the years. For ex- tracking measurement studies on both mobile and desk- ample, supercookies are Adobe Flash files, HTTP Etag top environments; (2) we performed the first compara- is an HTTP response header field, and HTML5 local tive measurement study of web tracking on 23,310 web- storage is a browser’s local storage for large data ob- jects; they can all be used to store stateful information at the client-side and track web users [1, 3, 24, 30]. Due to the awareness of privacy issues caused 1 A tracker in this paper is identified at the granularity of the fully qualified domain name (FQDN) [50]. by stateful techniques and the capability for privacy-
A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments 26 conscious users to disable such techniques, more web- OpenWPM-Mobile which still runs on a desktop envi- sites have started to use stateless web tracking tech- ronment; they found a significant overlap between track- niques such as browser fingerprinting. Browser finger- ing scripts and scripts accessing mobile sensor data. printing identifies unique web browsers based on their Lerner et al. [19] analyzed the Internet archive data from attributes. Researchers often evaluate the effectiveness 1996 to 2016, and found that web tracking has become of browser fingerprinting techniques by setting up a web- more sophisticated and pervasive over the years. site to invite visitors and collect the attributes of their Existing measurement studies exemplified above browsers. For example, in the influential Panopticlick were all performed using desktop computers. Measur- study [10], Eckersley collected 470,161 browser finger- ing mobile web tracking directly using mobile devices is prints composed of 10 attributes, and found that 83.6% necessary and important (Section 1), and it needs the of those browser fingerprints are unique. Similar to the support from tools that ideally leverage a full-fledged Panopticlick study, Laperdrix et al. collected 118,934 mobile browser. To the best of our knowledge, only Eu- browser fingerprints composed of 17 attributes [18], and bank et al. measured mobile web tracking using a mobile found that 90% of desktop browser fingerprints and 81% browser [13]. In more details, they ported the Fourth- of mobile browser fingerprints are unique, respectively. Party extension to the mobile Firefox, and measured the Researchers also demonstrated the fingerprintabil- role of HTTP cookies and JavaScript played in mobile ity of some other individual attributes such as those web tracking on 500 websites. In contrast, we performed of browser extensions [31], JavaScript engines [22, 23], the first comparative measurement study on 23,310 web- and HTML5 battery status [25]. Meanwhile, some re- sites, and conducted an in-depth analysis of JavaScript searchers showed the fingerprintability of smartphones API based and HTTP cookie based web tracking prac- by exploiting the manufacturing imperfections in hard- tices between mobile and desktop environments. ware devices such as speakers and microphones [4, 7, 34] as well as accelerometer and gyroscope motion sen- sors [4, 8, 9]. It is worth noting that third-party tracking also occurs in mobile apps [14, 15, 20], although the fo- 3 Measurement Framework cus of our work is on browser-based tracking. A few measurement tools are available on desktop en- vironments as described in Section 2.2. However, there is no measurement framework readily available for re- 2.2 Web Tracking Measurement Studies searchers to perform comparative web privacy measure- ment studies on mobile and desktop environments, de- To explore the extent to which web tracking techniques spite that mobile web browsing has been booming and reviewed above are adopted in the wild, researchers have overtaking the desktop web browsing over the years [35]. developed different tools and conducted several mea- Measuring mobile web tracking directly using mo- surement studies. bile devices (instead of using desktop computers without Mayer et al. [21] implemented a Firefox browser ex- or with emulated mobile browser user agents) is neces- tension, FourthParty, which can intercept HTTP traffic sary and important. This is because many differences and monitor Document Object Model (DOM) events. between mobile and desktop websites can incur inaccu- They investigated third-party web tracking policies and racy if we use a desktop-based tool to measure mobile techniques by using this tool over 500 websites. Acar websites. We verified this inaccuracy by examining the et al. [2] developed a measurement tool, FPDetective, differences between visiting websites using a real mobile which is built upon a stripped-down browser Phan- Firefox browser and using an emulated mobile Firefox tomJS and the Chromium browser driven by Sele- browser (which is a desktop Firefox browser configured nium [56]. They measured the top million Alexa web- with the user agent string of a real mobile Firefox and sites, identified new fingerprinting scripts and Flash the screen resolution of a smartphone). objects, and found that web tracking techniques are In more details, we visited the homepages of the widely used. More recently, Englehardt et al. [11] de- top 100 Alexa websites and compared the correspond- veloped OpenWPM, which is a comprehensive web pri- ing webpage sources between the real mobile browser vacy measurement platform that directly leverages a and the emulated one. We found that 15 out of the full-fledged Firefox browser. They also measured the top top 100 Alexa websites returned different webpages (in million Alexa websites, and showed the prevalence of terms of the DOM structure, JavaScript content, or many web tracking techniques. Das et al. [6] developed
A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments 27 page content) to the two browsers. The proportional 3.1 WTPatrol Browser Extension distances (Defined in Section 4.2) of returned webpages of these 15 websites vary from 0.03 to 0.95, with the As reviewed in Section 2.2, OpenWPM [11] is a recent average, standard deviation, and median values at 0.41, platform for measuring web tracking on desktop envi- 0.34, and 0.27, respectively. Specifically, five websites ronments. It mainly uses a Firefox browser extension returned two different versions of webpages to these expanded from FourthParty [21] to measure web track- two browsers (i.e., the emulated mobile browser was de- ing practices. This approach of using a browser exten- tected as a desktop browser, and the desktop version sion and a full-fledged browser enables more complete webpages are at least 38% larger in file size than the web tracking measurement than using a stripped-down corresponding mobile version webpages). For the rest browser such as PhantomJS [11]. Therefore, we build ten websites, eight of them returned webpages differing our WTPatrol browser extension upon the Firefox ex- in at least one script element as exemplified in Table 5 tension used in OpenWPM by porting it to mobile Fire- of Appendix B, while two of them returned webpages fox and further expanding it with new capabilities. Since differing in at least one large div element with more a desktop Firefox extension does not automatically work than six child elements. This result indicates that web- on mobile Firefox, to begin with, we modified the code of sites may use other techniques beyond checking the user the OpenWPM Firefox extension to make it compatible agent string and screen resolution to detect the type of with mobile Firefox (specifically Firefox for Android) a browser, and would not return mobile content to an using the Firefox Add-on SDK. We then expanded this emulated mobile browser. mobile version extension from three aspects. Note that Das et al. [6] built OpenWPM-Mobile to First, we made the architectural change. OpenWPM more realistically emulate Firefox for Android running integrates its data storage into a separate module writ- on a real Moto G5 Plus smartphone. Because emulating ten in Python instead of into its extension. While this devices is complicated as it requires manual fine-tuning approach lessens the burden of the Firefox browser, it for every new device that comes out, we in this work increases the complexity and incurs new dependency by take the simpler approach of using real mobile devices having the extra communication with the data storage although it has its own limitations. It is not very scalable module installed on the running platform. Therefore, in terms of using multiple (or multiple types of) devices to simplify the usage of our extension and remove the and is relatively expensive, in comparison to the use of dependency on an extra data storage module, we inte- emulated mobile devices. grated the data storage directly into the extension by Therefore, we set out to develop a framework, WT- taking the same approach as in FourthParty [21]. Patrol, that can allow researchers to accurately measure Second, we enriched the list of instrumentation for web tracking practices on both mobile and desktop en- tracking techniques. As shown in Figure 1, our WT- vironments, and comprehensively perform comparative Patrol browser extension mainly measures three types web privacy measurement studies. Figure 1 illustrates of techniques that can be potentially utilized to track the architecture of WTPatrol with three modules: au- web users: JavaScript API, HTTP request and response, tomation driver which automatically controls the crawl- and event registration. Table 1 summarizes all the 34 ing behavior of a browser, WTPatrol browser extension JavaScript APIs (or objects) instrumented in the WT- which instruments a browser for detecting a variety of Patrol browser extension. We added new instruments web tracking practices, and data analyzer which ana- to measure all the event handler registration activi- lyzes and compares web tracking related data collected ties with the initial goal to capture mobile-specific web from mobile and desktop websites. tracking techniques. In more details, by instrumenting window.addEventListener, our WTPatrol browser ex- tension is capable of, for example, capturing device- motion events which are related to the motion sensor access, and devicelight events which are related to the ambient light sensor access. Third, we added a self-crawling mode to the exten- sion, so that it can automatically visit a list of websites one by one as OpenWPM does. For each website, our extension can close the previous browser tab and open a Fig. 1. The architecture of the WTPatrol framework.
A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments 28 new one to visit its homepage; it stays for a configured tice differences between the two environments. We now period of time before visiting the next website. present data collection, dataset, and tracker definition. Based on this completed WTPatrol browser exten- sion for mobile Firefox, we also created a desktop version with the same capabilities. So now web tracking mea- 4.1 Data Collection surement studies can be fairly performed and compared on both mobile and desktop environments. In our study, we used four mobile devices, which are Google Nexus 6p smartphones installed with Firefox for Android and mobile WTPatrol. We also used four Table 1. Thirty four JavaScript APIs (or objects) that can be utilized to perform web tracking. desktop computers installed with Ubuntu 14.04, desk- Tracking Technique Tracking Technique top Firefox, and desktop WTPatrol. The version of both window.navigator.userAgent window.document.cookie Firefox for Android and desktop Firefox is 53.0 with- window.addEventListener window.navigator.plugins out the tracking protection function. Each data collec- window.navigator.language window.screen.colorDepth window.localStorage window.Storage tion experiment includes two stateless2 measurements CanvasRenderingContext2D window.name on the mobile and desktop environments, respectively. window.navigator.platform window.navigator.appName From our campus network, we ran the two stateless mea- window.sessionStorage AudioContent window.navigator.appVersion window.navigator.product surements side by side on the homepages of 116,000 window.navigator.vendor HTMLCanvasElement websites3 , among which 100,000 websites are the top window.navigator.doNotTrack window.performance Alexa websites4 and 16,000 websites are sampled lower window.navigator.mimeTypes window.screen.pixelDepth window.navigator.onLine window.navigator.oscpu rank websites. The way we sampled the lower rank web- window.screen.orientation ScriptProcessorNode sites is selecting the first 2,000 of every 100,000 websites window.navigator.geolocation RTCPeerConnection within the rank from 200,000 to 1,000,000 based on the window.navigator.buildID AnalyserNode window.navigator.cookieEnabled OscillatorNode Alexa top one million site list, resulting in eight groups. window.navigator.appCodeName GainNode After each homepage is loaded, WTPatrol collects the data for 20 seconds, among which the last five seconds are used to scroll down the current webpage five times 3.2 WTPatrol Automation Driver with an interval of one second and a scrollHeight of the HTML body element. The rendered webpage source will Automation is essential for performing large-scale web be saved after the last scroll for analysis. measurement studies. For the sake of simplicity, we did Considering the dynamic nature of web content not choose automation tools such as Selenium [56] to (e.g., dynamic advertising and time differences), we per- drive the Firefox browser and our browser extension; formed two data collection experiments with one day instead, we wrote shell scripts and leveraged Android difference between them using two sets of mobile de- Debug Bridge (ADB) to build our mobile version au- vices and desktop computers. These two experiments, tomation driver. Specifically, our shell scripts can pro- referred to as Experiment A and Experiment B, were grammatically start/stop the browser, prepare browsing performed from June 10th, 2019 to June 30th, 2019. profiles, and configure our WTPatrol browser extension. That is, in both experiments, we collected data from The shell scripts are executed on a desktop computer, all the 116,000 websites, but the data for each web- and they communicate with an Android device via USB site in Experiment B was collected one day after that using the ADB service. In contrast, our desktop ver- for the same website in Experiment A. Due to reasons sion automation driver executes shell scripts directly on such as timeout and site server errors, in total we have a terminal to drive a desktop Firefox browser and our 96,093 and 98,335 successful website visits with webpage browser extension. sources saved in Experiments A and B, respectively, and have 89,828 successful website visits in both ex- periments. We further identified from these 89,828 sites 4 Data Collection, Dataset, and 2 With the browser state being cleaned for each website visit. Tracker Definition 3 To speed up the data collection, we divided the 116,000 sites into roughly two halves, and used two sets of smartphones and We perform measurements using both mobile devices desktop computers to crawl each half independently. and desktop computers to capture web tracking prac- 4 The Alexa top one million site list is dated on June 3rd, 2019.
A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments 29 a list of 23,310 sites that have both desktop and mo- and A2 using Formula 1: bile versions of homepages. In other words, when each Pn D(t1 [i], t2 [i]) of these 23,310 sites is visited by a desktop browser and P D(t1 , t2 ) = Pi=1 n (1) a mobile browser, respectively, the two returned home- i=1 L(t1 [i], t2 [i]) pages are not very similar, indicating that the website The authors in [5] used this metric to compare the simi- generates different webpages between the two environ- larity between two phishing webpages. We use this met- ments and a comparison between them is meaningful. ric to compare the similarity between the mobile ver- sion and the desktop version webpages of the same site. Note that we provided an example of the proportional 4.2 A Dataset of 23,310 Sites distance calculation in Appendix A. Basically, the larger a proportional distance is, the We are not aware of an existing list of websites that have bigger the difference between a pair of mobile webpage both desktop and mobile versions, so we compare the and desktop webpage, indicating that the given website pair of desktop and mobile webpages in terms of their is more likely to have two different versions on mobile similarity for each of those 89,828 websites to eventually and desktop environments. We calculated the propor- identify that list of 23,310 sites as the dataset. There tional distance values of the 89,828 websites in Exper- are many ways to compare the similarity between two iments A and B individually. We found that for 35.4% webpages, ranging from simple but strict comparison on and 36.2% of all the 89,828 websites in Experiments A whether two pages are identical, to complex but flexi- and B, respectively, the proportional distance between ble DOM tree edit distance calculation. We leveraged a each pair of mobile webpage and desktop webpage is 0, simple yet flexible proportional distance metric proposed indicating that the two versions are identical or nearly in [5] to quantify the similarity between two webpages. identical between mobile and desktop environments. This metric is the ratio of the HTML tag-vector level Strictly speaking, once the proportional distance is Hamming Distance to the number of tags that appear greater than 0, some differences exist between a mobile in at least one of the two webpages, so that the former is webpage and the corresponding desktop webpage. How- normalized to take into account the webpage complexity ever, considering that some subtle differences could be differences among many different websites. incurred due to the dynamics of webpages such as time- Formally, the proportional distance between two sensitive strings and dropdown lists, choosing a thresh- webpages A1 and A2 is calculated as follows [5]: (1) de- old value that is greater than 0 but not too large will fine an arbitrary but fixed ordering of a corpus of n be helpful for us to suppress such noises. HTML tags (n=107 as in [5], that is, we consider 107 To select a reasonable proportional distance value different HTML tags which are from the complete set as the threshold for identifying whether a given website of tags provided by World Wide Web Consortium [60] has two specific versions, we randomly sampled 700 web- after removing some of more common tags such as sites from seven groups5 that have different proportional , , and but including tags such distance values and manually inspected their webpage as ), and define a “vector” as an array of n in- content. In more details, we started from the group of tegers; (2) construct the vector t1 for the webpage A1 websites with the proportional distance value at 0.05, by counting the number of times each HTML tag of the and found that their two versions of webpages do not corpus appears in A1 , and construct the corresponding differ significantly especially in terms of the JavaScript vector t2 for the webpage A2 ; (3) for each pair of corre- code and text content; we continued the manual inspec- sponding integers x1 and x2 in the two vectors t1 and t2 , tion for the groups of websites with the proportional dis- define the Hamming Distance D(x1 , x2 ) = 1 if x1 6= x2 , tance value at 0.1, 0.15,..., 0.35, respectively, and found and D(x1 , x2 ) = 0 otherwise; (4) for each pair of corre- that 0.35 is a reasonable threshold value for which at sponding integers x1 and x2 in the two vectors t1 and least 65% of the websites in the group differ significantly. t2 , further define L(x1 , x2 ) = 1 if x1 6= 0 OR x2 6= 0, and There is always a trade-off between choosing a smaller L(x1 , x2 ) = 0 otherwise, to indicate whether the corre- and a larger threshold value in similarity comparison. sponding tag appears in at least one of the two pages; In our case, choosing a smaller threshold value would (5) for non-null vectors t1 and t2 , calculate the propor- include into our dataset more websites that do not have tional distance P D(t1 , t2 ) between the two webpages A1 5 With 100 websites from each group.
A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments 30 significant differences between their two versions of web- fication method is commonly used in many studies such pages, while choosing a larger threshold value would ex- as [11]. clude from our dataset more websites that do have obvi- We then classify a third-party domain as a tracker ous differences between their two versions of webpages. if it would be blocked by some ad block engine. In Therefore, we decided to use 0.35 as the threshold this paper, we leverage the ad block engine used in value, which is indeed close to the value 0.32 used in [5] the Brave browser [39] to classify third-party domains for comparing phishing webpages. In total, we identified as trackers or non-trackers. In more details, the engine a list of 23,310 (25.9% of 89,828) sites with the propor- first parses filters based on blacklists such as EasyList, tional distance values greater than 0.35 in both Experi- and then matches third-party URLs against the parsed ments A and B as the dataset for us to further compare filters (e.g., regular expressions, URL substrings, and their privacy practices. Figure 2 depicts the distribu- whitelisted rules). The blacklists used in our work are tion of the 23,310 sites in six Alexa ranking groups. EasyPrivacy [49], EasyList [47], and 12 EasyList vari- Note that the selected group is for sites with the rank ants for different countries or languages [48]. Note that between 200,000 and 1,000,000 as described at the be- this collection of trackers may target at either desktop ginning of this section. We observed a clear trend that or mobile users as we inspected. While blacklists would the higher the ranking group, the larger the percentage never be complete to include all trackers, they are ap- of websites being selected from the group. This trend in- propriate for us to capture the differences of web track- tuitively indicates that higher ranked websites are more ing on the two environments. After parsing and match- likely to have both desktop-specific and mobile-specific ing against these lists, a third-party domain is classified versions of webpages than lower ranked ones. as a tracker if it would be blocked by the ad block engine. For instance, the first rule of the current EasyList is “&act=ads_”; example.com will be labeled as a tracker 12000 9953 10000 8591 (19.9%) in the “example.com/index.html?x=1&act=ads_” con- (21.5%) text, but not in the “example.com/beningpage.html” # of Websites 8000 6000 context as no matching pattern is identified by the ad- 4000 2270 block engine. In other words, similar to [11], we clas- 956 1179 (14.2%) 2000 361 (23.9%) (23.6%) sify a third-party domain as a tracker only if it is in (36.1%) 0 the tracking context. Also note that well-known third- ed k) ) k) ) k) 5k 0k ,1 party libraries such as jQuery and React are usually not 10 ct 00 ,5 k, le [1 ,1 k, [1 Se 0k [5 0k [1 blocked by blacklists, and thus they are not considered [5 Ranking Group as trackers in our study. Fig. 2. Distribution of the 23,310 sites in six ranking groups. A tracker may perform tracking activities using JavaScript APIs or placing HTTP cookies. In this pa- per, we refer to JavaScript API using trackers and cookie placement trackers as JS-trackers and cookie- 4.3 Tracker Definition trackers, respectively. Our WTPatrol browser exten- sion records the JavaScript execution (Table 1) and A tracker in this paper is identified at the granularity of cookie setting (e.g., via HTTP responses) activities as- the fully qualified domain name (FQDN) [50]. We iden- sociated with a visited first-party websites; therefore, in tify a tracker only if it appears in a third-party URL, our measurements, we ensure that a tracker has actually i.e., it must be a third-party domain. To determine if a used some of those techniques in practice, i.e., a tracker URL is a first-party URL or third-party URL, we lever- on a first-party webpage has executed JavaScript code age Mozilla’s Public Suffix List [53], in which a public or placed cookies that can be used for tracking. suffix is “one under which Internet users can (or histor- In the rest of this paper, our analysis is based on ically could) directly register names.” More specifically, the data collected in both Experiments A and B. We we classify a URL as first-party if its public suffix to- count a tracker as being included by a first-party web- gether with the first section of the domain name pro- site as long as certain tracking activities are identified ceeding the public suffix is identical to that of the URL in either Experiment A or Experiment B. Correspond- displayed in the address bar of a browser; otherwise, we ingly, unique JavaScript APIs and cookies are counted classify the URL as third-party. This third-party classi- as long as they are identified in either Experiment A or Experiment B.
A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments 31 pairs (with 100 different first-party domains and 100 dif- 5 Measurement Results Analysis ferent tracker domains) to estimate the false positives in our study. In terms of the first situation, we found that In this section, we present our measurement results zero pair has the same organization. In terms of the and analysis from the perspective of tracking based on second situation, by using the same two properties of a JavaScript APIs and the perspective of tracking based within-site tracker 6 as defined in [27], we identified six on HTTP cookies. We conclude this section with a sum- within-site trackers in the 100 sampled pairs. Overall, mary of results and implications. the false positive rate based on the sampled pairs is 6% considering both situations. 5.1 Tracking Based on JavaScript APIs 10000 30 5.1.1 Trackers’ Distribution and Their Techniques 25 # of JavaScript APIs Accessed 7500 # of First-party Sites 20 Figure 3 presents the general view of all the JS- 5000 trackers, i.e., 4,052 on the mobile environment and 15 5,073 on the desktop environment, in terms of the num- 10 2500 ber of first-party sites they appeared on and the number 5 of tracking JavaScript APIs they accessed. The y-axis 0 0 10 0 10 1 10 2 10 3 on the left side and the blue curve represent the number Tracker Index of first-party websites that include each of all the track- (a) Trackers on the mobile version first-party websites ers shown on the x-axis in log scale, while the y-axis on the right side and the red markers represent the num- 10000 30 ber of JavaScript APIs (i.e., out of those 34 JavaScript 25 # of JavaScript APIs Accessed APIs shown in Table 1) accessed by each of those track- 7500 # of First-party Sites 20 ers. The blue curves in Figure 3 show the “long-tail” 5000 15 phenomenon with trackers in the tail only appearing on very few websites. In more details, we found the vast 10 2500 majority of the trackers, i.e., 3,601 (88.9%) out of 4,052 5 trackers on the mobile environment and 4,389 (86.5%) 0 0 10 0 10 1 10 2 10 3 out of 5,073 trackers on the desktop environment, only Tracker Index appear on less than ten first-party websites. In addition, (b) Trackers on the desktop version first-party websites the number of JavaScript APIs accessed by all the track- Fig. 3. All the JS-trackers that appeared on the most number of ers vary from 1 to 27. On average, the trackers appear those 23,310 first-party websites. on 18 mobile websites and 24 desktop websites. To estimate the false positives in our tracker iden- tification, we manually analyzed 100 randomly sampled Figures 4a and 4b show the top 20 JS-trackers websites. False positives may occur in two situations. that appeared on the most number of those 23,310 first- One is that a tracker is indeed not a third-party be- party mobile version and desktop version websites, re- cause it belongs to the same organization as its cor- spectively. The meanings of the y-axes and x-axis are responding first-party, thus a false positive occurs. For identical to those in Figure 3, except that the x-axis rep- example, fbcdn.net may be misclassified as a tracker on resents the top 20 trackers with their domain names, and facebook.com while they belong to the same organiza- the bars represent the number of first-party websites tion Facebook. The other situation is that a tracker is that include each of these 20 trackers. We found that just a within-site (instead of cross-site) tracker accord- www.google-analytics.com is the top tracker on both mo- ing to Roesner et al. [27], thus a false positive occurs. A within-site tracker (e.g., google-analytics.com on a first- party website) cannot use unique identifiers to track 6 Property 1: the tracker’s script, running in the context of the users across sites while a cross-site tracker (e.g., dou- site, sets a site-owned cookie. Property 2: the tracker’s script bleclick.net on a first-party website) can [27]. We ran- explicitly leaks the site-owned cookie in the parameters of a request to the tracker’s domain, circumventing the same-origin domly sampled 100 (first-party domain, tracker domain) policy.
A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments 32 bile and desktop websites, appearing on 10,464 (44.9%) to each other between those sampled trackers and the and 10,228 (43.9%) of them, respectively. Note that, top trackers, we found that the percentages of trackers eight of the top ten trackers belong to the same organi- that have the reputation grade lower than Good (i.e., zation, Google, indicating its dominant tracking power. the reputation score is less than 60) are different with Thirteen common trackers appear on both Figures 4a 47.6% and 30.7% in the two groups, respectively. and 4b, which means that those top trackers monopolize We inspected all the JavaScript APIs accessed web tracking on both mobile and desktop environments. by trackers on mobile and desktop environments, and In Figure 4, on average the top 20 trackers appear on summarized in Table 6 of Appendix C. We found 2,196 mobile websites accessing 9 JavaScript APIs and that window.document.cookie is the most commonly ac- on 3,146 desktop websites accessing 11 JavaScript APIs. cessed JavaScript API with 67.4% and 67.5% trackers on mobile and desktop environments using it. Mean- 10000 25 while, window.navigator.userAgent is at second place with 59.8% and 55.4% trackers on mobile and desk- # of JavaScript APIs Accessed 20 # of First-party Sites 7500 15 top environments using it. The two lists contain the 5000 10 same set of top 10 JavaScript APIs, and their ranks 2500 5 of the top five APIs are even identical. In addition, 0 0 the percentages of trackers on these two environments ot om ch ru om on wr com nd t m m rg et om m ne .h com io m et nag om that use a corresponding API are close to each other. ce om m et . om sy vice t m co ba r.co ex ne .o re ogle ub ok.n pp .co c co -a um ar.c co ds dic ck.n m eo. g. ds lic.c oo oub n.co . ct c .c rip tjar. .c eo k. . s. m ics. am em n. in ja je ge glea ect ces it er lic o j rd .ya rit i t.b r ot e ro cl b o ar o .c t c. glea lec t .h .c at i ys ly t.h le i at c se ic v rs na r ic az .ne r ic ds .fa at og dse se a re nd va at We did not observe any JavaScript API that is only oo le-a st o -a g am ent ag sc ba yn st d .g tag d n. . ca . .g s n .g cd g le g n e oo re gl go co gl oo js co .g go .g o w .s se 2.g c. .g pu w w accessed on the mobile environment. Even for sen- sb w o w w tp w w ad w w w cu w pa Tracker sor access, the window.addEventListener API is used (a) Trackers on the mobile version first-party websites to listen to events such as devicemotion, but this 10000 25 API is also widely used to listen to other events on # of JavaScript APIs Accessed 20 both mobile and desktop environments. However, there # of First-party Sites 7500 15 are two APIs (i.e., window.navigator.plugins and win- 5000 10 dow.navigator.mimeTypes) being only accessed on the 2500 5 desktop environment due to the lack of support on the 0 0 mobile environment. et m ca s.c .com om ep ata et ed et rg n. t m m ct om ce m ne .n na com m t te s.co om s0 om c. nne atio m t The results of tracker distribution and tracking ce om ne n n ne .o pp .co co o dn .c co ds .m eo. k. o k. c ct .c . .c .c r.c co rdi. riteo k. re teo c lic ck gu ted. s. o d s. m am rch je gs tics n. ed rit ic bo ep ice ep lec ge ic lead ecli io .2 ro ri g. lecl .c ct a c .c og s.g. cat y vi o rv se ic b al te te l m st oog ub er ou fa b ic at ds se ro u di og ma n ro ro ad ou nd cu goo ct. JavaScript APIs in this subsection have the following -a st z do l.a d.d yn rd go tag d n. e ta sy es af gl af a af cd e le s. e oo gl ds re gl co ad ogl d oo .a .g .a ba .g le fw o w .g .s xe .g pu w w 2. at sb w major implications. First, from the perspective of the w w pi tp w re w go w w w w ge se pa Tracker overall volume of trackers, web tracking on the mobile (b) Trackers on the desktop version first-party websites environment is nearly as pervasive as that on the desk- Fig. 4. Top 20 JS-trackers that appeared on the most number of top environment, with 4,052 and 5,073 trackers on mo- those 23,310 first-party websites. bile and desktop environments, respectively. Second, the top trackers have a great impact on web tracking be- cause they are intensively included by first-party web- We investigated 20 randomly sampled trackers sites on both environments. Third, although the track- in each long tail, and found that the average number ers on the long tails access a relatively small number of of tracking JavaScript APIs accessed by these track- tracking JavaScript APIs, the overall volume of those ers is four on both environments (which is much less trackers is large and the impact of them should never than that for the top 20 trackers). We further checked be neglected. the trustworthiness of these 40 sampled trackers using a popular web reputation lookup service WebOfTrust (WOT) [58]. From 21 of them that have the reputa- 5.1.2 Relationship Between Mobile and Desktop tion data, we found that the average reputation score Trackers is 65 and 58 on mobile and desktop environments, re- spectively. In comparison, we found that 20 trackers in Figure 5 is a Venn diagram summarizing the relation- Figure 4a and 19 trackers in Figure 4b have the repu- ship between the trackers identified on mobile websites tation data with the average scores 66 and 60, respec- and desktop websites. We found that in total 4,052 dif- tively. Though the average reputation scores are close ferent trackers appeared on the mobile websites and
A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments 33 5,073 different trackers appeared on the desktop web- information of 251 trackers from the WHOIS record. sites, indicating the prevalence of including a variety of Figure 13 in Appendix D details the tracker coverage of trackers by first-party websites. These two environments the four data sources. share 3,290 common trackers. We found that 3,107 out Overall, we identified 382 unique organizations that of these 3,290 common trackers appeared on matched host these trackers. Note that by comparing with the desktop and mobile first-party websites. That is, each of 742 unique organizations hosting the 1,783 desktop- those trackers appeared on both the mobile and desk- specific trackers, we found that 258 of these 382 orga- top versions of at least one same first-party website. The nizations only appear on the mobile environment. We lists of these trackers can be accessed at a GitHub public summarized the top 10 organizations that host the most repository7 . number of mobile-specific JS-trackers as shown in Ta- Meanwhile, 762 trackers are unique on the mo- ble 2. We identified that five organizations, which are bile environment, indicating that a large number of Adobe, Optimzely, CloudFlare, Amazon, and Kameleoon, mobile-specific trackers exist on the Internet al- provide their content delivery network (CDN) services though they only use a subset of JavaScript APIs that to various trackers, thus making them ranked in this appeared on the desktop environment as shown in Ta- top 10 list. Interestingly, we found that the rest five ble 6. Note that due to the nature of measurement study, are all domain management organizations. They pro- these 762 trackers should only be considered as mobile- vide domain information protection services, which are specific in the results obtained by this study. They may privacy measures implemented by a large number of do- appear on the desktop environment in other studies, mains. In more details, among those 251 trackers whose and new mobile-specific trackers may be discovered at organization information was retrived from the WHOIS different times. record, there are 34, 25, 14, 8, and 7 having organi- zation of “Redacted for Privacy”, “Domains by Proxy”, “Whois Guard”, “Global Domain Privacy Services”, and Mobile Desktop “Whois Privacy”, respectively. Therefore, these five or- ganizations cannot represent the real organizations of 762 3,290 1,783 those trackers. These results indicate that some track- ers tend to use CDN services to host their track- ing services and intentionally hide their domain registration information. Fig. 5. The number of JS-trackers that appeared on the mobile websites and the desktop websites. Table 2. Top 10 organizations that host the most number of mobile-specific JS-trackers. 5.1.3 Who and Where Are Those JS-trackers? Organization # of mobile-specific trackers Adobe 48 To have a better understanding of the 762 mobile- Redacted For Privacy 34 specific trackers in terms of their organization and Domains By Proxy 25 country information, we leveraged the CrunchBase Optimizely 19 CloudFlare 17 database [45], Tim Libert’s library [57], TLS certificate, Whois Guard 14 and WHOIS record [59] as data sources 8 . In more de- Amazon 9 tails, we attempted to identify tracker information from Kameleoon 9 Global Domain Privacy Services 8 these four data sources in order. For each tracker, we Whois Privacy 7 stopped the identification process once the information is obtained from a data source. In total, we successfully retrieved the organization information of 411 trackers Table 3 shows the top 10 countries, which are also from either the CrunchBase database, Tim Libert’s li- extracted from the aforementioned data sources, that brary, or TLS certificate, and retrieved the organization have the most number of mobile-specific trackers. We found that the United States is the top one country that has the most, i.e., 264, mobile-specific trackers. 7 https://github.com/jun521ju/PETS2020_Web_Tracking. 8 D&B Hoovers can also be used for this purpose [26]. We did not leverage it because we could not obtain its database access.
A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments 34 30 Table 3. Top 10 countries that have the most number of mobile- 20 specific JS-trackers. 18 Average # of JavaScript APIs Accessed 25 16 Average # of Trackers 14 20 Country # of trackers Country # of trackers 12 United States 264 Panama 25 10 15 China 47 Germany 21 8 10 Japan 40 Unite Kingdom 17 6 France 33 Korea 16 4 5 Redacted For Privacy9 31 Canada 14 2 0 0 s ts es e rts e lt ss th y l ns g ce s n na ew u om nc et Ar n er io l am ne Ad o ea ee en pi ci io ie at N Sp t H pu op So si eg H _T er Sc G re Bu om Sh ef R ec d an R R C s_ 5.1.4 First-party Sites That Include JS-trackers d Ki Category (a) Mobile websites To explore the probability for first-party websites to 30 have more JS-trackers on their desktop versions than on 20 18 Average # of JavaScript APIs Accessed 25 their mobile versions, we performed the Binomial Pro- 16 portion statistical test for the 23,310 first-party websites. Average # of Trackers 14 20 12 Initially, we set our null hypothesis as H0 : P = 0.5, and 10 15 set our alternative hypothesis as H1 : P > 0.5, where 8 10 6 P represents the probability for first-party websites to 4 5 have more JS-trackers on their desktop versions than on 2 0 0 their mobile versions. We increased the value of P with s ts es e ts e t ss lth l y s ng ce s n ul na ew om nc et en Ar or er io am ne Ad ea en pi io ci ie at N Sp t Te H pu op So si eg H er Sc G re Bu d_ om the step of 0.05 to revise our initial hypotheses, and Sh ef R ec an R R C s_ d Ki found that we can even reject the null hypothesis with Category P = 0.7 at the 95% confidence interval, which indicates (b) Desktop websites that the probability for first-party websites to have Fig. 6. Average number of trackers appeared on first-party web- more JS-trackers on their desktop versions than sites of different Alexa categories. on their mobile versions is greater than 0.7 . To measure the prevalence of trackers included in sult. Correspondingly, more tracking JavaScript APIs different categories of the first-party websites, we used are accessed on the desktop environment than on the Alexa’s top 500 sites list in each of the 16 categories to mobile environment. determine the category of each of the first-party web- We further explored the prevalence of the trackers sites (note that 1,154 mobile websites and 1,199 desk- on different ranks of the first-party websites as shown top websites are successfully categorized because Alexa in Figure 7. According to Alexa’s ranked site list, we only provides top 500 sites per category). In Figures 6a first divided the 23,310 websites into six ranking groups. and 6b, we then averaged the number of trackers in- We then averaged in each group the number of trackers cluded by the first-party websites in each category rep- (represented by the vertical bars with the y-axis on the resented by the vertical bars with the y-axis on the left left side) included by the first-party websites as well as side, and averaged the number of JavaScript APIs ac- the number of JavaScript APIs accessed (represented by cessed by the corresponding trackers represented by the the curve with the y-axis on the right side) by the corre- curve with the y-axis on the right side. sponding trackers. We found that both the mobile and We found that the news category has the most desktop websites ranked between 1,000 and 5,000 have trackers on the desktop environment, while the the most (i.e., 5 and 9, respectively) trackers on aver- home category has the most trackers on the mo- age. Generally speaking, the average number of track- bile environment. Except for the adult category, the ers in each group smoothly decreases as the rank de- rest 15 categories all have nearly double or over dou- creases except for the ranking group [1, 1k). The selected ble numbers of trackers on desktop websites than on group (i.e., the sites with the rank between 200,000 and mobile websites; this category-based result is consistent 1,000,000 as described in Section 4.1) has the least num- with the aforementioned Binomial Proportion test re- ber of trackers. These results indicate that web track- ing tends to occur more intensively on the higher ranked websites, no matter for the mobile or desk- 9 The country information is hidden by this domain manage- top environment, than on the lower ranked ones. ment organization.
A Comparative Measurement Study of Web Tracking on Mobile and Desktop Environments 35 15 15 Desktop # of Trackers sor data. In addition, we found that 31 mobile websites Average # of JavaScript APIs Accessed Mobile # of Trackers Desktop # of JS APIs Accessed Mobile # of JS APIs Accessed accessed the ambient light sensor data. Average # of Trackers 10 10 5 5 5.2 Tracking Based on HTTP Cookies While HTTP cookies can be set by JavaScript code, tra- 0 0 ditionally the majority of cookies are directly set by a ed k) ) k) ) k) 5k 0k ,1 10 ct 00 ,5 k, le [1 ,1 k, [1 Se 0k browser based on the HTTP responses. What we present [5 0k [1 Ranking Group [5 in this subsection includes both cases and is a combined Fig. 7. Average number of trackers included by first-party web- result of cookie usage. Since some cookies are placed for sites of different Alexa ranking groups on both mobile and desk- functionality purposes instead of tracking, we follow the top environments. same method as in Section 4 of [11] to identify ID cook- ies. ID cookies store unique user identifiers and can be When inspecting the average number of JavaScript used for tracking. In more details, an ID cookie, which APIs accessed in different ranking groups, we found that is represented by a tuple of (cookie-name, parameter- the number on the desktop environment decreases from name, parameter-value), meets the following four crite- 13 in the ranking group [1, 1k) to 8 in the selected group, ria [11]: (1) has a lifetime longer than 90 days, (2) the while the number on the mobile environment does not length of its parameter-value is larger than 7 and less significantly change across different ranking groups. On than 101, (3) its parameter-value remains the same in average, the number of JavaScript APIs accessed on mo- our data collection experiment, and (4) the similarity of bile and desktop websites is 8 and 12, respectively. its parameter-value between our Experiments A and B is lower than 66% according to the Ratcliff-Obershelp algorithm[55]. We present our analysis based on all the 5.1.5 Sensor Access identified ID cookies. Motion sensors such as accelerometers and gyroscopes are widely equipped on modern smartphones. Re- 5.2.1 Trackers That Use HTTP Cookies searchers have recently presented motion sensor based device fingerprinting attacks [4, 6, 8, 9] as reviewed Figure 8 presents the general view of all the cookie- in Section 2. For example, the devicemotion event in trackers, i.e., 4,038 on the mobile environment and 4,879 browsers is fired at a regular interval and provides in- on the desktop environment, in terms of the number of formation about the force of acceleration and the rates first-party sites they appeared on and the number of the of rotation of the device. By listening to devicemotion unique cookies they placed. We found the vast major- events using the window.addEventListener API, attack- ity of these trackers, i.e., 3,540 out of 4,038 trackers on ers can collect motion sensor data from users’ devices the mobile environment and 4,148 out of 4,879 trackers without their awareness. on the desktop environment, only appear on less than As described in Section 3, we instrumented the win- ten first-party websites. The cookies placed by all these dow.addEventListener API to capture all the event han- trackers vary from 1 to 247. On average, trackers placed dler registration activities. In our measurements, we cookies on 10 mobile websites and 16 desktop websites. captured 456 and 302 unique types of events on desktop Figures 9a and 9b show the top 20 trackers that and mobile environments, respectively, while 212 unique placed cookies on the most number of those 23,310 first- types of events appeared on both environments. Specif- party mobile version and desktop version websites, re- ically, six types of events appeared on the top 10 list spectively. The y-axis on the left side and the vertical of each environment as shown in Table 4. We observed bars represent the number of first-party websites that that events related to sensor access, such as deviceorien- included the cookie-trackers shown on the x-axis, while tation and devicemotion, only appeared on the mobile the y-axis on the right side and the curve represent environment. In more details, we found that 3.3% (780 the number of different ID cookies created by each of out of 23,310) mobile websites accessed the motion sen- those trackers. Tracker agkn.com creates four different cookies on the most number of (i.e., 1,840 or 7.9% of 23,310) first-party desktop websites, while tracker ad-
You can also read