Email Tracking: a Study on its Prevalence - KU Leuven (ESAT)
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Email Tracking: a Study on its Prevalence Shirin Kalantari Thesis submitted for the degree of Master of Science in Engineering: Computer Science, option Secure Software Thesis supervisors: Prof. dr. ir. C. Diaz Prof. dr. ir. F. Piessens Assessors: Prof. dr. B. Berendt Dr. J.T. Mühlberg Mentors: M. Juárez T. van Goethem Academic year 2018 – 2019
c Copyright KU Leuven Without written permission of the thesis supervisors and the author it is forbidden to reproduce or adapt in any form or by any means any part of this publication. Requests for obtaining the right to reproduce or utilize parts of this publication should be addressed to the Departement Computerwetenschappen, Celestijnenlaan 200A bus 2402, B-3001 Heverlee, +32-16-327700 or by email info@cs.kuleuven.be. A written permission of the thesis supervisors is also required to use the methods, products, schematics and programs described in this work for industrial or commercial use, and for submitting this publication in scientific contests.
Contents Abstract iii 1 Introduction 1 1.1 Structure of the report . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Background and Literature Review 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Email Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 HTML email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 HTTP request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Rendering HTML emails . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Commercial Newsletter Emails . . . . . . . . . . . . . . . . . . . . . 15 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Problem Statement and Methodology 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Email tracking for senders . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Email tracking for third parties . . . . . . . . . . . . . . . . . . . . . 24 3.4 Identifying Tracking Images . . . . . . . . . . . . . . . . . . . . . . . 27 3.5 Identifying HTTP resources in email . . . . . . . . . . . . . . . . . . 29 3.6 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 Implementation and Results 33 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Read receipt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Identifying personalized tokens . . . . . . . . . . . . . . . . . . . . . 35 4.5 Identifying tracking Images . . . . . . . . . . . . . . . . . . . . . . . 36 4.6 Remote Contents in email . . . . . . . . . . . . . . . . . . . . . . . . 38 5 Discussion 41 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Email read receipt . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.3 Improving the defence . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 i
Contents 6 Conclusion 49 A Infrastructures 53 A.1 Mail server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Bibliography 57 ii
Abstract Unlike web tracking, email tracking has attracted little academic interest. Email tracking in its most common form is a result of an unwanted HTTP request. In this thesis we measure prevalence of different email tracking methods in a corpus of commercial newsletter email. We discovered that a certain method of email tracking could provide a persistent user identifier for online trackers. We find that 59% of senders in our corpus can leak this persistent user identifier. In addition to third parties, email tracking enables the sender to obtain additional information about recipients. While senders can be notified about user interactions with their email through standardized protocols, using email tracking they can receive the same information in a larger scope and without obtaining an explicit user consent. We discussed existing countermeasures and their effectiveness in resolving concerns of email tracking. iii
Chapter 1 Introduction Internet plays a vital role in our life. Each internet service that we use produce data. A vast amount of data about online interactions of users are being collected. Online trackers are interested in collecting these information and as previous studies shown web tracking is common in popular websites [27]. While web traffic dominant our modern usage of internet, electronic mail is another inevitable part of our online life. It is estimated that by the end of 2018, more than 280 billion emails were sent daily [3]. We do not exaggerate if we say that our email inboxes have traces of almost every single action we take online. Today, email systems deliver commercial newsletters, news briefings, social media updates, password recovery links and, of course, daily-life communications. Preserving the privacy of our emails is paramount because of the vast usages of email in our daily life. With the persistence of trackers on web, it is not unlikely to expect the presence of trackers in user’s inbox. Previous researches on email tracking had shown different methods that are used for tracking emails and prevalence of these methods in practice [26, 79]. Email tracking as discussed in both these works is in its core the result of HTTP requests that are made upon user interactions with an HTML email. These HTTP requests leak information related to an already sent email. If we summarize tracking methods described in these papers the information that lead to tracking can be generalized in three forms: meta-data, HTTP headers and personalized URL tokens. Meta-data information are basic root of an HTTP request and include information such as user’s IP address and timezone. The HTTP headers leak information such as cookies, referrer address, and user agent. These headers are sent as a result of miss-configurations or loose privacy settings of email clients. Finally personalized tokens, are deliberately included by the sender inside the URLs of an email. Recipient email address as part of a URL is an example of personalized tokens that was studied in previous works [26, 79]. The implication of email tracking as discussed in previous works is twofold. The first issue which is repeated in both these papers is that by using HTTP request the sender can learn whether, when, and how a user has interacted with an email. Xu et al. discussed that such methods of email tracking can be used to launch long-term 1
1. Introduction surveillance attacks against the recipient [79]. The second privacy pitfall which is only discussed in the paper by Englehardt et al. is leakage of personalized tokens to unauthorized parties [26]. Personalized tokens that they studied were based on recipient email address and were considered to be instances of Personally Identifiable Information (PII). The existing countermeasures are not very effective in resolving these concerns. The most effective (and yet practical) countermeasure can only prevent meta-data and HTTP header leakage and is incompetent in preventing tracking using personalized URL tokens[26]. In this thesis we further clarify the implications and methods of email tracking. We are interested to understand the role that email tracking plays in online tracking. Whether it can provide additional information about a user that cannot be obtained through well studied web tracking methods. In a smaller scope, we want to understand the role of email tracking for email senders. And in any case, we are interested to know whether email tracking methods are transparent for end users and whether users can have control over them. Based on results we get in this thesis we are able to demonstrate: • How personalized tokens in email could be used in order to employ a persistent method of online tracking. Such methods of tracking to our knowledge, has been unknown to this date. In comparison with Englehardt et al. we use a broader definition for personalized tokens, where they do not necessarily have to be based on recipient email address and are not considered as PII. • How HTTP requests in email can inform the sender about user interactions with an email in an obscure way. We demonstrate how probable it is for the sender to obtain information about both the specific recipient and the specific email from an HTTP request. • What resources in email generate HTTP requests and whether these resources could be replaced by offline alternatives. • A novel method for identifying advertisements based on their HTML structure and URL query parameters rather than their domain. We used this method to identify advertisements that are not detected by existing countermeasures. We measure the prevalence of email tracking in a corpus of 343,998 newsletter emails that were collected for this thesis. The emails came from 1,148 different senders and was collected from 2018-03-31 to 2018-10-26. Based on the result that we get in this thesis we proposed guideliness that could be used to improve the state of defence against email tracking. 1.1 Structure of the report In Chapter 2 we provide the background information about email protocols and review the cause of email tracking and the information that it leaks. We see how 2
1.1. Structure of the report commercial parties use this information to provide customized email services for their clients. In Chapter 3 we elaborate on consequences of email tracking. We propose methods for measuring prevalence of different tracking methods. In Chapter 4 we report and interpret the result of applying proposed methods on our corpus along with their limitation. In Chapter 5 we assess our result and its importance in practice. Frinaly we summaries our findings and conclude in Chapter 6. 3
Chapter 2 Background and Literature Review 2.1 Introduction HTML emails and HTTP contents in them are ground for email tracking. In this chapter, we study HTML emails in more details. We first discuss email protocols to understand how HTML is encoded in an email. We describe the HTML rendering process and identify part of HTTP request that could lead to information leakage. We outline the existing countermeasures and their effectiveness in preventing the unwanted information leakage. We finalize this chapter by indicating email tracking services that are currently used in commercial newsletter emails. 2.2 Email Protocols To send an email, different protocols and software are used. Figure 2.1 is an overview of key software and protocols that are used in sending and retrieving an email. To compose an email the sender uses a Mail User Agent (MUA or UA) software. MUAs can be categorized into two major types: web-mail clients like Gmail web-mail, and local mail clients like Thunderbird and iOS Mail. To send an email the MUA submits the email over Simple Mail Transfer Protocol (SMTP) to the sender’s mail server (also called Mail Transfer Agent (MTA)). MTA is in charge of transmitting emails and relay the email to recipient network via SMTP. Some examples of mailbox providers are Gmail, Yahoo!, and Outlook. Before routing an email each MTA might perform some security checks like spam filtering, and maleware detection on the email. When an email reach its destined MTA, the recipient MUA can retrieve the newly arrived email over email access protocols. In this section we elaborate more on details of these protocols. 5
2. Background and Literature Review Figure 2.1: An overview of steps and protocols that are used to send emails. 2.2.1 SMTP: The Email Transport Protocol SMTP is used to transmit email objects. The email object consists of two parts: the envelope headers and the content [46]. The SMTP envelope headers store bookkeeping information regarding the delivery and transportation of an email. The envelope headers are destined to be used by MTAs in order to transfer the email from sender network to the recipient network. The SMTP content, is itself consisting of two parts: the header section and the body. The content header section contains information that is used by the email client, for instance the email subject. The content headers are column separated key-value terms. The body contains the email message. SMTP can only carry email messages represented by US-ASCII. Multipurpose Internet Mail Extensions (MIME) relax this restriction by defining algorithms that can be used to encode the email message to US-ASCII. 2.2.2 MIME MIME refines the email object carried by SMTP protocol to allow for more practical contents. Using MIME the email object can contain textual contents with character sets other that US-ASCII, non-textual message contents like file attachments and multi-part message bodies [33]. The email object is organized in MIME parts. Each MIME part has some headers that provide additional information about its enclosing contents. The Content-ID is an identifier for a MIME part that can be used for referencing this MIME part in other part of the email[33]. To provide the encoding information, each MIME part uses two mail headers: the Content-Type header indicates the type of content that is being encoded and the Content-Transfer-Encoding indicates the encoding scheme. A MIME content type is expressed by a type and a subtype. The MIME type is the general description of the kind of data carried in the MIME enclosure. The subtype offers a more specific 6
2.2. Email Protocols description of the type of enclosed data. [33]. Figure 2.2 is an example of a MIME part that is encoding an image. Figure 2.2: Encoding an image in email message using MIME. 2.2.3 Email Access Protocols Email access protocols are used to transfer email objects from recipients’ mail servers to their MUAs. Post Office Protocol (POP) and Internet Message Access Protocol (IMAP) are standardized protocols that are deployed in most commercial mail servers and MUAs. Some mailbox providers use their custom protocol for transferring mail objects to MUAs. This protocols are often used by the native MUAs that the mail box provider has developed. For example Microsoft previously used DeltaSync and WebDav as email retrieval protocols. POP3 POP3 is a simple retrieval protocol described by RFC 1939[59]. Using this protocol the MUA can use certain command as defined in RFC to communicate with the mail server. The main communications are authentication of the user to the mail server, message retrieval and message deletion. Once a message is retrieved, the POP session can be terminated and the MUA can operate offline on the email. IMAP Defined by more that 10 RFCs, IMAP is a relatively complex email retrieval protocol. Using IMAP each email can have an additional set of flags associated with it. These 7
2. Background and Literature Review flags communicate information about user interactions between MUA and mail server. Table 2.1 is the collection of IMAP flags specified in RFC 3501 [21]. Flag Description \Seen Message has been read \Answered Message has been answered \Flagged Message is "flagged" for urgent/special attention \Deleted Message is "deleted" for later removal \Draft Message has not completed composition (marked as a draft). \Recent Message is "recently" arrived in this mailbox. Table 2.1: IMAP flags as described in RFC 3501 Flags Message Attribute sec- tion [21]. 2.3 HTML email Although the first motivation for using MIME was to support European characters in email [61], its introduction also enabled sending emails with richer text formatting like HTML. Using HTML, email messages are no longer restricted to textual content. Emails could contain well-designed messages with integrated multimedia contents that render consistently across different mail clients. HTML emails are claimed to be sent since 1995 [72]. While the main motivation was to have graphics and styling, having HTML emails has also highlighted concerns about user privacy since early date[72, 9]. Today, the scope of these concerns has been reduced, nevertheless HTTP resources that are included in an email give means to the sender to obtain additional information about an already sent email. HTTP contents are resources that are hosted on remote servers and are loaded through HTTP requests. In this section we identify general HTTP contents, different methods of including them and the consequence of using them in email, and outline the HTML rendering process by MUAs. 2.3.1 HTTP Contents: General HTTP resources that could exist in an HTML page are: Cascading Style Sheets (CSS), scripts, links, and images. CSS is used to attach style to HTML documents. Scripts are code that run on the client’s machine when the HTML page loads or upon certain user interaction [8]. Links are one of the prominent features of HTML and connect one page to another HTML resource [7]. Images provide richer contents in a page. Due to their basic role images are the popular HTTP content for serving tracking purposes. CSS: Style sheets are used for styling an HTML page. There are three alternatives to include CSS in an HTML page: inline, internal and external CSS. Inline CSS are 8
2.3. HTML email expressed using the style attribute inside an HTML tag. Internal CSS are expressed inside tag within the of an HTML page. External CSS are separated CSS files that are linked to in the of HTML page and are expressed with tag. These three methods are illustrated in Figure 2.3. Among these three methods only loading external CSS files result in an HTTP request. However loading external CSS in email open an attack surface that can be exploited for to change the contents of an email. The exploit called Ropemaker enables a malicious attacker to change the content of an email after it is sent, just by changing the content of the external CSS that is used inside an email [36]. Figure 2.3: Including CSS in HTML: external, internal, and remote CSS Scripts JavaScript is very popular in web. In an HTML file, JavaScript code can be internal, written inside tag, or they can be external expressed using tag. While internal JavaScript code does not leak any HTTP request, having JavaScript inside email is known to be very dangerous. Already in 1998, the Reaper vulnerability was found in HTML emails that enabled the sender of an email to wiretap the email messages when they are forwarded by the recipient to another email address [78]. Links Links which are expressed using tag are also a remote content. In web, links are the most commonly used HTML element [57]. The HTTP request for a link is made when a user clicks on a link. Since links need an explicit user interaction they are less hazardous. However when clicking on links, users are shifted to their browsing context that allows for all traditional web-tracking methods[26]. 9
2. Background and Literature Review Images: There are different methods for including an image inside an HTML page, for examples through ,, tags or CSS background-image property[6]. When expressed using tag, images inside an HTML email could have three different kinds: external, data URI, and Content-ID (CID). External images have a remote URL in their src attribute. With data URI, the src attribute include the ‘immediate data’ [54] directly embedded. CID images come as attachments to emails and the image src attribute reference to the MIME Content-ID of the attachment[54]. Figure 2.4 demonstrates how the image in Figure 2.2 can be referenced within an tag inside an email. Figure 2.4: Including a CID image: The image from Figure 2.2 is referenced within an tag . Among these three methods, external images are the preferred method for including images in email. Although CID and data URI images prevent the HTTP based email tracking, the security pitfalls of these two methods make external images the advisable choice. Data URI scheme can be used to launch phishing attacks in email [47][55]. Having CID images also have its downfalls since it can negatively affect the delivery of an email. With CID images included in an email message, the chance of getting blocked by spam filters increases. Most spam filters use the text of the email message to classify it as spam [64, 16]. To circulate these textual filters, spammers can format their messages inside the images. This is called an image spam [16]. Figure 2.5 is an example of an image spam email. For this reason having a lot of embedded images in an email alerts spam filters that the email might contain an image spam. Gmail use optical character recognition (OCR) techniques to extract the text from an image and and run their spam filters on it[11]. Email service providers advise against CID and data URI images and recommend external images instead [53, 42, 58]. 2.4 HTTP request HTTP requests are the main sources of tracking in email. The privacy concerns of HTTP requests are due to meta-data, HTTP headers and the personalized URL tokens that they convey. 2.4.1 Meta-data and HTTP headers HTTP requests in email can be generalized in the following form: GET request-URL ∗ (request-header) 10
2.4. HTTP request Figure 2.5: Some examples of image spams: Each email is structured in one image. The textual contents are part of the image. The image is taken from the study by Ketari et al. A Study of Image Spam Filtering Techniques[44]. The GET method indicates a retrieval request, request-URL is the address of the remote resource and request-header are one or more HTTP headers that MUA uses to include additional information. Some HTTP headers are: • User-Agent: A header containing operating system and MUA specifications like vendor and version. • Cookie: A header containing information previously set by the server. • Referer: A header carrying the address of the page from which the HTTP request was made. • Date: A header indicating the time and date at which the request was made (according to client’s machine). HTTP is an application layer protocol which depends on transport layer protocols such as TCP/IP. These protocols also contain meta-data, for instance, the IP address, ports and packet size. Privacy considerations HTTP headers and meta-data information can be used to obtain additional infor- mation about the recipient. At TCP/IP level, the IP address conveys information about the approximate location and timezone of the recipient [41, 74]. The HTTP headers that are sent in the request reveal identifying information. The User-Agent 11
2. Background and Literature Review header, when combined with other meta-data can contribute to uniquely identifying the recipient [79]. When using a web-based MUA, Referer and Cookie headers might be sent along the request that could compromise the privacy of the recipient. Referer header, as specified in RFC 7231 helps servers to identify the source of their traffic and allows user agent to generate back links [32]. When using a web MUA this header might point to the URL of web mail which might contain session information [32, 26]. Cookie header is originally designed to carry user identification information. When sent along an HTTP request, the server can associate the request to previous requests of the same user in web. Figure 2.6, contain an of example HTTP request for an image made by a web MUA. The request contains Cookie, User-Agent, and Referer headers. Figure 2.6: Loading an image through Outlook web client. The request contain Cookie, User-Agent, and Referer that can be used by the server for identifying the user. Countermeasures Blocking remote contents Most email agents can be customized to block HTTP contents that are in an email. In MUA’s terminology this is referred to as blocking remote contents. This countermeasure blocks requests that are made by the rendering engine upon loading a page. What is considered as the set of remote contents varies between different MUAs. For example Thunderbird’s blocking remote contents disable automatic loading of external CSS files and images [17]. But in order to disable links other settings must be changed 1 . Gmail does not have any setting for disabling links, but it can be customized to block images while having external CSS is strictly prohibited in Gmail [17]. When users explicitly decide to load HTTP contents either by clicking on links or by enabling remote contents, this countermeasure is not effective anymore. Content proxies This is a countermeasure deployed by Google in Gmail [38]. With a proxy, the request for remote contents uses the proxy’s properties instead of 1 The value of network.protocol-handler.external-default must be set to false in the preference file. 12
2.4. HTTP request user’s MUA. In this setting meta-data information such as IP address and timezone are preserved. The HTTP headers are also protected since the proxy does not have access to user’s web browsing cookies. However content proxies still leak the time of opening an email since they do not pre-fetch contents. They do not cache the images or change the caching policy of the response either, so each time a user opens an email, a new request will potentially be made. 2.4.2 HTTP request: Personalized URL URLs in email can have identifying tokens embedded in them. Based on these tokens, the sender can relate the request to one specific recipient. The study by Englehardt et al.[26] revealed that in their corpus of newsletter emails, 29% of emails had at least one link with personalized tokens. They had a predefined set of values for personalized tokens for each recipient and searched links for instances of such values. This personalized tokens were considered to be PII and were limited to either the recipient email address or some hashing and encoding schemes applied to the email address. Formula 2.1 is a demonstration of encoding schemes that was used in their work. Figure 2.7 shows two examples of URLs with such tracking tokens. In Figure 2.7 (a), the sender used the recipient email address as the user identifying token and in Figure 2.7 (b), as the naming of query parameter suggests hash of email address has been used (in this case MD5). P II0 =e P II1 = E(e) e = recipient email address P II2 = H(e) H = { MD5, SHA1, SHA256, SHA384,...} P II3 = H(E(e)) E = { URL encoding, Base64, Base32...} P II4 = E(H(e)) P II ∈ {P II0 , P II1 , P II2 , P II3 , P II4 } (2.1) Figure 2.7: Examples of images with user identifying tracking tokens inside emails. (a) The email address is used as a query parameter. (b) The MD5 digest of the email address is used as a query parameter. Since the authors considered these tokens as instances of PII, their focus was on leakage of these tokens to third parties. 13
2. Background and Literature Review Countermeasure Request blocking: In addition to blocking remote contents, another counter- measure for preventing leakage of PII tokens is applying URL filtering methods. Englehardt et al. demonstrated that by using ad-blocker and tracking-blocker ex- tensions the number of third parties receiving PII tokens reduces roughly in half [26]. 2.5 Rendering HTML emails HTTP requests in email can be categorized in two main types: Explicit requests that are made upon certain user interactions and implicit requests that are made by the MUA. In terms of information leakage and privacy impact both request are the same. However, implicit request are hazardous since they are made without user involvement as soon as the client renders an email. In order to display HTML emails, modern MUAs take two steps: preprocessing and HTML rendering[67, 13]. In the preprocessing step, based on MUA policy, some HTML tags are removed (HTML stripping) and certain elements are overwritten (HTML overwriting). The HTML stripping removes HTML tags that cause serious attacks in email. For example tag is removed because of the Reaper vulnerability, which leads to wiretapping emails. Data URIs are removed in some cases since they can be used to launch phishing attacks [47]. Figure 2.8 is a demonstration of such phishing attack in Gmail in 2017[55]. Using data URI the attacker encodes the HTML code of a phishing site into the email. When user opens this email, embedded code is rendered and can be used to steal sensitive information. To prevent this attack mail clients such as Gmail and Yahoo! (web, mobile (iOS and Android)) strip data URIs from email[17]. In HTML overwriting step the MUA blocking remote contents functionality is implemented. The client overwrites HTML properties of contents that it blocks in a way that the rendering engine would not request them (see Figure 2.9). In the rendering step HTML part is interpreted to visual elements. The MUA’s choice on how to deploy rendering engine affects the information leakage of HTTP request. Depending on this choice, MUAs can be categorized to local and web-based types. A local MUA comes with its own rendering engine while a web-based MUA uses the web browser engine for rendering HTML emails. When using web-based MUAs like Gmail2 , to display the email it becomes part of the Gmail’s web page. If the email itself has an HTML part, this HTML code is inserted inside the web mail’s HTML code. The browser cannot distinguish that these two HTML codes are from different sources. As a result the request for contents inside email includes the HTTP headers like Cookie. Local MUAs use a separate rendering engine which does not have access to user’s web browsing information. Hence HTTP request made from local MUA could not include user’s web cookies. 2 Gmail as: https://mail.google.com/mail/u/0/#inbox 14
2.6. Commercial Newsletter Emails Figure 2.8: Using data URI, when recipient open the phishing email a fake Gmail login page would be prompted. Image is taken from a blog post by Mark Maunder [55]. Figure 2.9: HTML overwriting: Blocking remote images in Outlook web, the src attribute of a remote image is overwritten. 2.6 Commercial Newsletter Emails To better understand email tracking we study email tracking services that are currently provided as a service. Companies that send out newsletter emails often use an Email Service Providers(ESP) for delivering emails and managing their mailing list. Mailchimp3 is an example of a well known ESP. Beside offering services like 3 https://mailchimp.com/ 15
2. Background and Literature Review email templates, list management and email delivery, ESPs also track each campaign and give reports to their customers. The following list contains different aspect of an email message that major ESPs track and report to their customers [28]: 1. Email delivery and bounce rates. 2. Open rate. 3. Click-through rate. 4. Opt-outs rates. 5. Spam complaints. 6. Meta data information (IP, timezone and devices). 7. Users who forward email. Email delivery and bounce rate: Marketers monitor delivery status of each campaign, and the placement of their emails in users’ inbox [63]. ESPs report cases where emails fail to reach their destined inboxes and categorize the failure into two categories: soft bounces, and hard bounces [52, 77]. Hard bounces happen when there are technical problems that prevent email delivery. The reasons include the mail server being down, typos in email address or network problems [52, 77]. In such scenarios the sending mail server will fail to deliver the email and might get a bounce message with a status code, like those described in RFC 3463 [75] that explains the problem from the recipient mail server or from a transport systems. Open rate and click-through rate: We indicate that HTTP read receipt is one of the implication of email tracking. If the MUA loads remote resources as soon as the recipient reads a commercial email, the request that can operate as a read receipt for the sender. We elaborate more on the properties of requests can operate as a read receipt in Chapter 3. ESPs use requests for remote images as an intermediate to obtain open and click-through rates. Click-through rate indicates how successful a campaign is in term of engaging its recipient [28]. ESPs store the links that each recipient has clicked on to enable senders to infer user’s interest that is useful for creating personalized offers. Opt-out rates: Anti-spam regulations and online authorities mandate marketing emails to provide an opt-out options for recipient [15]. CAN-SPAM Act is and example of such regulations that is currently in place in US[35]. There are two opt-out option in marketing emails: unsubscribe link and List-Unsubscribe header. When using an unsubscribe link, user locates and clicks on a link that is included in the newsletter email. The second method for opting-out is by using List-Unsubscribe header as specified in RFC 2369 [15]. Both method serve the same purpose, but List-Unsubscribe is usually intended to be used by mail client software. Mail client software use this header to provide an unsubscribe button in their user interface. An 16
2.6. Commercial Newsletter Emails example of such button is shown in Figure 2.10. Content providers track opt-out rates to improve their future emails and have less users unsubscribed from their mailing list. Figure 2.10: Gmail unsubscribe button: The image is taken from the official Google+ post, announcing the unsubscribe button in 2014 [39]. Spam complaints: MUAs often have a spam report button that user can use to identify emails as spam. After such report the mailbox provider learns a human identified spam email that spam filters failed to catch. To globally fight spam, the mailbox provider sends this information to other ISPs and mailbox providers through a so-called Complaint Feedback Loop [31]. As stated in RFC 6449: “Senders of bulk, transactional, social, or other types of email can also use this feedback to adjust their mailing practices, using Spam Complaints as an indicator of whether the Recipient wishes to continue receiving email” [31]. So in addition to ISPs, the senders also might receive this feedback information. In this way they can learn what kind of information a user identify as spam or junk and improve their content. Sources like [28] suggest that the ESP receives only aggregated information about the number of complaints per campaign. About the information that is in this feedback report the RFC 6449 state: “[...] the Recipient’s or reporter’s Email Address and IP address may be cat- egorized as private data and removed from the feedback report that is provided to the Feedback Consumer. Privacy laws and corporate data classification stan- dards should be consulted when determining what information should be considered private.”[31] Looking at feedback loop policy of mailbox providers, AOL and Microsoft redact the user email address from abuse report that they send to ESPs or registered email senders but they keep the complaint message intact[25, 24]. As we already discussed in the Introduction chapter, newsletter emails contain tracking tokens and hence in such cases ESPs can use these tokens to find the email address of the user who made the spam complaint. Some mailbox providers send the report in full detail without modifications like removing the email address [80]. Gmail, only send feedback loops to a limited number of ESPs[37]. In addition, to receive feedback loops from Gmail the email should contain a special feedback-id header. This header contain a unique sender identifier and three other optional fields that the sender can user to embed 17
2. Background and Literature Review identifiers of their choice4 . Figure 2.11 shows an example of a feedback-id header that was used in one of the emails in our corpus. When user report an email as spam in Gmail, they use this header to send the feedback report to the ESP which sending that email. Figure 2.11: An example feedback-id header in our corpus Most users are not aware that when they are hitting report as spam button such information is being propagated to different parties, potentially with their email address attached to it. Meta data ESPs use meta-data information to report approximate location of each recipient and devices and software they use for reading emails. Marketers use this information to customize their campaigns according to different platforms. The time at which each recipient reads the newsletter email is also being monitored. ESPs provide services to deliver emails according to the individual recipient time-zone [12, 19]. Forward information Marketers can use the ESP’s services to include a link inside each email that the user can use to forward an email to friends. The forwarded email will be sent via the ESP and contains a link that points to the web-hosted version of the same email. The ESP will not add the recipient email address to the mailing list but they do reports the users who forward the email through forward to a friend link [28, 2]. Figure 2.12 shows the scenario of forwarding an email using a froward to friend link. Figure 2.12: Following the forward to a friend link for one of the emails in our corpus. 4 Google example suggestions were campaign and customer identifiers 18
2.6. Commercial Newsletter Emails 2.6.1 Newsletters Analytic Features Users register for newsletter emails by filling in a subscription form. Before receiving newsletter emails, they receive an email with a confirmation link that has to be clicked on for the subscription to be finalized. By clicking on this link subscribers give their consent to receive subsequent emails from this sender. URL parameters Marketers use different mediums when they are promoting a campaign. They often want to compare the effectiveness of each medium in attracting users to their promoted campaign. Take the example of a website that wants to promote a specific product. This product has a page in their website (https://www.example.com/greatProduct). They promote this product in company’s Facebook page, in their weekly newsletter emails, in banners in their website, and through online advertisements. Some users will purchase this product and the website want to know which campaign resulted in this purchase. To collect this kind of information they embed a set of parameters in the URL that is pointing to their product in different mediums. There is a common set of query parameters called Urchin Tracking Module or Urchin Traffic Monitor (UTM) that are used for this purpose. UTM parameters are used and introduced by Google Analytic [40]. There are five query parameters that could be added to a URL. Here we explain the three mandatory parameters. When user land in the promoted web page, the query parameters will be sent to Google Analytic for reporting purposes [40]. 1. utm_source: This parameter identifies the entity (advertisement) that initiated the click. 2. utm_medium: This parameter identifies the medium. 3. utm_campaign: This parameter identifies the name of the campaign. A/B Testing Marketing emails are subject to A/B testing in which senders build different variations of a single campaign and compare users interaction with different versions to see which feature is better. They want to be able to build different variations of a single campaign and test different settings. The motivation behind doing the A/B testing is relatively simple. The marketers want to make sure that each email they send triggers maximum user engagement. Some ESPs like MailChimp provide A/B testing as a service to their customers [1]. The result of A/B testing on emails is that not all users receive the same version of an emails. As a result users subscribed to the same newsletter might get emails with slightly different content, subject line or HTML template. 19
2. Background and Literature Review 2.7 Conclusion With HTML emails being the the root cause of email tracking, in this chapter we outline the email protocols that enables sending HTML emails. The email tracking is tightly related to HTTP requests that are made from HTML emails. We explain the type of information that could be obtained from these requests and their existing countermeasures. We indicate different HTTP contents that could exist in an HTML email and their alternatives. We discuss that remote images and links do not have a straightforward offline alternative. We elaborate on HTML rendering process by MUAs and discuss how the choice of rendering engine can affect the information leakage by HTTP requests. To expand our understanding of email tracking methods that are commonly used, we compile a list of analytic services that ESPs provide to their customers. 20
Chapter 3 Problem Statement and Methodology 3.1 Introduction In previous chapters, we recognize email tracking as the result of HTTP requests. We break down information of an HTTP request to meta-data, HTTP headers and personalized URL tokens. We discussed the effectiveness of existing countermeasures in blocking these parts. Summarizing the email tracking methods we discuss in the previous chapter, we can notice two different trends in email tracking: In one hand, commercial senders are actively monitoring user engagement with their email. For their commercial interest they are interested in increasing user interaction with their emails and deploy email tracking to obtain analytic metrics for their emails. On the other hand, the paper by Englehardt et al. demonstrated how common it is for third parties to receive PII information about recipients through email tracking. In this chapter we elaborate on these two privacy concerns of email tracking. 3.2 Email tracking for senders The newsletter emails are reported to bring highest return of investment for senders. Sending emails at a time on which users are more likely to open it, , . One particular company phasee 1 , offer machine generated subject lines that is promised to boost email open rates. The senders obtain this information by HTTP requests that are made for remote contents in their emails. HTTP requests for resources that are included in emails are assumed to provide information about user interactions with emails. In this setting, HTTP requests for remote contents can work as a read receipt for the sender. However an HTTP request can work as a read receipt only when it contains information about both the recipient (who) and the read email(which email). While 1 https://phrasee.co/ 21
3. Problem Statement and Methodology Figure 3.1: Eve sends her email with personalized remote images, however she uses the same image in her emails. obtaining user identifying information from HTTP request has been discussed in previous works, methods for obtaining email identifying information from an HTTP request has not been captured. To better illustrate the importance of email identifiers for HTTP read receipt we use an example scenario: Eve has sent two emails e1 and e2 to Alice and Bob. Eve is interested to know whether/when Alice and Bob read her emails. For this reason she had inserted a remote beacon images i in her emails with the URL structure http : //eve.com/i/?user = (Figure 3.1, a). Eve personalized the user query of this URL based on the recipient (Figure 3.1 b, c). For reading emails Alice and Bob use an email client like Gmail, that loads external images by default, but uses a content proxy. Alice opens e1 and her email client make a request for image i. Eve notices this request and want to associate it to one recipient (Alice or Bob) and one email (e1 or e2 ) (Figure 3.1 d): • (who) Eve uses the personalized token in the URL of the request, to find out that Alice has made the request (meta-data and HTTP headers cannot be used since the request is made by the proxy). • (which email) Eve cannot determine which of her emails has been read by Alice since Eve has used the same image in both emails. However, if Eve uses different images in e1 and e2 (Figure 3.2), she can identify the email based on the image that is being requested. 22
3.2. Email tracking for senders Figure 3.2: Eve sends her email with personalized remote images, this time she uses different images in her emails. By this example we want to clarify that among all HTTP requests, we consider requests that can potentially identify both the user and the email as a privacy hazard. Sender can obtain message identifying information from HTTP requests if they include unique resources (or unique URLs) in their emails. The request for these unique resources can identify the email that is being opened. User identifying information can be obtained by meta-data (IP), HTTP headers (Cookie) or URL tokens (recipient email address as a token). If we consider the way images are loaded by MUAs we can further relax constrain of uniqueness of resources. We suggest that even with one unique image per email the senders can obtain their email identifier. We leverage from the request-all or block-all policy of email clients in regard of remote images. An email client either block all the images, or it loads all the images in an email. Under this setting with only one unique image in an email, the sender can learn whether the email is read when the email client loads images. In the context of the example we give, when Alice decide to load images in e1 , there is no way she can prevent the request for i1 . In order to see how often senders can obtain email identifying information form HTTP request we take the following steps: • For emails in one inbox, get all emails that are sent from the same sender. • For emails in this set, extract all external images that use the domain of the sender. • If for each email there is at least one image that has not been used in other emails, the sender can use the request for external images to obtain information about the email being read. 23
3. Problem Statement and Methodology 3.3 Email tracking for third parties As we discuss in the previous chapter Englehardt et al. claim that PII URL tokens that reache third party domains raise privacy concerns [26]. 1. While they claim that leakage of URL tokens to unauthorized parties happen when the request reaches a third party domain, we find a specific case, in which tokens in the URL reach an unauthorized party through the sender domain (first party). 2. We apply this claim in this thesis more widely by: (a) Highlighting privacy concerns of personalized tokens in URLs that are not considered as PII. (b) Demonstrating that personalized URLs could impose a privacy risk even when they reach sender of an email. 3.3.1 Leaking through sender domain LiveIntent2 is a Supply Side Platform (SSP) that enables publishers to receive revenue by managing their advertising space inside emails[51]. LiveIntent is among trackers that received the highest number of PII tokens in the paper by Englehardt et al. [26]. The advertisements are served through a so-called LiveTag. LiveTag is a clickable remote image. The domain that is used in the URLs of a LiveTag belongs to the first party (email sender’s domain). To serve advertisement through LiveTag senders should dedicate a subdomain to LiveIntent through DNS CNAME setting. This subdomain will be hosted by LiveIntent and will redirect to LiveIntent contents. Figure 3.3, shows LiveIntent advertisement in email and its LiveTag HTML, which uses the sender domain. For LiveIntent advertisements the distinction of first-party and third-party based on domain does not work. Although the URL of the advertisement is a subdomain of the first-party but it is actually serving third-party contents. 3.3.2 Generalizing Methods: URL tokens that are not PII In their paper, Englehardt et al. argued that hashing and encoding schemes applied on an email address do not bring enough privacy protection. They considered these transformations of an email address still as a PII. Indeed we find a later work supporting this claim that hashing a PII is not enough for preserving its privacy [23]. However, whether email address itself is a PII is debatable. Some sources considered email address as a PII [56, 69], while another research has mentioned regulations that explicitly exclude email addresses from being PII [60]. 2 https://www.liveintent.com/ 24
3.3. Email tracking for third parties Figure 3.3: LiveIntent advertisement in email coming from stltoday.com and its HTML code snippet. The advertisement URLs are using the sender domain. We argue that even if we do not consider these tokens as PII, privacy risks of having personalized URLs remain the same. Having personalized tokens provides additional information for the third parties that are involved in loading the resource. In addition to being a transformation of email address, personalized tokens can have other forms. We find one marketing platform, Blueshift3 , that uses a randomly generated string as user identifiers in its marketing emails. Figure 3.4, is an example of image in an email that is sent through Blueshift. The URL is personalized by string in Universally Unique Identifier (UUID) format. This personalized URL token is generated based on a random number. RFC 4122 and X.667 describe guidelines and recommendations for UUID generation [49] [14]. A UUID can be gen- erated in three versions: name-based, time-based and pseudo-random (also called random-number-based). The name-based version uses a globally unambiguous name like an email address, the time-based uses system clock and the pseudo-random version uses a cryptographic random number generator to generate a UUID value. The 13th character of a UUID string (bits 7 to 4 of octet 9) indicates to which version the string belongs. Table 3 in X.667 [14] maps this value to a UUID version. In Figure 3.4, the 13th byte is 4 which indicates that this UUID was generated in pseudo-random format. Although the token in Figure 3.4 is based on a random number, we claim that the sender can associate this value to a particular recipient. To support this claim we verified that the value of uid parameter is different among distinct recipients that receive the same email. Tokens in Figure 2.7 in the previous chapter and Figure 3.4 have been used for the same purpose. Using this token the sender can associate the HTTP request, to a particular recipient. 3 https://blueshift.com/ 25
3. Problem Statement and Methodology Figure 3.4: A URL with a UUID formatted string as its user identifier tracking token. The highlighted letter indicates that this token is a pseudo-random UUID When loading an HTTP request, third parties can receive the URL tokens in the URL: • Request for files (like CSS or image) in email goes through a chain of redirects to third parties and each redirect includes the previous URL in its Referer header [26]. • When clicking on a link, in addition to third parties included by redirect chain, the landing web page could also embed third parties. As we have seen in the previous chapter in Section 2.6.1, these trackers could receive the URL parameters of a web page. If these tokens have the properties of an identifier, the third party can use them as a mean for persistent tracking. We argue that while trackers (third parties) might not be able to link user identifiers in URL to an email address, they can link these tokens to the tracking profile of the recipient. Again we illustrate this argument by an example: In this setting Eve a powerful (widely spread) online tracker, has compiled a profile on behalf of Alice through online tracking (Figure 3.5). Alice receive newsletter emails from example.com. As a third party, Eve is present in all webpages of this site. Alice clicks on a link with personalized URL and she land in one of the webpages of example.com. Eve receive information regarding this visit (by third cookies, or JavaScript) along with the URL of the page. Based on the tracking profile she has for Alice, Eve identify this request. Overtime Eve can obtain one additional information about Alice. The fact that she is identified by user=@|!$e in example.com4 . Now let’s assume that Alice want to clear her tracking profile. Che change her device, IP and browsing software, but she still uses the same email address. She again click on a personalized link from example.com. This time Eve cannot use her conventional online tracking methods to identify Alice. But she uses user=@lice token to retrieve Alice’s profile (Figure 3.6). URL tokens that only reach the sender Even whey the URL tokens only reach the first party they can be used to identify the user based o for an HTTP read receipt. We expect certain properties for personalized URL parameters. Personalized token with these properties can be used by both the sender and the third parties for 4 If Eve has online partnership with example.com, just to know which query parameter they use for user identification, she can extend her profile from the first visit. 26
3.4. Identifying Tracking Images Figure 3.5: Personalized URL token is added to the online profile of Alice. Figure 3.6: Identifying Alice based on web tracking methods are not possible, however Eve can use the personalized token. tracking purposes. Senders use this tokens to obtain user identifying information for an HTTP read receipt. Third parties that get involved upon loading that request can extend their tracking profile for that user with the personalize token as a persistent identifier. The properties that we expect for these tokens are: • Within a URL, query parameters can hold personalized tokens. • For emails coming from one sender to a specific recipient, personalized tokens remain the same in different emails. • For two distinct users receiving the same email, the personalized tokens are different. 3.4 Identifying Tracking Images The existing countermeasures do not discriminate between the request for tracking and non-tracking images. Once the user decides to load images, the MUA will fetch all images in an email. But privacy risks and the functionality of tracking images and none tracking images are not comparable. Web beacons and advertisements are 27
3. Problem Statement and Methodology Figure 3.7: Example of images of few pixels from left to right: 2 × 2, 3 × 3, 4 × 4, 5 × 5 and 10 × 10. two types of tracking images that we focus on. Beacons are images in size of a few pixels with pure tracking purposes. They are not visible for human eyes and hence serve no functionality other than tracking purposes. It is reasonable to assume that when users choose to load images, their intention is to load visible images that are used as part of the email message. Advertisements are another example of images that we assume to serve tracking purposes in email. The question we want to answer here is, whether we could find methods for identifying these two tracking images and block the request for them. 3.4.1 Identifying Beacons For identifying beacons we can use their small size property. It is a recommended best practice to explicitly set the HTML image size attributes [6, 66]. In email this is more relevant since some MUA block images by default the sender wants to make sure that the email look proper and the template does not flicker when images are loaded. One method of specifying image size is through height and width attributes in the element. With size properties present, it is possible to identify the beacons before loading an image as any images with a size not recognizable by humans. Figure 3.7 shows some images in size of few pixels with largest image 10 × 10 pixels. Based on this images we consider any image less that 100 pixel (with height and width less than 10 pixels) to be a web beacon. 3.4.2 Identifying Advertisements In Section 3.3.1, we introduce an email advertiser that use a first party domain for its advertisements. The existing methods of ad-blocking that work based on URL filtering cannot detect and prevent such images from loading. We propose using the HTML structure of the advertisement element and its URL structure as a method for identifying and blocking it. Figure 3.8 is an example of advertisements block. This is called LiveTag, which is a placeholder for showing advertisements served by by LiveIntent 5 in email. URLs that are used in LiveTag should contain certain query parameters. To identify the user they should use email address of the recipient in e query parameter or the MD5 hash of the recipient in p query parameter. Query parameter p is also required a required parameter and identifies the sender [50]. If we generalize the HTML structure of a LiveTag we get to the following properties: Find all element that has an tag as their immediate child in the HTML structure. Check if URLs in the src and the href attributes of these elements 5 https://www.liveintent.com 28
3.5. Identifying HTTP resources in email contain the corresponding query parameters of a LiveTag which are p, and either e or m. Figure 3.8: A sample LiveTag which is used for serving advertisement in email. The image is taken from the LiveIntent page Publisher Onboarding and Tag Imple- mentation[50]. 3.5 Identifying HTTP resources in email For identifying personalized URL tokens we narrowed our focus on links and external images in email. However, there is no previous analysis on different HTTP resources that could exist in an HTML email. Once we know different HTTP resources and their prevalence in email, we could narrow our focus to minimizing the privacy concerns of commonly used resources. To identify HTTP resources in an HTML email, we need to illustrate which parts of an HTML element can potentially lead to an HTTP request by MUA. If we look at the structure of an example HTML element in Figure 3.9, we can see that not every URL will lead to an HTTP request. In this case, clicking on the link will result in a request to badsite.com. For all HTML tags we assume that if they include a URL as part of one of their attribute values, then they are embedding an HTTP resource. However, there are two tags which their text will also get processed by the MUA, namely and . As we discussed in the previous chapter tags are usually removed by the MUA in the preprocessing step. However, many MUAs support element to include internal CSS in an HTML page [17]. For this reason we also consider the URL in the text of a as an HTTP resource. To summarize the methods we used for identifying HTTP contents, we traverse the HTML document and for each tag: • We searched the values of all its attribute to find a URL. • If the tag is , we also search its text for a URL. 3.6 Data One of properties that we specified for personalized URL tokens is their variability for distinct users receiving the same email. In our dataset we need to have multiple 29
You can also read