A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists

Page created by Jeanne Gonzales
 
CONTINUE READING
A Long Way to the Top:
                                          Significance, Structure, and Stability of Internet Top Lists

                                                            Quirin Scheitle1 , Oliver Hohlfeld2 , Julien Gamba3,4 , Jonas Jelten1 ,
                                                        Torsten Zimmermann2 , Stephen D. Strowes5 , Narseo Vallina-Rodriguez3,6
                                                                       1
                                                                      Technical University of Munich (TUM) 2 RWTH Aachen University,
                                                             3
                                                                 IMDEA Networks Institute, 4 Universidad Carlos III de Madrid, 5 RIPE NCC, 6 ICSI

                                         ABSTRACT                                                                This leaves researchers using those lists with little to no in-
arXiv:1805.11506v1 [cs.NI] 29 May 2018

                                         A broad range of research areas including Internet measure-             formation about content, stability, biases, evolution and rep-
                                         ment, privacy, and network security rely on lists of target             resentativity of their contents.
                                         domains to be analysed; researchers make use of target lists               This work provides an in-depth understanding of such top
                                         for reasons of necessity or efficiency. The popular Alexa               lists and the scientific ecosystem around them, by analysing
                                         list of one million domains is a widely used example. De-               Alexa [1], Umbrella [2], and Majestic [3] as prominent exam-
                                         spite their prevalence in research papers, the soundness of             ples. We discuss the following characteristics:
                                         top lists has seldom been questioned by the community: little              Significance: We conduct a survey of 687 networking-
                                         is known about the lists’ creation, representativity, potential         related papers published in 2017, investigating if, and to what
                                         biases, stability, or overlap between lists.                            extent, these papers use Internet top lists. We find 69 (10.0%)
                                            In this study we survey the extent, nature, and evolution            of papers to use a top list (cf., §3).
                                         of top lists used by research communities. We assess the                   Structure: The properties of domains in different top lists,
                                         structure and stability of these lists, and show that rank ma-          such as a surprising amount of invalid TLDs, a surprisingly
                                         nipulation is possible for some lists. We also reproduce the            low intersections (
Alexa Global Top 1M list [1]. It is generated based on web                  3.1       Methodology
activity monitored by the Alexa browser plugin* , “directly                    We survey papers published at 10 network-related venues
measured sources” [5] and “over 25,000 different browser                    in 2017 listed in Table 1. First, we search the 687 papers pub-
extensions” [6] over the past three months [7] from “millions               lished at these venues for keywords† , in an automated manner.
of people” [5]. No information exists on the plugin’s user                  Next, we inspect matching papers manually, removing false
base, which opens questions on potential biases in terms                    positives (e.g., Amazon’s Alexa home assistant, or an author
of, e.g., geography or age of its user base. Alexa lists are                named Alexander), and papers that mention or reference the
offered for sale with few free offerings. The most popular free             lists without actually using them as part of a study.
offering, under no explicit license and briefly discontinued                   Finally, we reviewed the remaining 69 papers (10.0%)
in late 2016, is the list of the Global Top 1M domains. Paid                that made use of a top list, with various aims in mind: to
offerings include top lists per country, industry, or region.               understand the top lists used (§3.2), the nature of the study
   Cisco Umbrella: Another widely used top list is Cisco’s                  and the technologies measured (§3.3), whether the study was
Umbrella top list, a service launched in mid December 2016.                 dependent on the list for its results (§3.4), and whether the
This list contains the Top 1M domains (including subdo-                     study was possibly replicable (§3.5). Table 1 provides an
mains) as seen by Cisco’s OpenDNS service [2]. This DNS-                    overview of the results.
based nature is fundamentally different from collecting web-                   We find the measurements research area, at an average of
site visits or links. Hence, the Umbrella list contains Fully               23.5%, to be most reliant on top lists, but also other fields
Qualified Domain Names (FQDN) for any kind of Internet                      such as security (8.5%), systems (6.4%) and web technology
service, not just web sites as in the case of Alexa or Majestic.            (7.9%) use top lists frequently.
Data is provided “free of charge” without explicit license.
   Majestic: The third top list is the Majestic Million [3],                3.2       Top Lists Used
which was released in October 2012. This service offers a cre-                 We first investigate which list and what subsets of lists are
ative commons licensed Top 1M list based on Majestic’s web                  typically being used; Table 1 provides an overview of the
crawler. It ranks sites by the number of /24 subnets linking                lists that were used in the studies we identified. We find that
to that site [8]. This is yet another data collection method-               29 studies make use of the Alexa Global Top 1M, the most
ology, and similar to Alexa, heavily web-focused. While                     common choice across the publications at the selected venues.
the Majestic list is currently not widely used in research, we              However, we also find a surprising variety of subsets of the
still include it in our study for its orthogonal mechanism, its             Alexa Top 1M list being used.
explicitly open license, and its availability for several years.               All papers except one [67] use a list collated by Alexa.
   Other Top Lists: There are few other top lists available,                This paper instead uses the Umbrella Top 100 list to assess
but as those are not consistently available or fluctuate in size,           importance of ASes with BGP bursts. The Majestic list was
we will not investigate them in detail in this paper. Quant-                not used in any of the reviewed papers.
cast [9] provides a list of the Top 1M most frequently visited                 Note also that a study may use multiple distinct subsets of
websites per country, measured through their web intelligence               a list. For example, one study uses the Alexa Global Top 1k,
plugin on sites. Only the US-based list can be downloaded,                  10K, 500K and Top 1M at different stages of the study [59].
all other lists can only be viewed online and hide ranks when               We count these as distinct use-cases in Table 1.
not purchased. The Statvoo [10] list provides an API and                       We also find that 59 studies only use Alexa as their source.
a download for their Top 1M sites, but has frequently been                  Ten papers use lists from more than one origin; one paper
inaccessible in the months before this publication. Statvoo                 uses the Alexa Global Top 1M, the Umbrella Top 1M, and
does not offer insights about the metrics they use in their                 various DNS zone files as sources [19]. In total, two studies
creation process. Another list for domains is the Chrome UX                 make use of the Cisco Umbrella Top 1M [19, 65].
report [11], which publishes telemetry data about most com-                    Category and country-specific lists are also being used:
monly visited domains by Chrome users. However, domains                     eight studies use country-specific lists from Alexa, usually
are not ranked and not kept at a static size, which loses some              choosing only one country; one study selected 138 coun-
benefits of other top lists.                                                tries [27]. Category-based lists are rarer still: two studies
                                                                            made use of category subsets [15, 69].
3.    SIGNIFICANCE OF TOP LISTS                                             3.3       Characterisation of Studies
   Scientific literature often harnesses one or more of the top
                                                                               To show that top lists are used for various types of studies,
lists outlined in §2. In order to better understand how often
                                                                            we looked at the range of topics covered and the technologies
and to what purpose top lists are used by the literature, we
                                                                            measured in our surveyed papers. For each paper we assigned
survey 687 recent publications.
                                                                            a broad purpose, and the network layer in focus.
                                                                               Purposes: For all papers, we reviewed the broad area
                                                                            of study. The largest category we identified encompasses
     * Available                                                                †
                   for Internet Explorer, Firefox, and Google Chrome.               “alexa”, “umbrella”, and “majestic”

                                                                        2
using list   # dependent           # date?                        Alexa Global Top . . .
  Venue                    Area         Papers    #      %↓     Y V N              List Study     References        1M       29   5k      2
  ACM IMC             Measurements          42   11   26.2%       8        2   1     1      3       [12–22]         100k      2   1k      5
  TMA                 Measurements          19    4   21.1%       1        1   1     0      0       [23–25]         75k       1   500     8
  PAM                 Measurements          20    4   20.0%       3        1   0     0      0       [26–29]         50k       2   400     1
  Usenix Security        Security          85    12   14.1%       8       4    0     2      0       [30–41]         25k       2   300     1
  IEEE S&P               Security          60     5    8.3%       3       2    0     1      1       [42–47]         20k       1   200     1
  ACM CCS                Security         151    11    7.3%       4       5    2     1      1       [48–58]         16k       1   100     8
  NDSS                   Security          68     3    4.4%       2       0    1     0      0       [59–61]         10k      11   50      3
                                                                                                                    8k        1   10      1
  ACM CoNEXT             Systems            40    4   10.0%       2        1   1     0      1       [62–66]
  ACM SIGCOMM            Systems            38    3    7.9%       3        0   0     0      0       [67–69]         Alexa Country:        2
                                                                                                                    Alexa Category:       2
  WWW                   Web Tech.         164    13    7.9%    11         1    1     2      3       [70–82]
                                                                                                                    Umbrella 1M:          3
  Total                                   687    69   10.0%    45         17   7     7      9                       Umbrella 1k:          1

Table 1: Left: Use of top lists at 2017 venues. The ‘dependent’ column indicates whether we deemed the results of the study to
rely on the list used (‘Y’), or that the study relies on a list for verification (‘V’) of other results, or that a list is used but the
outcome doesn’t rely on the specific list selected (‘N’). The ‘date’ column indicates how many papers stated the list download or
the measurement date. Right: Type of lists used in 69 papers from left. Multiple counts for papers using multiple lists.

various aspects of security, across 38 papers in total: this               Such a study would take a list of a certain day, measure some
includes phishing attacks [79, 80], session safety during redi-            characteristic over the set of domains in that list, and draw
rections [81], and domain squatting [56], to name a few. Nine              conclusions about the measured characteristic. In a case like
more papers study aspects of privacy & censorship, such as                 that, the results depend on the list being used: a different set
the Tor overlay network [59] or user tracking [33]. Network                of domains in the list may have yielded different results.
or application performance is also a popular area: ten papers                 Verification (V): We identify 17 studies that use a list
in our survey focus on this, e.g., HTTP/2 server push [70], mo-            only to verify their results. A typical example may, e.g., be
bile web performance [69], and Internet latency [27]. Other                to develop some algorithm to find domains with a certain
studies look at economic aspects such as hosting providers.                property, and then use a top list to check whether these do-
   Layers: We also reviewed the network layers measured                    mains are popular. In such cases, the algorithm developed is
in each study. Many of the papers we surveyed focus on                     independent of the list’s content.
web infrastructure: 22 of the papers are concerned with con-                  Independent (N): Eight studies cite and use a list, but we
tent, and 8 focus on the HTTP(S) protocols, and 7 focus on                 determine that their results are not necessarily reliant on the
applications (e.g., browsers [37, 38]).                                    list. These are typically papers that use a top list as one of so
   Studies relating to core network protocols are common-                  many input sources that changes in the top lists would likely
place: DNS [30,34,49,50,59] (we identified 3 studies relating              not affect the overall results.
to domain names as separate from DNS protocol measure-
ments [22,56,61]), TCP [17,25], and IP [12,13,16,24,62,67],                3.5     Are Studies Replicable?
and TLS/HTTPS [19,35,36,48,55,74,81] layer measurements                       Repeatability, replicability, and reproducibility are ongo-
are common among our surveyed papers.                                      ing concerns in in Computer Networks [83, 84] and Internet
   Finally, we identify 12 studies whose experimental design               Measurement [85]. While specifying the date of when a top
measures more than one specific layer; e.g., cases where a full            list was downloaded, and the date when measurements where
connection establishment (from initial DNS query to HTTP                   conducted, are not necessarily sufficient to reproduce studies,
request) is being studied.                                                 they are an important first step.
   We conclude from this that top lists are frequently used to                Table 1 lists two “date” columns that indicate whether the
explicitly or implicitly measure DNS, IP, and TLS/HTTPS                    list download date or the measurement dates were given‡ .
characteristics, which we investigate in depth in §8.                      Across all 69 papers using top lists, only 7 stated the date the
                                                                           list was retrieved, and 9 stated the measurement date. Unfor-
3.4       Are Results Dependent on Top Lists?                              tunately, only 2 papers give both the list and measurement
                                                                           data and hence fulfil these basic criteria for reproducibility.
   In this section, we discuss how dependent the results of
                                                                           This does not necessarily mean that the other papers are not
studies are on the top lists chosen. For this, we fill the “de-
                                                                           reproducible, authors may publish the specific top list used
pendent” columns in Table 1 as follows:
   Dependent (Y): Across all papers surveyed, we identify                      ‡
                                                                               We require a specific day to be given to count a paper, the few
45 studies whose results may be affected by the list chosen.               papers just citing a year or month were counted as no date given

                                                                      3
as part of data, or authors might be able to provide those                  TLD Coverage is a first indicator of list quality and cov-
dates upon inquiry. However, recent investigations of repro-             erage. Per IANA [87], 1,543 TLDs exist as of May 20th,
ducibility in networking hints that this may be an unlikely              2018. Based on this list, we calculate the number of valid and
expectation [85, 86]. We find two papers that explicitly dis-            invalid TLDs per list. The average coverage of valid TLDs in
cuss instability and bias of top lists, and use aggregation or           the JOINT period is around ≈700 TLDs, covering only about
enrichment to stabilise results [43, 65].                                50% of active TLDs. This implies that measurements based
                                                                         on top lists may miss up to 50% of TLDs in the Internet.
3.6    Summary                                                              At the Top 1k level we find quite different behaviour with
  Though our survey has a certain level of subjectivity, its             105 valid TLDs for Alexa, 50 for Majestic, but only 13
broad findings should be stable: (i) that top lists are frequently       (com/net/org and few TLDs) for Umbrella. We speculate that
used, (ii) that that many papers’ results depend on list content,        this is rooted in DNS administrators from highly queried DNS
and (iii) that few papers indicate a precise date of the list            names preferring the smaller set of professionally managed
download or the measurements conducted.                                  and well-established top level domains over the sometimes
  We also find that top list use to measure network and                  problematic new gTLDs [88–90].
security characteristics (DNS, IP, HTTPS/TLS) is common.                    Invalid TLDs do not occur in the Top 1k domains and not
We further investigate how top list use impacts result quality           in the Alexa Top 1M domains, but as a minor count in the
and stability in studies by measuring these layers in §6.                Majestic Top 1M (7 invalid TLDs, resulting in 35 domain
                                                                         names), and significant count in the Umbrella Top 1M: there,
4.    TOP LISTS DATASET                                                  we can find 1,347 invalid TLDs, in a total of 23k domain
   For the three lists in focus of our study, we source daily            names (2.3% of the list). This is an early indicator of a
snapshots as far back as possible. Many top lists come                   specific characteristic in the Umbrella list: invalid domain
from our own archives, and others were shared with us by                 names queried by misconfigured hosts or outdated software
other members of the research community. Table 2 gives an                can easily get included into the list.
overview of our data sets along with some metrics discussed                 Comparing valid and invalid TLDs also reveals another
in §5. For the Alexa list, we have a data set with daily snap-           structural change in the Alexa list on July 20th, 2014: before
shots from January 2009 to March 2012, named AL0912, and                 that date, Alexa had a fairly static count of 206 invalid and
one dataset from April 2013 to April 2018, named AL1318.                 248 valid TLDs. After that date, invalid TLDs have been
As the Alexa list did undergo a significant change in January            reduced to ≈0, and valid TLDs have shown continued growth
2018, which we will study in the next sections, we create a              from 248 to ≈800. This confirms again that top lists can un-
partial dataset named AL18 to reflect this change. For the Um-           dergo rapid and unannounced changes in their characteristic,
brella list, we have a dataset spanning 2016 to 2018, named              which may significantly influence measurement results.
UM1618. For the Majestic Million list, we cover June 2017                   Subdomain Depth is an important property of top lists.
to April 2018.                                                           Base domains offer more breadth and variety in setups, while
   As many of our analyses are comparative between lists,                subdomains may offer interesting targets besides a domain’s
we create a JOINT dataset, spanning the overlapping period               main web presence. The ratio of base to sub domains is
from June 2017 to the end of April 2018. We also sourced in-             hence a breadth/depth trade-off, which we want to explore
dividual daily snapshots from the community and the Internet             for the three lists used. Table 2 shows the average number
archive, but only use those with continuous daily data.                  of base domains (µBD ) per top list. We note that Alexa and
                                                                         Majestic contain almost exclusively base domains with few
5.    STRUCTURE OF TOP LISTS                                             exceptions (e.g., for blogspot). In contrast, Umbrella only
                                                                         contains an average of 28% base domains, i.e., is emphasising
   In this section, we analyse the structure and nature of the           much more on depth of domains. Table 2 also details the
three top lists in our study. This includes questions such as            subdomain depth for a single-day snapshot (April 30, 2018)
TLD coverage, subdomain depth, and list intersection.                    of all lists. As the Umbrella list is based on DNS lookups,
   DNS Terms used in this paper, for clarity, are the follow-            such deep DNS labels can easily become part of the Umbrella
ing: for www.net.in.tum.de, .de is the public suffix (and top            list, regardless of the origin of the request. In fact, Umbrella
level domain), tum.de is the base domain, in.tum.de is the               holds subdomains up to level 33 (an IPv6 rDNS pointer).
first subdomain, and net.in.tum.de is the second subdomain.                 We also note that the base domain is usually part of the
We count www.net.in.tum.de as a third-level subdomain.                   list when its subdomains are listed. On average, each list
5.1    Domain Name Depth and Breadth                                     contains only few hundred subdomains whose base domain
                                                                         is not part of the list.
   A first characteristic to understand about top lists is the              Domain Aliases are domains with the same second-level
scope of their coverage: how many of the active TLDs do                  domain, but different top-level domains, e.g., google.com
they cover, and how many do they miss? How deep are they                 and google.de. Table 2 shows the number of domain aliases
going into specific subdomains, choosing trade-offs between              as DU PSLD . We find a moderate level of ≈5% of domain
breadth and depth?

                                                                     4
List       Top   Dataset               Dates    µT LD ± σ   µBD ± σ         SD1    SD2      SD3     SDM    DU PSLD            µ∆    µN EW
 Alexa      1M    AL0912      29.1.09–16.3.12    248 ± 2     973k ± 2k       1.6%   0.4%    ≈0%         4   47k ± 2k           23k      n/a
 Alexa      1M    AL1318      30.4.13–28.1.18    545 ± 180   972k ± 6k       2.2%   0.1%    ≈0%         4   49k ± 3k           21k      5k
 Alexa      1M    AL18        29.1.18–30.4.18    771 ± 8     962k ± 4k       3.7%   ≈0%     ≈0%         4   45k ± 1k          483k    121k
 Alexa      1M    JOINT        6.6.17–30.4.18    760 ± 11    972k ± 7k       2.6%   ≈0%     ≈0%         4   51k ± 4k          147k     38k
 Umbrella   1M    JOINT        6.6.17–30.4.18    580 ± 13    273k ± 13k     49.9%   14.7%   5.9%       33   15k ± 1k          100k     22k
 Majestic   1M    JOINT        6.6.17–30.4.18    698 ± 14    994k ± 617      0.4%   ≈0%     ≈0%         4   49k ± 1k            6k      2k
 Alexa      1k    JOINT        6.6.17- 30.4.18   105 ± 3      990 ±2         1.3%   0.0%    0.0%        1        22 ± 2   9 (78§ )   4 (8§ )
 Umbrella   1k    JOINT        6.6.17–30.4.18     13 ± 1      317 ±6        52.0%   14%     ≈0%         6        11 ± 2       44          2
 Majestic   1k    JOINT        6.6.17–30.4.18     50 ± 1      939 ±3         5.9%   0.1%    0.1%        4        32 ± 1         5        .8
 Umbrella   1M    UM1618     15.12.16–30.4.18    591 ± 45    281k ± 16k     49.4%   14.5%   5.7%       33   15k±1k            118k        n/a

Table 2: Datasets: mean of valid TLDs covered (µT LD ), mean of base domains (µBD ), mean of sub-domain level spread (SDn
for share of n-th level subdomains, SDM for maximum sub-domain level), mean of domain aliases (DU PSLD ), mean of daily
change (µ∆ ) and mean of new (i.e., not included before) domains per day (µN EW ). §: Average after Alexa’s change in January 18.

aliases within various top lists, with only 1.5% for Majestic.               List       # Disjunct   % hpHosts     % Lumen     % Top 1M
Analysis reveals a very flat distribution, with the top entry                Alexa          1,224       3.10%         1.55%      99.10%
google at ≈200 occurrences.                                                  Umbrella       1,116      20.16%        39.43%      25.63%
                                                                             Majestic         665       1.95%         3.76%      93.63%
5.2     Intersection between Lists
   We next study intersection between lists—all 3 lists in our           Table 3: Share of one-week Top 1k disjunct domains present
study promise to provide a view on the most popular domains              in hpHosts (blacklist), Lumen (mobile) and in the Top 1M of
in the Internet, hence measuring how much these lists agree              the other ranks.
is a strong indicator of bias in list creations. Figure 1a shows
the intersection between top lists over time during the JOINT
duration. We see that the intersection is quite small: for the              We aggregate the Alexa, Umbrella, and Majestic Top 1k
Top1M domains, Alexa and Majestic share 285k domains                     domains from the last week of April 2018, and analyse the
on average during the JOINT duration. Alexa and Umbrella                 set of 3,005 disjunct domains across these lists, i.e., those
agree on 150k, Umbrella and Majestic on 113k, and all three              found only in a single list. 40.73% of these domains originate
only on 99k out of 1M domains.                                           from Alexa, 37.14% from Umbrella, and 22.13% from Majes-
   For the Top1k lists, the picture is more pronounced. Alexa            tic. Subsequently, we identify whether the disjunct domains
and Majestic agree on average in the JOINT period on 295                 are associated with mobile traffic or third-party advertising
domains, Alexa and Umbrella on 56, Majestic and Umbrella                 and tracking services not actively visited by users, but in-
on 65, and all three only on 47 domains.                                 cluded through their DNS lookups. We opt against utilizing
   This disagreement on top domains suggests a high bias                 domain classifiers such as the OpenDNS Domain Tagging
in the list creation. We note that even both web-based lists,            service [92], as it has been reported that categories are vague
Alexa and Majestic, only share an average of 29% of domains.             and coverage is low [91].
   Standing out from Figure 1a is the fact that the Alexa list              Instead, we use the data captured by the Lumen Privacy
has changed its nature in January 2018, reducing the average             Monitor [93] to associate domains with mobile traffic for
intersection with Majestic from 285k to 240k. This change                more than 60,000 Android apps, and use popular anti-tracking
also introduced a weekly pattern, which we discuss further               blacklists such as MalwareBytes’ hpHosts ATS file [94]. We
in §6.2. A possible root cause for this could be that Alexa              also check if the domains from a given rank can be found
reduced the impact of its 3-month sliding window [7], making             in the aggregated Top 1M of the other two ranks during the
the list more volatile and susceptible to weekly pattern. We             same period of time. Table 3 summarises the results. As we
contacted Alexa about this change, but have not heard back.              suspected, Umbrella has significantly more domains flagged
                                                                         as “mobile traffic” and third-party advertising and tracking
5.3     Studying Top List Discrepancies                                  services than the other lists. It also has the lowest proportion
   The low intersection between Umbrella and the other lists             of domains shared with other Top 1M lists.
could be rooted in the DNS vs. web-based creation. Our hy-                  This confirms that Umbrella is capable of capturing do-
pothesis is that the web-based creation of Alexa and Majestic            mains from any device using OpenDNS, such as mobile and
lists, which also hold very few subdomains, misses domains               IoT devices, and also include domains users are not aware of
providing embedded content as well as domains popular on                 visiting, such as embedded third-party trackers in websites.
mobile applications [65, 91]. In this section, we explore the            Alexa and Majestic provide a web-specific picture of Internet
origin of discrepancies across domain lists.                             traffic.

                                                                   5
300k                                                                                                               50%
                                                     500k                                                          40%         Alexa_18
250k           Alexa-Majestic                                       Alexa Top 1M                                               Umbrella_JOINT
               Alexa-Umbrella                        400k           Umbrella Top 1M                                30%         Alexa_1318
200k           Umbrella-Majestic                     300k           Majestic Top 1M                                20%         Majestic_JOINT
               Alexa-Umbrella-Majestic
                                                     200k                                                          10%
150k
                                                                                                                  7.5%
                                                     100k
100k                                                                                                              5.0%
                                                         0                                                        2.5%
      6 7 8 9 0 1 2 1 2 3 4 5                             -067-07 7-08 7-097-10 7-117-12 8-01 8-028-03 8-048-05
   7-0 7-0 7-0 7-0 7-1 7-1 7-1 8-0 8-0 8-0 8-0 8-0      7
201 201 201 201 201 201 201 201 201 201 201 201      201 201 201 201 201 201 201 201 201 201 201 201                     102       103          104   105        106
         (a) Intersection between Top1M lists.              (b) Daily changes of Top1M entries.                          (c) Average % Daily Change over Rank.
               Figure 1: Intersection, daily changes and average stability of top lists (y-axis re-scaled at 10 in right plot).

6.       STABILITY OF TOP LISTS                                                            New or In-and-out Domains? Daily changes in domain
   Armed with a good understanding of the structure of top                              lists may stem from a set of domains churning in and out of a
lists, we now focus on their stability over time. Research has                          list, or from new domains joining every day. Figure 2a shows
revealed hourly, daily and weekly patterns on ISP traffic and                           the cumulative sum of all domains ever seen in a list. We find
service load, as well as significant regional and demographic                           Majestic to grow linearly by about 2k new domains per day
differences in accessed content due to user habits [95–98].                             (of 6k daily change), from 1M to 1.7M, Umbrella to grow at
In this section, we assess to what degree weekly patterns                               about 20k new domains per day (of 118k daily change) from
manifest in top lists, as a first step towards understanding the                        1M to 7.3M, and Alexa grow by 5k (of 21k) and 121k (of
impact of studies picking a list at a given time.                                       483k) domains per day, before and after its structural change
                                                                                        in January 2018. Because of Alexa’s change, its cumulative
6.1        Daily Changes                                                                number of domains increased to 13.5M domains.
                                                                                           Across all lists, we find an average of about 1⁄5 to 1⁄3 of
   We start our analysis by understanding the composition
                                                                                        daily changing domains to be domains entering the list for
and evolution of top lists on a daily basis. As all top lists
                                                                                        the first time. This also implies that 2⁄3 to 4⁄5 of daily changing
have the same size, we use the count of intersecting domains
                                                                                        domains are domains that keep going in and out of a list. We
as the metric to compare evolution over time.
                                                                                        also show these and the similar Top 1k numbers in column
     Figure 1b shows the amount of domains that changed
                                                                                        µN EW of Table 2.
from one day to the next day¶ . The Majestic list is very stable
                                                                                           This behaviour is further confirmed by Figure 2b. In this
(6k daily change), the Umbrella list offers significant churn
                                                                                        figure, we compute the intersection between a fixed starting
(118k), and the Alexa list used to be stable (21k), but changed
                                                                                        day and the upcoming days. We compute it seven times, with
its characteristic, in January 2018 (483k) to now be even more
                                                                                        each day of the first week of the JOINT dataset as the starting
unstable than the Umbrella list.
                                                                                        day. Figure 2b shows the evolution of the daily median value
   The Umbrella list clearly presents weekly patterns, as does
                                                                                        between these seven intersections.
the Alexa list after the change in January 2018. The average
                                                                                           This shows several interesting aspects: (i) the long-term
daily changes are given in column µ∆ of Table 2, including
                                                                                        trend in temporal decay per list, confirming much of what we
those for the Top 1k subsets.
                                                                                        have seen before (high stability for Majestic, weekly patterns
   Which Ranks Change? Previous studies of Internet traffic
                                                                                        and high instability for Umbrella and the late Alexa list)
revealed that the distribution of accessed domains and ser-
                                                                                        (ii) the fact that for Alexa and Umbrella, the decay is non-
vices follows a power-law distribution [66,95–98]. Therefore,
                                                                                        monotonic, i.e., a set of domains is leaving and rejoining at
the ranking of domains in the long tail should be based on
                                                                                        weekly intervals.
significantly smaller and hence less reliable numbers.
                                                                                           For How Long are Domains Part of a Top List? We
     Figure 1c displays the stability of lists in dependence
                                                                                        investigate the average number of days a domain remains in
of their size. The y-axis shows the mean number of daily
                                                                                        both the Top 1M and Top 1k lists in Figure 2c. This figure
changing domains in the top X domains, where X is depicted
                                                                                        displays a CDF with the number of days from the JOINT
on the x-axis. For example, an x-value of 1000 means that
                                                                                        dataset in the x-axis, and the normalised cumulative proba-
the lines at this point show the average daily change per
                                                                                        bility that a domain is X or less days part of the list. Our
list for the Top 1k domains. The figure shows instability
                                                                                        analysis reveals significant differences across lists. While
increasing with higher ranks for Alexa and Umbrella, but not
                                                                                        about 90% of domains in the Alexa Top 1M list are in the list
for Majestic. We plot Alexa before and after its January 2018
                                                                                        for 50 days or less, 40% of domains in the Majestic Top 1M
change highlighting the significance of the change across
                                                                                        list remain in the list across all days. With this reading, lines
all its ranks. Even its Top 1k domains have increased their
                                                                                        closer to the lower right corner are better in the sense that
instability from 0.62% to 7.7% of daily change.
                                                                                        more domains have stayed in the list for more days, while
     ¶
         |{domainsn } \ {domainsn+1 }|, with n=date                                     lines closer to the upper left indicate that domains get indexed

                                                                                   6
15M                                                       100%

                                                                                                                                                                                Share of domains ≤ days in list
 13M          Alexa Cumulative                                                                                                Alexa Top 1M                                                                 1.0
              Majestic Cumulative                           90%                                                               Umbrella Top 1M
 11M          Umbrella Cumulative                                                                                             Majestic Top 1M                                                              0.8
  9M                                                        80%
                                                                                                                                                                                                           0.6
  7M                                                        70%
  5M                                                        60%                                                                                                                                            0.4
  3M                                                        50%                                                                                                                                            0.2
  1M                                                                                                                                                                                                                                          Alexa 1k
                                                                                                                                                                                                                                              Alexa 1M
                                                                                                                                                                                                                                                                  Majestic 1k
                                                                                                                                                                                                                                                                  Majestic 1M
                                                                                                                                                                                                                                                                                        Umbrella 1k
                                                                                                                                                                                                                                                                                        Umbrella 1M
                                                            40%
      6 7 8 9 0 1 2 1 2 3 4 5
   7-0 7-0 7-0 7-0 7-1 7-1 7-1 8-0 8-0 8-0 8-0 8-0
                                                                                                                                                                                                           0.0
                                                                  -06 -07 -08 -09 -10 -11 -12 -01 -02 -03 -04 -05
201 201 201 201 201 201 201 201 201 201 201 201                 17 17 17 17 17 17 17 18 18 18 18 18
                                                              20 20 20 20 20 20 20 20 20 20 20 20
                                                                                                                                                                                                                         0                     100           Days 200                        300

(a) Cumulative sum of all domains ever included            (b) List intersection against a fixed starting set                                                                   (c) CDF of % of domains over days included in
in Top 1M lists (Top 1k similar).                          (median value between seven different starting                                                                       Top 1M and Top 1k lists.
                                                           day)
                        Figure 2: Run-up and run-down of domains; share of days that a domains spend in a top list.
1.25         Alexa 1k     Umbrella 1k      Majestic 1k                                 tumblr.com                                      blogspot                                                                                                nessus.org                               nflxso.net
             Alexa 1M     Umbrella 1M      Majestic 1M     30k                         sharepoint.com                                  wordpress.com                                                                                           ampproject.org
1.00
                                                                                                                                                                                20k
0.75                                                       20k

0.50                                                                                                                                                                            10k
                                                           10k

0.25
                                                             0                                                                                                                                   0
0.00                                                                      1-22          1   -29           2   -05        2   -12        2    -19        2-26          3   -05                                                -22              -29           -05           -12        -19        -26             -05
                                                                      8-0           8-0               8-0            8-0            8-0             8-0           8-0                                                  -01              -01           -02           -02           -02        -02          -03
                                                                   1               1              1                 1              1               1             1                                                  18             18               18            18            18         18           18
       0.0      0.2      0.4      0.6      0.8       1.0         20           20            20                20             20               20            20                                                    20         20               20            20            20            20         20
 (a) Kolmogorov-Smirnov (KS) distance between               (b) Weekday/weekend dynamics in Alexa Top 1M                                                                                                   (c) Weekday/weekend dynamics in Umbrella
       weekend and weekday distributions.                           Second-Level-Domains (SLDs).                                                                                                                        Top 1M SLDs.

          Figure 3: Comparison of weekday vs. weekend distributions and dynamics in Second-Level-Domains (SLDs).

or removed more quickly. The lists show quite different be-                                                              rank, where over 15% of domains have a KS distance of
haviour, with Majestic Top 1k being the most stable by far                                                               1. The changes are less pronounced for the Top 1k Alexa
(only ≈ 26% domains present < 100% days), and being fol-                                                                 and Umbrella lists, suggesting that the top domains are more
lowed by Majestic Top 1M, Umbrella Top 1k, Alexa Top 1k,                                                                 stable. This demonstrates that a certain set of domains, the
Umbrella Top 1M, and Alexa Top 1M. It is interesting to note                                                             majority of them localised in the long-tail, present disjunct
that the Majestic Top 1M list is offers similar stability as the                                                         rankings between weekends and weekdays.
Alexa and Umbrella Top 1k lists.                                                                                            What Domains are More Popular on Weekends? This
                                                                                                                         leads to the question about the nature of domains changing
6.2     Weekly Patterns                                                                                                  in popularity with a weekly pattern. To investigate this, we
   We now investigate the weekly pattern in the Alexa and                                                                group domains by “second-level-domain” (SLD), which we
Umbrella lists as observed in Figure 1b. We do not include                                                               define as the label left of a public suffix per the Public Suffix
Majestic as it does not display a weekly pattern. In this sec-                                                           list [99]. Figures 3b and 3c, display the time dynamics of
tion, we resort to different statistical methods to investigate                                                          SLDs groups for which the number of domains varies by
those weekend patterns. We will describe each one of them                                                                more than 40%. For Alexa, we can see stable beheviour be-
in their relevant subsection.                                                                                            fore its February 2018 change. We see that some groups such
   How Do Domain Ranks Change over the Weekends?                                                                         as blogspot|| or tumblr.com are significantly more pop-
The weekly periodical patterns shown in Figure 1b show that                                                              ular on weekends than on weekdays. The opposite is true for
list content depends on the day of the week. To investigate                                                              domains under sharepoint.com (a web-based Microsoft
this pattern statistically, we calculate a weekday and week-                                                             Office platform). Umbrella shows the same behaviour, with
end distribution of the rank position of a given domain and                                                              nessus.org (a threat intelligence tool) more popular dur-
compute the distance between those two distribution using                                                                ing the week, and ampproject.org (a dominant website
the Kolmogorov-Smirnov (KS) test. This method allows us                                                                  performance optimisation framework), and nflxso.net (a
to statistically determine to what degree the distribution of                                                            Netflix domain) more popular on weekends. These examples
a domain’s ranks on weekdays and weekends overlap, rep-                                                                  confirm our hypothesis that the weekly patterns stem different
resented in Figure 3a. We include Majestic as a base line                                                                Internet usage on weekends** .
without a weekly pattern. For Alexa Top 1M, we can see that
about 35% of domains have a KS distance of one, meaning
that their weekend and weekday distributions have no data                                                                               ||
                                                                                                                                             This includes all blogspot.* domains
point in common. This feature is also present in Umbrella’s                                                                            ** Data    indicates prevailing Saturday and Sunday weekends.

                                                                                                                7
Highest rank                                  Median rank                         Lowest rank
                 Domain
                                               Alexa          Umbrella      Majestic           Alexa     Umbrella     Majestic    Alexa     Umbrella     Majestic
                 google.com                         1                   1            1            1              1          1         2            4           8
                 facebook.com                       3                   1            2            3              6          2         3            8          19
                 netflix.com                       21                   1          455           32              2        515        34          487         572
                 jetblue.com                  2,284             14,291         4,810          3,133          29,637     4,960      5,000      56,964       5,150
                 mdc.edu                     25,619            177,571        24,720         35,405         275,579    26,122     88,093     449,309      30,914
                 puresight.com              183,088            593,773       687,838        511,800         885,269   749,819    998,407     999,694     869,872

                                Table 4: Rank variation for some more and less popular websites in the Top 1M lists.

       Alexa 1k                   Umbrella 1k                      Majestic 1k                         and bottom domains is striking and in line with our previous
       Alexa 1k (1st day fix)     Umbrella 1k (1st day fix)        Majestic 1k (1st day fix)
                                                                                                       findings: the ranks of top domains are fairly stable, while the
1.00                                                                                                   ranks of bottom domains vary drastically.
0.75                                                                                                   6.4      Summary
0.50                                                                                                      We investigate the stability of top lists, and find abrupt
                                                                                                       changes, weekly patterns, and significant churn for some lists.
0.25                                                                                                   Lower ranked domains fluctuate more, but the effect heavily
0.00
                                                                                                       depends on the list and the subset (Top 1k or Top 1M). We can
             0.65       0.70     0.75     0.80       0.85        0.90       0.95     1.00              confirm that the weekly pattern stems from leisure-oriented
                                                                                                       domains being more popular on weekends, and that the rank
        Figure 4: CDF of Kendall’s τ between top lists.                                                of not top-ranked domains can fluctuate significantly.

                                                                                                       7.     UNDERSTANDING AND INFLUENCING
                                                                                                              TOP LISTS RANKING MECHANISMS
                                                                                                          We have seen that top lists can be rather unstable from day
6.3      Order of Domains in Top Lists                                                                 to day, and hence investigate what traffic levels are required
   As top lists are sorted, a statistical analysis of order vari-                                      and at what effort it is possible to manipulate the ranking
ation completes our view of top lists’ stability. We use the                                           of a certain domain. As discussed previously, the Alexa list
Kendall rank correlation coefficient [100], commonly known                                             is based on its browser toolbar and “various other sources”,
as Kendall’s τ coefficient, to measure rank correlation, i.e.,                                         Umbrella is based on OpenDNS queries, and Majestic is
the similarity in the order of lists. Kendall’s correlation be-                                        based on the count of subnets hosting websites pointing to
tween two variables will be high when observations have a                                              the website in question. In this section, we investigate the
similar order between the two variables, and low when obser-                                           ranking mechanisms of these top lists more closely.
vations have a dissimilar (or fully different for a correlation
of -1) rank between the two variables.                                                                 7.1      Alexa
   In Figure 4, we show the CDF of Kendall’s τ rank correla-                                              Alexa obtains visited URLs through “over 25,000 different
tion coefficient for the Alexa, Umbrella and Majestic Top 1k                                           browser extensions” to calculate site ranks through visitor and
domains in two cases: (i) for day to day comparisons; (ii)                                             page view statistics [6, 101]. There is no further information
for a static comparison to the first day in the JOINT dataset.                                         about these toolbars besides Alexa’s own toolbar. Alexa also
For analysis, we can compare the percentage of very strongly                                           provides data to The Internet Archive to add new sites [102].
(>95%) correlated ranks. For day to day comparisons, Majes-                                            It has been speculated that Alexa provides tracking infor-
tic is clearly leading the way at 99%, with Alexa (72%) and                                            mation to feed the Amazon recommendation and profiling
Umbrella (40%) following at significant distance.                                                      engine since Amazon’s purchase of Alexa in 1999 [103]. To
   When compared for a reference day, very strong correlation                                          better understand the ranking mechanism behind the Alexa
drops below 5% for all lists. This suggests that the order                                             list, we reverse engineer the Alexa toolbar and investigate
variations are not perceived in the short term, but may arise                                          what data it gathers. Upon installation, the toolbar fetches
when considering longer temporal windows.                                                              a unique identifier which is stored in the browser’s local
   Investigating the Long Tail: To compare higher and                                                  storage, called the Alexa ID (aid). This identifier is used
lower ranked domains, we take three exemplary domains                                                  for distinctly tracking the device. During installation, Alexa
from the Top 100 and the lower ranks as examples. Table 4                                              requests information about age, (binary) gender, household
summarises the results. For each of the six domains, we com-                                           income, ethnicity, education, children, and the toolbar instal-
pute the highest, median, and lowest rank over the duration of                                         lation location (home/work). All of these are linked to the aid.
the JOINT dataset. The difference of variability between top                                           After installation, the toolbar transfers for each visited site:

                                                                                                8
1                                38,556                        22,300       ments with 100, 1k, 5k, and 1k probes, and at frequencies
                                                                                        of 1, 10, 50, and 100 queries per probe per day [107]. The
Queries per Day

                                             36,985                        21,147       resulting ranks, stabilised after several days of measuring, is
            10

                                                                                        depicted in Figure 5. A main insight is that the number of
                                             30,477                        17,261
            50

                                                                                        probes has a much stronger influence than the query volume
                             199,598 51,469 30,343 614,575 103,255 30,066 17,208        per probe: 10k probes at 1 query per day (total of 10k queries)
            100

                       100    1000    5000   10000    100   1000    5000   10000
                                                                                        achieve a rank of 38k, while 1000 probes at 100 queries per
                                 Probes                        Probes                   day (total 100k queries) only achieve rank 199k.
                                                                                           It is a reasonable and considerate choice to base the ranking
 Figure 5: Umbrella rank depending on probe count and query
                                                                                        mechanism mainly on the number of unique sources, as it
 frequency. Left is a Friday’s, right a Sunday’s list. Empty
                                                                                        makes the ranking less susceptible to individual heavy hitters.
 fields indicate the settings did not result in a Top 1M ranking.
                                                                                           TTL Influence: To test whether the Umbrella list nor-
 the page URL, screen/page sizes, referer, window IDs, tab                              malises the potential effects of TTL of values, we query DNS
 IDs, and loading time metrics. For a scarce set of 8 search en-                        names with 5 different TTL values from 1000 probes at a
 gine and shopping URLs†† , referer and URL are anonymised                              900s interval [108]. We can not determine any significant
 to their host name. For all other domains, the entire URL, in-                         effect of the TTL values, all 5 domains maintain a distance
 cluding all GET parameters, is transmitted to Alexa’s servers                          of less that 1k list places over time.
 under data.alexa.com. Because of the injected JavaScript,                                 This is coherent with our previous observation that the
 the visit is only transmitted if the site actually exists and was                      Umbrella rank is mainly determined from the number of
 loaded. In April 2018, Alexa’s API DNS name had a rank                                 clients and not the query volume per client: as the TTL
 of ≈30k in the Umbrella list, indicating at least 10k unique                           volume would mainly impact the query volume per client, its
 source IP addresses querying that DNS through OpenDNS                                  effect should be marginal.
 name per day (cf §7.2).
                                                                                        7.3 Majestic
    Due to its dominance, the Alexa rank of a domain is an
 important criterion in domain trading and search engine op-                               The Majestic Million top list is based on a custom web
 timisation. Unsurprisingly, there is a gray area industry of                           crawler mainly used for commerical link intelligence [109].
 sites promising to “optimise” the Alexa rank of a site for                             In its beginning, Majestic ranked sites by the raw number of
 money [104–106]. Though sending synthetic data to the                                  referring domains. As this had an undesired outcome, the
 backend API should be possible at reasonable effort, we re-                            link count was normalised by the count of referring /24-
 frain from doing so due to a current backend API change with                           subnets to limit the influence of single IP addresses [110].
 no updated toolbar available for analysis yet.                                         The list is calculated using 90 days of data [111]. As this
                                                                                        approach is similar to PageRank [112], except that Majestic
 7.2                   Umbrella                                                         does not weight incoming links by the originating domain,
    As the Umbrella list is solely based on DNS queries                                 referral services can increase a domain’s popularity. We can,
 through the OpenDNS public resolver, it mainly reflects do-                            however, not see an efficient way to influence a domain’s rank
 mains frequently resolved, not necessarily domains visited                             in the Majestic list, without using referral services.
 by humans, as confirmed in §5.3. Examples are the Internet
 scanning machines of various research institutions, which                              8.        IMPACT ON RESEARCH RESULTS
 show up in the Umbrella ranking through forward-confirmed                                 §3 revealed that top lists are used in a broad manner in net-
 reverse-DNS, but unlikely from humans entering the URL                                 working, security and systems study, but especially prevalent
 into their browser. Building a top list based on DNS queries                           in Internet measurement, and that top lists are being used for
 has various trade-offs and parameters, which we aim to ex-                             Internet measurements across all layers. As replication of
 plore here. One specifically is the TTL value of a DNS                                 all studies from our survey is not possible, we evaluate the
 domain name. As the DNS highly relies on caching, TTL val-                             impact of the lists’ structure on research results in the Internet
 ues could introduce a bias in determining popularity based on                          measurement field by investigating (i) common layers, such
 query volume: domain names with higher Time-To-Live val-                               as DNS and IP, that played a role in many studies, and (ii) a
 ues can be cached longer and may cause fewer DNS queries at                            sample of specific studies across a variety of layers, aiming
 upstream resolvers. To better understand Umbrella’s ranking                            for one specific study per layer.
 mechanism and query volume required, we set up 7 RIPE At-                                 We evaluate those scientific results with 3 questions in
 las measurements [107], which query the OpenDNS resolvers                              mind: (i) what is the bias when using a top list as compared
 for DNS names under our control.                                                       to a general population of com/net/org domains‡‡ (ii) what is
    Probe Count versus Query Volume: We set up measure-                                      ‡‡
                                                                                             com/net/org is still only a 45% sample of the general population
                  ††
    As   of  2018-05-17, these   are  google.com,                                       (156.7M of 332M domains as per [113]), but more complete and still
 instacart.com,    shop.rewe.de,     youtube.com,                                       unbiased samples are difficult to obtain due to ccTLDs restrictive
 search.yahoo.com, jet.com and ocado.com                                                zone file access policies. [19, 114–117]

                                                                                    9
Alexa          Umbrella           Majestic               Alexa          Umbrella            Majestic        com/net/org
  Study                                   1K                1K                1K                   1M               1M                  1M     157.24M ± 172K
  NXDOMAIN1                  t ∼0.0% ± 0.0%     t ∼0.0% ± 0.0%    t ∼0.0% ± 0.0%        t 0.13% ± 0.02    s 11.51% ± 0.9    s 2.66% ± 0.09         0.8% ± 0.02
  IPv6-enabled2                s 22.7% ± 0.6      s 22.6% ± 1.0     s 20.7% ± 0.4        s 12.9% ± 0.9     s 14.8% ± 0.8     s 10.8% ± 0.2          4.1% ± 0.2
  CAA-enabled1                 s 15.3% ± 0.9       s 5.6% ± 0.3     s 27.9% ± 0.3         s 1.7% ± 0.1      s 1.0% ± 0.0      s 1.5% ± 0.0          0.1% ± 0.0
  CNAMEs3                      n 53.1% ± 1.1    s 44.46% ± 0.43    s 64.8% ± 0.34          n 44.1% ± 1     t 27.86% ± 1    s 39.81% ± 0.15         51.4% ± 1.7
  CDNs (via CNAME)3           s 27.5% ± 0.89     s 29.9% ± 0.37    s 36.1% ± 0.22           s 6% ± 0.6   s 10.14% ± 0.63     s 2.6% ± 0.01        1.3% ± 0.004
  Unique AS IPv4 (avg.)3,4           256 ± 5            132 ± 4           250 ± 3          19511 ± 597       16922 ± 584        17418 ± 61          34876 ± 53
  Unique AS IPv6 (avg.)3,4            44 ± 5             26 ± 2           48 ± 30            1856 ± 56        2591 ± 157        1236 ± 793            3025 ± 9
  Top 5 AS (Share)3          s 52.68% ± 1.74    s 53.33% ± 1.75   s 51.74% ± 1.73      s 25.68% ± 0.67   n 33.95% ± 1.06   s 22.29% ± 0.17         40.22 ± 0.09
  TLS-capable5                      s 89.6%             s 66.2%          s 84.7%              s 74.65%          n 43.05%          s 62.89%             36.69%
  HSTS-enabled HTTPS5               s 22.9%             n 13.0%          s 27.4%              s 12.17%          s 11.65%           n 8.44%              7.63%
  HTTP23                      s 47.5% ± 0.75      s 36.3% ± 2.4    s 36.6% ± 0.72       s 26.6% ± 0.88   s 19.11% ± 0.63    s 19.8% ± 0.15        7.84% ± 0.08
  1: µ Apr, 2018    2: µ of JOINT period (6.6.17–30.4.18)   3: µ Apr, 2018 - 8. May, 2018     4: no share, thus no t , n , or s 5: Single day/list in May, 2018
Table 5: Internet measurement characteristics compared across top lists and general population, usually given as µ ± σ. For each
cell, we highlight if it significantly (50%6 ) exceeds s or falls behind t the base value (1k / 1M, 1M / com/net/org), or not n .
In almost all cases (s and t ), top lists significantly distort the characteristics of the general population.
6: For base values over 40%, the test for significant deviation is 25% and 5σ.

                                                                                                             1.75%          Alexa Top 1M
20%                   Alexa Top 1M                    12%                                                    1.50%          Umbrella Top 1M
                      Umbrella Top 1M
                      Majestic Top 1M                 10%                                                    1.25%          Majestic Top 1M
15%                                                                                                                         com/net/org
                      com/net/org                                                                            1.00%
10%                                                    8%                Alexa Top 1M                        0.75%
                                                       6%                Umbrella Top 1M                     0.50%
 5%                                                                      Majestic Top 1M
                                                       4%                com/net/org                         0.25%
 0%                                                                                                           0.0%
       -04 -06 -08 -10 -12 -02 -04                           -04 -06 -08 -10 -12 -02 -04                             -10 -11 -12 -01 -02 -03 -04 -05
   2017 2017 2017 2017 2017 2018 2018                    2017 2017 2017 2017 2017 2018 2018                      2017 2017 2017 2018 2018 2018 2018 2018
          (a) % of NXDOMAIN responses.                             (b) % of IPv6 Adoption.                            (c) % of CAA-enabled domains.
                   Figure 6: DNS characteristics in the Top 1M lists and general population of about 158M domains.

the difference in result when using a different top list? (iii)                       exception is teredo.ipv6.microsoft.com, a service
what is the difference in result when using a top list from a                         discontinued in 2013 and unreachable but still commonly
different day?                                                                        appearing at high ranks in Umbrella.
                                                                                         This also highlights a challenge in Majestic’s ranking
8.1       Domain Name System (DNS)                                                    mechanism: while counting the number of links to a cer-
   A typical first step in lists usage is DNS resolution, which is                    tain website is nicely stable over time, it is also very slow to
also a popular research focus (as seen in Section 3). We split                        react to domains going out of business.
this view into a record type perspective (e.g., IPv6 adoption)                           Tracking IPv6 adoption has been subject of several sci-
and a hosting infrastructure perspective (e.g., CDN preva-                            entific studies such as [118, 119]. We compare IPv6 adoption
lence and AS mapping).                                                                across top lists and the general population, for which we
                                                                                      count the number of domains that return at least one routed
8.1.1       Record Type Perspective                                                   IPv6 address as an AAAA record or within a chain of up to
   We investigate the share of NXDOMAIN domains and                                   10 CNAMEs. At 11–13%, we find IPv6 enablement across
IPv6-enabled domains, and the share of CAA-enabled do-                                top lists to significantly exceed the general population at 4%.
mains as an example of a DNS-based measurement study [114].                           Also, the highest adoption lies with Umbrella, a good indica-
Results are shown in Table 5 and Figure 6.                                            tion for IPv6 adoption: when the most frequently resolved
   NXDOMAIN: An NXDOMAIN error code in return to a                                    DNS names support IPv6, many subsequent content requests
DNS query means that the queried DNS names does not exist                             can use IPv6.
at the respective authoritative nameserver. This error code is                           CAA Adoption: Exemplary for other record types, we
unexpected for allegedly popular domain. Surprisingly, we                             also investigate the adoption of Certification Authority Autho-
find the amount of NXDOMAIN responses in both the Um-                                 rization (CAA) records in top lists and the general population.
brella (11.5%) and the Majestic (2.7%) top lists higher than                          CAA is a rather new record type, and has become mandatory
in the general population of com/net/org domains. This is in                          for CAs to check before certificate issuance, cf., [114, 120].
alignment with the fact that already ≈23k of domains in the                           We measure CAA adoption as described in [114], i.e., the
Umbrella list belong to non-existent top-level domains (cf.,                          count of base domains with an issue or issuewild set. Similar
§5.1). Figure 6a shows that the NXDOMAIN share is, except                             to IPv6 adoption, we find CAA adoption among top lists
for Umbrella, very stable over time. Among Top 1k ranked                              (1–2%) to significantly exceed adoption among the general
domains we found close to 0 NXDOMAINs. One notable

                                                                                 10
population at 0.1%. Even more stunning, the Top 1k lists                  ASNs: We next analyse the distribution of Autonomous
feature a CAA adoption of up to 28%, distorting the 0.1% in            Systems (AS) that announce a DNS name’s A record in BGP,
the general population by two magnitudes.                              as per route-view data from the day of the measurement,
   Takeaway: The DNS-focused results above highlight that              obtained from [122]. First, we study the AS diversity by
top lists may introduce a picture where results can be signif-         counting the number of different ASes hit by the different lists.
icantly distorted from compared to the general population,             We observe lists to experience large differences in the number
a fact to be urgently kept in mind. Figure 6 also shows that           of unique ASes (cf., Table 5); while Alexa Top 1M hits the
Umbrella, and recently Alexa, can have different results when          most ASes, i.e., 19511 on average, Umbrella Top 1M hits the
using a different day. The day-to-day differences, ranging             fewest, i.e., 16922 on average. To better understand which
e.g., from 1.5 to 1.8 percent points of CAA adoption around            ASes contribute the most IPs, we next focus on studying the
a mean of 1.7 for Alexa, are not extreme, but should be                top ASes. Figure 7d shows the top 5 ASes for the Top 1k and
accounted for.                                                         Top 1M domains of each list, as well as the set of com/net/org
                                                                       domains. We observe that both the set of involved ASes and
8.1.2    Hosting Infrastructure Perspective                            their share differ by list.
                                                                          First, we note that the general share of the top 5 ASes is
   Domains can be hosted by users themselves, by hosting               40% in the general population, compared to an average of
companies, or a variety of CDNs. The hosting landscape                 53% in the Top 1k and an average of 27% in the Top 1M lists.
is subject to a body of research that is using top lists to               In terms of structure, we further observe that GoDaddy
obtain target domains. We hence study the share of hosting             (AS26496) clearly dominates the general population with a
infrastructures obtained by measuring domains listed by the            share of 25.99%, which only accounts for 2.74% on the Alexa
different top lists.                                                   Top 1M and for 4.45% on the Majestic Top 1M.
   CDN Prevalence: We start by studying the prevalence of                 While Alexa and Majestic share a somewhat similar distri-
CDNs in top lists and general population. Since many CDNs              bution for both the Top 1M and Top 1k lists, Umbrella offers a
use DNS CNAME records to redirect clients to servers, we               quite different view, with a high share of Google/AWS hosted
perform daily DNS resolutions in April 2018. We then match             domains, which also relates to the CDN analysis above.
the CNAME records observed in the DNS against a list of                   This view is also eye-opening for other measurement stud-
CNAME patterns for 77 CDNs [121] to identify CDN use.                  ies: with a significant share of a population hosted controlled
   We first observe that the prevalence of CDNs differs by             by different 5 ASes, it is of no surprise that certain higher
list and domain rank (see Table 5), with all Top 1M lists              layer characteristics behave differently.
exceeding the general population by at least factor 2, and
all Top 1k lists exceeding the general population by at least
factor 20. When grouping the CDN ratio per list by weekdays            8.2    TLS
(see Figure 7a), we observe minor influences of weekends vs.              In line with the prevalence of TLS studies among surveyed
weekdays due to the top list dynamics described in §6.2.               top list papers in §3, we next investigate TLS adoption among
   After adoption of CDNs in general, we study the structure           lists and the general population. To probe for TLS support,
of CDN adoption. We analyse the top 5 CDNs and show                    we instruct zgrab to visit each domain via HTTPS for one
their distribution in Figure 7 to study if the relative share          day per list in May 2018. As in the previous sections, we
is stable over different lists. We thus show the fraction of           again prefix every domain with www. (except for Umbrella
domains using one of the top 5 CDNs for both a subset of               that contains subdomains), as we found greater coverage for
the Top 1k and the entire list of Top 1M domains per list. We          these domains. We were able to successfully establish TLS
first observe that the relative share of the top 5 CDNs differs        connections with 74.65% of the Alexa, 62.89% of the Majes-
by list and rank (see Figure 7b), but is generally very high at        tic, 43.05% of the Umbrella, and 36.69% of the com/net/org
more than 80%. The biggest discrepancy is between using a              domains (cf., Table 5). For Top 1k domains, TLS support
top list and focusing on the general population of com/net/org         further increases by 15–30% per list.
domains. Google dominates the general population with a                   These results show TLS support to be most pronounced
share of 71.17% due to many (private) Google-hosted sites.             among Alexa-listed domains, and that support in top lists
Domains in top lists are more frequently hosted by typical             generally exceeds the general population.
CDNs (e.g., Akamai). Grouping the CDN share per list                      HSTS: As one current research topic [19], we study the
by weekday in Figure 7c shows a strong weekend/weekday                 prevalence of HTTP Strict Transport Security (HSTS) among
pattern for Alexa, due to rank dynamics (see Section 6.2).             TLS enabled domains. We define a domain to be HSTS
Interestingly, the weekend days have a higher share of Google          enabled if the domain provides a valid HSTS header with
DNS, indicating that less enterprise-grade hosted domains              a max-age setting > 0. Out of the TLS-enabled domains,
are being visited on the weekend.                                      12.17% of the Alexa, 11.65% of the Umbrella, 8.44% of the
   These observations highlight that using a list or not has           Majestic, and 7.63% of the com/net/org domains provide
significant influence on the top 5 CDNs observed, and, if              HSTS support (see Table 5). Only inspecting Top 1k domains
using Alexa, the day of list creation as well.                         again increases support significantly to 22.9% for Alexa,

                                                                  11
Akamai      Amazon        Zenedge              Google             Fastly         Amazon       Akamai (20940)       Microsoft (8075)
M 0.265 0.069 0.297 0.087 0.362 0.026      Google      WordPress     Highwinds            Akamai             Incapsula      Instart      Cloudflare (13335)   GoDaddy (26496)
                                           Fastly      Facebook      CHN Net              WordPress                                      Google (15169)       OVH (16276)
T 0.280 0.055 0.295 0.102 0.361 0.026      Incapsula   Instart                                                                           Amazon (16509)       1&1 (8560)
                                        1.0                                       1.0                                                    Amazon (14618)       Confluence (40034)
W 0.279 0.055 0.303 0.106 0.361 0.026                                                                                                    Fastly (54113)
                                        0.8                                       0.8
T 0.282 0.055 0.302 0.105 0.363 0.026
                                        0.6                                       0.6                                                   0.4
F 0.279 0.056 0.301 0.106 0.363 0.026
                                        0.4                                       0.4
S 0.275 0.059 0.299 0.105 0.360 0.026                                                                                                   0.2
                                        0.2                                       0.2
S 0.263 0.068 0.297 0.100 0.361 0.026
    1k   1M    1k    1M    1k    1M     0.0                                       0.0                                                   0.0
      Alexa    Umbrella    Majestic            Alexa   Umbrella Majestic c/n/o                      Alexa   Umbrella     Majestic             Alexa   Umbrella Majestic c/n/o
    (a) Ratio of detected CDNs by            (b) Share of top 5 CDNs,                     (c) Share of top 5 CDNs,                          (d) Share of top 5 ASes,
  list (x-axis) & weekday (y-axis).     Top 1k vs. Top 1M vs. com/net/org.               daily pattern (Mon - Sun).                    Top 1k vs. Top 1M vs. com/net/org

      Figure 7: Overall CDN ratio, ratio of top 5 CDNs, and ratio of top 5 ASes, dependent on list, list type, and weekday.

13.0% for Umbrella, and 27.4% for Majestic. HSTS support                                            60         Alexa 1M             Umbrella 1M        Majestic 1M         c/n/o
is, again, over-represented in top lists.                                                                      Alexa 1k             Umbrella 1k        Majestic 1k
                                                                                                    50
   Takeaway: We have analysed the properties of top lists

                                                                                        Share [%]
and the general population across many layers, and found                                            40
that: Top lists (i) generally show significantly more extreme                                       30
measurement results, e.g., protocol adoption. This effect is                                        20
pronounced to typically 2 orders of magnitude for the Top 1k
                                                                                                    10
domains. Results can (ii) be affected by a weekly pattern,
e.g., the % of protocol adoption may yield a different result                                        0
                                                                                                        -11    -14    -17    -20    -23    -26    -29    -02    -05    -08
when using a list generated on a weekend as compared to a                                           8-04 018-04 018-04 018-04 018-04 018-04 018-04 018-05 018-05 018-05
                                                                                            201        2      2      2      2      2      2      2      2      2
weekday. This is a significant limitation to be kept in mind
when using top lists for measurement studies.                                           Figure 8: HTTP/2 adoption over time for the respective
                                                                                        Top 1k and Top 1M ranks of each list compared to the popu-
8.3      HTTP/2 Adoption                                                                lation of com/net/org domains.
   One academic use of top lists is to study the adoption of                               We next investigate HTTP/2 adoption between top lists
upcoming protocols, e.g., HTTP/2 [117,123]. The motivation                              based on Figure 8. Unsurprisingly, we observe HTTP/2 adop-
for probing top listed domains can be based on the assump-                              tion differs by list and by weekday for those lists with a
tion that very popular domains are more likely to adopt new                             weekday pattern (cf., §6.2). We also note the extremely dif-
protocols and are thus promising targets to study. We thus                              ferent result when querying the Top 1k lists as compared to
exemplify this effect and the influence of different top lists by                       the general population.
assessing the HTTP/2 adoption as a current example studied                                 Takeaway: Similar to other adoption rates, HTTP/2 is
in the literature [117, 123].                                                           highly overrepresented in top lists, and results can vary when
   To assess the HTTP/2 adoption, we probe domains in top                               using a weekend or a weekday top list.
lists and the general population. We try to fetch the domains’
landing page via HTTP/2 by using the nghttp2 library. We
prefix all domains in Alexa and Majestic with a www., if                                9.               DISCUSSION
not already present, as we found greater coverage for these                                We have shown in §3 that top lists are being frequently
domains. In case of a successfully established HTTP/2 con-                              used in scientific studies. We acknowledge that using top lists
nection, we issue a GET request for the / page of the domain.                           has distinct advantages—they provide a stable small set of
We follow up to 10 redirects and if actual data for the land-                           relevant domains that can be compared over time. However,
ing page is transferred via HTTP/2, we count the domain as                              the use of top lists also comes with certain disadvantages,
HTTP/2-enabled. We probe top lists on a daily basis and the                             which we have explored in this paper. First, such lists typi-
larger zone file on a weekly basis.                                                     cally come with an unclear selection bias, with can lead to
   We show HTTP/2 adoption in Figure 8. First, we observe                               over-representation of certain domains. We have observed
that the HTTP/2 adoption of all com/net/org domains is 7.84%                            in §8 that almost all conceivable measurements suffer signif-
on average and thus significantly lower than for domains                                icant bias when using a Top 1M list, and excessive bias in
listed in Top 1M lists, (up to 26.6% for Alexa) and even                                terms of magnitudes when using a Top 1k list. This indicates
more so for Top 1k lists, which, displayed in Table 5, show                             that top lists are a biased sample of the general population.
adoption around 50% or more.                                                            Second, we have shown that one-off measurements may yield
   One explanation is that, as shown above, popular domains                             unstable results when lists change too much from day to day.
are more likely hosted on progressive infrastructures (e.g.,                            This effect stems from the fact that top lists themselves are
CDNs) than the general population.                                                      prone to significant daily change. Furthermore, the choice

                                                                                 12
You can also read