The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives - arXiv

Page created by Dan Garrett

Buildings

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives - arXiv

The Archives Unleashed Project: Technology, Process, and
                                                Community to Improve Scholarly Access to Web Archives
                                                                       Nick Ruest,1 Jimmy Lin,2 Ian Milligan,3 and Samantha Fritz3
                                                                                                   1
                                                                                                 York University Libraries
                                                                        2   David R. Cheriton School of Computer Science, University of Waterloo
                                                                                      3 Department of History, University of Waterloo

                                         ABSTRACT                                                                  as a shorthand for the individuals who make use of web archives
                                         The Archives Unleashed project aims to improve scholarly access           for their studies, typically historians, digital humanists, or social
arXiv:2001.05399v1 [cs.DL] 15 Jan 2020

                                         to web archives through a multi-pronged strategy involving tool           scientists. Due to the composition of our team, the needs of histori-
                                         creation, process modeling, and community building—all proceed-           ans are perhaps best represented, but our engagements with the
                                         ing concurrently in mutually-reinforcing efforts. As we near the          broader community through datathons (discussed below) and other
                                         end of our initially-conceived three-year project, we report on our       workshops have ensured that the social sciences and the digital
                                         progress and share lessons learned along the way. The main contri-        humanities have not been shortchanged.
                                         bution articulated in this paper is a process model that decomposes          The biggest hurdle we set out to overcome is the classic chicken-
                                         scholarly inquiries into four main activities: filter, extract, aggre-    and-egg problem in technology adoption: On the one hand, without
                                         gate, and visualize. Based on the insight that these activities can be    guidance from “users” (that is, the scholars who study web archives),
                                         disaggregated across time, space, and tools, it is possible to generate   tool builders are taking stabs in the dark on what the real needs
                                         “derivative products”, using our Archives Unleashed Toolkit, that         are. They might produce computational artifacts that are not useful
                                         serve as useful starting points for scholarly inquiry. Scholars can       and perhaps tackle the wrong aspects of the overall challenge. On
                                         download these products from the Archives Unleashed Cloud and             the other hand, without any existing tools as reference points and
                                         manipulate them just like any other dataset, thus providing access        often lacking the technical training to understand “what’s possible”,
                                         to web archives without requiring any specialized knowledge. Over         scholars may have a hard time articulating their needs. Without
                                         the past few years, our platform has processed over a thousand            a community of eager scholars, tool builders are often hesitant to
                                         different collections from about two hundred users, totaling over         invest in software development efforts that might not lead to mean-
                                         280 terabytes of web archives.                                            ingful adoption. And of course, without the right computational
                                                                                                                   tools, scholars cannot make headway in their inquiries. Thus, we
                                                                                                                   were at an impasse.
                                         1    INTRODUCTION                                                            The goal of this paper is to report how we have made progress in
                                         The Archives Unleashed project aims to improve scholarly access           providing scholarly access to web archives through a multi-pronged
                                         to web archives using a three-pronged strategy that simultaneously        strategy involving tool creation, process modeling, and commu-
                                         addresses technology, process, and community. Our efforts began in        nity building—all proceeding concurrently in mutually-reinforcing
                                         2017 with the generous support of a grant by the Andrew W. Mellon         efforts. Although the problem remains far from solved, signifi-
                                         Foundation and supplemented by a number of other sources. As              cant progress has been made: Our contributions comprise lessons
                                         we are nearing the end of the initially-conceived three-year project,     learned that are hopefully valuable to the rest of the community
                                         the goal of this paper is to share with the broader community             and can be transferred beyond web archives into other domains of
                                         experiences we have accumulated along the way.                            digital preservation as well.
                                            There is no doubt that scholars are ill-equipped to face the
                                         already-here deluge of digital materials for study, but how to pro-       2   PROJECT HISTORY
                                         vide scholarly access remains a stubborn, unsolved problem. In this       Warcbase [12], the immediate predecessor to our current project,
                                         project, our focus is on web archives, but the same can be said of        was conceived as a scalable web archiving platform to support
                                         tweets, emails, and a plethora of born-digital records. As the field      temporal browsing and large-scale analytics. Development of the
                                         of “web history” emerges, as defined by the SAGE Handbook of              platform began in 2013 by one of the co-authors of this paper (Lin),
                                         Web History [4] and two recent monographs [3, 13], this problem is        a computer scientist working on big data, and lacking input from
                                         becoming all the more pressing. Since the rationale for and impor-        scholars, it suffered exactly from the “tool builder without users”
                                         tance of preserving the web has already been articulated elsewhere,       problem discussed above. At around the same time, another one of
                                         we find no need to repeat those arguments here.                           the co-authors (Milligan), a historian studying early web history,
                                            Recognizing that this challenge can only be addressed by a multi-      faced the exact opposite problem: he was engaging in scholarly
                                         disciplinary team, we assembled one at the outset comprising three        inquiry at scale without the proper computational tools and thus
                                         main stakeholders: librarians and archivists who are charged with         struggled to make headway. A fortuitous and timely gathering of
                                         gathering and managing digital collections, computer scientists who       the co-authors, along with many like-minded individuals at the
                                         build the tools necessary to manipulate those collections, and, of        Working with Internet Archives for Research (WIRE) Workshop in
                                         course, scholars who interrogate their contents to support various        2014 planted the seeds of what would eventually evolve into the
                                         lines of inquiry. Throughout this paper, we use the term “scholar”        Archives Unleashed project.

In 2015, a group of colleagues who had coalesced at the workshop      web archiving efforts; the 2017 National Digital Stewardship Al-
(including the two co-authors mentioned above) collectively stum-        liance Web Archiving Survey Report found that 94% of institutions
bled upon a potential solution to the scholarly access problem: to       were using Archive-It that year, with an additional 4% collecting
overcome the chicken-and-egg problem by building tools and com-          via separate Internet Archive contracts [9]. Individual collections of
munity simultaneously. Through a series of in-person “datathons”         these sort typically range from tens of gigabytes to (a small number
dubbed “Archives Unleashed”, the group brought together a few            of) terabytes. Quite specifically, collection development and content
dozen stakeholders to engage in community building and skills            harvesting lie outside the scope of our project, as there are already
training [14]. Over the course of two to three days, tool builders       many existing resources that provide guidance on those aspects of
(typically, computer scientists) interacted with scholars, librarians,   the web archiving lifecycle. Thus, we assume that an organization
and archivists with the goal of overcoming the challenges outlined       already has a collection of WARC files (the standard container for-
above. The first datathon was held in Toronto in March 2016, primar-     mat for web archives) or ARC files (an older file format), in most
ily with the support of the Social Sciences and Humanities Research      cases already held by the Internet Archive as part of Archive-It, and
Council of Canada and the U.S. National Science Foundation (plus         desires to provide scholarly access to them.
additional contributions from a host of other organizations). The           We believe that targeting Archive-It subscribers maximizes our
event successfully brought together a group of stakeholders to ini-      potential impact. By definition, these organizations already recog-
tiate the community building process, which continued through            nize the need for web archiving, and most of them have already
three more events, in June 2016, February 2017, and June 2017.           begun to harvest content that meets their institutions’ develop-
   The present Archives Unleashed project retained the catchy            ment needs. These small-to-medium organizations often operate
moniker from these previous datathons and officially began in            with barebones staffing, and hence are not in a position to actively
June 2017 with a different team. By that time, the approach to           facilitate scholarly access—yet they truly feel the pain of letting
simultaneously building technology and community seemed to be            their collections lie fallow. Our project aims to answer the question:
gaining traction, and thus it made sense to build on those initial       “We’ve begun a web archiving program and have gathered a few
successes. We additionally recognized the importance of process          collections, now what?” While our toolkit may also be useful to
(more details below), thus leading to a three-pronged strategy:          large organizations, for example, national libraries, they are not the
                                                                         primary audience we intend to serve.
(1) Development of the Archives Unleashed Toolkit (AUT). This open-         Given this brief overview of the genesis of the Archives Un-
    source toolkit represents the evolution of the Warcbase plat-        leashed project, the remainder of this paper will focus on lessons
    form, with the benefit of a better understanding of scholars’        we have learned along the way that we hope will be valuable to the
    needs, gained through both the composition of our team and           broader community.
    experiences from the datathons.
(2) Deployment of the Archives Unleashed Cloud (AUK). To bridge
    the gap between open-source tools and their deployment at            3    HIGH-LEVEL LESSONS
    scale, we built a one-stop portal that allows users to ingest        As with most technology adoption challenges, perhaps unsurpris-
    their collections and execute a number of analyses with a few        ingly, the technology itself (AUT and AUK in our case) is relatively
    keystrokes and mouse clicks. This service operationalizes a          straightforward. The toolkit and the cloud service represent fairly
    process model for scholarly inquiry that provides guidance on        standard instances of open-source software development, and we
    how to interrogate collections at scale. Although the Archives       have adopted standard best practices that lead us down relatively
    Unleashed Toolkit is open source, which means that anyone is         well-trodden paths. From the technical perspective, the biggest inno-
    able to install and run the software, we expect that most scholars   vation in the toolkit (beyond its Warcbase foundations) is the move
    would be uninterested in managing their own infrastructure.          from Spark’s resilient distributed datasets (RDDs) to DataFrames
    Thus, AUK can be viewed as the “canonical instance” of AUT           and a shift from Scala to Python as the language of choice (see
    where we handle deployment and maintenance, freeing scholars         Section 5 for more details).
    to focus on their inquiries.                                            We have come to realize that the process of scholarly inquiry
(3) Organization of Archives Unleashed Datathons. These efforts at       (in the face of the daunting sizes of many collections) is the most
    community-building and outreach represent a continuation of          critical component in building and sustaining a community around
    the previous in-person events, but with a shift in focus to the      web archiving. This is not to diminish the importance of getting the
    tools and services that the project was developing. In addition,     community together in the first place (the third prong of our strat-
    we used these venues as an opportunity to develop longer-term        egy), a challenge our datathons have already made some headway
    plans to sustain the scholarly community around web archiving        on. However, even once we get all the stakeholders “in the same
    beyond the life of the project.                                      room”, they still need to “do something” to engage in scholarly
                                                                         inquiry, and that “something” can leverage our proposed process
In order to more easily make headway in addressing these goals,          model to serve as a framework for organizing their activities. We
our efforts primarily focused on thematic web archive collections        had previously proposed what we called the FAAV cycle [12] to
being gathered by a diverse group of small to medium cultural her-       characterize how scholars interrogated web archives, comprised of
itage institutions, specifically those who are subscribers to Internet   four main activities: filter, analyze, aggregate, and visualize.
Archive’s Archive-It service. These subscribers represent the vast          In this paper, we present a refinement to FAAV that we term
majority of institutions in the United States that engage in active      FEAV, where “analyze” has been replaced with “extract” for a more

accurate characterization. This process model provides a descriptive      far in most cases for a historian to learn how to parse one hundred
characterization of scholarly activity based on our observations          gigabytes of raw WARC files without assistance; but, with the sup-
and can be used prescriptively in a pedagogical manner. It offers         port of appropriate training and resources (such as those referenced
an organizing framework for the rest of the project, guiding the          above), scholars can be reasonably expected to develop fluency in
development of the toolkit as well as the cloud service.                  using Jupyter notebooks to manipulate modest amounts of CSV or
   The articulation of the FEAV process represents a major contribu-      JSON data (for example).
tion of our work. Our single most important insight is that the main
activities (filter, extract, aggregate, visualize) can be disaggregated   4.1     From Madlibs to a Process Model
across time, space, and tools. That is, each of the activities need not   For our target group of scholars, early datathons showed that they
occur in the same session or even the same location, and perhaps          generally had little difficulty grasping the concepts (e.g., transfor-
most importantly, with the same tools. More concretely, we have           mations on large collections of data records, accessing fields in
realized that a number of “derivative products” provide useful start-     a tuple, etc.) and the “mechanics” of using Warcbase (and later,
ing points to scholarly inquiry, more so than the raw collections         the Archives Unleashed Toolkit). However, when it became time
themselves (at least at the outset). These derivative are essentially     to analyze their own collections, the scholars were often unsure
the output of a pre-determined pipeline of filtering, extraction, and     where to begin [12]. Indeed, we were faced with this conundrum:
aggregation that we explicitly store and share. In addition to being      with us in the room to provide support, scholars could perform
useful from a scholarly perspective, these derivatives are also much      amazingly creative analyses, but few adopted the technology on
smaller than the raw collections, typically manipulable on a laptop       their own without assistance. It soon became clear that scholars
or in a cloud notebook. These derivative and the disaggregation of        did not know where to start. Faced with several hundred gigabytes
FEAV yield important implications for adoption, which we detail           of WARC files and a command shell (with a blinking cursor) ready
in Section 4.4.                                                           to accept commands, what to do?
                                                                             We have attempted to jumpstart the process of inquiry by pro-
                                                                          viding a number of examples illustrating the capabilities of our
4   THE FEAV PROCESS MODEL                                                tools, adopting a “madlibs” (i.e., fill-in-the-blank) approach; more
As we have argued above regarding the central role of process in          in Section 7. For example, we illustrate: this is how you find the
facilitating scholarly access to web archives, it makes sense to begin    website that has the most mentions of Canadian Prime Minister
with a detailed discussion of our FEAV model.                             Justin Trudeau—if you want to change the person of interest and
   Our previous work has considerably influenced this process             the collection, change the variables here and there. Or, this is how
model. Even prior to the Archives Unleashed project, in the context       you create a word cloud of the crawl from September 2014—you
of Warcbase, we aimed to serve a well-defined group of scholars as        can change the temporal interval using this variable, or focus on a
our primary audience. Although we did not expect these scholars to        particular domain by setting that variable.
have formal computer science training, our initial efforts targeted          What began to emerge, essentially, was a “cookbook” with a
those who were already comfortable with a scripting language such         series of “recipes” for addressing a number of common analytics
as Python or R. We assumed that the scholars could perform simple         tasks. Indeed, this format now forms the backbone of our online
manipulations of datasets in well-defined formats (e.g., CSV or           documentation.1 These recipes can be phrased in the form of “how
JSON), use standard libraries to execute common tasks (e.g., create a     do I...” questions, for example:
word cloud), or were proficient enough with search engines to figure      • extract all URLs in a collection?
out what to do by searching online for instructions. Furthermore,         • count the occurrences of different domains?
we expected that the scholars were already comfortable with the           • extract plain text from URLs matching a pattern?
command line and possessed basic knowledge of the file system             • compute counts of specific keywords?
(e.g., moving or copying files and directories).                          • find the most common person mentioned?
   In our experience, such a skillset is relatively common, particu-      • extract the anchor text of links to a particular URL?
larly among recently-trained scholars who want to seriously work          • find all pages that link to a site?
with digital objects. It might not be unrealistic to expect that these    • determine the most popular gif?
form the “core compentencies” of all future scholars in this intellec-    • compute the checksum of images?
tual space, but how to bring everyone up to this level of technical
                                                                          With this cookbook approach, combined with observations of how
proficiency is beyond the scope of our work. More importantly,
                                                                          our tool was being used successfully in the datathons and through
even if graduate schools and professional organizations within the
                                                                          interactions with other scholars, we began to generalize scholarly
humanities and social sciences are not properly training scholars in
                                                                          activities into a process model we termed the FAAV cycle, first
these skills, they are competencies obtainable through libraries and
                                                                          articulated in Lin et al. [12]. FAAV stands for “filter”, “analyze”, “ag-
other free resources. The Programming Historian, for example, has
                                                                          gregate”, and “visualize”—the four main activities we found scholars
existed in various formats since 2008 [1] and Software Carpentry
                                                                          engaging in as part of their inquiries. Over the past few years, we
(and related projects such as Data Carpentry) has been running
                                                                          have further refined FAAV into FEAV, which we describe in detail
in-person workshops around the world on basic computational
                                                                          next: the main refinement is replacing “analyze” with “extract” for
skills and principles. A growing body of literature explores best
                                                                          greater descriptive accuracy.
practices for how best to develop collaborative lessons for these
diverse learners [8]. In other words, it would truly be a bridge too      1 https://github.com/archivesunleashed/aut-docs

4.2 The Four Main Activities be passed to an external application to generate complex, interac-
The four main activities of the FEAV model are “filter”, “extract”, tive visualizations. These visualizations can be static (e.g., a figure,
“aggregate”, and “visualize” in support of scholarly inquiry: graph, or plot) or interactive; they can be designed to support fur-
ther exploration (i.e., an intermediate product) or be prepared for
Filter. Typical web archive collections range from tens of giga- a publication or a blog post (i.e., for dissemination). It is not our
bytes to terabytes; it is relatively rare—with perhaps the exception intention for the Archives Unleashed Toolkit to evolve into a com-
of high-level exploratory probes and “distant reading” studies from prehensive visualization framework; instead, our main goal is to
the greatest distance—that an individual analysis would consume support interoperability with existing visualization toolkits in the
the entire collection. Thus, a scholar usually begins by focusing on broader ecosystem that scholars may already be familiar with.
a particular subset of the web archive, which we characterize as
filtering. This can be accomplished by content, metadata, or some
extracted information. Content-based filtering might be based on
4.3 Discussion
text, e.g., consider only pages mentioning a particular set of key- Although we describe the FEAV process model as comprised of
words, or based on the hyperlink structure between the pages, e.g., distinct activities, it is important to note that all of the activities
consider only pages that link to a particular website or domain. are closely intertwined in practice, and may not even proceed in
Metadata-based filtering might be based on a particular range of the described order. For example:
crawl dates or pages whose URLs match a particular pattern. Filter- Does a scholar filter first, then perform extraction, or vice versa?
ing can also be based on any information extracted from either the Sometimes, filtering is performed on extraction results—for exam-
content or metadata, as the result of running arbitrary user-defined ple, running a named-entity detector to identify person names, and
functions (more below). The filtering criteria may be arbitrarily then considering only pages that mention certain names. In other
complex and nested, for example, a scholar is only interested in cases, the distinction between filtering and extraction is blurred—for
pages containing a particular keyword that link to a specific domain, example, filtering based on hyperlinks technically requires running
within some temporal range. a link extractor on the page first, but conceptually, the scholar’s
goal is to filter, not to extract. In practice, filtering and extraction
Extract. After selecting a subset of material, the scholar typically are often tightly coupled.
then extracts some information of interest. Examples include ex- Filtering is also commonly applied to the output of aggregations.
tracting the plain text from the raw HTML source, identifying After counting, the scholar might wish to discard items that appear
mentions of named entities (e.g., people, organizations, places), as- too frequently or too rarely. In web collections, there is inevitably
sessing the sentiment of the underlying text, etc. Extractions need a long tail of items that appear only a small number of times (e.g.,
not be limited to HTML—for example, a scholar might be inter- misspelled names), and for the most part, scholars are not interested
ested in PDFs in a collection—or even be limited to textual data— in those. On the flip side, there are often items that appear very
we have begun experimenting with pipelines that analyze images, frequently (e.g., “here” as anchor text) and hence it makes sense to
for example. Typically, extraction is accomplished by user-defined discard those as to not clutter up the analysis. The combination of
functions (UDFs); the user in this case refers to the programmer. these activities in our process model, we argue, is flexible enough
Generally, we would not expect scholars to write their own UDFs to accommodate diverse scholarly needs.
(at least at the outset); instead, we provide a library of UDFs to per- Finally, filtering, extraction, and aggregation might proceed cohe-
form common operations that they can draw from and assemble in sively in a tight loop: For example, an initial set of filtering keywords
novel combinations. Note that UDFs are extensible and can invoke yields too many pages and thus requires additional refinement to
arbitrarily-complex functionalities—for example, leveraging a deep produce a manageable set of results. Or the opposite could happen—
neural network to perform object detection on images [19]. the filter was too restrictive and did not retain enough records for
Aggregate. The output of filtering and extraction is a collection of analysis. These activities might even alter the course of scholarly
records of interest, typically already far smaller (often, orders of inquiry—for example, a result leads to an interesting question that
magnitude smaller) than the raw collection. In most cases, however, compels the scholar to follow another tangential thread.
these records need to be aggregated or summarized before they are Note that during the normal course of scholarly inquiry, some
suitable for human consumption. The simplest example of aggrega- of the activities might be skipped altogether. For example, if the
tion is to produce a table of counts, e.g., how many times a person filter is very specific and retains only a handful of records (e.g.,
or location has been mentioned within a set of pages, how many half a dozen pages), then the scholar may decide to examine those
times a particular page has been linked to, etc. Other common ex- results directly—in this case, there is no meaningful aggregation
amples include finding the maximum (e.g., the page with the most and visualization to speak of. By varying the specificity of the filter,
incoming links), the minimum (e.g., the least frequently-mentioned a scholar can switch between “distant” or “close” reading [15] when
name from some list of individuals), or the average (e.g., the average using our toolkit. Once again, these possibilities are captured by
of sentiment expressed across pages of a website as determined by our process model.
an automatic classifier).
Visualize. Finally, the aggregated results are presented in some 4.4 Standard Derivatives
sort of visualization for the scholar’s consumption. The visualiza- Both our original FAAV and the refined FEAV model emerged as
tion can be as simple as a table or list showing individual records, a descriptive, bottom-up characterization of what we observed
directly generated by our toolkit, or the output of the toolkit can scholars doing as they engaged with web archives. It was not our

original intent that the model be applied prescriptively—our goal                                          Collections
was merely to provide a reference that scholars could consult. Later
on, though, we realized that our model could indeed serve a peda-
gogical purpose; that is, during tutorial sessions in our datathons,                                          Filter
we offered FEAV as a way to help new users of the toolkit structure
their approach to inquiry. We would offer: once you’ve formed your
initial research question, think about what subset of the collection
you would need to interrogate, what information you’d need to                               Visualize   Archives Unleashed   Extract
extract from those pages, etc.                                                                                Toolkit
   As a result of this guidance, many analyses began much the
same way, with very similar initial steps. In particular, we have
found three analysis products to be so frequently used by scholars                   Derivatives           Aggregate
that with the Archives Unleashed Cloud and for collections used
at our datathons, we have begun to pre-generate and store them,
before they are even requested. Among other benefits, this obviates      Figure 1: The relationship between our FEAV process model
the need for scholars to run potentially large analysis jobs before      and the Archives Unleashed Toolkit and Cloud. The toolkit
they arrive at a datathon, saving precious time in our in-person         is designed to support all four main activities, shown by the
events. These are what we term our three “standard derivatives”, the     blue arrows. The cloud, in contrast, is designed to only sup-
creation of which is tightly integrated into the Archives Unleashed      port a subset of the activities, but additionally allows users
Cloud (see Section 6 for more details):                                  to manage the ingestion of collections and downloading of
                                                                         derivatives (green arrows).
• Domain Distribution. We extract all URLs to compute the fre-
  quency of domains appearing in the collection.                         existence of fixed opening moves and well-known countermoves
• Domain webgraph. We extract all hyperlinks to create a domain-         certainly does not diminish the overall beauty of the game or the
  to-domain network graph; that is, the hyperlink structure of the       creativity necessary to play—in the same way that our derivatives
  collection, aggregated at the domain level.                            do not diminish the creative potential of scholarly inquiry. These
• Plain Text. We extract plain text from all web pages, along with       products merely present useful starting points, and scholars always
  metadata such as crawl date, domain name, and the URL.                 have the option of starting from the raw web archives.
It remains an interesting question as to whether these derivatives
are popular because we provide them, or if their popularity is in re-    4.5    FEAV Disaggregation
sponse to scholarly demand. The answer is likely a mixture of both,      Somewhat surprisingly, we have found the standard derivatives
but these derivatives have two important properties: First, they are     described above to be so useful that we encourage scholars (particu-
genuinely useful in answering a number of basic questions that are       larly those just learning about web archives) to use them as starting
applicable to every web archive: The domain distribution provides        points of analyses, rather than the raw archives themselves. This
a starting point to answering the question “What’s in this collec-       has led to the single biggest insight that has facilitate adoption—the
tion?” It also provides the scholar with a sense of potential biases     recognition that FEAV can be disaggregated across time, space, as
that may be present—by noting, for example, the over- or under-          well as tools. What does this mean? Allow us to explain:
representation of certain sites. The domain webgraph provides a             The derivatives represent a combination of filtering, extraction,
tractable overview of the collection structure—after all, the hyper-     and aggregation whose output we capture, store, and share. Thus,
link structure is one important distinguishing characteristic that       when using these products, the scholars are taking the output of
separates web archives from other large collections of documents.        FEA and continuing with their own iteration of FEAV. At that point,
Page-to-page link structure is usually too fine-grained to be helpful    how these derivatives came to be generated (when and where) be-
in providing an overview, and we have found that domain-based            comes irrelevant (this is exactly the disaggregation we speak of),
aggregation provides a nice middle ground that shows interesting         since scholars can engage with the material on their own laptops
structures without being overwhelming in size. The plain text of         (a different time, a different place). Furthermore, scholars are not
pages is useful for obvious reasons, serving as the starting point       limited to analyses with our toolkit—they can use whatever tools
for most content-based analyses.                                         they are already familiar with, be it Python, R, and yes, even Mi-
    Second, and perhaps more importantly, these derivatives are          crosoft Excel. In this way, our process model supports seamless
often small enough to be manipulable on the scholar’s laptop. Our        integration of our tools and services with the broader ecosystem.
previous work [6] provides a detailed quantitative analysis of the       Given the popularity of Python for data science today, scholars
relationship between raw and derivative sizes, but in rough terms,       can conduct analyses in notebooks, manipulate the derivatives us-
for a “typical” Archive-It collection, the domain distribution data is   ing Pandas DataFrames, use existing packages for visualization, or
usually less than 1 MB, the domain webgraph is 10s MB, and the           take advantage of a multitude of other capabilities in the Python
raw text is perhaps 10s GB.                                              ecosystem. This means that scholars can begin inquiries into web
    An apt analogy to the creation and use of these derivatives might    collections without needing to know anything specific about web
be the notion of an opening game in chess, which formulaically           archives (for example, the difference between an ARC and a WARC
captures combinations of the first few moves in chess games. The         or the subtleties in parsing hyperlinks to recover anchor texts).

RecordLoader.loadArchives("collection/", sc) RecordLoader.loadArchives("collection/", sc)
.keepValidPages() .webgraph()
.flatMap(r => ExtractLinks(r.getUrl, r.getContent)) .groupBy(ExtractDomain($"src"), ExtractDomain($"dest"))
.map(r => (ExtractDomain(r._1), ExtractDomain(r._2))) .count()
.filter(r => r._1 != "" && r._2 != "") .filter($"count" > 5)
.countItems() .write.csv("domain-webgraph/")
.filter(r => r._2 > 5)
.saveAsTextFile("domain-webgraph/")

Figure 2: Simplified Scala scripts for generating the domain webgraph using the Archives Unleashed Toolkit. On the left we
present the analysis using resilient distributed datasets (RDDs); on the right we present the same analysis using DataFrames.

These relationships are shown in Figure 1. With the Archives Un- • Input and output connectors. Our toolkit provides abstractions
leashed Toolkit, a scholar can directly engage in the FEAV processes to handle robust parsing of WARC and ARC container files that
(with perhaps the support of external applications for visualization)— comprise standard web archive collections, exposing RDDs with
these are indicated by the blue arrows. This, of course, requires the records containing usable fields such as HTML content and HTTP
scholar to download, install, and configure the toolkit, presenting a headers (for web pages). It is easy to understate the amount of
potential barrier to entry, but offering the most complete suite of effort that has been devoted to building robust input connectors:
capabilities. Alternatively, the scholar can directly download the throughout the project we have encountered countless errors,
derivatives from the Archives Unleashed Cloud. Behind the scenes, corner cases, and non-compliant data across hundreds of ter-
the portal is still using the toolkit to perform filtering, extraction, abytes of web archives, each of which we’ve had to manually
and aggregation, but these steps are hidden from the scholar. debug and build error handling code for. This has likely been
It is somewhat ironic that the success of our project means mak- the biggest sink of development effort. In addition to input con-
ing our project invisible to scholars. For example, a social scientist nectors, the toolkit also includes output connectors to facilitate
downloads the domain webgraph of a particular collection, loads interoperability with external applications, for example, export-
the structure into a Pandas DataFrame in Python, and proceeds to ing webgraphs in the GraphML format to be used with the Gephi
analyze the clustering properties of a number of websites of interest. graph visualization platform.
She is unfamiliar with how the webgraph was extracted (through • Library of user-defined functions (UDFs) for extraction. To support
an AUT job triggered by AUK in the cloud, see Section 6), but those extraction of useful information from raw individual records, the
details are unimportant; the upshot is that we have enabled her toolkit provides a collection of UDFs to conduct various analyses.
research in an unobtrusive manner. The scholar already knew how These include functions for manipulating URLs, working with
to manipulate DataFrames using Pandas, and so analyzing web text, hyperlinks, images, etc., and more. This library is constantly
archives did not require learning any new skills. evolving in response to scholars’ needs.
We have fully embraced this philosophy of making our tools • Convenience transformations. Spark programs are typically de-
and services as invisible as possible, because the best strategy to scribed as a sequence of transformations operating on RDDs.
adoption is to not require users to learn anything new! In this Many commonly-used sequences of transformations, such as
way, the Archives Unleashed Cloud has evolved into a conduit that filtering RDDs to retain records of a certain type (e.g., HTML
allows users to ingest raw web archive collections and produce pages) as well as a number of frequently-used aggregations, are
these derivatives for easy access. encapsulated in “convenience transformations” to reduce the
verbosity of analysis scripts.
5 THE TOOLKIT A simplified AUT script (in Scala) for generating the domain web-
The Archives Unleashed Toolkit (AUT) is the successor to Warcbase graph (i.e., one of our standard derivatives, see Section 4.4) is shown
and can be characterized as a refinement of the older package, as on the left side of Figure 2. The script has been simplified for pre-
opposed to an entirely new platform. For this reason, we focus sentation purposes, but retains the same conceptual structure as the
on differences and new features, and avoid duplicating previously- actual working version. It begins by invoking the RecordLoader
published material; see Lin et al. [12] for additional details. to load a collection: note here that we hide details such as compres-
At a high level, AUT provides a domain-specific language for an- sion, processing of multiple files, WARC vs. ARC formats, etc. The
alyzing web archives built on top of the open-source Apache Spark scholar only needs to specify the path of the directory containing
data processing platform. Previously, Warcbase also supported tem- the data. Next, keepValidPages is a convenience transformation
poral browsing via HBase, a distributed, scalable, big data store that that filters the raw archive RDD to retain only valid HTML pages.
supports low-latency random access to large datasets. This feature, The transformation hides many details that go into the decision
however, was little used, since scholars were already accustomed of whether a record is a “valid” HTML page, based on the server’s
to using the Internet Archive’s Wayback Machine for temporal response MIME type, file extension, and other details.
browsing; hence, AUT removed such support from Warcbase. After the collection has been filtered to retain only the HTML
On top of Spark’s core data structure, known as resilient dis- pages, hyperlinks are extracted, and from the source and target
tributed datasets (RDDs), the Archives Unleashed Toolkit provides of the hyperlinks we keep only the domain (the flatMap and map
three main capabilities specific to web archiving: transformations; ExtractLinks and ExtractDomain are UDFs). A

www.lajemmerais.ca

                                                                                                                                                                                                                                                                                                                             mrcbm.qc.ca
                                                                                                                                                                                                                                                                                                                                                    www.notre-planete.info

                                                                                                                                                                                                                                                                                                                                                               www.caaquebec.com
                                                                                                                                                                                                                                                                                                                                                                                     www.aac.aluminium.qc.ca

                                                                                                                                                                                                                                                                                                                                                                                                      www.cremtl.qc.ca

                                                                                                                                                                                                                                                                                                                                                                                                           www.ipcc.ch
                                                                                                                                                                                                                                                                                                                                                                                                                                      www.ville.laval.qc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     questions begin to emerge: For example, almost all hyperlinks are to
                                                                                                                                                                                                                                                                                                 www.calendrier.umontreal.ca                                                 www.creca.qc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                            www.amigoexpress.com

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     pages within the qc.ca domain; there are few external ones beyond
                                                                                                                                                                                                                                                                                                                                                               www.santepublique.gc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                      www.covoiturage.ca
                                                                                                                                                                                                                       www.expocite.com                                                                      www.dare-to-choose-a-more-fuel-efficient-vehicle.com
                                                                                                                                                                                                                                                                                                                                                                                                          www.caa.ca                                            www.communauto.com
                                                                                                                                                                                                                                                                                                                       video.google.com
                                                                                                                                                                                                                                             carte.ville.quebec.qc.ca                                                                                                                                             www.davidsuzuki.org
                                                                                                                                                                                                                                                                                                               www.cutaactu.ca
                                                                                                                                                                                                                                                                                                                                                                                  palcan.scc.ca                                              www.cqdd.qc.ca
                                                                                                                                                                                                     donnees.ville.quebec.qc.ca                                                                       www.credemontreal.qc.ca

                                                                                                                                                                                                                                                                                                    www.velo.qc.ca                                                                            wceaeq                                                                 www.crelanaudiere.ca
                                                                                                                                                                                           www.rtcquebec.ca
                                                                                                                                                                                                                                                                                                                                                                                                                      www.oee.nrcan.gc.ca                                                                 www.ville.victoriaville.qc.ca
                                                                                                                                                                                                                                                              www.lundisansviande.net                                  www.crecq.qc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                               www.rncreq.org

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     a few web services or platforms (Twitter for embedded accounts or
                                                                                                                                                                                                                                                                                       www.crebsl.com                         www.tvanouvelles.ca                            etvcanada.ca                                 www.starflood.eu
                                                                                                                                                      get.adobe.com                                                                                                                                                                                                        www.biospherelac-st-pierre.qc.ca
                                                                                                                                                                                        player.vimeo.com                                                          www.faitesdelair.org                                                                                                                                                                                     www.cagbc.org

                                                                                                                                                                                                                                  www.phac-aspc.gc.ca                                         www.pieval.mddep.gouv.qc.ca                              www.gov.uk                                                                            www.nrcan-rncan.gc.ca
                                                                                                                                      www.bibliothequesdequebec.qc.ca

                                                                                                                                    www.adobe.com                     www.ecomuseum.ca
                                                                                                                                                                                                                www.natureconservancy.ca
                                                                                                                                                                                                                                           rivieredesoutaouais.ca
                                                                                                                                                                                                                                                                         www.fao.org
                                                                                                                                                                                                                                                                                                               www.iqa.mddelcc.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                                             www.cobali.org                         www.jeunes.gouv.qc.ca

                                                                                                                                                                                                                                                                                                                                                                           www.deficlimat.qc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                              www.insidethebottle.org

                                                                                                                                                                                                                                                                                                                                                                                                                                               www.journalsonline.tandf.co.uk                                    www.creddo.ca                         www.creddsaglac.com

                                                                                                                         reglements.ville.quebec.qc.ca                           www.santé.gouv.qc.ca                                                                    www.scics.gc.ca                                                                                                                                         www.cedre.fr                                                               francophonie.saic.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                     protegeonsleau.gouv.qc.ca
                                                                                                                                 www.floraquebeca.qc.ca                                              www.mamot.gouv.qc.ca                                 www.cflri.ca                                                                                                             www.carrieres.gouv.qc.ca                                     www.fsbio-hannover.de                                              www.addthis.com
                                                                                                                                                                                                                                                                                       esa.un.org                                                                                                                                                                                           uli.org
                                                                                                                                                                                                                         www.banquesalimentaires.org                                                                       www.nrcresearchpress.com                                                    www.finances.gouv.qc.ca

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Google). Given the history of Canadian federalism, which has seen
                                                                                                                                                                    www.usherbrooke.ca                                                                                                                                                                                                                                                                         vieenvert.telequebec.tv                                    www.environnementestrie.ca
                                                                                                                                                                                  www.osez-un-vehicule-plus-econergetique.com                                                              www.qc.dfo-mpo.gc.ca                                     www.eng.buffalo.edu                       www.fnh.org                          www.mcc.gouv.qc.ca                                                             www.cregim.org                                 www.pes3.enviroweb.gouv.qc.ca
                                                                                                               www.quebecregion.com
                                                                                                                                                                                                                                                    www.iddpnql.ca                                                        www.glslregionalbody.org                                 www.mamr.gouv.qc.ca
                                                                                                                 www.pc.gc.ca                                     www.ville.quebec.qc.ca                                                                                                                                                                                                                                            www.cgrmp.com                            www.mesrst.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   politiqueenergetique.gouv.qc.ca
                                                                                                                                                                                                                                                                                                    www.meds-sdmm.dfo-mpo.gc.ca                                                   video.mrnf.gouv.qc.ca
                                                                                                                                    on.undp.org                                                                         www.courrierlaval.com                                                                                                                                                                                   www.fpa2.mc                            cdn-contenu.quebec.ca                                          www.menv.gouv.qc.ca.-
                                                                                                                                                                                                                                                                                                                                      mrnf-faune.gouv.qc.ca                                                                                                                                                                                                                                 www.cre-lanaudiere.qc.ca
                                                                                                                                                                                                                               www.lapresse.ca                             ftp.mrnf.gouv.qc.ca
                                                                                                                                                       nature.org                     www.compost.org                                                                                                                                                                                     www.mels.gouv.qc.ca                                msssa4.msss.gouv.qc.ca                               dx.doi.org                      www.cfia-acia.agr.ca                           www.cre-mauricie.com
                                                                                                                                                                                                                                                            sedna.radio-canada.ca                                      www.cmhc-schl.gc.ca
                                                                                                                                                                          consortium-evolution.org                                                                                                                                                                  www.fqf.qc.ca                                             www.agr.gc.ca
                                                                                                                    www.cmquebec.qc.ca                                                                                                                                                                                                                                                                                                                           www.nrc-cnrc.gc.ca                                                www.donneesquebec.ca                              robvq.qc.ca
                                                                                                                                                                                                                            www.fpq.com                              www.suretequebec.gouv.qc.ca
                                                                                                                                                                                         www.aeaq.ca                                                                                                                                           www.ceaeq.gouv.qc.ca                                     www.evb.csq.qc.net                         www.crem.qc.ca
                                                                                                                                 ees-gazdeschiste.gouv.qc.ca                                                                                                                                                                                                                                                                                                                                www.cosepac.gc.ca                                            www.registres.mddelcc.gouv.qc.ca
                                                                                                                                                                                                                         www.hww.ca                           www.crelaurentides.org

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     uneasy relationships between federal and provincial levels of gov-
                                                                         www.quebecoiseaux.org                                                                                                                                                                                                                                 www.uncsd2012.org                                                   ville.montreal.qc.ca                        www.ene.gov.on.ca
                                                                                                                                                    nosplansdeau.com                                     www.nosplansdeau.com                                                                                                                                                                                                                                                                                                       phil.cdc.gov                      www.mess.gouv.qc.ca                                          services.ville.montreal.qc.ca
                                                                                                            cdpnq.gouv.qc.ca                                                                                                                                                                www.mrifce.gouv.qc.ca                                                        www.securite.clicsequr.gouv.qc.ca                                                                               espacepourlavie.ca                                                        www.quebecvert2020.gouv.qc.ca
                                                                                                                                                        www.fapaq.gouv.qc.ca                                                                                                                                                                                                                                                                   www.equiterre.org
                                                                                                                                                                                                                                                    sante.gouv.qc.ca                                                          faune.gouv.qc.ca
                                                                                                                www.linkedin.com                                                                communiques.gouv.qc.ca                                                                                                                                              www.oaq.qc.ca                          www.microsoft.com                                                                www.planstlaurent.qc.ca                                accounts.google.com
                                                                                                                                                                                                                                                                                           www.unep.org                                                                                                                                                                                                                                                                                                               www.credelaval.qc.ca
                                                                                                                                                                                                                                                                                                                           www.oecd.org                                                        www.umq.qc.ca                                                  www.aee.gouv.qc.ca
                                           www.natureserve-canada.ca                                   www.cdpnq.gouv.qc.ca                                                                                                                       www.fedecp.qc.ca                                                                                                                                                                                                                                                www.ccebj-jbace.ca
                                                                                                                                                           www.natureserve.org                             www.ftgq.qc.ca                                                                        www.energie.riotinto.com
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   www.inspection.gc.ca
                                                                              www.ees-gazdeschiste.gouv.qc.ca                                                                                                                                                                                                                                          www.agrireseau.qc.ca                                               www.plannord.gouv.qc.ca                                     www.informaworld.com
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           www.saumon-fqsa.qc.ca                                     www.cap-quebec.com

                                                                                                                                                                                           www2.publicationsduquebec.gouv.qc.ca
                                                                                                                                                            www.jardinpotager.com                                                                 www.jourdelaterre.org
                                                    documents.mddelcc.gouv.qc.ca                                                                                                                                                                                                                                                                                        www.conseildelafederation.ca                               www.biodiv.org                        www.hydroquebec.com
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              www.emploiquebec.gouv.qc.ca
                                                                                                                                                                                                         www.emilydamstra.com                                                                                                                                                                                                                                                                                      www.mfa.gouv.qc.ca
                                                                                                           www.biosphere.ec.gc.ca                                                                                                                                                                                                                                        www.reseau-environnement.com                                                            www.mri.gouv.qc.ca
                                     www.quebecinternational.ca

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ernment, it is notable that there are no links to Environment Canada,
                                                                                                                                                             www.cre-capitale.org                                         www.cglg.org                                                                                                                                                                                                                                                                                                                                                                 www.opc.gouv.qc.ca
                                                                                                                                                                                                                                                                                       www.publicationsduquebec.gouv.qc.ca                                                                                                www.hc-sc.gc.ca                                                                                                 images.fws.gov
                                                                        www.undp.org                                                  wci-inc.org
                                                                                                                                                                                 twitter.com                                          www.facebook.com
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      www.mrnfp.gouv.qc.ca                                                                  www.crecn.org

                                                                                                                                                                                                                                                  www.mddep.gouv.qc.ca
                                                                 itunes.apple.com                                                                                                                                                                                                                 mddep.gouv.qc.ca                                          www.recyc-quebec.gouv.qc.ca                                                      www.canards.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   www.ville.montreal.qc.ca
                                               www.provancher.qc.ca                                        www.oiseauxqc.org                                 geoboutique.mrnf.gouv.qc.ca                                                                                                                                                                                                                                                                                           www.keac-ccek.ca                                                                                            www.stat.gouv.qc.ca
                                                                                                                                                                                                                      www.bape.gouv.qc.ca                                    www.fleurs-acadie.com                                        www.aujardin.info                         www.radio-canada.ca                             www.quebec.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                                         www.erudit.org                           www.caaaq.gouv.qc.ca                                       www.epa.gov
                                    www.weao.org                                                  www.fondationdelafaune.qc.ca                                                                                  www2.ville.montreal.qc.ca                                                                 www.tc.gc.ca                                                               www.protegeonsleau.gouv.qc.ca
                                                                                                                                                                    www.ewg.org                                                                                                                                                                                                                                                                                                                                                              www.iowadnr.com                                                                                     www.emplois-superieurs.gouv.qc.ca
                                                                                                               www.etudeairemarineim.ca                                                                                                                                                                                                                                                                                                                mddefp.gouv.qc.ca
                                                                   www.unwater.org
                                                                                                                                                     www.youtube.com
                                                                                                                                                                                                        www3.mrnf.gouv.qc.ca                      www.mrnf.gouv.qc.ca                                                       www.un.org                   www.ec.gc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      www.wci-citss.org                              mffp.gouv.qc.ca                                            www.urgencequebec.gouv.qc.ca

                                                                                                                                                                                                                                                                                                                     www.mddelcc.gouv.qc.ca
                                                   auto.lapresse.ca                                                                                                                                         www.atlasamphibiensreptiles.qc.ca                                                           www.cbd.int                               www.agora21.org
                                                                                                www.senv.mddep.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                              www.adobe.fr                                www.revenu.gouv.qc.ca                                                       www.creat08.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             www.agrirecup.ca

                                                                                                                                                                  www.gouv.qc.ca
                                                                                                                                                                                      climat.meteo.gc.ca                                                                                 www.registres.mddefp.gouv.qc.ca                                                                                                                                      laws.justice.gc.ca                                                                    www.mern.gouv.qc.ca

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     their federal counterpart. Already, a research project—looking at
                                                      www.info.clicsequr.gouv.qc.ca                                      publications.msss.gouv.qc.ca                                                                           www.meq.gouv.qc.ca                                                                                                                                                                                                                                                         www.mnr.gov.on.ca                                                                           www.nunavikparks.ca
                                                                                                                                                                                                                                                                                                                                                       www.menv.gouv.qc.ca                                         fish.dnr.cornell.edu

                                                                                                                                                                                                                                                                                                                                                                 www.mddefp.gouv.qc.ca
                                                                                                                                                                                                                                                                                       www.fetedelapeche.gouv.qc.ca
                                                         www.canada.ca                                                                                         vehiculeselectriques.gouv.qc.ca                                                                                                                                                                                                                                                                   www.nord.gouv.qc.ca                                                                     www.parcmarin.qc.ca
                                                                                                                                                                                                                                         www.sofad.qc.ca                                                                             www.ccme.ca                                                                                                                                                                                                                                                                 www.commuterchallenge.net
                                      www.tbs-sct.gc.ca                                            www.sih.mddep.gouv.qc.ca                                                                                        intranet                                                              www.tresor.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                                                        www.mrn.gouv.qc.ca                                                        www.ramsar.org                             www.rageduratonlaveur.gouv.qc.ca                                                     www1.ete.inrs.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             www3.publicationsduquebec.gouv.qc.ca
                                                                                      publications.gc.ca                                                www.premier-ministre.gouv.qc.ca                                               www.environnement.gouv.qc.ca                                                   www.mtq.gouv.qc.ca                                              www.registrelep.gc.ca                                                           www.cehq.gouv.qc.ca                                    www.drapeau.gouv.qc.ca                       www.rage.gouv.qc.ca
                                             www.santepub-mtl.qc.ca                                                                       www.irda.qc.ca                                 www.fqrnt.gouv.qc.ca                                                                                                                                                 wwf.ca                                                                                                                                                                                                                                       depot-e.uqtr.ca
                                                                                                                                                                                                                                                                                                                                                                                                                  www.msss.gouv.qc.ca
                                                                                           www.dfo-mpo.gc.ca                                                                                                                                                     www.registres.mddep.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                                                    wmin1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   www.lesdebrouillards.qc.ca
 www.ree.environnement.gouv.qc.ca                                                                                                              www.sant.gouv.qc.ca                                                                                                                                                                                                          www.mapaq.gouv.qc.ca                                                  www.rsqa.qc.ca                           www.denniskalma.com                                                                                                        www.masanteausommet.com
                                                                                                                                                                                                                                       www.westernclimateinitiative.org

                                                     www.droitauteur.gouv.qc.ca
                                                                              www.monclimatmasante.qc.ca                                                                                                                                                                                                                                                                                                                                                                                                                                                           www.ncdc.noaa.gov
                                                                                                                                                                                                                                                                                                                                                            www.inspq.qc.ca                                     planstlaurent.qc.ca                           www.bnq.qc.ca                                    www.mdeie.gouv.qc.ca                                                                                             www.servicesenligne2.mddep.gouv.qc.ca
                                                                                                      www.ramq.gouv.qc.ca                                                                 www.protegerlenord.mddep.gouv.qc.ca
                                                    www.cpcq.gouv.qc.ca
                                             www.clubsconseils.org                                                                                                                                                                                                          www.iqa.mddep.gouv.qc.ca

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     various provincial ministries to explore changing relationships with
                                                                                      www.apple.com                                                                                                                                                                                                                                                                                                          www.securitepublique.gouv.qc.ca                                          www.ogm.gouv.qc.ca                                                           www.actionrebuts.org
                                                                                                            ft mrnf.gouv.qc.ca                       www.cites.ec.gc.ca                      www.slv2000.qc.ca                                                                                                                     www.ouranos.ca
                                                                                                                                                                                                                                          www.mffp.gouv.qc.ca                                                                          www.faisonslepoureux.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                www.aegq.qc.ca                              mern.gouv.qc.ca                                                                           www.sommetjohannesburg.org
                                                                         www.sciencedirect.com                                         www.letsdoitforthem.gouv.qc.ca                                                  ec.gc.ca
                                                                                                                                                                                                                                                                                              www.seao.ca                                                                                                             www.efficaciteenergetique.gouv.qc.ca                                                                                                         www.sopfeu.qc.ca
                                                                                                                                                                                                                                                                                                                                               pleinderessources.gouv.qc.ca                                                                                                                           www.ccaq.com                                                                                             www.revenuquebec.ca
                                                                              www.transparence.gouv.qc.ca                                                         www.phenixdelenvironnement.qc.ca
                                        developer.apple.com                                                                                                                                                                                                                                                              www.sqrd.org                                                                     www.mamrot.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                              www.sepaq.com
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 images.google.ca                             legisquebec.gouv.qc.ca                                     www.educapoles.org
                                                       www.vehiculeselectriques.gouv.qc.ca                                             www.alap.qc.ca                                                                                                                                                                                                       www.fihoq.qc.ca
                                                                                                                                                                                                                 www.sagepesticides.qc.ca                                                                                                                                                                                                                                                  www.canlii.org
                                                                                                                           www.g3e-ewag.ca                                                                                                                          recherched.gouv.qc.ca                                     www.assnat.qc.ca                                                       www.glerl.noaa.gov                          www.cfc-cafc.gc.ca                                                                       www.habitattitude.ca                             www.uqrop.qc.ca
                                                                                                                                                                           www.cosewic.gc.ca                                www.wci-auction.org                                 www.mce.gouv.qc.ca                                                                                                                                                                                             radio-canada.ca
                                                                           www.glf.dfo-mpo.gc.ca                                                                                                                                                                                                                                                                         sn2000.taxonomy.nl
                                                                                                                                 www.twitter.com                                                                                                                                                                                                                                                                                                                collectionscanada.gc.ca                                             rageduratonlaveur.gouv.qc.ca
                                                                           www.cbcq.gouv.qc.ca                                                                         www.calgary.ca                      link.springer.com                            www.cai.gouv.qc.ca                                             filaman.ifm-geomar.de                                                                        www.robvq.qc.ca
                                                                                                                          www.fishbase.org
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        www.ecoentreprises.qc.ca                                      www.mrif.gouv.qc.ca

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     federal counterparts—begins to take shape.
                                                                                                                                                                 www.rappel.qc.ca                ogsl.ca                www.rsma.qc.ca                                                               wwwxml.gouv.qc.ca                           www.faqdd.qc.ca                                      www.fqe.qc.ca                          www.qc.ec.gc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              www.coroner.gouv.qc.ca                                                 sustainablecanadadialogues.ca
                                                                                                                                                                          geoegl.msp.gouv.qc.ca                                          efficaciteenergetique.mrn.gouv.qc.ca                                                                                                  www.saaq.gouv.qc.ca                                                     www.iqa.mddefp.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                        www.irpeqexpress.qc.ca
                                                                                                                                                                                                                                                                                                                                                                                                           banderiveraine.org                                                                                 www.transports.gouv.qc.ca                                              www.cnil.fr                              www.toponymie.gouv.qc.ca
                                                                          www.servicecanada.gc.ca                                     www.climatechange2013.org                                                           ftp2.cits.rncan.gc.ca                             www.naturequebec.org                                                                         www.ccq.org
                                                                      store.apple.com plus.google.com                                                                                                                                                                                                                                                                                                                                                 www.zecquebec.com                                                                                                 www.premiere-ministre.gouv.qc.ca                                 weather.gc.ca
                                                                                                                                                                                           faqdd.qc.ca                    under2mou.org                                                                                            www.reserve-duchenier.com                              lavoieverte.qc.ec.gc.ca                                                                                                            uwaterloo.ca
                                                                                                                                                                                                                                                                     www.economie.gouv.qc.ca                                                                                                                                                                                        www.research.net                                                                             www.pes2.enviroweb.gouv.qc.ca
                                                                                                                                                                           www.comga.org                                                                                                                                            www.fil-information.gouv.qc.ca                                                                                www.who.int
                                                                                                                                                                                                                                                                                                                                                                                                          www-es.criq.qc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        www.calacademy.org
                                                                                                                                 www.dsp.santemontreal.qc.ca                                                       www.international.gc.ca                                                                                                                                                                                                                       laws-lois.justice.gc.ca                                                                                  www.cubiq.ribg.gouv.qc.ca
                                                                                                                                                                                                                                                                                 www.deontologie-policiere.gouv.qc.ca                                                  wwwapps.tc.gc.ca
                                                                                                                                                               www.budget.finances.gouv.qc.ca                                      dgizc.uqar.ca                                                                                                                                                                 www.msg.gouv.qc.ca                                                     www.oqlf.gouv.qc.ca                               www.quebec-ere.org
                                                                                                                                                                                                                                                                         collections.banq.qc.ca                             oee.nrcan.gc.ca
                                                                                                                                                                                                              www.mcq.org                                                                                                                                                        www2.gouv.qc.ca

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Development efforts on the toolkit have followed standard best
                                                                                                                                                                                                                                                                                                                                           www.scopus.com                                                                           www3.mffp.gouv.qc.ca                                                  ici.radio-canada.ca
                                                                                                                                                                                                                                                       strategiemaritime.gouv.qc.ca
                                                                                                                                                          www.ville.lac-megantic.qc.ca

                                               www.google.com
                                                                                                                                                                                                                                                                                                                                                                          www.registres.environnement.gouv.qc.ca                                                                   provancher.qc.ca                                  www.ree.mddelcc.gouv.qc.ca
                                                                                                                                                                                                                               www.transitionenergetique.gouv.qc.ca
                                                                                                                                      www.rageduratonlvaeur.gouv.qc.ca                                                                                                                                          www.wsc.ec.gc.ca                                                                                                               cse.google.com
                                                                                                                                                                                                                                                                                                                                                                       www.iqa.environnement.gouv.qc.ca
                                                                                                                                                                                                                   www.aventure-ecotourisme.qc.ca                                                   www.lcbp.org
                                                                                                                                                                                                                                                                                                                                      www.tandfonline.com                                                   www.sararegistry.gc.ca                                                  www.statcan.gc.ca
                                                                                                                                                              www.transportselectriques.gouv.qc.ca                                                         darwin.cyberscol.qc.ca                                                                                   www.coq.qc.ca                                                                             vertebrates.si.edu                                     www.theclimategroup.org
                                                                                                                                                                                                                                    www.evb.lacsq.org                                                herbierduquebec.gouv.qc.ca                        gouv.qc.ca                                          www.alguesbleuvert.gouv.qc.ca
                                                                                                                                                                                                                                                                            www.gnb.ca                                 ftp.environnement.gouv.qc.ca                             fr.wikipedia.org                                  www.autochtones.gouv.qc.ca
                                                                                                                                                                    maps.google.ca                                                                                                                         www.cqlc.gouv.qc.ca                          www.cbif.gc.ca                                           www.ijc.org

                                                                                                                            www.google.ca
                                                                                                                                                                                     code.google.com
                                                                                                                                                                                                                                                                            wpp01.msss.gouv.qc.ca

                                                                                                                                                                                                                                                                                                            www.frqnt.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                                    www.racj.gouv.qc.ca

                                                                                                                                                                                                                                                                                                                                                             www.ophq.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                                                                                    www.aou.org
                                                                                                                                                                                                                                                                                                                                                                                                                    www.infrastructure.gc.ca

                                                                                                                                                                                                                                                                                                                                                                                                                                    meteo.gc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     practices for open-source software. Source code is held in a public
                                                                                                                                          www.addinto.com
                                                                                                                                                       liverpool.metapress.com

                                                                                                                                                www.liverpooluniversitypress.co.uk
                                                                                                                                                                                                                                   maps.google.com

                                                                                                                                                                                                                                                                           gestion.cqlc.gouv.qc.ca

                                                                                                                                                                                                                                                                                             vpnssl1.msp.gouv.qc.ca
                                                                                                                                                                                                                                                                                                    www.revolutiontranquille.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                                                    racj.gouv.qc.ca
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     GitHub repository,2 where we extensively use “issues” to keep track
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     of bugs, feature requests, as well as for planning new features and
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 search.microsoft.com

                                                                                                                                                                                                                                                                                                                                                                                                                                                                             developer.longtailvideo.com
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         go.microsoft.com

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  www.longtailvideo.com

                                                                                                                                                                                                                    deontologie-policiere.gouv.qc.ca

                                                                                                                                                                                                                                                                                                                                                                                                                                dashboard.longtailvideo.com

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     discussing high-level design. All issues are public and participation
Figure 3: A visualization of the domain webgraph from                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                is open to everyone who may be interested—while (quite obviously)
a web archive of the Ministry of Environment of Québec,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              our team is the most active in the online forum, we regularly receive
collected between 2011 and 2014 by the Bibliothéque et                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               comments and feedback from the community. Modifications to
Archives nationales du Québec.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       the codebase occur via pull requests and undergo code review
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     before they are merged to the master branch. The toolkit has fairly
filter transformation is applied to discard empty output, and the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    good unit test coverage, and standard continuous deployment tools
results are then aggregated by count: countItems is another con-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     simplify automated testing as part of the code review process. We
venience transformation that AUT provides (essentially serving as                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    create official releases periodically using standard toolchains.
syntactic sugar for a groupBy and count). Another filter is applied                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      To conclude our discussion of the toolkit, we present two ongoing
to discard all target domains that receive five or fewer inlinks (to re-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             development efforts:
duce the amount of noise in the webgraph) before the final (source                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Transitioning from RDDs to DataFrames. While RDDs provide the
domain, destination domain, count) output tuples are materialized                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    core data structure in Spark, the platform has seen the emergence
and saved to a text file. Here, we can see exactly how activities in                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 of DataFrames as an alternative higher-level abstraction for ma-
our process model (Section 4.2) translate into RDD transformations                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   nipulating large collections of records. The primary difference is
in an AUT script, although in this case the script does not gen-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     that DataFrames conform to schemas and are organized into named
erate a visualization, but rather stores the output for subsequent                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   columns, much like tables in a relational database, whereas RDDs
consumption elsewhere.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               can comprise heterogeneous records of arbitrary format. While
    Given this script, the Spark engine orchestrates execution over                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  RDDs are more flexible—in that they support arbitrary record-level
collections at scale. Although Spark was designed to run on clusters,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                transformations—this flexibility is rarely needed by scholars, and
we have primarily processed collections on individual multi-core                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     in fact can lead to confusion and unexpected errors (for example,
servers. This decision makes sense for a few reasons: since our                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      type mismatches). With DataFrames, the scholar can refer to fields
project uses transient cloud virtual machines, spinning up and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       using meaningful names like the src and dest of a hyperlink, as
down clusters on demand adds an additional level of configuration                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    opposed to RDDs, where they must use Scala’s underscore notation
complexity that is not strictly needed for our use case. Spark is still                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              (e.g., r._1) to access individual fields in a tuple. Furthermore, Spark
able to make use of multiple cores on a single machine to analyze                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DataFrames were inspired by DataFrames in the highly-popular
collections in parallel. Most collections can be processed by indi-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Pandas package for data analysis in Python. Many scholars are
vidual servers within reasonable amounts of time, and furthermore                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    already familiar with Pandas DataFrames, and thus they would be
our jobs are not latency sensitive. For even our largest collections                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 comfortable manipulating Spark DataFrames with minimal training.
(see Section 6), Spark has proven to be robust and has no problem                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    This flattens the learning curve and lowers barriers to adoption.
with job scheduling or task management—it is simply a matter of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      There are additional performance benefits as well: conformance
time waiting for jobs to complete. Our users are a patient lot.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      to schemas allows the Spark engine to safely make certain opti-
    To complete this walkthrough, the domain webgraph can be                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         mizations that would not be possible with RDDs, and thus certain
ingested into the popular open-source network analysis platform                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      operations with DataFrames may become quicker to execute.
Gephi [2] for visualization and further exploration. The software                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       We are currently in the process of replicating RDD features us-
has robust documentation and an active user base within the digital                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ing DataFrames in Scala, with the goal of providing two different
humanities and computational social sciences, making it a natural                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    approaches to analyses; i.e., everything that can be done with RDDs
platform for scholars interested in web archives. In this example,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   will have a counterpart using DataFrames. More concretely, this
we use a web archive of the Ministry of Environment of Québec,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       involves creating different DataFrame “views” on the raw archive
collected between 2011 and 2014 by the Bibliothéque et Archives                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      records, since our input connectors are still written in terms of
nationales du Québec. After a series of basic transformations in                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     RDDs and we are leveraging Spark’s internal machinery to build
Gephi—in this case, sizing the nodes and labels based on computed                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    DataFrames from RDDs. For example, the webgraph view presents
PageRank values—the scholar can see the basic outlines of the col-
lection’s hyperlink structure, shown in Figure 3. Potential research                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 2 https://github.com/archivesunleashed/aut

a table of hyperlinks in a collection, with four columns: the crawl
date, the source URL, the destination URL, and the anchor text. Us-
ing this DataFrame, we can extract the domain webgraph with the
simplified script shown on the right of Figure 2. The juxtaposition
of the RDD version and the DataFrame version highlights some
of the advantages of DataFrames: the scholar can refer to named
columns using the dollar sign ($) notation, which allows her to
more easily track the course of each datum through the analytical
flow. Although conceptually, the records are undergoing similar
transformations, the DataFrames code is simpler and easier to un-
derstand. We have verified that execution of the DataFrames script
is no slower than the RDD version, and thus clarity does not come
at the cost of performance.                                                 Figure 4: A screenshot from the Archives Unleashed Cloud,
Transitioning from Scala to Python. As Spark was implemented in             showing the user dashboard that provides an overview of
Scala, we followed suit and built AUT primarily in Scala as well.           collections available for analysis.
Although scholars are less likely to be familiar with Scala (compared
to Python or R), we have not found the language choice to be an
insurmountable barrier based on experience from our datathons.
Nevertheless, it would be desirable to allow scholars to conduct
analyses in Python, which is the language they are most likely
to already know. This is possible with PySpark, which provides
a Python interface to Spark, and we are currently in the process
of transitioning to Python as the default language when using
the toolkit. Specifically, we aim to replicate in Python all existing
functionalities in Scala. In conjunction with DataFrames, the next
iteration of the toolkit will be more intuitive and familiar to scholars,
further reducing the barriers to adoption.

6    THE CLOUD
The Archives Unleashed Cloud (AUK) was conceived as the “canon-
ical deployment” of the Archives Unleashed Toolkit. Although AUT
                                                                            Figure 5: A screenshot from the Archives Unleashed Cloud
is open source, we anticipated that installing, configuring, and
                                                                            showing the overview of a collection, which provides a do-
deploying the toolkit might pose too high a barrier of entry for
                                                                            main webgraph visualization, distribution of top domains,
most scholars. Instead, we aimed to create a “cloud portal” whereby
                                                                            and links for downloading derivatives.
scholars could easily ingest their collections and leverage the capa-
bilities of AUT, while we handled configuration, maintenance, and
                                                                            endpoint. Once the download job is finished, AUK coordinates the
deployment transparently behind the scenes. Currently, AUK is
                                                                            generation of the standard derivatives discussed in Section 4.4 us-
built with tight integration with Internet Archive’s Archive-It ser-
                                                                            ing the toolkit. This is accomplished by provisioning a single-node
vice: As discussed in Section 2, this represents the greatest potential
                                                                            server for Spark execution, as described in Section 5.
for achieving impact.
                                                                               A few smaller jobs follow, and the user is notified via email once
   From the technical perspective, the Archives Unleashed Cloud
                                                                            the entire processing pipeline has finished. The final product is a
(AUK) is a Rails application. Users can log in via their GitHub or
                                                                            collection overview page: an example for the Anarchist Archives
Twitter credentials (the two current authentication methods we
                                                                            collection from the University of Victoria is shown in Figure 5. For
support), and are then presented with a basic dashboard interface.
                                                                            collections that are manageable in size, we provide a JavaScript-
Here, users provide their Archive-It credentials, which triggers a
                                                                            based visualization of the domain webgraph; a bar chart showing
background job that imports metadata from the collections in their
                                                                            the top 10 domains in the collection is also provided. Finally, the
Archive-It account. Once this job is finished, the user is notified via
                                                                            overview page provides download links for the derivative products
email that they can now analyze any of their Archive-It collections.
                                                                            that were created by the toolkit. Currently, these products are avail-
A screenshot of this dashboard interface is shown in Figure 4.
                                                                            able in CSV format, but we are in the process of adding support for
   Once a collection is selected for analysis, AUK triggers a chain
                                                                            Apache Parquet, a popular columnar storage format for big data.
of background jobs to perform the actual computation. These jobs
                                                                            The webgraphs are also available in a format that can be directly
physically execute on infrastructure provided by Compute Canada,
                                                                            read by the Gephi graph visualization platform.
which is a service that provides Canadian researchers with com-
                                                                               The Archives Unleashed Cloud itself is open source and publicly
puting resources. First, the raw WARC (or ARC) files comprising
                                                                            available on GitHub,3 and thus anyone could run their own private
the collection are copied over to our Compute Canada storage via
Archive-It’s Web Archiving Systems API (WASAPI) data transfer               3 https://github.com/archivesunleashed/auk

You can also read