The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives - arXiv
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives Nick Ruest,1 Jimmy Lin,2 Ian Milligan,3 and Samantha Fritz3 1 York University Libraries 2 David R. Cheriton School of Computer Science, University of Waterloo 3 Department of History, University of Waterloo ABSTRACT as a shorthand for the individuals who make use of web archives The Archives Unleashed project aims to improve scholarly access for their studies, typically historians, digital humanists, or social arXiv:2001.05399v1 [cs.DL] 15 Jan 2020 to web archives through a multi-pronged strategy involving tool scientists. Due to the composition of our team, the needs of histori- creation, process modeling, and community building—all proceed- ans are perhaps best represented, but our engagements with the ing concurrently in mutually-reinforcing efforts. As we near the broader community through datathons (discussed below) and other end of our initially-conceived three-year project, we report on our workshops have ensured that the social sciences and the digital progress and share lessons learned along the way. The main contri- humanities have not been shortchanged. bution articulated in this paper is a process model that decomposes The biggest hurdle we set out to overcome is the classic chicken- scholarly inquiries into four main activities: filter, extract, aggre- and-egg problem in technology adoption: On the one hand, without gate, and visualize. Based on the insight that these activities can be guidance from “users” (that is, the scholars who study web archives), disaggregated across time, space, and tools, it is possible to generate tool builders are taking stabs in the dark on what the real needs “derivative products”, using our Archives Unleashed Toolkit, that are. They might produce computational artifacts that are not useful serve as useful starting points for scholarly inquiry. Scholars can and perhaps tackle the wrong aspects of the overall challenge. On download these products from the Archives Unleashed Cloud and the other hand, without any existing tools as reference points and manipulate them just like any other dataset, thus providing access often lacking the technical training to understand “what’s possible”, to web archives without requiring any specialized knowledge. Over scholars may have a hard time articulating their needs. Without the past few years, our platform has processed over a thousand a community of eager scholars, tool builders are often hesitant to different collections from about two hundred users, totaling over invest in software development efforts that might not lead to mean- 280 terabytes of web archives. ingful adoption. And of course, without the right computational tools, scholars cannot make headway in their inquiries. Thus, we were at an impasse. 1 INTRODUCTION The goal of this paper is to report how we have made progress in The Archives Unleashed project aims to improve scholarly access providing scholarly access to web archives through a multi-pronged to web archives using a three-pronged strategy that simultaneously strategy involving tool creation, process modeling, and commu- addresses technology, process, and community. Our efforts began in nity building—all proceeding concurrently in mutually-reinforcing 2017 with the generous support of a grant by the Andrew W. Mellon efforts. Although the problem remains far from solved, signifi- Foundation and supplemented by a number of other sources. As cant progress has been made: Our contributions comprise lessons we are nearing the end of the initially-conceived three-year project, learned that are hopefully valuable to the rest of the community the goal of this paper is to share with the broader community and can be transferred beyond web archives into other domains of experiences we have accumulated along the way. digital preservation as well. There is no doubt that scholars are ill-equipped to face the already-here deluge of digital materials for study, but how to pro- 2 PROJECT HISTORY vide scholarly access remains a stubborn, unsolved problem. In this Warcbase [12], the immediate predecessor to our current project, project, our focus is on web archives, but the same can be said of was conceived as a scalable web archiving platform to support tweets, emails, and a plethora of born-digital records. As the field temporal browsing and large-scale analytics. Development of the of “web history” emerges, as defined by the SAGE Handbook of platform began in 2013 by one of the co-authors of this paper (Lin), Web History [4] and two recent monographs [3, 13], this problem is a computer scientist working on big data, and lacking input from becoming all the more pressing. Since the rationale for and impor- scholars, it suffered exactly from the “tool builder without users” tance of preserving the web has already been articulated elsewhere, problem discussed above. At around the same time, another one of we find no need to repeat those arguments here. the co-authors (Milligan), a historian studying early web history, Recognizing that this challenge can only be addressed by a multi- faced the exact opposite problem: he was engaging in scholarly disciplinary team, we assembled one at the outset comprising three inquiry at scale without the proper computational tools and thus main stakeholders: librarians and archivists who are charged with struggled to make headway. A fortuitous and timely gathering of gathering and managing digital collections, computer scientists who the co-authors, along with many like-minded individuals at the build the tools necessary to manipulate those collections, and, of Working with Internet Archives for Research (WIRE) Workshop in course, scholars who interrogate their contents to support various 2014 planted the seeds of what would eventually evolve into the lines of inquiry. Throughout this paper, we use the term “scholar” Archives Unleashed project.
In 2015, a group of colleagues who had coalesced at the workshop web archiving efforts; the 2017 National Digital Stewardship Al- (including the two co-authors mentioned above) collectively stum- liance Web Archiving Survey Report found that 94% of institutions bled upon a potential solution to the scholarly access problem: to were using Archive-It that year, with an additional 4% collecting overcome the chicken-and-egg problem by building tools and com- via separate Internet Archive contracts [9]. Individual collections of munity simultaneously. Through a series of in-person “datathons” these sort typically range from tens of gigabytes to (a small number dubbed “Archives Unleashed”, the group brought together a few of) terabytes. Quite specifically, collection development and content dozen stakeholders to engage in community building and skills harvesting lie outside the scope of our project, as there are already training [14]. Over the course of two to three days, tool builders many existing resources that provide guidance on those aspects of (typically, computer scientists) interacted with scholars, librarians, the web archiving lifecycle. Thus, we assume that an organization and archivists with the goal of overcoming the challenges outlined already has a collection of WARC files (the standard container for- above. The first datathon was held in Toronto in March 2016, primar- mat for web archives) or ARC files (an older file format), in most ily with the support of the Social Sciences and Humanities Research cases already held by the Internet Archive as part of Archive-It, and Council of Canada and the U.S. National Science Foundation (plus desires to provide scholarly access to them. additional contributions from a host of other organizations). The We believe that targeting Archive-It subscribers maximizes our event successfully brought together a group of stakeholders to ini- potential impact. By definition, these organizations already recog- tiate the community building process, which continued through nize the need for web archiving, and most of them have already three more events, in June 2016, February 2017, and June 2017. begun to harvest content that meets their institutions’ develop- The present Archives Unleashed project retained the catchy ment needs. These small-to-medium organizations often operate moniker from these previous datathons and officially began in with barebones staffing, and hence are not in a position to actively June 2017 with a different team. By that time, the approach to facilitate scholarly access—yet they truly feel the pain of letting simultaneously building technology and community seemed to be their collections lie fallow. Our project aims to answer the question: gaining traction, and thus it made sense to build on those initial “We’ve begun a web archiving program and have gathered a few successes. We additionally recognized the importance of process collections, now what?” While our toolkit may also be useful to (more details below), thus leading to a three-pronged strategy: large organizations, for example, national libraries, they are not the primary audience we intend to serve. (1) Development of the Archives Unleashed Toolkit (AUT). This open- Given this brief overview of the genesis of the Archives Un- source toolkit represents the evolution of the Warcbase plat- leashed project, the remainder of this paper will focus on lessons form, with the benefit of a better understanding of scholars’ we have learned along the way that we hope will be valuable to the needs, gained through both the composition of our team and broader community. experiences from the datathons. (2) Deployment of the Archives Unleashed Cloud (AUK). To bridge the gap between open-source tools and their deployment at 3 HIGH-LEVEL LESSONS scale, we built a one-stop portal that allows users to ingest As with most technology adoption challenges, perhaps unsurpris- their collections and execute a number of analyses with a few ingly, the technology itself (AUT and AUK in our case) is relatively keystrokes and mouse clicks. This service operationalizes a straightforward. The toolkit and the cloud service represent fairly process model for scholarly inquiry that provides guidance on standard instances of open-source software development, and we how to interrogate collections at scale. Although the Archives have adopted standard best practices that lead us down relatively Unleashed Toolkit is open source, which means that anyone is well-trodden paths. From the technical perspective, the biggest inno- able to install and run the software, we expect that most scholars vation in the toolkit (beyond its Warcbase foundations) is the move would be uninterested in managing their own infrastructure. from Spark’s resilient distributed datasets (RDDs) to DataFrames Thus, AUK can be viewed as the “canonical instance” of AUT and a shift from Scala to Python as the language of choice (see where we handle deployment and maintenance, freeing scholars Section 5 for more details). to focus on their inquiries. We have come to realize that the process of scholarly inquiry (3) Organization of Archives Unleashed Datathons. These efforts at (in the face of the daunting sizes of many collections) is the most community-building and outreach represent a continuation of critical component in building and sustaining a community around the previous in-person events, but with a shift in focus to the web archiving. This is not to diminish the importance of getting the tools and services that the project was developing. In addition, community together in the first place (the third prong of our strat- we used these venues as an opportunity to develop longer-term egy), a challenge our datathons have already made some headway plans to sustain the scholarly community around web archiving on. However, even once we get all the stakeholders “in the same beyond the life of the project. room”, they still need to “do something” to engage in scholarly inquiry, and that “something” can leverage our proposed process In order to more easily make headway in addressing these goals, model to serve as a framework for organizing their activities. We our efforts primarily focused on thematic web archive collections had previously proposed what we called the FAAV cycle [12] to being gathered by a diverse group of small to medium cultural her- characterize how scholars interrogated web archives, comprised of itage institutions, specifically those who are subscribers to Internet four main activities: filter, analyze, aggregate, and visualize. Archive’s Archive-It service. These subscribers represent the vast In this paper, we present a refinement to FAAV that we term majority of institutions in the United States that engage in active FEAV, where “analyze” has been replaced with “extract” for a more
accurate characterization. This process model provides a descriptive far in most cases for a historian to learn how to parse one hundred characterization of scholarly activity based on our observations gigabytes of raw WARC files without assistance; but, with the sup- and can be used prescriptively in a pedagogical manner. It offers port of appropriate training and resources (such as those referenced an organizing framework for the rest of the project, guiding the above), scholars can be reasonably expected to develop fluency in development of the toolkit as well as the cloud service. using Jupyter notebooks to manipulate modest amounts of CSV or The articulation of the FEAV process represents a major contribu- JSON data (for example). tion of our work. Our single most important insight is that the main activities (filter, extract, aggregate, visualize) can be disaggregated 4.1 From Madlibs to a Process Model across time, space, and tools. That is, each of the activities need not For our target group of scholars, early datathons showed that they occur in the same session or even the same location, and perhaps generally had little difficulty grasping the concepts (e.g., transfor- most importantly, with the same tools. More concretely, we have mations on large collections of data records, accessing fields in realized that a number of “derivative products” provide useful start- a tuple, etc.) and the “mechanics” of using Warcbase (and later, ing points to scholarly inquiry, more so than the raw collections the Archives Unleashed Toolkit). However, when it became time themselves (at least at the outset). These derivative are essentially to analyze their own collections, the scholars were often unsure the output of a pre-determined pipeline of filtering, extraction, and where to begin [12]. Indeed, we were faced with this conundrum: aggregation that we explicitly store and share. In addition to being with us in the room to provide support, scholars could perform useful from a scholarly perspective, these derivatives are also much amazingly creative analyses, but few adopted the technology on smaller than the raw collections, typically manipulable on a laptop their own without assistance. It soon became clear that scholars or in a cloud notebook. These derivative and the disaggregation of did not know where to start. Faced with several hundred gigabytes FEAV yield important implications for adoption, which we detail of WARC files and a command shell (with a blinking cursor) ready in Section 4.4. to accept commands, what to do? We have attempted to jumpstart the process of inquiry by pro- viding a number of examples illustrating the capabilities of our 4 THE FEAV PROCESS MODEL tools, adopting a “madlibs” (i.e., fill-in-the-blank) approach; more As we have argued above regarding the central role of process in in Section 7. For example, we illustrate: this is how you find the facilitating scholarly access to web archives, it makes sense to begin website that has the most mentions of Canadian Prime Minister with a detailed discussion of our FEAV model. Justin Trudeau—if you want to change the person of interest and Our previous work has considerably influenced this process the collection, change the variables here and there. Or, this is how model. Even prior to the Archives Unleashed project, in the context you create a word cloud of the crawl from September 2014—you of Warcbase, we aimed to serve a well-defined group of scholars as can change the temporal interval using this variable, or focus on a our primary audience. Although we did not expect these scholars to particular domain by setting that variable. have formal computer science training, our initial efforts targeted What began to emerge, essentially, was a “cookbook” with a those who were already comfortable with a scripting language such series of “recipes” for addressing a number of common analytics as Python or R. We assumed that the scholars could perform simple tasks. Indeed, this format now forms the backbone of our online manipulations of datasets in well-defined formats (e.g., CSV or documentation.1 These recipes can be phrased in the form of “how JSON), use standard libraries to execute common tasks (e.g., create a do I...” questions, for example: word cloud), or were proficient enough with search engines to figure • extract all URLs in a collection? out what to do by searching online for instructions. Furthermore, • count the occurrences of different domains? we expected that the scholars were already comfortable with the • extract plain text from URLs matching a pattern? command line and possessed basic knowledge of the file system • compute counts of specific keywords? (e.g., moving or copying files and directories). • find the most common person mentioned? In our experience, such a skillset is relatively common, particu- • extract the anchor text of links to a particular URL? larly among recently-trained scholars who want to seriously work • find all pages that link to a site? with digital objects. It might not be unrealistic to expect that these • determine the most popular gif? form the “core compentencies” of all future scholars in this intellec- • compute the checksum of images? tual space, but how to bring everyone up to this level of technical With this cookbook approach, combined with observations of how proficiency is beyond the scope of our work. More importantly, our tool was being used successfully in the datathons and through even if graduate schools and professional organizations within the interactions with other scholars, we began to generalize scholarly humanities and social sciences are not properly training scholars in activities into a process model we termed the FAAV cycle, first these skills, they are competencies obtainable through libraries and articulated in Lin et al. [12]. FAAV stands for “filter”, “analyze”, “ag- other free resources. The Programming Historian, for example, has gregate”, and “visualize”—the four main activities we found scholars existed in various formats since 2008 [1] and Software Carpentry engaging in as part of their inquiries. Over the past few years, we (and related projects such as Data Carpentry) has been running have further refined FAAV into FEAV, which we describe in detail in-person workshops around the world on basic computational next: the main refinement is replacing “analyze” with “extract” for skills and principles. A growing body of literature explores best greater descriptive accuracy. practices for how best to develop collaborative lessons for these diverse learners [8]. In other words, it would truly be a bridge too 1 https://github.com/archivesunleashed/aut-docs
4.2 The Four Main Activities be passed to an external application to generate complex, interac- The four main activities of the FEAV model are “filter”, “extract”, tive visualizations. These visualizations can be static (e.g., a figure, “aggregate”, and “visualize” in support of scholarly inquiry: graph, or plot) or interactive; they can be designed to support fur- ther exploration (i.e., an intermediate product) or be prepared for Filter. Typical web archive collections range from tens of giga- a publication or a blog post (i.e., for dissemination). It is not our bytes to terabytes; it is relatively rare—with perhaps the exception intention for the Archives Unleashed Toolkit to evolve into a com- of high-level exploratory probes and “distant reading” studies from prehensive visualization framework; instead, our main goal is to the greatest distance—that an individual analysis would consume support interoperability with existing visualization toolkits in the the entire collection. Thus, a scholar usually begins by focusing on broader ecosystem that scholars may already be familiar with. a particular subset of the web archive, which we characterize as filtering. This can be accomplished by content, metadata, or some extracted information. Content-based filtering might be based on 4.3 Discussion text, e.g., consider only pages mentioning a particular set of key- Although we describe the FEAV process model as comprised of words, or based on the hyperlink structure between the pages, e.g., distinct activities, it is important to note that all of the activities consider only pages that link to a particular website or domain. are closely intertwined in practice, and may not even proceed in Metadata-based filtering might be based on a particular range of the described order. For example: crawl dates or pages whose URLs match a particular pattern. Filter- Does a scholar filter first, then perform extraction, or vice versa? ing can also be based on any information extracted from either the Sometimes, filtering is performed on extraction results—for exam- content or metadata, as the result of running arbitrary user-defined ple, running a named-entity detector to identify person names, and functions (more below). The filtering criteria may be arbitrarily then considering only pages that mention certain names. In other complex and nested, for example, a scholar is only interested in cases, the distinction between filtering and extraction is blurred—for pages containing a particular keyword that link to a specific domain, example, filtering based on hyperlinks technically requires running within some temporal range. a link extractor on the page first, but conceptually, the scholar’s goal is to filter, not to extract. In practice, filtering and extraction Extract. After selecting a subset of material, the scholar typically are often tightly coupled. then extracts some information of interest. Examples include ex- Filtering is also commonly applied to the output of aggregations. tracting the plain text from the raw HTML source, identifying After counting, the scholar might wish to discard items that appear mentions of named entities (e.g., people, organizations, places), as- too frequently or too rarely. In web collections, there is inevitably sessing the sentiment of the underlying text, etc. Extractions need a long tail of items that appear only a small number of times (e.g., not be limited to HTML—for example, a scholar might be inter- misspelled names), and for the most part, scholars are not interested ested in PDFs in a collection—or even be limited to textual data— in those. On the flip side, there are often items that appear very we have begun experimenting with pipelines that analyze images, frequently (e.g., “here” as anchor text) and hence it makes sense to for example. Typically, extraction is accomplished by user-defined discard those as to not clutter up the analysis. The combination of functions (UDFs); the user in this case refers to the programmer. these activities in our process model, we argue, is flexible enough Generally, we would not expect scholars to write their own UDFs to accommodate diverse scholarly needs. (at least at the outset); instead, we provide a library of UDFs to per- Finally, filtering, extraction, and aggregation might proceed cohe- form common operations that they can draw from and assemble in sively in a tight loop: For example, an initial set of filtering keywords novel combinations. Note that UDFs are extensible and can invoke yields too many pages and thus requires additional refinement to arbitrarily-complex functionalities—for example, leveraging a deep produce a manageable set of results. Or the opposite could happen— neural network to perform object detection on images [19]. the filter was too restrictive and did not retain enough records for Aggregate. The output of filtering and extraction is a collection of analysis. These activities might even alter the course of scholarly records of interest, typically already far smaller (often, orders of inquiry—for example, a result leads to an interesting question that magnitude smaller) than the raw collection. In most cases, however, compels the scholar to follow another tangential thread. these records need to be aggregated or summarized before they are Note that during the normal course of scholarly inquiry, some suitable for human consumption. The simplest example of aggrega- of the activities might be skipped altogether. For example, if the tion is to produce a table of counts, e.g., how many times a person filter is very specific and retains only a handful of records (e.g., or location has been mentioned within a set of pages, how many half a dozen pages), then the scholar may decide to examine those times a particular page has been linked to, etc. Other common ex- results directly—in this case, there is no meaningful aggregation amples include finding the maximum (e.g., the page with the most and visualization to speak of. By varying the specificity of the filter, incoming links), the minimum (e.g., the least frequently-mentioned a scholar can switch between “distant” or “close” reading [15] when name from some list of individuals), or the average (e.g., the average using our toolkit. Once again, these possibilities are captured by of sentiment expressed across pages of a website as determined by our process model. an automatic classifier). Visualize. Finally, the aggregated results are presented in some 4.4 Standard Derivatives sort of visualization for the scholar’s consumption. The visualiza- Both our original FAAV and the refined FEAV model emerged as tion can be as simple as a table or list showing individual records, a descriptive, bottom-up characterization of what we observed directly generated by our toolkit, or the output of the toolkit can scholars doing as they engaged with web archives. It was not our
original intent that the model be applied prescriptively—our goal Collections was merely to provide a reference that scholars could consult. Later on, though, we realized that our model could indeed serve a peda- gogical purpose; that is, during tutorial sessions in our datathons, Filter we offered FEAV as a way to help new users of the toolkit structure their approach to inquiry. We would offer: once you’ve formed your initial research question, think about what subset of the collection you would need to interrogate, what information you’d need to Visualize Archives Unleashed Extract extract from those pages, etc. Toolkit As a result of this guidance, many analyses began much the same way, with very similar initial steps. In particular, we have found three analysis products to be so frequently used by scholars Derivatives Aggregate that with the Archives Unleashed Cloud and for collections used at our datathons, we have begun to pre-generate and store them, before they are even requested. Among other benefits, this obviates Figure 1: The relationship between our FEAV process model the need for scholars to run potentially large analysis jobs before and the Archives Unleashed Toolkit and Cloud. The toolkit they arrive at a datathon, saving precious time in our in-person is designed to support all four main activities, shown by the events. These are what we term our three “standard derivatives”, the blue arrows. The cloud, in contrast, is designed to only sup- creation of which is tightly integrated into the Archives Unleashed port a subset of the activities, but additionally allows users Cloud (see Section 6 for more details): to manage the ingestion of collections and downloading of derivatives (green arrows). • Domain Distribution. We extract all URLs to compute the fre- quency of domains appearing in the collection. existence of fixed opening moves and well-known countermoves • Domain webgraph. We extract all hyperlinks to create a domain- certainly does not diminish the overall beauty of the game or the to-domain network graph; that is, the hyperlink structure of the creativity necessary to play—in the same way that our derivatives collection, aggregated at the domain level. do not diminish the creative potential of scholarly inquiry. These • Plain Text. We extract plain text from all web pages, along with products merely present useful starting points, and scholars always metadata such as crawl date, domain name, and the URL. have the option of starting from the raw web archives. It remains an interesting question as to whether these derivatives are popular because we provide them, or if their popularity is in re- 4.5 FEAV Disaggregation sponse to scholarly demand. The answer is likely a mixture of both, Somewhat surprisingly, we have found the standard derivatives but these derivatives have two important properties: First, they are described above to be so useful that we encourage scholars (particu- genuinely useful in answering a number of basic questions that are larly those just learning about web archives) to use them as starting applicable to every web archive: The domain distribution provides points of analyses, rather than the raw archives themselves. This a starting point to answering the question “What’s in this collec- has led to the single biggest insight that has facilitate adoption—the tion?” It also provides the scholar with a sense of potential biases recognition that FEAV can be disaggregated across time, space, as that may be present—by noting, for example, the over- or under- well as tools. What does this mean? Allow us to explain: representation of certain sites. The domain webgraph provides a The derivatives represent a combination of filtering, extraction, tractable overview of the collection structure—after all, the hyper- and aggregation whose output we capture, store, and share. Thus, link structure is one important distinguishing characteristic that when using these products, the scholars are taking the output of separates web archives from other large collections of documents. FEA and continuing with their own iteration of FEAV. At that point, Page-to-page link structure is usually too fine-grained to be helpful how these derivatives came to be generated (when and where) be- in providing an overview, and we have found that domain-based comes irrelevant (this is exactly the disaggregation we speak of), aggregation provides a nice middle ground that shows interesting since scholars can engage with the material on their own laptops structures without being overwhelming in size. The plain text of (a different time, a different place). Furthermore, scholars are not pages is useful for obvious reasons, serving as the starting point limited to analyses with our toolkit—they can use whatever tools for most content-based analyses. they are already familiar with, be it Python, R, and yes, even Mi- Second, and perhaps more importantly, these derivatives are crosoft Excel. In this way, our process model supports seamless often small enough to be manipulable on the scholar’s laptop. Our integration of our tools and services with the broader ecosystem. previous work [6] provides a detailed quantitative analysis of the Given the popularity of Python for data science today, scholars relationship between raw and derivative sizes, but in rough terms, can conduct analyses in notebooks, manipulate the derivatives us- for a “typical” Archive-It collection, the domain distribution data is ing Pandas DataFrames, use existing packages for visualization, or usually less than 1 MB, the domain webgraph is 10s MB, and the take advantage of a multitude of other capabilities in the Python raw text is perhaps 10s GB. ecosystem. This means that scholars can begin inquiries into web An apt analogy to the creation and use of these derivatives might collections without needing to know anything specific about web be the notion of an opening game in chess, which formulaically archives (for example, the difference between an ARC and a WARC captures combinations of the first few moves in chess games. The or the subtleties in parsing hyperlinks to recover anchor texts).
RecordLoader.loadArchives("collection/", sc) RecordLoader.loadArchives("collection/", sc) .keepValidPages() .webgraph() .flatMap(r => ExtractLinks(r.getUrl, r.getContent)) .groupBy(ExtractDomain($"src"), ExtractDomain($"dest")) .map(r => (ExtractDomain(r._1), ExtractDomain(r._2))) .count() .filter(r => r._1 != "" && r._2 != "") .filter($"count" > 5) .countItems() .write.csv("domain-webgraph/") .filter(r => r._2 > 5) .saveAsTextFile("domain-webgraph/") Figure 2: Simplified Scala scripts for generating the domain webgraph using the Archives Unleashed Toolkit. On the left we present the analysis using resilient distributed datasets (RDDs); on the right we present the same analysis using DataFrames. These relationships are shown in Figure 1. With the Archives Un- • Input and output connectors. Our toolkit provides abstractions leashed Toolkit, a scholar can directly engage in the FEAV processes to handle robust parsing of WARC and ARC container files that (with perhaps the support of external applications for visualization)— comprise standard web archive collections, exposing RDDs with these are indicated by the blue arrows. This, of course, requires the records containing usable fields such as HTML content and HTTP scholar to download, install, and configure the toolkit, presenting a headers (for web pages). It is easy to understate the amount of potential barrier to entry, but offering the most complete suite of effort that has been devoted to building robust input connectors: capabilities. Alternatively, the scholar can directly download the throughout the project we have encountered countless errors, derivatives from the Archives Unleashed Cloud. Behind the scenes, corner cases, and non-compliant data across hundreds of ter- the portal is still using the toolkit to perform filtering, extraction, abytes of web archives, each of which we’ve had to manually and aggregation, but these steps are hidden from the scholar. debug and build error handling code for. This has likely been It is somewhat ironic that the success of our project means mak- the biggest sink of development effort. In addition to input con- ing our project invisible to scholars. For example, a social scientist nectors, the toolkit also includes output connectors to facilitate downloads the domain webgraph of a particular collection, loads interoperability with external applications, for example, export- the structure into a Pandas DataFrame in Python, and proceeds to ing webgraphs in the GraphML format to be used with the Gephi analyze the clustering properties of a number of websites of interest. graph visualization platform. She is unfamiliar with how the webgraph was extracted (through • Library of user-defined functions (UDFs) for extraction. To support an AUT job triggered by AUK in the cloud, see Section 6), but those extraction of useful information from raw individual records, the details are unimportant; the upshot is that we have enabled her toolkit provides a collection of UDFs to conduct various analyses. research in an unobtrusive manner. The scholar already knew how These include functions for manipulating URLs, working with to manipulate DataFrames using Pandas, and so analyzing web text, hyperlinks, images, etc., and more. This library is constantly archives did not require learning any new skills. evolving in response to scholars’ needs. We have fully embraced this philosophy of making our tools • Convenience transformations. Spark programs are typically de- and services as invisible as possible, because the best strategy to scribed as a sequence of transformations operating on RDDs. adoption is to not require users to learn anything new! In this Many commonly-used sequences of transformations, such as way, the Archives Unleashed Cloud has evolved into a conduit that filtering RDDs to retain records of a certain type (e.g., HTML allows users to ingest raw web archive collections and produce pages) as well as a number of frequently-used aggregations, are these derivatives for easy access. encapsulated in “convenience transformations” to reduce the verbosity of analysis scripts. 5 THE TOOLKIT A simplified AUT script (in Scala) for generating the domain web- The Archives Unleashed Toolkit (AUT) is the successor to Warcbase graph (i.e., one of our standard derivatives, see Section 4.4) is shown and can be characterized as a refinement of the older package, as on the left side of Figure 2. The script has been simplified for pre- opposed to an entirely new platform. For this reason, we focus sentation purposes, but retains the same conceptual structure as the on differences and new features, and avoid duplicating previously- actual working version. It begins by invoking the RecordLoader published material; see Lin et al. [12] for additional details. to load a collection: note here that we hide details such as compres- At a high level, AUT provides a domain-specific language for an- sion, processing of multiple files, WARC vs. ARC formats, etc. The alyzing web archives built on top of the open-source Apache Spark scholar only needs to specify the path of the directory containing data processing platform. Previously, Warcbase also supported tem- the data. Next, keepValidPages is a convenience transformation poral browsing via HBase, a distributed, scalable, big data store that that filters the raw archive RDD to retain only valid HTML pages. supports low-latency random access to large datasets. This feature, The transformation hides many details that go into the decision however, was little used, since scholars were already accustomed of whether a record is a “valid” HTML page, based on the server’s to using the Internet Archive’s Wayback Machine for temporal response MIME type, file extension, and other details. browsing; hence, AUT removed such support from Warcbase. After the collection has been filtered to retain only the HTML On top of Spark’s core data structure, known as resilient dis- pages, hyperlinks are extracted, and from the source and target tributed datasets (RDDs), the Archives Unleashed Toolkit provides of the hyperlinks we keep only the domain (the flatMap and map three main capabilities specific to web archiving: transformations; ExtractLinks and ExtractDomain are UDFs). A
www.lajemmerais.ca mrcbm.qc.ca www.notre-planete.info www.caaquebec.com www.aac.aluminium.qc.ca www.cremtl.qc.ca www.ipcc.ch www.ville.laval.qc.ca questions begin to emerge: For example, almost all hyperlinks are to www.calendrier.umontreal.ca www.creca.qc.ca www.amigoexpress.com pages within the qc.ca domain; there are few external ones beyond www.santepublique.gc.ca www.covoiturage.ca www.expocite.com www.dare-to-choose-a-more-fuel-efficient-vehicle.com www.caa.ca www.communauto.com video.google.com carte.ville.quebec.qc.ca www.davidsuzuki.org www.cutaactu.ca palcan.scc.ca www.cqdd.qc.ca donnees.ville.quebec.qc.ca www.credemontreal.qc.ca www.velo.qc.ca wceaeq www.crelanaudiere.ca www.rtcquebec.ca www.oee.nrcan.gc.ca www.ville.victoriaville.qc.ca www.lundisansviande.net www.crecq.qc.ca www.rncreq.org a few web services or platforms (Twitter for embedded accounts or www.crebsl.com www.tvanouvelles.ca etvcanada.ca www.starflood.eu get.adobe.com www.biospherelac-st-pierre.qc.ca player.vimeo.com www.faitesdelair.org www.cagbc.org www.phac-aspc.gc.ca www.pieval.mddep.gouv.qc.ca www.gov.uk www.nrcan-rncan.gc.ca www.bibliothequesdequebec.qc.ca www.adobe.com www.ecomuseum.ca www.natureconservancy.ca rivieredesoutaouais.ca www.fao.org www.iqa.mddelcc.gouv.qc.ca www.cobali.org www.jeunes.gouv.qc.ca www.deficlimat.qc.ca www.insidethebottle.org www.journalsonline.tandf.co.uk www.creddo.ca www.creddsaglac.com reglements.ville.quebec.qc.ca www.santé.gouv.qc.ca www.scics.gc.ca www.cedre.fr francophonie.saic.gouv.qc.ca protegeonsleau.gouv.qc.ca www.floraquebeca.qc.ca www.mamot.gouv.qc.ca www.cflri.ca www.carrieres.gouv.qc.ca www.fsbio-hannover.de www.addthis.com esa.un.org uli.org www.banquesalimentaires.org www.nrcresearchpress.com www.finances.gouv.qc.ca Google). Given the history of Canadian federalism, which has seen www.usherbrooke.ca vieenvert.telequebec.tv www.environnementestrie.ca www.osez-un-vehicule-plus-econergetique.com www.qc.dfo-mpo.gc.ca www.eng.buffalo.edu www.fnh.org www.mcc.gouv.qc.ca www.cregim.org www.pes3.enviroweb.gouv.qc.ca www.quebecregion.com www.iddpnql.ca www.glslregionalbody.org www.mamr.gouv.qc.ca www.pc.gc.ca www.ville.quebec.qc.ca www.cgrmp.com www.mesrst.gouv.qc.ca politiqueenergetique.gouv.qc.ca www.meds-sdmm.dfo-mpo.gc.ca video.mrnf.gouv.qc.ca on.undp.org www.courrierlaval.com www.fpa2.mc cdn-contenu.quebec.ca www.menv.gouv.qc.ca.- mrnf-faune.gouv.qc.ca www.cre-lanaudiere.qc.ca www.lapresse.ca ftp.mrnf.gouv.qc.ca nature.org www.compost.org www.mels.gouv.qc.ca msssa4.msss.gouv.qc.ca dx.doi.org www.cfia-acia.agr.ca www.cre-mauricie.com sedna.radio-canada.ca www.cmhc-schl.gc.ca consortium-evolution.org www.fqf.qc.ca www.agr.gc.ca www.cmquebec.qc.ca www.nrc-cnrc.gc.ca www.donneesquebec.ca robvq.qc.ca www.fpq.com www.suretequebec.gouv.qc.ca www.aeaq.ca www.ceaeq.gouv.qc.ca www.evb.csq.qc.net www.crem.qc.ca ees-gazdeschiste.gouv.qc.ca www.cosepac.gc.ca www.registres.mddelcc.gouv.qc.ca www.hww.ca www.crelaurentides.org uneasy relationships between federal and provincial levels of gov- www.quebecoiseaux.org www.uncsd2012.org ville.montreal.qc.ca www.ene.gov.on.ca nosplansdeau.com www.nosplansdeau.com phil.cdc.gov www.mess.gouv.qc.ca services.ville.montreal.qc.ca cdpnq.gouv.qc.ca www.mrifce.gouv.qc.ca www.securite.clicsequr.gouv.qc.ca espacepourlavie.ca www.quebecvert2020.gouv.qc.ca www.fapaq.gouv.qc.ca www.equiterre.org sante.gouv.qc.ca faune.gouv.qc.ca www.linkedin.com communiques.gouv.qc.ca www.oaq.qc.ca www.microsoft.com www.planstlaurent.qc.ca accounts.google.com www.unep.org www.credelaval.qc.ca www.oecd.org www.umq.qc.ca www.aee.gouv.qc.ca www.natureserve-canada.ca www.cdpnq.gouv.qc.ca www.fedecp.qc.ca www.ccebj-jbace.ca www.natureserve.org www.ftgq.qc.ca www.energie.riotinto.com www.inspection.gc.ca www.ees-gazdeschiste.gouv.qc.ca www.agrireseau.qc.ca www.plannord.gouv.qc.ca www.informaworld.com www.saumon-fqsa.qc.ca www.cap-quebec.com www2.publicationsduquebec.gouv.qc.ca www.jardinpotager.com www.jourdelaterre.org documents.mddelcc.gouv.qc.ca www.conseildelafederation.ca www.biodiv.org www.hydroquebec.com www.emploiquebec.gouv.qc.ca www.emilydamstra.com www.mfa.gouv.qc.ca www.biosphere.ec.gc.ca www.reseau-environnement.com www.mri.gouv.qc.ca www.quebecinternational.ca ernment, it is notable that there are no links to Environment Canada, www.cre-capitale.org www.cglg.org www.opc.gouv.qc.ca www.publicationsduquebec.gouv.qc.ca www.hc-sc.gc.ca images.fws.gov www.undp.org wci-inc.org twitter.com www.facebook.com www.mrnfp.gouv.qc.ca www.crecn.org www.mddep.gouv.qc.ca itunes.apple.com mddep.gouv.qc.ca www.recyc-quebec.gouv.qc.ca www.canards.ca www.ville.montreal.qc.ca www.provancher.qc.ca www.oiseauxqc.org geoboutique.mrnf.gouv.qc.ca www.keac-ccek.ca www.stat.gouv.qc.ca www.bape.gouv.qc.ca www.fleurs-acadie.com www.aujardin.info www.radio-canada.ca www.quebec.ca www.erudit.org www.caaaq.gouv.qc.ca www.epa.gov www.weao.org www.fondationdelafaune.qc.ca www2.ville.montreal.qc.ca www.tc.gc.ca www.protegeonsleau.gouv.qc.ca www.ewg.org www.iowadnr.com www.emplois-superieurs.gouv.qc.ca www.etudeairemarineim.ca mddefp.gouv.qc.ca www.unwater.org www.youtube.com www3.mrnf.gouv.qc.ca www.mrnf.gouv.qc.ca www.un.org www.ec.gc.ca www.wci-citss.org mffp.gouv.qc.ca www.urgencequebec.gouv.qc.ca www.mddelcc.gouv.qc.ca auto.lapresse.ca www.atlasamphibiensreptiles.qc.ca www.cbd.int www.agora21.org www.senv.mddep.gouv.qc.ca www.adobe.fr www.revenu.gouv.qc.ca www.creat08.ca www.agrirecup.ca www.gouv.qc.ca climat.meteo.gc.ca www.registres.mddefp.gouv.qc.ca laws.justice.gc.ca www.mern.gouv.qc.ca their federal counterpart. Already, a research project—looking at www.info.clicsequr.gouv.qc.ca publications.msss.gouv.qc.ca www.meq.gouv.qc.ca www.mnr.gov.on.ca www.nunavikparks.ca www.menv.gouv.qc.ca fish.dnr.cornell.edu www.mddefp.gouv.qc.ca www.fetedelapeche.gouv.qc.ca www.canada.ca vehiculeselectriques.gouv.qc.ca www.nord.gouv.qc.ca www.parcmarin.qc.ca www.sofad.qc.ca www.ccme.ca www.commuterchallenge.net www.tbs-sct.gc.ca www.sih.mddep.gouv.qc.ca intranet www.tresor.gouv.qc.ca www.mrn.gouv.qc.ca www.ramsar.org www.rageduratonlaveur.gouv.qc.ca www1.ete.inrs.ca www3.publicationsduquebec.gouv.qc.ca publications.gc.ca www.premier-ministre.gouv.qc.ca www.environnement.gouv.qc.ca www.mtq.gouv.qc.ca www.registrelep.gc.ca www.cehq.gouv.qc.ca www.drapeau.gouv.qc.ca www.rage.gouv.qc.ca www.santepub-mtl.qc.ca www.irda.qc.ca www.fqrnt.gouv.qc.ca wwf.ca depot-e.uqtr.ca www.msss.gouv.qc.ca www.dfo-mpo.gc.ca www.registres.mddep.gouv.qc.ca wmin1 www.lesdebrouillards.qc.ca www.ree.environnement.gouv.qc.ca www.sant.gouv.qc.ca www.mapaq.gouv.qc.ca www.rsqa.qc.ca www.denniskalma.com www.masanteausommet.com www.westernclimateinitiative.org www.droitauteur.gouv.qc.ca www.monclimatmasante.qc.ca www.ncdc.noaa.gov www.inspq.qc.ca planstlaurent.qc.ca www.bnq.qc.ca www.mdeie.gouv.qc.ca www.servicesenligne2.mddep.gouv.qc.ca www.ramq.gouv.qc.ca www.protegerlenord.mddep.gouv.qc.ca www.cpcq.gouv.qc.ca www.clubsconseils.org www.iqa.mddep.gouv.qc.ca various provincial ministries to explore changing relationships with www.apple.com www.securitepublique.gouv.qc.ca www.ogm.gouv.qc.ca www.actionrebuts.org ft mrnf.gouv.qc.ca www.cites.ec.gc.ca www.slv2000.qc.ca www.ouranos.ca www.mffp.gouv.qc.ca www.faisonslepoureux.gouv.qc.ca www.aegq.qc.ca mern.gouv.qc.ca www.sommetjohannesburg.org www.sciencedirect.com www.letsdoitforthem.gouv.qc.ca ec.gc.ca www.seao.ca www.efficaciteenergetique.gouv.qc.ca www.sopfeu.qc.ca pleinderessources.gouv.qc.ca www.ccaq.com www.revenuquebec.ca www.transparence.gouv.qc.ca www.phenixdelenvironnement.qc.ca developer.apple.com www.sqrd.org www.mamrot.gouv.qc.ca www.sepaq.com images.google.ca legisquebec.gouv.qc.ca www.educapoles.org www.vehiculeselectriques.gouv.qc.ca www.alap.qc.ca www.fihoq.qc.ca www.sagepesticides.qc.ca www.canlii.org www.g3e-ewag.ca recherched.gouv.qc.ca www.assnat.qc.ca www.glerl.noaa.gov www.cfc-cafc.gc.ca www.habitattitude.ca www.uqrop.qc.ca www.cosewic.gc.ca www.wci-auction.org www.mce.gouv.qc.ca radio-canada.ca www.glf.dfo-mpo.gc.ca sn2000.taxonomy.nl www.twitter.com collectionscanada.gc.ca rageduratonlaveur.gouv.qc.ca www.cbcq.gouv.qc.ca www.calgary.ca link.springer.com www.cai.gouv.qc.ca filaman.ifm-geomar.de www.robvq.qc.ca www.fishbase.org www.ecoentreprises.qc.ca www.mrif.gouv.qc.ca federal counterparts—begins to take shape. www.rappel.qc.ca ogsl.ca www.rsma.qc.ca wwwxml.gouv.qc.ca www.faqdd.qc.ca www.fqe.qc.ca www.qc.ec.gc.ca www.coroner.gouv.qc.ca sustainablecanadadialogues.ca geoegl.msp.gouv.qc.ca efficaciteenergetique.mrn.gouv.qc.ca www.saaq.gouv.qc.ca www.iqa.mddefp.gouv.qc.ca www.irpeqexpress.qc.ca banderiveraine.org www.transports.gouv.qc.ca www.cnil.fr www.toponymie.gouv.qc.ca www.servicecanada.gc.ca www.climatechange2013.org ftp2.cits.rncan.gc.ca www.naturequebec.org www.ccq.org store.apple.com plus.google.com www.zecquebec.com www.premiere-ministre.gouv.qc.ca weather.gc.ca faqdd.qc.ca under2mou.org www.reserve-duchenier.com lavoieverte.qc.ec.gc.ca uwaterloo.ca www.economie.gouv.qc.ca www.research.net www.pes2.enviroweb.gouv.qc.ca www.comga.org www.fil-information.gouv.qc.ca www.who.int www-es.criq.qc.ca www.calacademy.org www.dsp.santemontreal.qc.ca www.international.gc.ca laws-lois.justice.gc.ca www.cubiq.ribg.gouv.qc.ca www.deontologie-policiere.gouv.qc.ca wwwapps.tc.gc.ca www.budget.finances.gouv.qc.ca dgizc.uqar.ca www.msg.gouv.qc.ca www.oqlf.gouv.qc.ca www.quebec-ere.org collections.banq.qc.ca oee.nrcan.gc.ca www.mcq.org www2.gouv.qc.ca Development efforts on the toolkit have followed standard best www.scopus.com www3.mffp.gouv.qc.ca ici.radio-canada.ca strategiemaritime.gouv.qc.ca www.ville.lac-megantic.qc.ca www.google.com www.registres.environnement.gouv.qc.ca provancher.qc.ca www.ree.mddelcc.gouv.qc.ca www.transitionenergetique.gouv.qc.ca www.rageduratonlvaeur.gouv.qc.ca www.wsc.ec.gc.ca cse.google.com www.iqa.environnement.gouv.qc.ca www.aventure-ecotourisme.qc.ca www.lcbp.org www.tandfonline.com www.sararegistry.gc.ca www.statcan.gc.ca www.transportselectriques.gouv.qc.ca darwin.cyberscol.qc.ca www.coq.qc.ca vertebrates.si.edu www.theclimategroup.org www.evb.lacsq.org herbierduquebec.gouv.qc.ca gouv.qc.ca www.alguesbleuvert.gouv.qc.ca www.gnb.ca ftp.environnement.gouv.qc.ca fr.wikipedia.org www.autochtones.gouv.qc.ca maps.google.ca www.cqlc.gouv.qc.ca www.cbif.gc.ca www.ijc.org www.google.ca code.google.com wpp01.msss.gouv.qc.ca www.frqnt.gouv.qc.ca www.racj.gouv.qc.ca www.ophq.gouv.qc.ca www.aou.org www.infrastructure.gc.ca meteo.gc.ca practices for open-source software. Source code is held in a public www.addinto.com liverpool.metapress.com www.liverpooluniversitypress.co.uk maps.google.com gestion.cqlc.gouv.qc.ca vpnssl1.msp.gouv.qc.ca www.revolutiontranquille.gouv.qc.ca racj.gouv.qc.ca GitHub repository,2 where we extensively use “issues” to keep track of bugs, feature requests, as well as for planning new features and search.microsoft.com developer.longtailvideo.com go.microsoft.com www.longtailvideo.com deontologie-policiere.gouv.qc.ca dashboard.longtailvideo.com discussing high-level design. All issues are public and participation Figure 3: A visualization of the domain webgraph from is open to everyone who may be interested—while (quite obviously) a web archive of the Ministry of Environment of Québec, our team is the most active in the online forum, we regularly receive collected between 2011 and 2014 by the Bibliothéque et comments and feedback from the community. Modifications to Archives nationales du Québec. the codebase occur via pull requests and undergo code review before they are merged to the master branch. The toolkit has fairly filter transformation is applied to discard empty output, and the good unit test coverage, and standard continuous deployment tools results are then aggregated by count: countItems is another con- simplify automated testing as part of the code review process. We venience transformation that AUT provides (essentially serving as create official releases periodically using standard toolchains. syntactic sugar for a groupBy and count). Another filter is applied To conclude our discussion of the toolkit, we present two ongoing to discard all target domains that receive five or fewer inlinks (to re- development efforts: duce the amount of noise in the webgraph) before the final (source Transitioning from RDDs to DataFrames. While RDDs provide the domain, destination domain, count) output tuples are materialized core data structure in Spark, the platform has seen the emergence and saved to a text file. Here, we can see exactly how activities in of DataFrames as an alternative higher-level abstraction for ma- our process model (Section 4.2) translate into RDD transformations nipulating large collections of records. The primary difference is in an AUT script, although in this case the script does not gen- that DataFrames conform to schemas and are organized into named erate a visualization, but rather stores the output for subsequent columns, much like tables in a relational database, whereas RDDs consumption elsewhere. can comprise heterogeneous records of arbitrary format. While Given this script, the Spark engine orchestrates execution over RDDs are more flexible—in that they support arbitrary record-level collections at scale. Although Spark was designed to run on clusters, transformations—this flexibility is rarely needed by scholars, and we have primarily processed collections on individual multi-core in fact can lead to confusion and unexpected errors (for example, servers. This decision makes sense for a few reasons: since our type mismatches). With DataFrames, the scholar can refer to fields project uses transient cloud virtual machines, spinning up and using meaningful names like the src and dest of a hyperlink, as down clusters on demand adds an additional level of configuration opposed to RDDs, where they must use Scala’s underscore notation complexity that is not strictly needed for our use case. Spark is still (e.g., r._1) to access individual fields in a tuple. Furthermore, Spark able to make use of multiple cores on a single machine to analyze DataFrames were inspired by DataFrames in the highly-popular collections in parallel. Most collections can be processed by indi- Pandas package for data analysis in Python. Many scholars are vidual servers within reasonable amounts of time, and furthermore already familiar with Pandas DataFrames, and thus they would be our jobs are not latency sensitive. For even our largest collections comfortable manipulating Spark DataFrames with minimal training. (see Section 6), Spark has proven to be robust and has no problem This flattens the learning curve and lowers barriers to adoption. with job scheduling or task management—it is simply a matter of There are additional performance benefits as well: conformance time waiting for jobs to complete. Our users are a patient lot. to schemas allows the Spark engine to safely make certain opti- To complete this walkthrough, the domain webgraph can be mizations that would not be possible with RDDs, and thus certain ingested into the popular open-source network analysis platform operations with DataFrames may become quicker to execute. Gephi [2] for visualization and further exploration. The software We are currently in the process of replicating RDD features us- has robust documentation and an active user base within the digital ing DataFrames in Scala, with the goal of providing two different humanities and computational social sciences, making it a natural approaches to analyses; i.e., everything that can be done with RDDs platform for scholars interested in web archives. In this example, will have a counterpart using DataFrames. More concretely, this we use a web archive of the Ministry of Environment of Québec, involves creating different DataFrame “views” on the raw archive collected between 2011 and 2014 by the Bibliothéque et Archives records, since our input connectors are still written in terms of nationales du Québec. After a series of basic transformations in RDDs and we are leveraging Spark’s internal machinery to build Gephi—in this case, sizing the nodes and labels based on computed DataFrames from RDDs. For example, the webgraph view presents PageRank values—the scholar can see the basic outlines of the col- lection’s hyperlink structure, shown in Figure 3. Potential research 2 https://github.com/archivesunleashed/aut
a table of hyperlinks in a collection, with four columns: the crawl date, the source URL, the destination URL, and the anchor text. Us- ing this DataFrame, we can extract the domain webgraph with the simplified script shown on the right of Figure 2. The juxtaposition of the RDD version and the DataFrame version highlights some of the advantages of DataFrames: the scholar can refer to named columns using the dollar sign ($) notation, which allows her to more easily track the course of each datum through the analytical flow. Although conceptually, the records are undergoing similar transformations, the DataFrames code is simpler and easier to un- derstand. We have verified that execution of the DataFrames script is no slower than the RDD version, and thus clarity does not come at the cost of performance. Figure 4: A screenshot from the Archives Unleashed Cloud, Transitioning from Scala to Python. As Spark was implemented in showing the user dashboard that provides an overview of Scala, we followed suit and built AUT primarily in Scala as well. collections available for analysis. Although scholars are less likely to be familiar with Scala (compared to Python or R), we have not found the language choice to be an insurmountable barrier based on experience from our datathons. Nevertheless, it would be desirable to allow scholars to conduct analyses in Python, which is the language they are most likely to already know. This is possible with PySpark, which provides a Python interface to Spark, and we are currently in the process of transitioning to Python as the default language when using the toolkit. Specifically, we aim to replicate in Python all existing functionalities in Scala. In conjunction with DataFrames, the next iteration of the toolkit will be more intuitive and familiar to scholars, further reducing the barriers to adoption. 6 THE CLOUD The Archives Unleashed Cloud (AUK) was conceived as the “canon- ical deployment” of the Archives Unleashed Toolkit. Although AUT Figure 5: A screenshot from the Archives Unleashed Cloud is open source, we anticipated that installing, configuring, and showing the overview of a collection, which provides a do- deploying the toolkit might pose too high a barrier of entry for main webgraph visualization, distribution of top domains, most scholars. Instead, we aimed to create a “cloud portal” whereby and links for downloading derivatives. scholars could easily ingest their collections and leverage the capa- bilities of AUT, while we handled configuration, maintenance, and endpoint. Once the download job is finished, AUK coordinates the deployment transparently behind the scenes. Currently, AUK is generation of the standard derivatives discussed in Section 4.4 us- built with tight integration with Internet Archive’s Archive-It ser- ing the toolkit. This is accomplished by provisioning a single-node vice: As discussed in Section 2, this represents the greatest potential server for Spark execution, as described in Section 5. for achieving impact. A few smaller jobs follow, and the user is notified via email once From the technical perspective, the Archives Unleashed Cloud the entire processing pipeline has finished. The final product is a (AUK) is a Rails application. Users can log in via their GitHub or collection overview page: an example for the Anarchist Archives Twitter credentials (the two current authentication methods we collection from the University of Victoria is shown in Figure 5. For support), and are then presented with a basic dashboard interface. collections that are manageable in size, we provide a JavaScript- Here, users provide their Archive-It credentials, which triggers a based visualization of the domain webgraph; a bar chart showing background job that imports metadata from the collections in their the top 10 domains in the collection is also provided. Finally, the Archive-It account. Once this job is finished, the user is notified via overview page provides download links for the derivative products email that they can now analyze any of their Archive-It collections. that were created by the toolkit. Currently, these products are avail- A screenshot of this dashboard interface is shown in Figure 4. able in CSV format, but we are in the process of adding support for Once a collection is selected for analysis, AUK triggers a chain Apache Parquet, a popular columnar storage format for big data. of background jobs to perform the actual computation. These jobs The webgraphs are also available in a format that can be directly physically execute on infrastructure provided by Compute Canada, read by the Gephi graph visualization platform. which is a service that provides Canadian researchers with com- The Archives Unleashed Cloud itself is open source and publicly puting resources. First, the raw WARC (or ARC) files comprising available on GitHub,3 and thus anyone could run their own private the collection are copied over to our Compute Canada storage via Archive-It’s Web Archiving Systems API (WASAPI) data transfer 3 https://github.com/archivesunleashed/auk
You can also read