Introduction to Web Mining for Social Scientists - Ulrich Matter
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Introduction to Web Mining for Social Scientists Lecture 8: Scrapers, Spiders, Crawlers Prof. Dr. Ulrich Matter (University of St.Gallen) 13/04/2021 1 Web Mining Programs In very simple terms, web data mining is about writing programs that automatically download web pages. Most of the simpler programs and scripts we have implemented so far can be called ‘web scrapers’ or simply ‘scrapers’, referring to the task of automatically extracting (scraping) specific data out of a webpage. Scrapers are usually designed for a data extraction task related to a specific website or a clearly defined set of websites (e.g., scrape the headlines of all Swiss newspapers’ homepages). More sophisticated programs in web data mining deal with situations where we want to automatically extract data from the Web without having a clearly defined set of pages to be visited. These programs thus incorporate an algorithm that traverses the Web, jumping from one website to another, extracting links to other websites and storing information from the visited pages in a local repository. Such programs are generally referred to as ‘crawlers’, ‘spiders’, or ‘robots’ (all terms refer essentially to the same concept). Crawlers contain the basic blueprint of a web scraper (HTTP interaction, extraction of data, storage of data) but work in a more autonomous and dynamic way in the sense that they constantly update the list of URLs to be visited. This dynamic aspect needs some careful implementation considerations. 2 Applications for Web Crawlers In computer science, the development of crawlers reaches a level of sophistication that goes well beyond the needs in usual social science research. In fact, a core industry of the modern Internet is built on the concept of crawlers: search engines. In order to work properly, a search engine such as Google’s has to index (and therefore physically store) the entire (open) Web. Google’s crawler is constantly traversing, storing, and indexing all webpages it can find. When we search a term in www.google.com, the search is actually executed on Google’s copy of the Web. The URLs in the returned search results point then to the respective webpage in the real Web. As webpages across the Internet change frequently, Google’s crawler needs to constantly traverse the Web at high speed in order to keep the index up to date. Otherwise the returned search results might contain URLs that point to webpages that do not exist anymore. Such a high-performance crawler is not only a sophisticated piece of software it also relies on a massive hardware infrastructure in order to yield the necessary performance. Crawlers for search engines might be the most impressive kind from a technical point of view. Many smaller, less sophisticated crawlers are applied in other contexts. Such applications include: • Business intelligence and marketing: collect information about competitors, competitors’ products and prices. • Monitor websites in order to be automatically notified about changes. • Harvest email addresses for spamming. • Collect personal information for phishing attacks. • Dynamic generation of websites with contents crawled from other websites. 1
For academic research in the social sciences, crawlers are interesting tools for a number of reasons. They can serve as custom-made search engines, traversing the Web to collect specific content that is otherwise hard to find. They are a natural extension of a simple scraper focused on a specific website. They are the primary tool of trade if the research question at hand is specifically about the link structure of certain websites. That is, if we want to investigate what pages are linked with what other pages in order to build a link network. An illustrative and well-known example of the latter is the study by Adamic and Glance (2005), who investigate how political blogs linked to each other during the 2004 U.S. election. The authors reveal that there is a clear political divide in the U.S. blog-sphere with bloggers sympathizing with one of the two major parties (Republicans/Democrats) predominantly linking to blogs/blog-entries associated with the same party and rarely link to blogs that favor the other party’s position (see Figure 1). Their study got a lot of attention, partly as it was interpreted as providing evidence for Cass Sunstein’s famous theses that the Internet generates ‘filter bubbles’ and ‘echo chambers’ and is thereby fostering political polarization (Sunstein 2002). Figure 1: Community structure of political blogs (expanded set), shown using utilizing the GUESS visual- Figure 1:ization Graph Orange and analysis tool[2]. The colors reflect political orientation, red for conservative, and blue for liberal. linksdepicting the tocommunity go from liberal conservative, structure of political and purple ones blogs. toNodes from conservative liberal. are The blogs, edges are links size of each between blog blogs. Thethe reflects colors numberindicate of other political blogs that orientation, link to it. red for conservative and blue for liberal. The size of nodes reflects the number of in-links. Source: Adamic and Glance (2005). Because of bloggers’ ability to identify and frame break- neighborhoods of Atrios, a popular liberal blog, and In- ing news, many mainstream media sources keep a close eye stapundit, a popular conservative blog. He found the In- on the best known political blogs. A number of mainstream stapundit neighborhood to include many more blogs than news sources have started to discuss and even to host blogs. the Atrios one, and observed no overlap in the URLs cited 3 Crawler Theory and Basic Concepts In an online survey asking editors, reporters, columnists and between the two neighborhoods. The lack of overlap in lib- publishers to each list the “top 3” blogs they read, Drezner eral and conservative interests has previously been observed A Web crawler is fundamentally a graph traversal and Farrell [4] identified a short list of dominant “A-list” blogs. Just 10 of the most popular blogs accounted for over algorithm (or ‘graph search algorithm’), a process of in purchases of political books on Amazon.com [8]. This brings about the question of whether we are witnessing a visiting each half thenode inthe blogs on a graph (network) journalists’ where lists. They also found nodes that, represent webpages cyberbalkanization [11, 13]and of theedges Internet,represent hyperlinks. The where the prolif- besides capturing most of the attention of the mainstream eration of specialized online news sources allows people with algorithm then media, the starts withpolitical most popular one or several blogs also get nodes/pages a dispro- (so-called different ‘seeds’), political leanings to be extracting exposed only to each of the URLs to information other pages and number portionate addingof those URLs links from other to theShirky blogs. list of [12]URLs to be visited in agreement with their(the ‘frontier’), previously held views.and then Yale law pro-moving on to fessor Jack Balkin provides a counter-argument7 by pointing the next observed etpage. the same effect for blogs in general and Hindman Figure al. [7] found it to2hold illustrates for political how such websites an on focusing algorithm out thatbasically works. such segregation is unlikely in the blogosphere be- various issues. cause bloggers systematically comment on each other, even While these previous studies focused on the inequality of if only to voice disagreement. citation links for political blogs overall, there has been com- In this paper we address both hypotheses by examining in paratively little study of subcommunities of political blogs. a systematic way the linking patterns and discussion topics In the context of political websites, Hindman et al. [7] noted of political bloggers. In doing so, we not only measure the that, for example, those dealing with the issue of abortion, degree of interaction between liberal and conservative blogs, gun control, and the death penalties, contain subcommuni- but also uncover differences in the structure of the two com- ties of opposing views. In the case of the pro-choice and munities. Our data set includes the posts of 40 A-list blogs pro-life web communities, an earlier study [1] found pro-life over the period of two months preceding the U.S. Presiden- websites to be more densely linked than pro-choice ones. In tial Election of 2004. We also study a large network of over a study of a sample of the blogosphere, Herring et al.[6] dis- 1,000 political blogs based on a single day snapshot that in- covered densely interlinked (non-political) blog communities cludes blogrolls (the list of links to other blogs frequently focusing on the topics of Catholicism and homeschooling, as found in sidebars), and so presents a more static picture of well as a core network of A-list blogs, some of them political. a broader blogosphere. Recently, Butts and Cross [3] studied the response in the From both samples we find that liberal and conservative structure of networks of political blogs to polling data and blogs did indeed have different lists of favorite news sources, election campaign events. In another political blog study, 7 Welsch [15] gathered a single-day snapshot of the network http://balkin.blogspot.com/2004 01 18 balkin 2archive.html#107480769112109137
Figure 2: Illustration of a graph traversal algorithm (breadth-first). Source: https://algorithmsandme.in. A key aspect of the crawling algorithm is the order in which new URLs are picked (‘de-queued’) from the frontier to be visited next. This, in fact, determines the type of graph traversal algorithm implemented in the crawler program. There are two important paradigms of how the next URL is de-queued from the frontier that matter in practical applications of web crawlers for relatively simple web mining tasks: first-in-first-out and different forms of priority queue. The table below compares the two approaches regarding implementation and consequences (following Liu (2011, 311 ff)). Property First-in-First-Out Priority Queue Type of Crawler Breadth-First Crawler Preferential Crawler Implementation Add new URLs to tail of frontier, pick next Keep the frontier ordered URL to crawl from head of frontier. according to the estimated value of the page (according to some predefined criteria, e.g. words occurring in the URL etc.), always pick URL from head of the frontier. Consequence Crawling is highly correlated with the Result is highly dependent on in-degree of a webpage (the number of links how the ordering of the frontier is pointing to it). The result is strongly implemented. affected by the choice of seeds (initial set of URLs in frontier). Without further specifications, both types of crawlers could (in theory) continue traversing the Web until they have visited all webpages on the Internet. This will hardly be needed for any standard application of 3
crawlers in the social sciences. Thus in most practical applications of a simple crawler following either one of the two principles described above, a stopping criterion will be defined, bringing the crawler to a halt after reaching a certain goal. This predefined goal can be of quantitative or qualitative nature. For example, a data collection task to investigate the link structure surrounding a specific website (e.g., an online newspaper) might be limited to a certain degree of the link network. That is, we define that the crawler only visits all pages within the domain of the website, and all the links pointing directly from within the domain of the website to other pages outside the domain (but not to those pointing from outside the domain to even further pages, etc.). Another criterion to stop the crawler could simply be the number of pages visited. Usually such crawling rules can be implemented by restricting the type of URLs (depending on domain) or the number of URLs that can be added to the frontier, and/or by keeping track of all the pages that were already visited. In the simplest case, traversing the Web by extracting links from newly visited pages and keeping track of the pages already visited, is all that a crawler does. In that case, the aim of the crawler is to collect data on the link structure itself. Any extensions regarding the collection of the content can then be added by means of scraping tasks to be executed for each of the pages visited. In the simplest case the ‘scraping’ task can consist of just saving the entire webpage to a local repository. Other tasks could involve extracting only specific parts (tables, titles, etc.) from the visited pages and save these in a predefined format. In short, any of the scraping exercises in the previous lectures could be implemented in one way or the other as part of a crawler. Thus, the scraping part of a Web crawler can be seen as a separate component, a module that can be separately developed and optimized (independently of how the graph traversal part of the crawler is implemented). Figure 3 illustrates a basic crawler that simply stores the visited webpage in a local repository. Figure 3: Flow chart of a basic Web crawler. Source: Liu (2011, 313). 4
4 Practical Implementation Following the flow chart and the discussion regarding web scraping aspects of a crawler above, we have a look at the practical implementation of the components making up a well-functioning breadth-first crawler. For each component, critical aspects of the implementation are discussed and implemented in R. For the sake of the exercise we build a crawler that traverses the link structure surrounding a specific Wikipedia page, until the frontier is empty or 100 pages were visited. 4.1 Initiating frontier, crawler history, and halt criterion The starting point of any crawler is to define the initial set of URLs to be visited (the seed, i.e, the initial frontier). Note that if the frontier consists of more than one URL, the order of how the URLs are entered into the frontier matter for the crawling result. # PREAMBLE ----- # load packages library(httr) library(rvest) library(xml2) library(Rcrawler) # initiate frontier vector with initial URLs as seeds frontier
4.3 Extract links, canonicalize links add to frontier This component consists of what we would implement in the second component of a simple scraper. However, by nature of the crawling algorithm, the focus is on how to extract URLs in order to add them to the frontier. Thereby, the ‘canonicalization’ of URLs is important for the robustness/stability of the crawler. The aim of canonicalization of URLs is to ensure that the URLs added to the frontier work and are not duplicated. Key aspect of this are to ensure that URLs are absolute and do not contain unnecessary parts and formatting (remove standard ports and default pages, coerce to lower case characters, etc.). For this task, we make use of the LinkNormalization()-function provided in the Rcrawler package. # LINK EXTRACTION ----- # extract all URLs links_nodes
4.6 The crawler loop All that is left to be done in order to have a fully functioning web crawler, is to put the components together in a while-loop. With this loop we tell R to keep executing all the code within the loop until the stopping criteria are reached. Note that crawlers based on while loops without wisely defined control statements that keep track of when to stop, could go on forever. Once a crawler (or any other program based on a while-loop is implemented) it is imperative to test the program by setting the control variables such that we would expect the program to stop after a few iterations if everything is implemented correctly. If we were to implement a crawler that we expect to visit thousands of pages until reaching its goal, it might be hard to get a sense of whether the crawler is really iterating towards its goal or whether it is simply still running because we haven’t correctly defined the halting criterion. ####################################################### # Introduction to Web Mining 2017 # 8: Crawlers, Spiders, Scrapers # # Crawler implementation: # A simple breadth-first web crawler collecting the # link structure up to n visited pages. # # U.Matter, November 2017 ####################################################### # PREAMBLE ----- # load packages library(httr) library(rvest) library(Rcrawler) # initiate frontier vector with initial URLs as seeds frontier
html_doc
9TxDJdgC. 9
You can also read