Introduction to Web Mining for Social Scientists - Ulrich Matter

Page created by Philip Little

Cars & Machinery

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Introduction to Web Mining for Social Scientists
Lecture 8: Scrapers, Spiders, Crawlers

Prof. Dr. Ulrich Matter
(University of St.Gallen)

13/04/2021

1 Web Mining Programs
In very simple terms, web data mining is about writing programs that automatically download web pages.
Most of the simpler programs and scripts we have implemented so far can be called ‘web scrapers’ or simply
‘scrapers’, referring to the task of automatically extracting (scraping) specific data out of a webpage. Scrapers
are usually designed for a data extraction task related to a specific website or a clearly defined set of websites
(e.g., scrape the headlines of all Swiss newspapers’ homepages). More sophisticated programs in web data
mining deal with situations where we want to automatically extract data from the Web without having a
clearly defined set of pages to be visited. These programs thus incorporate an algorithm that traverses the
Web, jumping from one website to another, extracting links to other websites and storing information from
the visited pages in a local repository. Such programs are generally referred to as ‘crawlers’, ‘spiders’, or
‘robots’ (all terms refer essentially to the same concept). Crawlers contain the basic blueprint of a web scraper
(HTTP interaction, extraction of data, storage of data) but work in a more autonomous and dynamic way in
the sense that they constantly update the list of URLs to be visited. This dynamic aspect needs some careful
implementation considerations.

2 Applications for Web Crawlers
In computer science, the development of crawlers reaches a level of sophistication that goes well beyond the
needs in usual social science research. In fact, a core industry of the modern Internet is built on the concept
of crawlers: search engines. In order to work properly, a search engine such as Google’s has to index (and
therefore physically store) the entire (open) Web. Google’s crawler is constantly traversing, storing, and
indexing all webpages it can find. When we search a term in www.google.com, the search is actually executed
on Google’s copy of the Web. The URLs in the returned search results point then to the respective webpage
in the real Web. As webpages across the Internet change frequently, Google’s crawler needs to constantly
traverse the Web at high speed in order to keep the index up to date. Otherwise the returned search results
might contain URLs that point to webpages that do not exist anymore. Such a high-performance crawler is
not only a sophisticated piece of software it also relies on a massive hardware infrastructure in order to yield
the necessary performance.
Crawlers for search engines might be the most impressive kind from a technical point of view. Many smaller,
less sophisticated crawlers are applied in other contexts. Such applications include:
• Business intelligence and marketing: collect information about competitors, competitors’ products and
prices.
• Monitor websites in order to be automatically notified about changes.
• Harvest email addresses for spamming.
• Collect personal information for phishing attacks.
• Dynamic generation of websites with contents crawled from other websites.

For academic research in the social sciences, crawlers are interesting tools for a number of reasons. They
can serve as custom-made search engines, traversing the Web to collect specific content that is otherwise
hard to find. They are a natural extension of a simple scraper focused on a specific website. They are
the primary tool of trade if the research question at hand is specifically about the link structure of certain
websites. That is, if we want to investigate what pages are linked with what other pages in order to build
a link network. An illustrative and well-known example of the latter is the study by Adamic and Glance
(2005), who investigate how political blogs linked to each other during the 2004 U.S. election. The authors
reveal that there is a clear political divide in the U.S. blog-sphere with bloggers sympathizing with one of the
two major parties (Republicans/Democrats) predominantly linking to blogs/blog-entries associated with the
same party and rarely link to blogs that favor the other party’s position (see Figure 1). Their study got a lot
of attention, partly as it was interpreted as providing evidence for Cass Sunstein’s famous theses that the
Internet generates ‘filter bubbles’ and ‘echo chambers’ and is thereby fostering political polarization (Sunstein
2002).

        Figure 1: Community structure of political blogs (expanded set), shown using utilizing the GUESS visual-
Figure 1:ization
            Graph
        Orange
                  and analysis tool[2]. The colors reflect political orientation, red for conservative, and blue for liberal.
                   linksdepicting   the tocommunity
                         go from liberal    conservative, structure      of political
                                                           and purple ones              blogs. toNodes
                                                                             from conservative     liberal. are
                                                                                                            The blogs,   edges are links
                                                                                                                size of each
between blog
          blogs.    Thethe
               reflects   colors
                            numberindicate
                                     of other political
                                              blogs that orientation,
                                                         link to it.      red for conservative and blue for liberal. The size of
nodes reflects the number of in-links. Source: Adamic and Glance (2005).
             Because of bloggers’ ability to identify and frame break-          neighborhoods of Atrios, a popular liberal blog, and In-
          ing news, many mainstream media sources keep a close eye              stapundit, a popular conservative blog. He found the In-
          on the best known political blogs. A number of mainstream             stapundit neighborhood to include many more blogs than
          news sources have started to discuss and even to host blogs.          the Atrios one, and observed no overlap in the URLs cited
3       Crawler Theory and Basic Concepts
          In an online survey asking editors, reporters, columnists and         between the two neighborhoods. The lack of overlap in lib-
          publishers to each list the “top 3” blogs they read, Drezner          eral and conservative interests has previously been observed
A Web crawler          is  fundamentally             a  graph      traversal
          and Farrell [4] identified a short list of dominant “A-list”
          blogs. Just 10 of the most popular blogs accounted for over
                                                                               algorithm        (or ‘graph search algorithm’), a process of
                                                                                in purchases of political books on Amazon.com [8]. This
                                                                                brings about the question of whether we are witnessing a
visiting each
          half thenode    inthe
                    blogs on   a graph      (network)
                                  journalists’                where
                                               lists. They also  found nodes
                                                                       that,  represent       webpages
                                                                                cyberbalkanization     [11, 13]and
                                                                                                                of theedges
                                                                                                                       Internet,represent     hyperlinks. The
                                                                                                                                 where the prolif-
          besides capturing most of the attention of the mainstream             eration of specialized online news sources allows people with
algorithm     then
          media,   the starts    withpolitical
                        most popular      one or      several
                                                   blogs  also get nodes/pages
                                                                   a dispro-         (so-called
                                                                                diﬀerent              ‘seeds’),
                                                                                          political leanings   to be extracting
                                                                                                                     exposed only to each     of the URLs to
                                                                                                                                      information
other pages     and number
          portionate    addingof those       URLs
                                    links from    other to   theShirky
                                                         blogs.    list of
                                                                        [12]URLs    to be visited
                                                                                in agreement   with their(the      ‘frontier’),
                                                                                                            previously  held views.and    then
                                                                                                                                     Yale law pro-moving on to
                                                                                fessor Jack Balkin provides a counter-argument7 by pointing
the next observed
          etpage.
                     the same eﬀect for blogs in general and Hindman
                      Figure
             al. [7] found  it to2hold
                                    illustrates
                                        for political how     such
                                                       websites       an on
                                                                focusing  algorithm
                                                                                out thatbasically       works.
                                                                                           such segregation    is unlikely in the blogosphere be-
          various issues.                                                       cause bloggers systematically comment on each other, even
             While these previous studies focused on the inequality of          if only to voice disagreement.
          citation links for political blogs overall, there has been com-          In this paper we address both hypotheses by examining in
          paratively little study of subcommunities of political blogs.         a systematic way the linking patterns and discussion topics
          In the context of political websites, Hindman et al. [7] noted        of political bloggers. In doing so, we not only measure the
          that, for example, those dealing with the issue of abortion,          degree of interaction between liberal and conservative blogs,
          gun control, and the death penalties, contain subcommuni-             but also uncover diﬀerences in the structure of the two com-
          ties of opposing views. In the case of the pro-choice and             munities. Our data set includes the posts of 40 A-list blogs
          pro-life web communities, an earlier study [1] found pro-life         over the period of two months preceding the U.S. Presiden-
          websites to be more densely linked than pro-choice ones. In           tial Election of 2004. We also study a large network of over
          a study of a sample of the blogosphere, Herring et al.[6] dis-        1,000 political blogs based on a single day snapshot that in-
          covered densely interlinked (non-political) blog communities          cludes blogrolls (the list of links to other blogs frequently
          focusing on the topics of Catholicism and homeschooling, as           found in sidebars), and so presents a more static picture of
          well as a core network of A-list blogs, some of them political.       a broader blogosphere.
             Recently, Butts and Cross [3] studied the response in the             From both samples we find that liberal and conservative
          structure of networks of political blogs to polling data and          blogs did indeed have diﬀerent lists of favorite news sources,
          election campaign events. In another political blog study,
                                                                                7
          Welsch [15] gathered a single-day snapshot of the network               http://balkin.blogspot.com/2004 01 18 balkin
                                                                               2archive.html#107480769112109137

Figure 2: Illustration of a graph traversal algorithm (breadth-first). Source: https://algorithmsandme.in.

A key aspect of the crawling algorithm is the order in which new URLs are picked (‘de-queued’) from the
frontier to be visited next. This, in fact, determines the type of graph traversal algorithm implemented in the
crawler program. There are two important paradigms of how the next URL is de-queued from the frontier
that matter in practical applications of web crawlers for relatively simple web mining tasks: first-in-first-out
and different forms of priority queue. The table below compares the two approaches regarding implementation
and consequences (following Liu (2011, 311 ff)).

Property First-in-First-Out Priority Queue
Type of Crawler Breadth-First Crawler Preferential Crawler
Implementation Add new URLs to tail of frontier, pick next Keep the frontier ordered
URL to crawl from head of frontier. according to the estimated value
of the page (according to some
predefined criteria, e.g. words
occurring in the URL etc.),
always pick URL from head of
the frontier.
Consequence Crawling is highly correlated with the Result is highly dependent on
in-degree of a webpage (the number of links how the ordering of the frontier is
pointing to it). The result is strongly implemented.
affected by the choice of seeds (initial set of
URLs in frontier).

Without further specifications, both types of crawlers could (in theory) continue traversing the Web until
they have visited all webpages on the Internet. This will hardly be needed for any standard application of

crawlers in the social sciences. Thus in most practical applications of a simple crawler following either one of
the two principles described above, a stopping criterion will be defined, bringing the crawler to a halt after
reaching a certain goal. This predefined goal can be of quantitative or qualitative nature. For example, a
data collection task to investigate the link structure surrounding a specific website (e.g., an online newspaper)
might be limited to a certain degree of the link network. That is, we define that the crawler only visits all
pages within the domain of the website, and all the links pointing directly from within the domain of the
website to other pages outside the domain (but not to those pointing from outside the domain to even further
pages, etc.). Another criterion to stop the crawler could simply be the number of pages visited. Usually such
crawling rules can be implemented by restricting the type of URLs (depending on domain) or the number of
URLs that can be added to the frontier, and/or by keeping track of all the pages that were already visited.
In the simplest case, traversing the Web by extracting links from newly visited pages and keeping track of the
pages already visited, is all that a crawler does. In that case, the aim of the crawler is to collect data on the
link structure itself. Any extensions regarding the collection of the content can then be added by means of
scraping tasks to be executed for each of the pages visited. In the simplest case the ‘scraping’ task can consist
of just saving the entire webpage to a local repository. Other tasks could involve extracting only specific parts
(tables, titles, etc.) from the visited pages and save these in a predefined format. In short, any of the scraping
exercises in the previous lectures could be implemented in one way or the other as part of a crawler. Thus,
the scraping part of a Web crawler can be seen as a separate component, a module that can be separately
developed and optimized (independently of how the graph traversal part of the crawler is implemented).
Figure 3 illustrates a basic crawler that simply stores the visited webpage in a local repository.

Figure 3: Flow chart of a basic Web crawler. Source: Liu (2011, 313).

4     Practical Implementation
Following the flow chart and the discussion regarding web scraping aspects of a crawler above, we have a look
at the practical implementation of the components making up a well-functioning breadth-first crawler. For
each component, critical aspects of the implementation are discussed and implemented in R. For the sake of
the exercise we build a crawler that traverses the link structure surrounding a specific Wikipedia page, until
the frontier is empty or 100 pages were visited.

4.1    Initiating frontier, crawler history, and halt criterion
The starting point of any crawler is to define the initial set of URLs to be visited (the seed, i.e, the initial
frontier). Note that if the frontier consists of more than one URL, the order of how the URLs are entered
into the frontier matter for the crawling result.
# PREAMBLE -----

# load packages
library(httr)
library(rvest)
library(xml2)
library(Rcrawler)

# initiate frontier vector with initial URLs as seeds
frontier

4.3    Extract links, canonicalize links add to frontier
This component consists of what we would implement in the second component of a simple scraper. However,
by nature of the crawling algorithm, the focus is on how to extract URLs in order to add them to the frontier.
Thereby, the ‘canonicalization’ of URLs is important for the robustness/stability of the crawler. The aim of
canonicalization of URLs is to ensure that the URLs added to the frontier work and are not duplicated. Key
aspect of this are to ensure that URLs are absolute and do not contain unnecessary parts and formatting
(remove standard ports and default pages, coerce to lower case characters, etc.). For this task, we make use
of the LinkNormalization()-function provided in the Rcrawler package.
# LINK EXTRACTION -----

# extract all URLs
links_nodes

4.6    The crawler loop
All that is left to be done in order to have a fully functioning web crawler, is to put the components together
in a while-loop. With this loop we tell R to keep executing all the code within the loop until the stopping
criteria are reached. Note that crawlers based on while loops without wisely defined control statements
that keep track of when to stop, could go on forever. Once a crawler (or any other program based on a
while-loop is implemented) it is imperative to test the program by setting the control variables such that we
would expect the program to stop after a few iterations if everything is implemented correctly. If we were to
implement a crawler that we expect to visit thousands of pages until reaching its goal, it might be hard to
get a sense of whether the crawler is really iterating towards its goal or whether it is simply still running
because we haven’t correctly defined the halting criterion.
#######################################################
# Introduction to Web Mining 2017
# 8: Crawlers, Spiders, Scrapers
#
# Crawler implementation:
# A simple breadth-first web crawler collecting the
# link structure up to n visited pages.
#
# U.Matter, November 2017
#######################################################

# PREAMBLE -----

# load packages
library(httr)
library(rvest)
library(Rcrawler)

# initiate frontier vector with initial URLs as seeds
frontier