Obtaining a Video Dataset from YouTube via DASH
                                Ida Marie Frøseth, Stefan Leicht, Richard Reimer, Viet Thi Tran

   Abstract—Multi-modal video analysis considers both the video             detail. The information in its entirety is stored in a file format
itself and the surrounding metadata at the same time to extract             that can be used later on to perform multimodal analysis.
information from the video. In online video databases, metadata
may consist of ratings, comments or information regarding the                  Chapter II , furthermore, discusses the related work in the
uploader. As such, in order to test video analysis algorithms, a            field of YouTube data gathering research. In Chapter III, the
large corpus of videos and metadata is necessary. The corpus                Dynamic Adaptive Streaming over HTTP (DASH) protocol
should be a representative sample of all available videos (within           will be described in greater detail with additional information
the chosen parameters), rather than a limited set of uploaders              on how YouTube utilizes DASH. The proposed tool supports
or categories, as the latter would be biased. Consequently, this
paper proposes a tool for the YouTube video platform to use                 two approaches for video crawling: one using the YouTube
two distinct approaches as means of achieving this representative           API with a default key search as input, and the other using a
sample of all videos, extracting their metadata and exporting the           traditional web crawling technique; see Chapter IV for more
information in a JSON, CSV or XML format. Furthermore, the                  details. The design of the tool is described in Chapter V, while
video download itself is supported with this tool                           the evaluation of the tool is included in Chapter VI. Chapter
  Index Terms—DASH, HTTP, YouTube, Video, Download tool,                    VII concludes the paper, with a conclusion and an outlook on
Dataset                                                                     future work.

                         I. I NTRODUCTION                                                        II. R ELATED WORK

T     HE rapid technological progress over the past decade,
      in addition to the increasing popularity of Web 2.0
applications, has led to a vast increase of user-generated
                                                                               There are a number of studies that use different crawling
                                                                            methods to gather and analyse YouTube video meta-data.
                                                                            Cha et al. [5] analysed the popularity distribution, popularity
content. This holds especially true for YouTube, an online                  evolution and content duplication of user-generated video
video service created in 2005, and purchased by Google in                   contents for YouTube in 2007. They crawled 1.7 million videos
November 2006.                                                              from the Entertainment category and 250,000 from the Science
   A YouTube video is surrounded by a wide range of meta-                   category. Their results showed that the recommendation engine
data. This metadata can be grouped in three categories: 1.)                 of YouTube favors a small number of popular items, pointing
information regarding the author/uploader; 2.) general infor-               the user away from unpopular ones. This observation is
mation on the video itself; and 3.) information concerning                  congruent with web search engines. J.Cho and S.Roy [6]
user communication. An example of the latter is the comment                 proofed in a seven month long experimental study that popular
section, where users can openly state their opinions and                    websites are getting preferred by the search engine. This
feelings on any given video.                                                has the consequence that popular websites are getting more
                                                                            popular while unpopular websites are getting less popular.
   By analysing these data, general statistics, behavioural pat-
terns and even a person’s emotions and possibly the emotions                   Cheng et al. [7] also collected metadata for three million
of an entire city can be detected. In 2014, Guthier et al.                  videos in 2007 to examine the popularity of YouTube videos.
conducted a multimodal analysis of the Twitter platform. They               Their approach for data gathering was a breath-first search
developed a system to detect emotions and visualize them on                 starting with a primary video and then traversing through all
a map, performing a text analysis of messages and using the                 related videos until the fourth depth. They revealed that there
geolocation tag in the metadata to determine the location of                exists a clustering coefficient for related videos indicating that
the message [1].                                                            a grouping of videos exists. This fact states, that the video
                                                                            search results are biased towards the initial video. Starting with
   In order to perform an accurate multimodal analysis of a
                                                                            a music video, most of the related links will also be music
user-generated database like YouTube, a large corpus of the
                                                                            videos. In addition, Cheng et al. [7] revealed characteristics
data has to be retrieved that reflects the entire database of
                                                                            with lower impact on the search result which show that to
YouTube must be retrieved [2][3]. Since YouTube’s growth
                                                                            some degree any two videos can be linked in the related videos
rate drastically increases every year [4], the dataset has to be
                                                                            of YouTube.
relatively fresh to conduct an up-to-date analysis.
                                                                               To gather a potential unbiased dataset for measuring the
   In the following chapters the functionality and development
                                                                            popularity and view counts of videos, Szabo, G. and Huber-
of a tool to retrieve such a dataset from YouTube — including
                                                                            man, B. A., [8] daily examined the newly added YouTube
videos, metadata and user interaction — will be described in
                                                                            videos for a 30-day period. After a 10-day examination period
  Ida Marie Frøseth and Viet Thi Tran are with University of Oslo, Norway
S. Leicht and R. Reimer are with University of Mannheim, Germany
  S. Leicht and R. Reimer are with University of Mannheim, Germany          with a 90% accuracy.
Fig. 1. DASH overall description[12]

   Zhou et al. [9] used a random prefix sampling to gather
information via the YouTube Search API with evidence to             Fig. 2. DASH MPD file structure[13]
support that their method provides a random sample of of all
videos. Although, they focus their research effort on estimating
the total number of YouTube videos in 2011, their approach          B. Media Presentation Description – the heart of DASH
looks most promising for our purpose to gather a representative        The one thing enabling the adaptive manner of DASH is the
sample of all videos.                                               MPD or the DASH manifest. The MPD tells the client how the
                                                                    segmentation is done and what encoding and resolution being
    III. DYNAMIC A DAPTIVE S TREAMING OVER HTTP                     available for a particular video entry. The MPD file format is
   Media streaming over the Internet is by far the largest          XML and the structure is shown in figure 2.
application using the Internet, and it is increasing. Cisco            As figure 2 shows, the MPD contains one or more periods
predicts that by 2020 Internet video will contribute to around      denoted with a start time. Inside a period you will find one
90 percent of the network load [10]. The traditional way of         or more adaption sets. Every adaption set contains different
streaming uses the stateful Real-Time Protocol, but there has       representations, for example it can be one adaption set for the
been a dramatically increase in streaming over HTTP over            available videos, one for audio and one for the text. There
the last couple of years[11]. HTTP is a stateless protocol,         are also multiple adaption set containers like WebM and ISO
and traditionally when a user request a video the entire video      BMFF. Within one adaption set, you will find one or more
stream would be downloaded, regardless if the user switches         representations. Multiple representation sets within one adap-
view during playback. This leads to the obvious downside of         tion set are alternative to each other. Each representation has
potential bandwidth waist, and even network congestion [11].        some meta-information about the available streams, like the
This, among other reasons, have led to the development of           codec, resolution, required bandwidth and the most important
Dynamic Adaptive Streaming protocol over HTTP. Another              the base URL to retrieve the stream. As soon as the client
big advantage of streaming over HTTP is that the developers         has chosen the best suited stream segment, it downloads the
do not have to worry about firewall and nats. The following         segment by calling a get request to the provided base URL.
section will give an overview of the protocol, the Media               Each segment must contain at least one segment element
Presentation Description and at the end how YouTube uses            and an initialization segment, that represent sthe segment
DASH.                                                               information. The segment information can either be inherited
                                                                    by the higher level segment information from the Adaption set
A. DASH protocol overview                                           or Period, or it can be aligned in the segment Representation
                                                                    itself. The DASH documentation lists three different segment
   Figure 1 shows a really simplified view of how DASH              information element types; SegmentBase, SegmentTemplate
works. At the time the video is uploaded to the server, the         and SegmentList depending on the use case [14]. The next
server encodes and stores the video in various qualities. The       section will outline in detail how YouTube uses DASH and
video is also split into segments so that the client can adapt      how YouTube uses the SegmentBase type.
the video quality to fit the network bandwidth. When the user
starts to view the video, he will issue a get request for each
                                                                    C. DASH in YouTube
sequence of video. If the user switches the view during play
back, the stream stops with the last get request. To make this         When YouTube was launched, they used progressive down-
adaptive manner possible, DASH uses a file called the Media         load over HTTP to deliver their video content. By progressive
Presentation Description or DASH manifest to tell the client        download the entire video will be downloaded as a fully
what sequences are available at what quality. With DASH more        runnable file. The drawback with this approach is that the
logic and control is moved to the client, and the client does not   user could not start to view the video before the entire
have to negotiate with the server to get the suited stream [11].    stream was downloaded. YouTube is one of many service
The server side can also use proxies to cache the streams,          providers that have adopted DASH to ensure a better quality
or the streams of different qualities can even be distributed       of their service. In addition to DASH, YouTube also support
between multiple servers.                                           regular streaming. The difference between regular streaming
                                                                            of how DASH is utilized in YouTube we inspected several
                                                                            MPD files, and appendix B shows an overview of our result.
                                                                            To make sure that MPD files are not biased in some way,
                                                                            a filtered random search was preformed to retrieve videos
                                                                            having a various set of attributes, these attributes consisted of
                                                                            age, duration and quality. The dataset was relative small but it
                                                                            proofs that all the videos only contain one period. It also states
                                                                            that YouTube supports two different representation containers
                                                                            for adaptive streaming being WebM and ISO BMFF 4 . Hence,
                                                                            in a YouTube MPD you will usually find two adaption sets
                                                                            for audio and two for the video namely the: audio/webm
                                                                            and audio/mp4. By inspecting different DASH manifest we
                                                                            could also find that YouTube aligns its segment information
                                                                            within the representation by using baseUrl element and the
                                                                            SegmentBase template.

                                                                            < I n i t i a l i z a t i o n r a n g e =”0 234”/ >
and DASH is that instead of slicing the stream in multiple
segments, it downloads will request the entire video stream                    3) Automatically switching view during playback: YouTube
in one slice 1 . YouTube started using DASH after google I/O                support both switching quality during playback and seeking
2013 2 . The largest advantages of adaptive streaming is, as                the video content – going back and forth in time. They
mentioned, that the user can change the suitable video quality              are doing this by using a byte versus time mapping5 . And
while playing if the bandwidth get better or worse, or even                 as mentioned in the DASH section, the video is split into
based on the CPU on user’s devices. More control and logic                  segments. When playing a YouTube video, they display each
are moved to the client side with DASH, and the next few                    segment of video between two yellow bars at the progress bar.
section describes how YouTube make this work.                               These yellow bars indicates where the information segment is
   1) The YouTube Itag: Each video on YouTube may have                      located, or keyframes as YouTube call them. If a user changes
several related download-streams, that means the user or client             the quality during playback, it has to start with one of these
application can change to the suitable quality of the content               information segment periods. This means if the user changes
due to the quality of the network. Youtube provides their                   quality in between two information segments, the user will
videos using Adobe Dynamic Streaming for Flash which                        experience that the playback will go back or forth in time to
supports dynamic streaming over HTTP, but it is not purely                  the closest information segment. When the user hit play the
adopting the international standard MPEG-DASH. Instead they                 video segment would be cached by the browser, and this is
use a DASH manifest they embed in their video information                   displayed by a grey shadow in the progressbar for the part that
in the HTML content and encode the url with a so called itag                has been cached. All these features makes the user experience
which identifies different types of streams and qualities. We               much smoother and YouTube also experience a huge saving
did not manage to find an official document describing these                in network traffic6 .
itags, but at Wikipedia some users have made comprehensive
table of itags. Figure 3 shows the itags for DASH videos3 , in
addition there are similar tables for non-dash streams and live                                    IV. I MPLEMENTATION
streaming videos.
   2) YouTubes DASH manifest: Since YouTube embedded                          This chapter will first give an overall description of the
their MPD information in the HTML content, the player does                  tool design and what features each of the tool modules
not have to download the manifest, but he only has to be                    support, followed by a section that presents in detail how the
aware of the itag for particular streams. If the client does not            video download function is realized (section IV-B) before it
know of these itag, google has an API where it is possible                  describes the strategies for getting a representative dataset of
to download the mainfest for a particular video stream based                YouTube in more detail(section IV-C).
on the URL for that stream. To get a profound understanding
                                                                               4 http://www.streamingmediaglobal.com/Articles/Editorial/Featured-
  1 http://www.onlinevideo.net/2011/05/streaming-vs-progressive-download-   Articles/The-State-of-MPEG-DASH-Deployment-96144.aspx,              retrieved
vs-adaptive-streaming/, retrieved 31.10.2015                                20.10.2015
  2    https://en.wikipedia.org/wiki/YouTube#Video technology, retrieved       5     From     min    8:50   to    min    9:45     Google     IO     2013
31.10.2015                                                                  https://www.youtube.com/watch?v=UklDSMG9ffU retrieved 31.10.2015
  3 https://en.wikipedia.org/wiki/YouTube#Quality and formats, retrieved       6 Google IO 2013, https://www.youtube.com/watch?v=UklDSMG9ffU, re-
28.10.2015                                                                  trieved 31.10.2015
                                                                         Fig. 5. Tool supported features

                                                                         items, but it can easily be expanded for all available filters.
                                                                         When using the filters it is also important to notice that most of
                                                                         these attributes are user defined and there is no guarantee that
                                                                         the video is an Music video, even though it has this category
                                                                         tag assigned to it. It is also important to notice the default
                                                                         values to each parameter when the user uploads the video.
                                                                         This will influence the result because it seems like not all
                                                                         users alter these values because some of the attributes are only
                                                                         accessible through the advanced settings when uploading a
                                                                         video. Figure 5 show an overview of the supported features and
                                                                         what crawler implements which feature. The three columns
                                                                         keyword, filter and random show the attributes that can be
                                                                         changes and each row identify a combination of attributes. For
                                                                         example row number two show that the API crawler supports a
                                                                         search with no filters or keyword that are random. The features
                                                                         that are not supported are denoted with a ”NO” and the row
                                                                         is in a gray color. These features are 1) A non random search
                                                                         with filter applied, 2) A random search with a keyword applied
Fig. 4. YouTube Downloader design                                        and 3) A random search with both keyword and filter applied.

                                                                           Filters supported by the API search:
A. Design
                                                                           •   Keyword: adding this filter will result in videos that have
    The YouTube downloader tool is designed with the four                      this keyword either in the title, description or tag. Just be
modules: Search, Information Extractor, graphical user in-                     aware that by adding this filter, the result will no longer
terface (GUI) and the YTManagement, see Figure 4. The                          be random since the random crawler alter this parameter
search module is the most important module, since it has the                   to make the search random.
responsible for getting a representative dataset of unique video           •   Location and radius: Defines a circular geographic area
ids from YouTube. Next the Information Extractor is fed with                   and restricts the search to videos that specify in their
these video IDs and will download both the metadata and the                    metadata a geographic location that falls within that area.
video in the desired format. The GUI displays these features                   The radius must be followed by one of the followed
in an intuitive way and additional is also able to compute and                 measurement parameters m, km, ft and mi. When no
display some statistics of the result after completed search                   measurement parameter is inserted, the standard value of
request. Lastly, all the interaction between the modules are                   km is applied.
handled by the YTManager.                                                  •   Period: Restrict the search to retrieve only videos in a
    1) Search module - getting a representative dataset: As                    specified period. The default value is from all and to
figure 4 shows, the Search module support two approaches for                   current date, and it is also possible to configure a specific
getting a dataset of video IDs. The API Search and the jsoup                   day and month.
crawler, where the first use the YouTube API and the latter use            •   Category: filter will give a result within the specified
a traditional web crawler technique and the jsoup library to                   category. The category is a value the uploader is defining
parse the HTML document. The API search approach support                       in the advanced settings, and the default value is Peoples
filtering, while the jsoup crawler doesn‘t. See section IV C for               and Blog.
details on each crawler. The YouTube API support filtering on              •   Language: returns a result relevant for the specified
vast of parameters7 . This tool only includes the most used filter             language.
                                                                           •   Region: return the results for the specified country.
  7   https://developers.google.com/youtube/v3/ , retrieved 26.10.2015     •   Duration: returns the videos that are within the specified
  •    Definition: return only videos that support the specified
       definition, this is either SD,HD or both.
  •    Type: a video can be tagged with either Episode, Shows
       or Movie and this filter will issue only videos that fits this
       parameter. It is optional to configure the type parameter
       and it is located in the advanced settings tab when
       uploading a video to YouTube.
   2) Information extractor: As figure 4 shows, there are two
information extractors; the Metadata extractor and the Video
downloader. The metadata extractor uses the YouTube API,
while the support for downloading a video was removed when
YouTube merged from APIv2 to APIv3, hence the Video                     Fig. 6. Tool Graphical user interface
Downloader has to parse the HTML to extract the download
link, and therefore the description of how to download a video
                                                                        be empty, and will populated when a search has been executed.
is awarded its own section, see section IV B. The tool enables
                                                                        The statistics view display the distribution of categories, year
the user to choose what type of information to include in the
                                                                        and likes. And the result view displays a list of the fetched
download, because adding more information would take more
                                                                        videos and the user can click on one of the video to look at the
time and more space. The user have basically the following
                                                                        fetched metadata. Be aware that the comments and url are not
                                                                        included in this view. As figure 4 depicts, there are also two
  1)    Only download video metadata                                    different search views, one for the API crawler approach and
  2)    Include comments in the video metadata                          one for the Jsoup crawler. This distinct separation between the
  3)    Include video download link in the metadata                     two ensures that the user is aware what crawler he is using.
  4)    Include the video in all the available formats and add          When the user has chosen the appropriate search filters and
        the video link to the metadata                                  settings and the search is started, the tool enables the user
                                                                        to stop the crawling. When the crawling is canceled, all the
Option two through four would give a metadata file that
                                                                        metadata up to the point before canceling is saved and the gui
has the comments and/or the download URL link included.
                                                                        statistics are going to be calculated and drawn.
YouTube API implements a RESTful API that uses JSON
as data representation format. Therefore the default download
format is in JSON, but the tool also support conversion to              B. YouTube Video Download
XML and CSV. YouTube keeps all the metadata for a video
in a video object and is fetched by using the YouTube Video                YouTube API v3 has no support for downloading the video,
List API, this will result in the information about the video.          instead they offer three YouTube player APIs to embed a
To fetch the comments, another get request has to be issued             YouTube video player, these are the IFrame API [15], Android
since the comments are not stored in a video object itself              Player API[16] and the iOS Helper Library[17]. Be aware that
but within a comment object. This will result in the desired            to download a video is actually against YouTubes policy, but
number of top level comments for the video. Which comments              since this was a part of the task of this project we implemented
that are marked as top level depends on what settings the users         it as a proof-of-concept.
has selected when uploading the video, either most popular                 1) Extracting and identifying a URL stream from the HTML
comments or most recent comments. The comments also come                content: Since the video player at www.youtube.com is an
with some metadata like the author, when its published and              HTML embedded video player, it is possible to parse the
more, see appendix C for details. The tool limits the number            HTML content from a YouTube page representing a video and
of comments to five, the reason for this is to limit the amount         extract the available streams. All the adaptive streams follows
of data and ensure that the download would finish within a              the tag $adaptive fmts$ and the regular streams are located
reasonable time. Each comment can potentially contain up                after the tag $rl encoded fmt stream map$. By searching for
to 10 000 characters (about 10kB), while one video entry                these patterns within the HTML content, it is possible to
without comments is around 1700 characters and upwards, so              extract all the video links and decode using an URL decoding
one comment could potentially use the space of five videos.             technique. As mentioned in sectionIII YouTube support a
Another issue with comments are that there are only about               whole lot of video formats and containers and there is also both
30% of the videos have comments, in the case where it there             DASH videos and regular streams. All the available formats
are no comments available they are either disabled or no one            for a video entry would be included in the HTML document
has been commenting the video yet.                                      of that particular video. To identify the format and quality,
   https://www.youtube.com                                              YouTube uses a Itag 8 and this tag is added to the specific
                                                                        URL for one stream.
   3) Graphical user interface: The GUI uses a tabbed pane to
display the user with three views; Search, Result and Statistics,         8 https://en.wikipedia.org/wiki/YouTube#Quality and formats, retrieved
see figure 6. The Result view and the Statistics would initially        28.10.2015

                                                                                    Prefix length
                                                                                                Number of videos Mean number per request
    2) Handling encrypted YouTube signature: Each stream                                  2        16.245.872            16.245,87
has a signature to ensure integrity, this signature is either                             3         801.220               801,22
denoted signature, sig, s or RTMPE9 . If the signature is                                 4          20.997                21,00
                                                                                          5           625                   0,63
identified with a s” and RTMPE it means that the signature is                             6            23                   0,02
encrypted using Adobe’s own security mechanism. There is no                                               TABLE I
official report on how YouTube decrypt the signatures of their                   N UMBER OF RETURNED VIDEOS PER PREFIX LENGTH FOR 1000 API
                                                                                                         REQUESTS .
videos but there are many discussions about how to decrypt
the signature of YouTube-download-link, and they found that
YouTube also includes the function to decrypt the signature
   . The downloader tool does not support those streams that                  The first 10 characters of the ID consist of any of the 64
have an encrypted signature, hence it only support urls that                  characters in S = {0-9, , - , A-Z, a-z}. The 11-th and last
have a signature denoted signature or sig.                                    character only consists of one of the 16 characters in
    3) Download the video file in parallel with the video info:               T = {0, 4, 8, E, I, M, Q, U, Y, c, g, k, o, s, w}.
Since a video is pretty large compared to the video metadata,                 In total the ID space size has 1064 ⇤ 16 possible ids. Zhou et
it takes a lot more time to download the video compared to                    al. [9] showed in an experimental setting with 2 million video
retrieving a video. To ensure that the metadata is not slowed                 ids, that these IDs are randomly generated from the id space
down because of the video download, the tool has a separate                   and they don’t have any sequence or pattern. For each new
thread that handles the download process. This thread has                     video upload YouTube selects an unused ID from this pool.
a monitor which keeps a queue of videos that should be                           2) YouTube API v3 Crawler: The YouTube API v3 offers
downloaded. Whenever the crawler discovers a new video, it                    a not documented function to alter the search result. By
puts the video in this video downloader thread queue. The                     using the API keyword search with a string of the format
download-thread will handle one video at a time until the                     ”watch?v=x...z”, including the quotation marks, where “x...z”
queue is empty. The downloader would also create one thread                   is a prefix of size 1-11 with properties of the sets S and T
for each available video quality, so it can download all the                  the API returns videos which IDs start with this prefix. For
available streams concurrent. In addition whenever one video                  example the keyword search for ”watch?v=fXEz” results in
is added to the queue the Video downloader would respond                      24 videos with not related videos that were uploaded between
with all the URL’s of that video so they can be saved in the                  two weeks and six years ago with 0 to 5000 views. All video
Video information file. Also important to notice is that the                  IDs start with “fXEz”. We noticed that there is an exception
download link has an expire time. As a result, the download-                  for the ”-” literal. The “-” literal in the beginning and end of a
links do not stay alive forever and each time the tool needs to               search term serves a special function as a whitespace character.
download videos, it must re-get the download links. So it is                  When we use ”watch?v=-XEz” the YouTube API will only
not possible to download videos after the expired time.                       derive video IDs that start with “XEz”. This increases the
                                                                              number of returned videos rapidly from 24 to over 850 videos.
C. Crawling strategies                                                        It is evident that the prefix size determines the number of the
                                                                              search results. In table I the prefix sizes with the correlating
    Gathering and analysing the meta-data of YouTube videos number of result videos are displayed for 1000 search requests.
can be of great interest not only from a social perspective side When the prefix length is too long, the search engine might
e.g. detecting user emotions by inspecting their comments for not return any results because the probability that the prefix
videos uploaded in a given area but also from a technical string is included in the YouTube ID space is very small.
perspective e.g. how many videos are uploaded on YouTube This especially holds true for a prefix size greater than 5.
every day and how much traffic do they cause?                                 Contrasting with a too large prefix size, a small prefix size of
    Unfortunately, these information and further statistics are 1-3 returns a search result with more requests than can actually
not publicly available and YouTube only publishes a few be handled by the YouTube API user himself.
general statistics about their number of users, mobile usage                   YouTube limits their video result list for each search with
percentages and advertisement11 . Attaining more profound the API at a maximum of 500 videos12 . Consequently the
information is not an effortless task and has to be done either prefix search with prefixes sizes smaller than 4 are unsuitable
by crawling the webpage of YouTube or by using the official because not all results can be retrieved. By only retrieving the
YouTube API v3. Before we elaborate on both data gathering first 500 videos the result list is biased towards more popular
approaches in more detail, we present an introduction on how videos because YouTube applies as a default a relevance filter
YouTube uniquely identifies their videos.                                     for every search. That means that the initial generated search
    1) YouTube Video IDs: Each YouTube video link is deter- prefix has to be adjusted to not include a ”-” at the beginning
mined by a unique 11-character identifier (YouTube video id). and end. A prefix length of 4 returns a mean of 21 videos per
                                                                              search request tested with 1.000 random request. This is an
   9 https://en.wikipedia.org/wiki/Real Time Messaging Protocol#Encryption,
                                                                              optimal size to traverse through the random YouTube video
retrieved 28.10.2015
   10 http://stackoverflow.com/questions/23975878/getting-the-signature-of-a- ID space because it does not interfere with any result limit
youtube-video, retrieved 30.10.2015
  11                                                                             12 https://code.google.com/p/gdata-issues/issues/detail?id=4282#c24, re-
  https://www.youtube.com/yt/press/en/statistics.html, retrieved 26.10.2015   trieved 26.10.2015

                                                                     API crawler     1.000 Requests   10.000 Requests   100.000 Requests
set by the API. YouTube video IDs are not case sensitive and           1 Thread         3 min 8 sec     30 min 12 sec         4h 39 min
therefore a query with ”watch?v=fXEz” will result the same             5 Threads           32,7 sec       5 min 6 sec      49 min 31 sec
return values as ”watch?v=FXEZ”.                                      10 Threads           14,7 sec      2 min 47 sec      30 min 24 sec
                                                                      25 Threads            8,6 sec       1 min 4 sec       10 min 6 sec
   For our YouTube API v3 crawler we take advantage of the            50 Threads            5,2 sec            38 sec       6 min 17 sec
explained search function modification to gather a represen-         100 Threads            5,0 sec            29 sec       4 min 11 sec
                                                                                                   TABLE II
tative sample of all available YouTube videos by randomly                                    API C RAWLER SPEED
generating strings with the size 4 that hold true for the previ-
ously described characteristics. The performance and quality
                                                                     Jsoup crawler    1.000 Requests 10.000 Requests     100.000 Requests
of this random API crawler will be discussed in the evaluation
                                                                        1 Thread               6 min         1h 6 min         25h 31 min
section V.                                                                                        TABLE III
In addition the random search can also be enriched by applying                             J SOUP C RAWLER SPEED
search filters for the API request before the query is executed.
This results in a huge performance benefit because the requests
are filtered upfront and not after all videos are crawled. Our      to analyze the datasets according to different metrics. We start
API v3 crawler has a build in location and radius filters as well   with the crawler performance and then move to the quality
as category, language, year, region, definition and type filters.   of the crawled dataset. Performance is a key indicator how
Thus, the user is also able to search for a representative sample   well a crawler executes his task. The overall goal of this tool
of all videos available within specified filters. The filters are   was not to create a massive database upfront and then allow
described in more detail in the previous design section of          analysis on already collected, probably outdated data. With our
chapter IV.                                                         approach the user can collect a huge number of actual videos
   3) Jsoup Crawler: The jsoup crawler is independent of the        on the spot. Herefore, the speed how fast the crawler returns
YouTube API, and it uses the fact that for each YouTube             the video links with the corresponding metadata for each video
video page there are multiple references to related videos          is crucial. To compare the performance of both crawlers, we
as a suggestion to what the user could view next. In a              analyze several data sets between 1.000 and 100.000 crawls.
HTML document these reference links are identified with the            Another key driver for evaluating a crawler is the quality
” < ahref = url > ” tag and can easily be extracted using           of the results. A crawler can be insanely fast but if it only
the jsoup library to parse the HTML content. To ensure that         collects bad data, the speed is valueless. The Jsoup crawler
the crawler does not go to links outside of YouTube, the jsoup      specializes on the website of YouTube to represent a good
crawler only looks for links containing the structure describe      overview of videos a normal user would get if he clicks
in the introduction of this section. The HTML content of a          through the YouTube webpage. The API crawler represents
particular webpage is retrieved by issuing a get request for        a typical video distribution of the whole YouTube database.
that specific url, and will result in the whole HTML file being
downloaded, and the tool uses this in its advantage by also
parsing the HTML for the video metadata. The metadata is            A. Performance
saved in the desired file format, being either XML or CSV.
                                                                       The performance and functionality of both crawlers is very
In comparison to the API crawler, the jsoup Crawler does not
                                                                    contrasting. While the API crawler can run in multiple threads,
support filtering or the opportunity to extract the comments
                                                                    the Jsoup crawler is not able to utilize multithreading. Table
because YouTube do not embed them into the HTML code.
                                                                    III shwos that the missing thread opportunity results in a much
   The Jsoup crawler enables the user to select what page to        slower crawling performance for the Jsoup crawler. Besides the
start crawling from. This could either be the url of a YouTube      missing thread functioncality, the Jsoup crawler loses most of
video, the YouTube main page or a YouTube video ID. The             the crawling time for establishing a connection to the next
crawler also has the feature to remember what link it has           website. As soon as the connection is established and the
crawler so it does not crawl the same page twice in one run,        HTML code of the website is downloaded, getting all the
and the downloaded metadata only consists of unique video           metadata is relatively fast. Expressed in numbers this means
entries. It is important to be aware of that the links the Jsoup    crawling 10.000 video website links needs 66 minutes. Which
crawler is crawling, are posted at a page by YouTube. These         means every second 2,5 video links are crawled in average. In
links serves a task of being suggestion to the viewer of what       contrast the API crawler with the standard setting of 10 threads
to watch next. How YouTube decides these relations between          runs 10.000 videos in 64sec which results in more then 156
videos and the suggested videos is not officaly documented.         videos per second.
In in their YouTube APi, they call it ”relevance filter”. On
                                                                       The API crawling speed is not affected by longer crawling
the other hand the Jsoup crawler would potentially to inspect
                                                                    times. Table II shows that 100.000 videos take almost 10 times
these relation and give a image of what the user is met with
                                                                    as long as 10.000 videos. This fact does not stand true for
when using YouTube.
                                                                    the Jsoup crawler where the download speed is decreasing a
                                                                    lot, the bigger the request amount is. This is caused because
                       V. E VALUATION
                                                                    YouTube provides for there related videos a lot duplicates, the
  As described in the previous section, we have two data            longer the crawler is searching. For 100.000 crawling attempts,
collectors using distinct gathering methods. The next step is       our tool had to sort out 14.521 duplicates. This results in a

much longer crawling time caused by relativly long connection            35


time.                                                                    30
                                                                                                                                                                                                                   API 100.000
    Another benefit of the API crawler is, that it can apply             25                                                                                                                                        Jsoup 100.000

filters and comments in the crawling process. By applying                20


filters, the search speed decreases. How strong the impact of



the decreasing depends on how many filters are applied and










how bordering the filters are.



    The procedural work for a search request for with specific           0

                                                                                                                                                                                                                                   Autos & Vehicles

                                                                                 Peoples & Blog




                                                                                                                                                                                                Film & Animation
                                                                                                                                                                            News & Politics

                                                                                                                                                                                                                                                        Pets & Animals
keyword can not be seperated and distributed between different
threads. As a result the API crawler is limited for one thread
in the keyword search. Hence, the results are not directly
comparable with the previous results from the random prefix
search.                                                              Fig. 7. Category distribution for 100.000 crawled videos
    All those crawling runs are done without the comment inte-
gration to have a better comparability with the Jsoup crawler.
By including the comments – even thought not every video                 35

has comments – the crawling time is increased significantly              30
depending on the number of retrieved comments. For crawling              25
                                                                                                                                                                                                                    API 1.000
                                                                                                                                                                                                                    API 100.000
10.000 videos, the process time is almost doubled. The same
effect can be seen for crawling 100.000 videos. The YouTube




plattform does a 10.000 character restrictions on one comment,
which can be up to 5 times as large as the all other metadata



together. Although, the comment size is capped, there exists


no maximum limit on the maximum number of comments for                   0


                                                                                 Peoples & Blog

                                                                                                                                                                                                                                   Autos & Vehicles



                                                                                                                                                                            News & Politics

                                                                                                                                                                                                Film & Animation

                                                                                                                                                                                                                                                        Pets & Animals
a video. A great deal of videos have several thousand and more
comments or in the most extreme case: the music video “PSY
- Gangnam Style” has roughly 5 million comments. As a result
we are forced to limit the amount of retrieved comments.
    The API crawler can also download the videos itself. This
can’t be done by the Jsoup crawler, because this function is not     Fig. 8. API crawler category change with 1.000 and 100.000 videos
implemented yet. Downloading the videos itself is a really time
and internet bandwidth consuming task. Hence the download
of the videos need also be confirmed by the user beforehand. If      API crawler was the ”People and Blog” category. The reason
this is the case the video links are put on a list and a download    is probably that the default setting when a user uploads a
thread starts to download one video after the other. The list        video to YouTube is ”Peoples and Blog” and the user has
avoids that all crawled videos are downloaded concurrently           to go into advanced setting to force another category to be
but rather one by one. The download and crawling time in             assigned to the video. Looking at figure 8 it doesn’t matter
general is heavily dependent on the processing power and             for the API crawler if the dataset is only 1.000 or 100.000
internet connection of the user. This makes it rather hard           videos large. The distribution is nearly the same and indicates
to compare exact times and create a thorough performance             the profound random distribution of the API crawler, even for
analysis. Regarding the performance, the jsoup crawler is far        small samples.
behind the API crawler in every aspect.                                 We also performed a search to identify when ”Peoples and
                                                                     Blogs” became the largest category to see when YouTube
                                                                     changed their default option for new videos. This was done by
B. Quality
                                                                     only requesting videos for the specific year and compare the
   To evaluate the quality of the tool, we gather a dataset          distribution of the categories, the result is shown in appendix
of 1.000, 10.000 and 100.000 videos with both crawlers.              D, and show that it has been the largest category since 2010.
The statistics that were generated are the distribution of the          For the Jsoup crawler on the other hand, the Entertainment
Categories and upload Year and the average number of views           category is the most significant category. When inspecting
per video. Since, it is easiest to see a significant change in the   figure 7 it is important to be aware that the initial page of
dataset when going from 1.000 to 100.000 those are the only          the crawler was the YouTube main page. To test if the crawler
two included in this part. This section will first look into the     was biased towards the start page we gathered a dataset where
category statistics, then at the distribution of the year and at     the initial page was in the sports category from 2008 (VideoID
the end it will compare the average view count.                      = 4az-U8wTj2k). By comparing figure 7 and figure 10 it is
   1) Category statistics: Figure 7 depicts the distribution of      obvious that an initial crawl with the Jsoup crawler is biased
the categories from the dataset of 100.000 videos for each           towards the start page. Figure 9 and 10 shows that by crawling
crawler. This shows that the most dominant category for the          more pages the initial category is getting less significant,


    40                                                                                                                                                                                                                                                                                                                                                                                             API 100.000
                                                                                                                                                                                                Jsoup                                                                         40                                                                                                                   Jsoup 100.000

    30                                                                                                                                                                                                                                                                        35
    25                                                                                                                                                                                          100.000














    5                                                                                                    2,8




                                                                                                                                                                                                                          Autos & Vehicles
               Peoples & Blog



                                                                                                                                                                       Film & Animation



                                                                                                                                             News & Politics

                                                                                                                                                                                                                                                    Pets & Animals






                                                                                                                                                                                                                                                                                            2005       2006          2007         2008         2009       2010        2011          2012            2013           2014            2015

Fig. 9. Jsoup category distribution                                                                                                                                                                                                                                      Fig. 11. Comparing publishing year for API and Jsoup crawler



     60                                                                                                                                                                                                                                                                      30                                                                                                                      API 1000

                                                                                                                                                                                                  Jsoup Sport

                                                                                                                                                                                                                                                                                                                                                                                                     API 100.000

                                                                                                                                                                                                  Jsoup sport

     30                                                                                                                                                                                           100.000                                                                    20















                                                                                                                                                                                                                        Autos & Vehicles



                                                                                        Peoples & Blog

                                                                                                                                                                           News & Politics

                                                                                                                                                                                                   Film & Animation

                                                                                                                                                                                                                                                 Pets & Animals



                                                                                                                                                                                                                                                                                      2005           2006          2007          2008         2009        2010         2011          2012            2013             2014            2015

Fig. 10. Jsoup category distribution when starting at a sports video from 2008                                                                                                                                                                                           Fig. 12. Comparing publishing year for API crawler

and this could be an indication that by crawling deeper the                                                                                                                                                                                                              the dataset the result would move towards the dataset of the
distribution would better reflect the YouTube database, but at                                                                                                                                                                                                           API search, but there is no proof that this eventually would end
the same time it is important to notice that the links that are                                                                                                                                                                                                          up in a distribution that reflect the whole YouTube database.
found at a YouTube page are posted there by YouTube as                                                                                                                                                                                                                   Another strong inidication of this biased behavior is that the
related videos. How YouTube define realted videos are not                                                                                                                                                                                                                number of views per video are much larger when perfoming
documented.                                                                                                                                                                                                                                                              Jsoup search. Crawling with the API we have videos with
   2) Year statistics: For the circulation of uploaded videos                                                                                                                                                                                                            around 15.000 views per video but for the Jsoup search we
per year, the distribution of the Jsoup crawler and API crawler                                                                                                                                                                                                          get over 1.634.570 views per video when crawling 100.000
is more aligned, see figure 11, and for both the crawlers the                                                                                                                                                                                                            videos starting at the sport videos. It is very likely that popular
year 2015 is significant larger then the rest. This distribution                                                                                                                                                                                                         videos are those videos YouTube suggest for the user to see
of the crawlers can be explained by the enourmous growth                                                                                                                                                                                                                 next. This undermines our statement that popular videos are
of YouTube content, as shown in the statitics from Statistica                                                                                                                                                                                                            getting more popular while unpopular videos will not get any
from 2014[4]. By comparing the result distribution for the API                                                                                                                                                                                                           recommendations by YouTube.
when increasing the data set from 1000 videos by a factor of
100, the change in the distribution is very little. This indicates                                                                                                                                                                                                                                   VI. C ONCLUSION AND FUTURE WORK
again that the API has a good random distribution, see figure
12.                                                                                                                                                                                                                                                                         The objective of this work was to design and develop
   For the Jsoup crawler, the situation is quite similar as for the                                                                                                                                                                                                      a research tool for gathering independent YouTube videos
Categories when looking into a dataset that starts at a specific                                                                                                                                                                                                         within given parameters and providing their metadata as well
video. Figure 13 is from a dataset starting crawling at a sport                                                                                                                                                                                                          as the video download files. This corpus of videos should
video form 2008, and it shows that the years close to 2008                                                                                                                                                                                                               be a representative sample of all available videos within the
are also strongly represented. The reason for this is again the                                                                                                                                                                                                          YouTube video space and the chosen filter parameters.
way YouTube links the realated videos to eachother. The same                                                                                                                                                                                                                The intention behind such a tool is to provide an easy access
figure also shows the same trend as earlier, that by increasing                                                                                                                                                                                                          for researchers to collect this data from YouTube and allow

                                                            A PPENDIX B
                                                        YOU T UBE MPD FILES

                      Year           Duration      Quality                             mp4/A mp4/V Audio/ Video/
                  2014      20min   SD     HD         VideoID   Period    udio ideo webm webM
                                                                FN‐h2tLQmxU     1        2     2      1      1
                   Yes      No      Yes    No     Yes      No   k70‐MAIW2Uo     1        2     2      2      2
                                                                Vt7g‐VcAFvm     1        2     4     no     no
                                                                aH1OBsYEFIU     1        2     5      1      5
                   Yes      No      Yes    No     No      Yes   K7yOpj29YQo     1        2     5      1      5
                                                                XmXoyQ‐PjkU     1        2     5      1      5
                                                                hBqxNfCrfL8     1        2     4      1      4
                   Yes      No       No    Yes    Yes      No   3FTwaojNkXw     1        2     2      1      3
                                                                U0qTkTcsz0I     1        2     2      1      3
                                                                3v7RcHviRdU     1        2     5      1      5
                   Yes      No       No    Yes    No      Yes   Fj_FUQ2mXy4     1        2     5      1      5
                                                                9Z6RworZrLQ     1        2     5      1      5
                                                                iEc8‐83aywc     1        1     4      1      4
                    No     Yes      Yes    No     Yes      No   owaF‐6Ko0ic     1        2     4     no     no
                                                                98Z‐n‐yPTn8     1        2     4     no     no
                                                                MrJr‐dn‐7Rs     1        2     7     no     no
                    No     Yes      Yes    No     No      Yes   N2CO‐xlgD9g     1        2     5     no     no
                                                                rfzU‐Iigzgw     1        1     5      1      5
                                                                xDN_‐ihLcmo     1        2     4      1      4
                    No     Yes       No    Yes    Yes      No   nku‐pRuftDg     1        1     3      1      3
                                                                0eUm‐V8vxJ0     1        1     5     no     no
                                                                dp6‐T6jNIhy     1        1     5      1      5
                    No     Yes       No    Yes    No      Yes   Akt‐jf0L5zQ     1        1     5      1      5
                                                                Qpy‐HHOMZ1      1        3     5     no     no

Fig. 14. Content of inspected MPD files

                        A PPENDIX C                                  },
                D OWNLOADED J SON S TRUCTURE                         ” player ”: {
                                                                        ” embedHtml ” : s t r i n g
  The follwing listing show the output format of the tool when       },
                                                                     ” topicDetails ”: {
using Json as the export format.                                        ” topicIds ”: [
{                                                                          string
    ” kind ” : ” youtube # video ” ,                                    ],
    ” etag ” : etag ,                                                   ” relevantTopicIds ”: [
    ” id ”: string ,                                                       string
    ” snippet ”: {                                                      ]
       ” publishedAt ”: datetime ,                                   },
       ” channelId ”: string ,                                       ” recordingDetails ”: {
       ” t i t l e ”: string ,                                          ” locationDescription ”: string ,
       ” description ”: string ,                                        ” location ”: {
       ” thumbnails ”: {                                                   ” l a t i t u d e ” : double ,
            ( key ) : {                                                    ” l o n g i t u d e ” : double ,
                ” url ”: string ,                                          ” a l t i t u d e ”: double
                ” width ” : unsigned i n t e g e r ,                    },
                ” height ”: unsigned i n t e g e r                      ” recordingDate ”: datetime
           }                                                         },
       },                                                            ” comments ” : {
       ” channelTitle ”: string ,                                       ” comment ” [ {
       ” tags ”: [                                                             ” k i n d ” : ” y o u t u b e # comment ” ,
            string                                                             ” etag ” : etag ,
        ],                                                                     ” id ”: string ,
       ” categoryId ”: string ,                                                ” snippet ”: {
       ” liveBroadcastContent ”: string ,                                               ” channelId ”: string ,
       ” defaultAudioLanguage ”: s t r i n g                                            ” videoId ”: string ,
    },                                                                                  ” textDisplay ”: string ,
    ” contentDetails ”: {                                                               ” textOriginal ”: string ,
       ” duration ”: string ,                                                           ” parentId ”: string ,
       ” dimension ” : s t r i n g ,                                                    ” authorDisplayName ” : s t r i n g ,
       ” definition ”: string ,                                                         ” authorProfileImageUrl ”: string ,
       ” caption ”: string ,                                                            ” authorChannelUrl ”: string ,
       ” l i c e n s e d C o n t e n t ” : boolean ,                                    ” authorChannelId ”: {
       ” regionRestriction ”: {                                                            ” value ”: s t r i n g
           ” allowed ”: [                                                               },
                string                                                                  ” authorGoogleplusProfileUrl ”: string ,
            ],                                                                          ” canRate ” : boolean ,
           ” blocked ”: [                                                               ” viewerRating ”: string ,
                string                                                                  ” likeCount ”: unsigned integer ,
            ]                                                                           ” moderationStatus ”: string ,
       },                                                                               ” publishedAt ”: datetime ,
       ” contentRating ”: {                                                             ” updatedAt ” : datetime
           ” acbRating ”: string ,                                             }
              ” agcomRating ” : s t r i n g ,                           } ]
           ” anatelRating ”: string ,                                },
            .                                                        ” videoLinks ”:{
            .                                                           ” singleDownloadLink ” : [ {
            .                                                              ” itag ”: integer ,
       }                                                                   ” url ”: string ,
    },                                                                  }]
    ” status ”: {                                                    }
       ” uploadStatus ”: string ,                                }
       ” failureReason ”: string ,
       ” rejectionReason ”: string ,
       ” privacyStatus ”: string ,
       ” publishAt ”: datetime ,
       ” license ”: string ,
       ” embeddable ” : boolean ,
       ” publicStatsViewable ”: boolean
    ” s t a t i s t i c s ”: {
       ” viewCount ” : u n s i g n e d long ,
       ” l i k e C o u n t ” : unsigned long ,
       ” d i s l i k e C o u n t ” : unsigned long ,
       ” f a v o r i t e C o u n t ” : unsigned long ,
       ” commentCount ” : u n s i g n e d l o n g

                                                            A PPENDIX D
                                                L ARGEST YOU T UBE CATEGORY BY YEAR




                                                                                                Peoples & Blogs
    25                                                                                          Music

    15                                                                                          Comedy

    10                                                                                          Gaming
                                                                                                News & Politics

          2007           2008           2009          2010          2011   2012   2013   2014

Fig. 15. The change in largest category on YouTube by year

