Obtaining a Video Dataset from YouTube via DASH - Semantic Scholar
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
1 Obtaining a Video Dataset from YouTube via DASH Ida Marie Frøseth, Stefan Leicht, Richard Reimer, Viet Thi Tran Abstract—Multi-modal video analysis considers both the video detail. The information in its entirety is stored in a file format itself and the surrounding metadata at the same time to extract that can be used later on to perform multimodal analysis. information from the video. In online video databases, metadata may consist of ratings, comments or information regarding the Chapter II , furthermore, discusses the related work in the uploader. As such, in order to test video analysis algorithms, a field of YouTube data gathering research. In Chapter III, the large corpus of videos and metadata is necessary. The corpus Dynamic Adaptive Streaming over HTTP (DASH) protocol should be a representative sample of all available videos (within will be described in greater detail with additional information the chosen parameters), rather than a limited set of uploaders on how YouTube utilizes DASH. The proposed tool supports or categories, as the latter would be biased. Consequently, this paper proposes a tool for the YouTube video platform to use two approaches for video crawling: one using the YouTube two distinct approaches as means of achieving this representative API with a default key search as input, and the other using a sample of all videos, extracting their metadata and exporting the traditional web crawling technique; see Chapter IV for more information in a JSON, CSV or XML format. Furthermore, the details. The design of the tool is described in Chapter V, while video download itself is supported with this tool the evaluation of the tool is included in Chapter VI. Chapter Index Terms—DASH, HTTP, YouTube, Video, Download tool, VII concludes the paper, with a conclusion and an outlook on Dataset future work. I. I NTRODUCTION II. R ELATED WORK T HE rapid technological progress over the past decade, in addition to the increasing popularity of Web 2.0 applications, has led to a vast increase of user-generated There are a number of studies that use different crawling methods to gather and analyse YouTube video meta-data. Cha et al. [5] analysed the popularity distribution, popularity content. This holds especially true for YouTube, an online evolution and content duplication of user-generated video video service created in 2005, and purchased by Google in contents for YouTube in 2007. They crawled 1.7 million videos November 2006. from the Entertainment category and 250,000 from the Science A YouTube video is surrounded by a wide range of meta- category. Their results showed that the recommendation engine data. This metadata can be grouped in three categories: 1.) of YouTube favors a small number of popular items, pointing information regarding the author/uploader; 2.) general infor- the user away from unpopular ones. This observation is mation on the video itself; and 3.) information concerning congruent with web search engines. J.Cho and S.Roy [6] user communication. An example of the latter is the comment proofed in a seven month long experimental study that popular section, where users can openly state their opinions and websites are getting preferred by the search engine. This feelings on any given video. has the consequence that popular websites are getting more popular while unpopular websites are getting less popular. By analysing these data, general statistics, behavioural pat- terns and even a person’s emotions and possibly the emotions Cheng et al. [7] also collected metadata for three million of an entire city can be detected. In 2014, Guthier et al. videos in 2007 to examine the popularity of YouTube videos. conducted a multimodal analysis of the Twitter platform. They Their approach for data gathering was a breath-first search developed a system to detect emotions and visualize them on starting with a primary video and then traversing through all a map, performing a text analysis of messages and using the related videos until the fourth depth. They revealed that there geolocation tag in the metadata to determine the location of exists a clustering coefficient for related videos indicating that the message [1]. a grouping of videos exists. This fact states, that the video search results are biased towards the initial video. Starting with In order to perform an accurate multimodal analysis of a a music video, most of the related links will also be music user-generated database like YouTube, a large corpus of the videos. In addition, Cheng et al. [7] revealed characteristics data has to be retrieved that reflects the entire database of with lower impact on the search result which show that to YouTube must be retrieved [2][3]. Since YouTube’s growth some degree any two videos can be linked in the related videos rate drastically increases every year [4], the dataset has to be of YouTube. relatively fresh to conduct an up-to-date analysis. To gather a potential unbiased dataset for measuring the In the following chapters the functionality and development popularity and view counts of videos, Szabo, G. and Huber- of a tool to retrieve such a dataset from YouTube — including man, B. A., [8] daily examined the newly added YouTube videos, metadata and user interaction — will be described in videos for a 30-day period. After a 10-day examination period Ida Marie Frøseth and Viet Thi Tran are with University of Oslo, Norway they could predict the future popularity of a YouTube video S. Leicht and R. Reimer are with University of Mannheim, Germany with a 90% accuracy.
2 Fig. 1. DASH overall description[12] Zhou et al. [9] used a random prefix sampling to gather information via the YouTube Search API with evidence to Fig. 2. DASH MPD file structure[13] support that their method provides a random sample of of all videos. Although, they focus their research effort on estimating the total number of YouTube videos in 2011, their approach B. Media Presentation Description – the heart of DASH looks most promising for our purpose to gather a representative The one thing enabling the adaptive manner of DASH is the sample of all videos. MPD or the DASH manifest. The MPD tells the client how the segmentation is done and what encoding and resolution being III. DYNAMIC A DAPTIVE S TREAMING OVER HTTP available for a particular video entry. The MPD file format is Media streaming over the Internet is by far the largest XML and the structure is shown in figure 2. application using the Internet, and it is increasing. Cisco As figure 2 shows, the MPD contains one or more periods predicts that by 2020 Internet video will contribute to around denoted with a start time. Inside a period you will find one 90 percent of the network load [10]. The traditional way of or more adaption sets. Every adaption set contains different streaming uses the stateful Real-Time Protocol, but there has representations, for example it can be one adaption set for the been a dramatically increase in streaming over HTTP over available videos, one for audio and one for the text. There the last couple of years[11]. HTTP is a stateless protocol, are also multiple adaption set containers like WebM and ISO and traditionally when a user request a video the entire video BMFF. Within one adaption set, you will find one or more stream would be downloaded, regardless if the user switches representations. Multiple representation sets within one adap- view during playback. This leads to the obvious downside of tion set are alternative to each other. Each representation has potential bandwidth waist, and even network congestion [11]. some meta-information about the available streams, like the This, among other reasons, have led to the development of codec, resolution, required bandwidth and the most important Dynamic Adaptive Streaming protocol over HTTP. Another the base URL to retrieve the stream. As soon as the client big advantage of streaming over HTTP is that the developers has chosen the best suited stream segment, it downloads the do not have to worry about firewall and nats. The following segment by calling a get request to the provided base URL. section will give an overview of the protocol, the Media Each segment must contain at least one segment element Presentation Description and at the end how YouTube uses and an initialization segment, that represent sthe segment DASH. information. The segment information can either be inherited by the higher level segment information from the Adaption set A. DASH protocol overview or Period, or it can be aligned in the segment Representation itself. The DASH documentation lists three different segment Figure 1 shows a really simplified view of how DASH information element types; SegmentBase, SegmentTemplate works. At the time the video is uploaded to the server, the and SegmentList depending on the use case [14]. The next server encodes and stores the video in various qualities. The section will outline in detail how YouTube uses DASH and video is also split into segments so that the client can adapt how YouTube uses the SegmentBase type. the video quality to fit the network bandwidth. When the user starts to view the video, he will issue a get request for each C. DASH in YouTube sequence of video. If the user switches the view during play back, the stream stops with the last get request. To make this When YouTube was launched, they used progressive down- adaptive manner possible, DASH uses a file called the Media load over HTTP to deliver their video content. By progressive Presentation Description or DASH manifest to tell the client download the entire video will be downloaded as a fully what sequences are available at what quality. With DASH more runnable file. The drawback with this approach is that the logic and control is moved to the client, and the client does not user could not start to view the video before the entire have to negotiate with the server to get the suited stream [11]. stream was downloaded. YouTube is one of many service The server side can also use proxies to cache the streams, providers that have adopted DASH to ensure a better quality or the streams of different qualities can even be distributed of their service. In addition to DASH, YouTube also support between multiple servers. regular streaming. The difference between regular streaming
3 of how DASH is utilized in YouTube we inspected several MPD files, and appendix B shows an overview of our result. To make sure that MPD files are not biased in some way, a filtered random search was preformed to retrieve videos having a various set of attributes, these attributes consisted of age, duration and quality. The dataset was relative small but it proofs that all the videos only contain one period. It also states that YouTube supports two different representation containers for adaptive streaming being WebM and ISO BMFF 4 . Hence, in a YouTube MPD you will usually find two adaption sets for audio and two for the video namely the: audio/webm and audio/mp4. By inspecting different DASH manifest we could also find that YouTube aligns its segment information within the representation by using baseUrl element and the SegmentBase template. < I n i t i a l i z a t i o n r a n g e =”0 234”/ > and DASH is that instead of slicing the stream in multiple segments, it downloads will request the entire video stream 3) Automatically switching view during playback: YouTube in one slice 1 . YouTube started using DASH after google I/O support both switching quality during playback and seeking 2013 2 . The largest advantages of adaptive streaming is, as the video content – going back and forth in time. They mentioned, that the user can change the suitable video quality are doing this by using a byte versus time mapping5 . And while playing if the bandwidth get better or worse, or even as mentioned in the DASH section, the video is split into based on the CPU on user’s devices. More control and logic segments. When playing a YouTube video, they display each are moved to the client side with DASH, and the next few segment of video between two yellow bars at the progress bar. section describes how YouTube make this work. These yellow bars indicates where the information segment is 1) The YouTube Itag: Each video on YouTube may have located, or keyframes as YouTube call them. If a user changes several related download-streams, that means the user or client the quality during playback, it has to start with one of these application can change to the suitable quality of the content information segment periods. This means if the user changes due to the quality of the network. Youtube provides their quality in between two information segments, the user will videos using Adobe Dynamic Streaming for Flash which experience that the playback will go back or forth in time to supports dynamic streaming over HTTP, but it is not purely the closest information segment. When the user hit play the adopting the international standard MPEG-DASH. Instead they video segment would be cached by the browser, and this is use a DASH manifest they embed in their video information displayed by a grey shadow in the progressbar for the part that in the HTML content and encode the url with a so called itag has been cached. All these features makes the user experience which identifies different types of streams and qualities. We much smoother and YouTube also experience a huge saving did not manage to find an official document describing these in network traffic6 . itags, but at Wikipedia some users have made comprehensive table of itags. Figure 3 shows the itags for DASH videos3 , in addition there are similar tables for non-dash streams and live IV. I MPLEMENTATION streaming videos. 2) YouTubes DASH manifest: Since YouTube embedded This chapter will first give an overall description of the their MPD information in the HTML content, the player does tool design and what features each of the tool modules not have to download the manifest, but he only has to be support, followed by a section that presents in detail how the aware of the itag for particular streams. If the client does not video download function is realized (section IV-B) before it know of these itag, google has an API where it is possible describes the strategies for getting a representative dataset of to download the mainfest for a particular video stream based YouTube in more detail(section IV-C). on the URL for that stream. To get a profound understanding 4 http://www.streamingmediaglobal.com/Articles/Editorial/Featured- 1 http://www.onlinevideo.net/2011/05/streaming-vs-progressive-download- Articles/The-State-of-MPEG-DASH-Deployment-96144.aspx, retrieved vs-adaptive-streaming/, retrieved 31.10.2015 20.10.2015 2 https://en.wikipedia.org/wiki/YouTube#Video technology, retrieved 5 From min 8:50 to min 9:45 Google IO 2013 31.10.2015 https://www.youtube.com/watch?v=UklDSMG9ffU retrieved 31.10.2015 3 https://en.wikipedia.org/wiki/YouTube#Quality and formats, retrieved 6 Google IO 2013, https://www.youtube.com/watch?v=UklDSMG9ffU, re- 28.10.2015 trieved 31.10.2015
4 Fig. 5. Tool supported features items, but it can easily be expanded for all available filters. When using the filters it is also important to notice that most of these attributes are user defined and there is no guarantee that the video is an Music video, even though it has this category tag assigned to it. It is also important to notice the default values to each parameter when the user uploads the video. This will influence the result because it seems like not all users alter these values because some of the attributes are only accessible through the advanced settings when uploading a video. Figure 5 show an overview of the supported features and what crawler implements which feature. The three columns keyword, filter and random show the attributes that can be changes and each row identify a combination of attributes. For example row number two show that the API crawler supports a search with no filters or keyword that are random. The features that are not supported are denoted with a ”NO” and the row is in a gray color. These features are 1) A non random search with filter applied, 2) A random search with a keyword applied Fig. 4. YouTube Downloader design and 3) A random search with both keyword and filter applied. Filters supported by the API search: A. Design • Keyword: adding this filter will result in videos that have The YouTube downloader tool is designed with the four this keyword either in the title, description or tag. Just be modules: Search, Information Extractor, graphical user in- aware that by adding this filter, the result will no longer terface (GUI) and the YTManagement, see Figure 4. The be random since the random crawler alter this parameter search module is the most important module, since it has the to make the search random. responsible for getting a representative dataset of unique video • Location and radius: Defines a circular geographic area ids from YouTube. Next the Information Extractor is fed with and restricts the search to videos that specify in their these video IDs and will download both the metadata and the metadata a geographic location that falls within that area. video in the desired format. The GUI displays these features The radius must be followed by one of the followed in an intuitive way and additional is also able to compute and measurement parameters m, km, ft and mi. When no display some statistics of the result after completed search measurement parameter is inserted, the standard value of request. Lastly, all the interaction between the modules are km is applied. handled by the YTManager. • Period: Restrict the search to retrieve only videos in a 1) Search module - getting a representative dataset: As specified period. The default value is from all and to figure 4 shows, the Search module support two approaches for current date, and it is also possible to configure a specific getting a dataset of video IDs. The API Search and the jsoup day and month. crawler, where the first use the YouTube API and the latter use • Category: filter will give a result within the specified a traditional web crawler technique and the jsoup library to category. The category is a value the uploader is defining parse the HTML document. The API search approach support in the advanced settings, and the default value is Peoples filtering, while the jsoup crawler doesn‘t. See section IV C for and Blog. details on each crawler. The YouTube API support filtering on • Language: returns a result relevant for the specified vast of parameters7 . This tool only includes the most used filter language. • Region: return the results for the specified country. 7 https://developers.google.com/youtube/v3/ , retrieved 26.10.2015 • Duration: returns the videos that are within the specified
5 duration. • Definition: return only videos that support the specified definition, this is either SD,HD or both. • Type: a video can be tagged with either Episode, Shows or Movie and this filter will issue only videos that fits this parameter. It is optional to configure the type parameter and it is located in the advanced settings tab when uploading a video to YouTube. 2) Information extractor: As figure 4 shows, there are two information extractors; the Metadata extractor and the Video downloader. The metadata extractor uses the YouTube API, while the support for downloading a video was removed when YouTube merged from APIv2 to APIv3, hence the Video Fig. 6. Tool Graphical user interface Downloader has to parse the HTML to extract the download link, and therefore the description of how to download a video be empty, and will populated when a search has been executed. is awarded its own section, see section IV B. The tool enables The statistics view display the distribution of categories, year the user to choose what type of information to include in the and likes. And the result view displays a list of the fetched download, because adding more information would take more videos and the user can click on one of the video to look at the time and more space. The user have basically the following fetched metadata. Be aware that the comments and url are not choices: included in this view. As figure 4 depicts, there are also two 1) Only download video metadata different search views, one for the API crawler approach and 2) Include comments in the video metadata one for the Jsoup crawler. This distinct separation between the 3) Include video download link in the metadata two ensures that the user is aware what crawler he is using. 4) Include the video in all the available formats and add When the user has chosen the appropriate search filters and the video link to the metadata settings and the search is started, the tool enables the user to stop the crawling. When the crawling is canceled, all the Option two through four would give a metadata file that metadata up to the point before canceling is saved and the gui has the comments and/or the download URL link included. statistics are going to be calculated and drawn. YouTube API implements a RESTful API that uses JSON as data representation format. Therefore the default download format is in JSON, but the tool also support conversion to B. YouTube Video Download XML and CSV. YouTube keeps all the metadata for a video in a video object and is fetched by using the YouTube Video YouTube API v3 has no support for downloading the video, List API, this will result in the information about the video. instead they offer three YouTube player APIs to embed a To fetch the comments, another get request has to be issued YouTube video player, these are the IFrame API [15], Android since the comments are not stored in a video object itself Player API[16] and the iOS Helper Library[17]. Be aware that but within a comment object. This will result in the desired to download a video is actually against YouTubes policy, but number of top level comments for the video. Which comments since this was a part of the task of this project we implemented that are marked as top level depends on what settings the users it as a proof-of-concept. has selected when uploading the video, either most popular 1) Extracting and identifying a URL stream from the HTML comments or most recent comments. The comments also come content: Since the video player at www.youtube.com is an with some metadata like the author, when its published and HTML embedded video player, it is possible to parse the more, see appendix C for details. The tool limits the number HTML content from a YouTube page representing a video and of comments to five, the reason for this is to limit the amount extract the available streams. All the adaptive streams follows of data and ensure that the download would finish within a the tag $adaptive fmts$ and the regular streams are located reasonable time. Each comment can potentially contain up after the tag $rl encoded fmt stream map$. By searching for to 10 000 characters (about 10kB), while one video entry these patterns within the HTML content, it is possible to without comments is around 1700 characters and upwards, so extract all the video links and decode using an URL decoding one comment could potentially use the space of five videos. technique. As mentioned in sectionIII YouTube support a Another issue with comments are that there are only about whole lot of video formats and containers and there is also both 30% of the videos have comments, in the case where it there DASH videos and regular streams. All the available formats are no comments available they are either disabled or no one for a video entry would be included in the HTML document has been commenting the video yet. of that particular video. To identify the format and quality, https://www.youtube.com YouTube uses a Itag 8 and this tag is added to the specific URL for one stream. 3) Graphical user interface: The GUI uses a tabbed pane to display the user with three views; Search, Result and Statistics, 8 https://en.wikipedia.org/wiki/YouTube#Quality and formats, retrieved see figure 6. The Result view and the Statistics would initially 28.10.2015
6 Prefix length Number of videos Mean number per request 2) Handling encrypted YouTube signature: Each stream 2 16.245.872 16.245,87 has a signature to ensure integrity, this signature is either 3 801.220 801,22 denoted signature, sig, s or RTMPE9 . If the signature is 4 20.997 21,00 5 625 0,63 identified with a s” and RTMPE it means that the signature is 6 23 0,02 encrypted using Adobe’s own security mechanism. There is no TABLE I official report on how YouTube decrypt the signatures of their N UMBER OF RETURNED VIDEOS PER PREFIX LENGTH FOR 1000 API REQUESTS . videos but there are many discussions about how to decrypt the signature of YouTube-download-link, and they found that YouTube also includes the function to decrypt the signature 10 . The downloader tool does not support those streams that The first 10 characters of the ID consist of any of the 64 have an encrypted signature, hence it only support urls that characters in S = {0-9, , - , A-Z, a-z}. The 11-th and last have a signature denoted signature or sig. character only consists of one of the 16 characters in 3) Download the video file in parallel with the video info: T = {0, 4, 8, E, I, M, Q, U, Y, c, g, k, o, s, w}. Since a video is pretty large compared to the video metadata, In total the ID space size has 1064 ⇤ 16 possible ids. Zhou et it takes a lot more time to download the video compared to al. [9] showed in an experimental setting with 2 million video retrieving a video. To ensure that the metadata is not slowed ids, that these IDs are randomly generated from the id space down because of the video download, the tool has a separate and they don’t have any sequence or pattern. For each new thread that handles the download process. This thread has video upload YouTube selects an unused ID from this pool. a monitor which keeps a queue of videos that should be 2) YouTube API v3 Crawler: The YouTube API v3 offers downloaded. Whenever the crawler discovers a new video, it a not documented function to alter the search result. By puts the video in this video downloader thread queue. The using the API keyword search with a string of the format download-thread will handle one video at a time until the ”watch?v=x...z”, including the quotation marks, where “x...z” queue is empty. The downloader would also create one thread is a prefix of size 1-11 with properties of the sets S and T for each available video quality, so it can download all the the API returns videos which IDs start with this prefix. For available streams concurrent. In addition whenever one video example the keyword search for ”watch?v=fXEz” results in is added to the queue the Video downloader would respond 24 videos with not related videos that were uploaded between with all the URL’s of that video so they can be saved in the two weeks and six years ago with 0 to 5000 views. All video Video information file. Also important to notice is that the IDs start with “fXEz”. We noticed that there is an exception download link has an expire time. As a result, the download- for the ”-” literal. The “-” literal in the beginning and end of a links do not stay alive forever and each time the tool needs to search term serves a special function as a whitespace character. download videos, it must re-get the download links. So it is When we use ”watch?v=-XEz” the YouTube API will only not possible to download videos after the expired time. derive video IDs that start with “XEz”. This increases the number of returned videos rapidly from 24 to over 850 videos. C. Crawling strategies It is evident that the prefix size determines the number of the search results. In table I the prefix sizes with the correlating Gathering and analysing the meta-data of YouTube videos number of result videos are displayed for 1000 search requests. can be of great interest not only from a social perspective side When the prefix length is too long, the search engine might e.g. detecting user emotions by inspecting their comments for not return any results because the probability that the prefix videos uploaded in a given area but also from a technical string is included in the YouTube ID space is very small. perspective e.g. how many videos are uploaded on YouTube This especially holds true for a prefix size greater than 5. every day and how much traffic do they cause? Contrasting with a too large prefix size, a small prefix size of Unfortunately, these information and further statistics are 1-3 returns a search result with more requests than can actually not publicly available and YouTube only publishes a few be handled by the YouTube API user himself. general statistics about their number of users, mobile usage YouTube limits their video result list for each search with percentages and advertisement11 . Attaining more profound the API at a maximum of 500 videos12 . Consequently the information is not an effortless task and has to be done either prefix search with prefixes sizes smaller than 4 are unsuitable by crawling the webpage of YouTube or by using the official because not all results can be retrieved. By only retrieving the YouTube API v3. Before we elaborate on both data gathering first 500 videos the result list is biased towards more popular approaches in more detail, we present an introduction on how videos because YouTube applies as a default a relevance filter YouTube uniquely identifies their videos. for every search. That means that the initial generated search 1) YouTube Video IDs: Each YouTube video link is deter- prefix has to be adjusted to not include a ”-” at the beginning mined by a unique 11-character identifier (YouTube video id). and end. A prefix length of 4 returns a mean of 21 videos per search request tested with 1.000 random request. This is an 9 https://en.wikipedia.org/wiki/Real Time Messaging Protocol#Encryption, optimal size to traverse through the random YouTube video retrieved 28.10.2015 10 http://stackoverflow.com/questions/23975878/getting-the-signature-of-a- ID space because it does not interfere with any result limit youtube-video, retrieved 30.10.2015 11 12 https://code.google.com/p/gdata-issues/issues/detail?id=4282#c24, re- https://www.youtube.com/yt/press/en/statistics.html, retrieved 26.10.2015 trieved 26.10.2015
7 API crawler 1.000 Requests 10.000 Requests 100.000 Requests set by the API. YouTube video IDs are not case sensitive and 1 Thread 3 min 8 sec 30 min 12 sec 4h 39 min therefore a query with ”watch?v=fXEz” will result the same 5 Threads 32,7 sec 5 min 6 sec 49 min 31 sec return values as ”watch?v=FXEZ”. 10 Threads 14,7 sec 2 min 47 sec 30 min 24 sec 25 Threads 8,6 sec 1 min 4 sec 10 min 6 sec For our YouTube API v3 crawler we take advantage of the 50 Threads 5,2 sec 38 sec 6 min 17 sec explained search function modification to gather a represen- 100 Threads 5,0 sec 29 sec 4 min 11 sec TABLE II tative sample of all available YouTube videos by randomly API C RAWLER SPEED generating strings with the size 4 that hold true for the previ- ously described characteristics. The performance and quality Jsoup crawler 1.000 Requests 10.000 Requests 100.000 Requests of this random API crawler will be discussed in the evaluation 1 Thread 6 min 1h 6 min 25h 31 min section V. TABLE III In addition the random search can also be enriched by applying J SOUP C RAWLER SPEED search filters for the API request before the query is executed. This results in a huge performance benefit because the requests are filtered upfront and not after all videos are crawled. Our to analyze the datasets according to different metrics. We start API v3 crawler has a build in location and radius filters as well with the crawler performance and then move to the quality as category, language, year, region, definition and type filters. of the crawled dataset. Performance is a key indicator how Thus, the user is also able to search for a representative sample well a crawler executes his task. The overall goal of this tool of all videos available within specified filters. The filters are was not to create a massive database upfront and then allow described in more detail in the previous design section of analysis on already collected, probably outdated data. With our chapter IV. approach the user can collect a huge number of actual videos 3) Jsoup Crawler: The jsoup crawler is independent of the on the spot. Herefore, the speed how fast the crawler returns YouTube API, and it uses the fact that for each YouTube the video links with the corresponding metadata for each video video page there are multiple references to related videos is crucial. To compare the performance of both crawlers, we as a suggestion to what the user could view next. In a analyze several data sets between 1.000 and 100.000 crawls. HTML document these reference links are identified with the Another key driver for evaluating a crawler is the quality ” < ahref = url > ” tag and can easily be extracted using of the results. A crawler can be insanely fast but if it only the jsoup library to parse the HTML content. To ensure that collects bad data, the speed is valueless. The Jsoup crawler the crawler does not go to links outside of YouTube, the jsoup specializes on the website of YouTube to represent a good crawler only looks for links containing the structure describe overview of videos a normal user would get if he clicks in the introduction of this section. The HTML content of a through the YouTube webpage. The API crawler represents particular webpage is retrieved by issuing a get request for a typical video distribution of the whole YouTube database. that specific url, and will result in the whole HTML file being downloaded, and the tool uses this in its advantage by also parsing the HTML for the video metadata. The metadata is A. Performance saved in the desired file format, being either XML or CSV. The performance and functionality of both crawlers is very In comparison to the API crawler, the jsoup Crawler does not contrasting. While the API crawler can run in multiple threads, support filtering or the opportunity to extract the comments the Jsoup crawler is not able to utilize multithreading. Table because YouTube do not embed them into the HTML code. III shwos that the missing thread opportunity results in a much The Jsoup crawler enables the user to select what page to slower crawling performance for the Jsoup crawler. Besides the start crawling from. This could either be the url of a YouTube missing thread functioncality, the Jsoup crawler loses most of video, the YouTube main page or a YouTube video ID. The the crawling time for establishing a connection to the next crawler also has the feature to remember what link it has website. As soon as the connection is established and the crawler so it does not crawl the same page twice in one run, HTML code of the website is downloaded, getting all the and the downloaded metadata only consists of unique video metadata is relatively fast. Expressed in numbers this means entries. It is important to be aware of that the links the Jsoup crawling 10.000 video website links needs 66 minutes. Which crawler is crawling, are posted at a page by YouTube. These means every second 2,5 video links are crawled in average. In links serves a task of being suggestion to the viewer of what contrast the API crawler with the standard setting of 10 threads to watch next. How YouTube decides these relations between runs 10.000 videos in 64sec which results in more then 156 videos and the suggested videos is not officaly documented. videos per second. In in their YouTube APi, they call it ”relevance filter”. On The API crawling speed is not affected by longer crawling the other hand the Jsoup crawler would potentially to inspect times. Table II shows that 100.000 videos take almost 10 times these relation and give a image of what the user is met with as long as 10.000 videos. This fact does not stand true for when using YouTube. the Jsoup crawler where the download speed is decreasing a lot, the bigger the request amount is. This is caused because V. E VALUATION YouTube provides for there related videos a lot duplicates, the As described in the previous section, we have two data longer the crawler is searching. For 100.000 crawling attempts, collectors using distinct gathering methods. The next step is our tool had to sort out 14.521 duplicates. This results in a
8 much longer crawling time caused by relativly long connection 35 30,4 26,5 time. 30 API 100.000 Another benefit of the API crawler is, that it can apply 25 Jsoup 100.000 filters and comments in the crawling process. By applying 20 14,2 14,2 12,8 % filters, the search speed decreases. How strong the impact of 11,2 11,2 15 8,2 the decreasing depends on how many filters are applied and 8,1 8,1 7,4 10 6,2 6,2 5,3 5,1 4,7 4,2 3,9 3,6 how bordering the filters are. 3,3 1,7 5 1,3 The procedural work for a search request for with specific 0 Autos & Vehicles Education Comedy Sports Peoples & Blog Music Entertainment Gaming Film & Animation News & Politics Pets & Animals keyword can not be seperated and distributed between different threads. As a result the API crawler is limited for one thread in the keyword search. Hence, the results are not directly comparable with the previous results from the random prefix search. Fig. 7. Category distribution for 100.000 crawled videos All those crawling runs are done without the comment inte- gration to have a better comparability with the Jsoup crawler. By including the comments – even thought not every video 35 30,4 30,3 has comments – the crawling time is increased significantly 30 depending on the number of retrieved comments. For crawling 25 API 1.000 API 100.000 10.000 videos, the process time is almost doubled. The same 20 effect can be seen for crawling 100.000 videos. The YouTube 12,8 % 12,6 11,6 11,2 11,2 10,7 15 plattform does a 10.000 character restrictions on one comment, 10 which can be up to 5 times as large as the all other metadata 5,4 5,1 4,7 4,7 4,5 4,2 3,9 3,9 3,6 3,6 3,5 3,3 together. Although, the comment size is capped, there exists 1,7 1,7 5 no maximum limit on the maximum number of comments for 0 Gaming Education Peoples & Blog Autos & Vehicles Entertainment Comedy Sports Music News & Politics Film & Animation Pets & Animals a video. A great deal of videos have several thousand and more comments or in the most extreme case: the music video “PSY - Gangnam Style” has roughly 5 million comments. As a result we are forced to limit the amount of retrieved comments. The API crawler can also download the videos itself. This can’t be done by the Jsoup crawler, because this function is not Fig. 8. API crawler category change with 1.000 and 100.000 videos implemented yet. Downloading the videos itself is a really time and internet bandwidth consuming task. Hence the download of the videos need also be confirmed by the user beforehand. If API crawler was the ”People and Blog” category. The reason this is the case the video links are put on a list and a download is probably that the default setting when a user uploads a thread starts to download one video after the other. The list video to YouTube is ”Peoples and Blog” and the user has avoids that all crawled videos are downloaded concurrently to go into advanced setting to force another category to be but rather one by one. The download and crawling time in assigned to the video. Looking at figure 8 it doesn’t matter general is heavily dependent on the processing power and for the API crawler if the dataset is only 1.000 or 100.000 internet connection of the user. This makes it rather hard videos large. The distribution is nearly the same and indicates to compare exact times and create a thorough performance the profound random distribution of the API crawler, even for analysis. Regarding the performance, the jsoup crawler is far small samples. behind the API crawler in every aspect. We also performed a search to identify when ”Peoples and Blogs” became the largest category to see when YouTube changed their default option for new videos. This was done by B. Quality only requesting videos for the specific year and compare the To evaluate the quality of the tool, we gather a dataset distribution of the categories, the result is shown in appendix of 1.000, 10.000 and 100.000 videos with both crawlers. D, and show that it has been the largest category since 2010. The statistics that were generated are the distribution of the For the Jsoup crawler on the other hand, the Entertainment Categories and upload Year and the average number of views category is the most significant category. When inspecting per video. Since, it is easiest to see a significant change in the figure 7 it is important to be aware that the initial page of dataset when going from 1.000 to 100.000 those are the only the crawler was the YouTube main page. To test if the crawler two included in this part. This section will first look into the was biased towards the start page we gathered a dataset where category statistics, then at the distribution of the year and at the initial page was in the sports category from 2008 (VideoID the end it will compare the average view count. = 4az-U8wTj2k). By comparing figure 7 and figure 10 it is 1) Category statistics: Figure 7 depicts the distribution of obvious that an initial crawl with the Jsoup crawler is biased the categories from the dataset of 100.000 videos for each towards the start page. Figure 9 and 10 shows that by crawling crawler. This shows that the most dominant category for the more pages the initial category is getting less significant,
9 46,2 50 45 45 38,1 40 API 100.000 Jsoup 40 Jsoup 100.000 35 1.000 26,5 30 35 Jsoup 25 100.000 30 % 24,9 20 14,2 14,2 22,0 25 % 15 9,6 18,8 9,0 8,8 8,2 8,1 8,1 7,9 7,8 7,4 10 6,2 6,2 5,8 16,2 20 5,3 3,5 5 2,8 13,2 13,1 1,3 0,6 0,5 15 10,2 0 Autos & Vehicles Peoples & Blog Gaming Sports Film & Animation Education Entertainment Comedy News & Politics Music Pets & Animals 7,5 10 6,7 4,9 4,6 3,1 2,8 5 1,8 1,7 0,6 0,5 0,2 0,1 0,0 0,0 0 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Fig. 9. Jsoup category distribution Fig. 11. Comparing publishing year for API and Jsoup crawler 35 64,7 70 54,4 60 30 API 1000 25,1 Jsoup Sport 24,9 API 100.000 50 1.000 25 40 19,6 Jsoup sport % 18,8 30 100.000 20 16,2 16,0 % 20 10,2 13,7 9,7 9,2 8,2 13,1 7,8 6,8 15 4,3 3,4 10 2,6 2,4 1,9 1,6 1,5 1,2 1,2 1,1 0,6 0,5 0,3 0,2 10,2 10,0 0 10 Gaming Autos & Vehicles Education Sports Music Comedy Entertainment Peoples & Blog News & Politics Film & Animation Pets & Animals 6,7 6,5 4,9 3,9 3,1 3,1 5 1,9 1,7 0,5 0,1 0,1 0,0 0 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Fig. 10. Jsoup category distribution when starting at a sports video from 2008 Fig. 12. Comparing publishing year for API crawler and this could be an indication that by crawling deeper the the dataset the result would move towards the dataset of the distribution would better reflect the YouTube database, but at API search, but there is no proof that this eventually would end the same time it is important to notice that the links that are up in a distribution that reflect the whole YouTube database. found at a YouTube page are posted there by YouTube as Another strong inidication of this biased behavior is that the related videos. How YouTube define realted videos are not number of views per video are much larger when perfoming documented. Jsoup search. Crawling with the API we have videos with 2) Year statistics: For the circulation of uploaded videos around 15.000 views per video but for the Jsoup search we per year, the distribution of the Jsoup crawler and API crawler get over 1.634.570 views per video when crawling 100.000 is more aligned, see figure 11, and for both the crawlers the videos starting at the sport videos. It is very likely that popular year 2015 is significant larger then the rest. This distribution videos are those videos YouTube suggest for the user to see of the crawlers can be explained by the enourmous growth next. This undermines our statement that popular videos are of YouTube content, as shown in the statitics from Statistica getting more popular while unpopular videos will not get any from 2014[4]. By comparing the result distribution for the API recommendations by YouTube. when increasing the data set from 1000 videos by a factor of 100, the change in the distribution is very little. This indicates VI. C ONCLUSION AND FUTURE WORK again that the API has a good random distribution, see figure 12. The objective of this work was to design and develop For the Jsoup crawler, the situation is quite similar as for the a research tool for gathering independent YouTube videos Categories when looking into a dataset that starts at a specific within given parameters and providing their metadata as well video. Figure 13 is from a dataset starting crawling at a sport as the video download files. This corpus of videos should video form 2008, and it shows that the years close to 2008 be a representative sample of all available videos within the are also strongly represented. The reason for this is again the YouTube video space and the chosen filter parameters. way YouTube links the realated videos to eachother. The same The intention behind such a tool is to provide an easy access figure also shows the same trend as earlier, that by increasing for researchers to collect this data from YouTube and allow
10 25 A PPENDIX A R ESPONSIBILITIES 21,2 20,9 Jsoup Sport 1.000 20 Acronyms: Jsoup sport 100.000 • Ida Marie Frøseth: IMF 14,8 Stefan Leicht: SL 13,1 15 • 12,9 12,9 12,9 Richard Reimer: RR 11,1 • 10,9 % • Viet Thi Tran: VTT 9,4 9,2 10 7,2 7,2 6,7 5,7 A. Research 4,9 4,5 3,9 3,9 5 3,5 • Dash: IMF, VTT Data Collection: SL, RR 0,1 0,0 • 0 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 • Video Download: VTT • User Interface: IMF, RR, VTT Fig. 13. Comparing year distribution for Jsoup crawler when starting at a sports video from 2008 B. Design • Architecture: IMF, SL, RR, VTT a convenient way to export the information, where it can be further analysed. Our tool achieves this by allowing the user to choose between the three different export formats CSV, JSON C. Implementation and XML. • API Crawler: IMF, RR To collect this datasets two distinct data gathering methods • Jsoup Crawler: IMF, SL were developed and included in the tool. The API v3 crawler • Video Download: VTT which relies strongly on the YouTube API and the Jsoup • User Interface: IMF, RR, VTT crawler which is completly independent of the YouTube API. It uses the Jsoup Library to crawl the YouTube webpage. We D. Evaluation showed that depending on the different input filters and collec- tion methods the results differ a lot regarding their qualitative • Performance: IMF, SL, RR distribution and also respecting their performance. While the • Quality: IMF, SL, RR Jsoup crawler is heavily biased towards the initial page, the random prefix sampling with the API v3 crawler provides a E. Paper Writing way to unbiasedly study the YouTube meta-information. • Abstract: RR Nevertheless, despite a proper number of benefits and func- • Introduction: IMF, RR tionality, the proposed tool has some room for improvements • Related Work: RR as well. As revealed in the evaluation, the Jsoup crawler • DASH: IMF, VTT succumbs the API crawler in every aspect. Even the only • Implementation: IMF, SL, RR, VTT benefit, that the crawler is not depending on the YouTube • Evaluation: IMF, SL, RR API can be argued as a major drawback. Every time the • Conclusion & Future Work: RR webpage of YouTube changes, the Jsoup crawler has to be adjusted accordingly. While on the other side, the YouTube API is expected to run for years with backwards compability F. Figures when a new version is released. The Jsoup crawler can be • Figure 3, 4, 5, 6, 7, 8, 9, 10: IMF used to inspect how YouTube decides what relative videos • Figure 11, 12, 13: SL are, and what videos are exposed to the user by using the standard YouTube search. Consequently, it does not despict a G. Tables representative sample of all available YouTube videos. Even thought the related videos is not providing an rep- • Table 1, 2, 3: RR resentative sample, it can give further insights for future research. To benefit from this fact a next step could be H. Appendix to implement the relative video gathering method from the • Appendix A : RR Jsoup crawler with the API crawler. It is expected, that the • Appendix B, C, D : IMF search speed will be drastically improved through code opti- mization and multithreading. Subsequently, all functionality of the Jsoup crawler can be realized in the API crawler and for further analysis, the Jsoup crawler should be omitted completely.
11 A PPENDIX B YOU T UBE MPD FILES Year Duration Quality mp4/A mp4/V Audio/ Video/ 2014 20min SD HD VideoID Period udio ideo webm webM FN‐h2tLQmxU 1 2 2 1 1 Yes No Yes No Yes No k70‐MAIW2Uo 1 2 2 2 2 Vt7g‐VcAFvm 1 2 4 no no aH1OBsYEFIU 1 2 5 1 5 Yes No Yes No No Yes K7yOpj29YQo 1 2 5 1 5 XmXoyQ‐PjkU 1 2 5 1 5 hBqxNfCrfL8 1 2 4 1 4 Yes No No Yes Yes No 3FTwaojNkXw 1 2 2 1 3 U0qTkTcsz0I 1 2 2 1 3 3v7RcHviRdU 1 2 5 1 5 Yes No No Yes No Yes Fj_FUQ2mXy4 1 2 5 1 5 9Z6RworZrLQ 1 2 5 1 5 iEc8‐83aywc 1 1 4 1 4 No Yes Yes No Yes No owaF‐6Ko0ic 1 2 4 no no 98Z‐n‐yPTn8 1 2 4 no no MrJr‐dn‐7Rs 1 2 7 no no No Yes Yes No No Yes N2CO‐xlgD9g 1 2 5 no no rfzU‐Iigzgw 1 1 5 1 5 xDN_‐ihLcmo 1 2 4 1 4 No Yes No Yes Yes No nku‐pRuftDg 1 1 3 1 3 0eUm‐V8vxJ0 1 1 5 no no dp6‐T6jNIhy 1 1 5 1 5 No Yes No Yes No Yes Akt‐jf0L5zQ 1 1 5 1 5 Qpy‐HHOMZ1 1 3 5 no no Fig. 14. Content of inspected MPD files
12 A PPENDIX C }, D OWNLOADED J SON S TRUCTURE ” player ”: { ” embedHtml ” : s t r i n g The follwing listing show the output format of the tool when }, ” topicDetails ”: { using Json as the export format. ” topicIds ”: [ { string ” kind ” : ” youtube # video ” , ], ” etag ” : etag , ” relevantTopicIds ”: [ ” id ”: string , string ” snippet ”: { ] ” publishedAt ”: datetime , }, ” channelId ”: string , ” recordingDetails ”: { ” t i t l e ”: string , ” locationDescription ”: string , ” description ”: string , ” location ”: { ” thumbnails ”: { ” l a t i t u d e ” : double , ( key ) : { ” l o n g i t u d e ” : double , ” url ”: string , ” a l t i t u d e ”: double ” width ” : unsigned i n t e g e r , }, ” height ”: unsigned i n t e g e r ” recordingDate ”: datetime } }, }, ” comments ” : { ” channelTitle ”: string , ” comment ” [ { ” tags ”: [ ” k i n d ” : ” y o u t u b e # comment ” , string ” etag ” : etag , ], ” id ”: string , ” categoryId ”: string , ” snippet ”: { ” liveBroadcastContent ”: string , ” channelId ”: string , ” defaultAudioLanguage ”: s t r i n g ” videoId ”: string , }, ” textDisplay ”: string , ” contentDetails ”: { ” textOriginal ”: string , ” duration ”: string , ” parentId ”: string , ” dimension ” : s t r i n g , ” authorDisplayName ” : s t r i n g , ” definition ”: string , ” authorProfileImageUrl ”: string , ” caption ”: string , ” authorChannelUrl ”: string , ” l i c e n s e d C o n t e n t ” : boolean , ” authorChannelId ”: { ” regionRestriction ”: { ” value ”: s t r i n g ” allowed ”: [ }, string ” authorGoogleplusProfileUrl ”: string , ], ” canRate ” : boolean , ” blocked ”: [ ” viewerRating ”: string , string ” likeCount ”: unsigned integer , ] ” moderationStatus ”: string , }, ” publishedAt ”: datetime , ” contentRating ”: { ” updatedAt ” : datetime ” acbRating ”: string , } ” agcomRating ” : s t r i n g , } ] ” anatelRating ”: string , }, . ” videoLinks ”:{ . ” singleDownloadLink ” : [ { . ” itag ”: integer , } ” url ”: string , }, }] ” status ”: { } ” uploadStatus ”: string , } ” failureReason ”: string , ” rejectionReason ”: string , ” privacyStatus ”: string , ” publishAt ”: datetime , ” license ”: string , ” embeddable ” : boolean , ” publicStatsViewable ”: boolean }, ” s t a t i s t i c s ”: { ” viewCount ” : u n s i g n e d long , ” l i k e C o u n t ” : unsigned long , ” d i s l i k e C o u n t ” : unsigned long , ” f a v o r i t e C o u n t ” : unsigned long , ” commentCount ” : u n s i g n e d l o n g
13 A PPENDIX D L ARGEST YOU T UBE CATEGORY BY YEAR 45 40 35 30 Peoples & Blogs 25 Music % Entertainment 20 Sports 15 Comedy 10 Gaming News & Politics 5 0 2007 2008 2009 2010 2011 2012 2013 2014 Year Fig. 15. The change in largest category on YouTube by year
14 R EFERENCES [1] Benjamin Guthier, Rajwa Alharthi, Rana Abaalkhail, and Abdulmotaleb El Saddik. Detection and visualization of emotions in an affect-aware city. In Proceedings of the 1st International Workshop on Emerging Multimedia Applications and Services for Smart Cities, EMASC ’14, pages 23–28, New York, NY, USA, 2014. ACM. [2] Juan Cao, Yong-Dong Zhang, Yi-Cheng Song, Zhi-Neng Chen, Xu Zhang, and Jin-Tao Li. MCG-WEBV: A benchmark dataset for web video analysis. Beijing: Institute of Computing Technology, 10:324–334, 2009. [3] PradeepK. Atrey, M.Anwar Hossain, Abdulmotaleb El Saddik, and MohanS. Kankanhalli. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems, 16(6):345–379, 2010. [4] YouTube. YouTube: hours of video uploaded every minute 2014 | Statis- tic. http://www.statista.com/statistics/259477/hours-of-video-uploaded- to-youtube-every-minute/. [5] Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue Moon. I tube, you tube, everybody tubes: analyzing the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages 1–14. ACM, 2007. [6] Junghoo Cho and Sourashis Roy. Impact of search engines on page popularity. In Proceedings of the 13th international conference on World Wide Web, pages 20–29. ACM, 2004. [7] Xu Cheng, Cameron Dale, and Jiangchuan Liu. Understanding the characteristics of internet short video sharing: Youtube as a case study. arXiv preprint arXiv:0707.3670, 2007. [8] Gabor Szabo and Bernardo A Huberman. Predicting the popularity of online content. Communications of the ACM, 53(8):80–88, 2010. [9] Jia Zhou, Yanhua Li, Vijay Kumar Adhikari, and Zhi-Li Zhang. Count- ing youtube videos via random prefix sampling. In Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference, pages 371–380. ACM, 2011. [10] Cisco. Cisco visual networking index: Forecast and methodology, 2014- 2019 white paper. Technical report, Cisco, MAY 2015. [11] Thomas Stockhammer. Dynamic adaptive streaming over http –: Standards and design principles. In Proceedings of the Second Annual ACM Conference on Multimedia Systems, MMSys ’11, pages 133–144, New York, NY, USA, 2011. ACM. [12] Christian Dr. Trimmer. Dynamic adaptive stream- ing over http (dash): Past, present, and future. http://www.streamingmediaglobal.com/Articles/Editorial/Featured- Articles/Dynamic-Adaptive-Streaming-over-HTTP-(DASH)-Past- Present-and-Future-93275.aspx. [13] Sotiris Antoniadis. Mpeg-dash - multimedia streaming over wireless/- mobile networks. http://santoniadis.blogspot.no/2014/01/mpeg-dash- multimedia-streaming-over.html, 2014. [14] Iso/iec 23009-1:2014 information technology – dynamic adaptive streaming over http (dash) - part 1: Media presentation description and segment formats, 2010. [15] YouTube. Youtube player api reference for iframe embeds. https://developers.google.com/youtube/iframe api reference, 2014. [16] YouTube. Youtube android player api. https://developers.google.com/youtube/players/android player api, 2015. [17] YouTube. Embed youtube videos in ios ap- plications with the youtube helper library. https://developers.google.com/youtube/v3/guides/ios youtube helper, 2014.
You can also read