Deliverable 2.1 Report on Qualitative Crowdsourced and Open Data Filtering Methodology

Page created by Gene Clarke
 
CONTINUE READING
Deliverable 2.1 Report on Qualitative Crowdsourced and Open Data Filtering Methodology
Ref. Ares(2018)3483820 - 30/06/2018

            Deliverable 2.1
Report on Qualitative Crowdsourced and
   Open Data Filtering Methodology

 This project has received funding from the European Union’s Horizon 2020 Research and Innovation
                            Programme under Grant Agreement No. 780121
Deliverable 2.1 Report on Qualitative Crowdsourced and Open Data Filtering Methodology
PTwist – GA No. 780121                               D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                  Open Data Filtering Methodology

Copyright
© Copyright 2018 The PTwist Consortium

Consisting of:

    •   ARISTOTELIO PANEPISTIMIO THESSALONIKIS
    •   FACHHOCHSCHULE ZENTRALSCHWEIZ - HOCHSCHULE LUZERN
    •   NUROGAMES GMBH
    •   BETTER FUTURE FACTORY BV
    •   ALMERYS
    •   EOLAS S.L.
    •   DIKTYO MESOGEIOS SOS
    •   STICHTING BLUECITY
    •   TEKNOLOJI ARASTIRMA GELISTIRME ENDUSTRIYEL URUNLER BILISIM TEKNOLOJILERI SANAYI VE
        TICARET ANONIM TICARET

This document may not be copied, reproduced, or modified in whole or in part for any purpose without
written permission from the PTwist Consortium. In addition, an acknowledgement of the authors of the
document and all applicable portions of the copyright notice must be clearly referenced.

All rights reserved.

This document may change without notice.

                                                                                                      1
Deliverable 2.1 Report on Qualitative Crowdsourced and Open Data Filtering Methodology
PTwist – GA No. 780121                                    D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                       Open Data Filtering Methodology

 Document Classification

 Title                      Report on Qualitative Crowdsourced and Open Data Filtering Methodology
 Deliverable                D2.1
 Type                       R: Report
 Work Package               WP2 – Pilots Requirements and Data Modelling
 Partners                   AUTH
 Authors                    Ilias Dimitriadis
 Dissemination Level        PU (Public)

 Abstract

This document describes the developed topic detection filtering methodology, based on the data collected
by social media and open data sources. It includes analysis of the requirements as described in the PTwist
project proposal and the overall system design for delivering high quality content. Moreover, it includes a
detailed evaluation of the whole process, followed by specific examples and charts.

 Version Control

 Version    Description                                          Name                      Date
 1.0        Initial draft                                        Ilias Dimitriadis         11 Jun 2018
 1.1        Added Primitive Filtering section                    Ilias Dimitriadis         18 Jun 2018
 2.0        Total update of document                             Ilias Dimitriadis         25 Jun 2018

                                                                                                           2
Deliverable 2.1 Report on Qualitative Crowdsourced and Open Data Filtering Methodology
PTwist – GA No. 780121                                                                   D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                                                      Open Data Filtering Methodology

Table of Contents

1. Executive Summary ....................................................................................................................................... 5
2. Introduction ................................................................................................................................................... 6
3. Primitive Filtering .......................................................................................................................................... 8
   3.1. Social Media Sources .............................................................................................................................. 8
       3.1.1. Facebook.......................................................................................................................................... 9
       3.1.2. Twitter ........................................................................................................................................... 10
       3.1.3. Flickr............................................................................................................................................... 11
   3.2. Open Data Sources ............................................................................................................................... 11
       3.2.1. Precious Plastics ............................................................................................................................ 11
       3.2.2. Thingiverse..................................................................................................................................... 12
4. Dynamic Content Filtering ........................................................................................................................... 13
   4.1. Identifying Content in Twitter .............................................................................................................. 13
       4.1.1. First evaluation of the collected data ............................................................................................ 17
       4.1.2. Profanity Filtering .......................................................................................................................... 19
       4.1.3. Filtering data from specific Twitter users ...................................................................................... 20
       4.1.4. Filtering data from specific Facebook Pages ................................................................................. 21
       4.1.5. Filtering data from Thingiverse and Flickr ..................................................................................... 22
5. Data Streams & User Filtering ..................................................................................................................... 23
   5.1. Identifying Influential Users ................................................................................................................. 23
       5.1.1. Evaluation of the Influencer filtering process ............................................................................... 25
   5.2. Selecting high quality content .............................................................................................................. 26
       5.2.1. Selecting top tweets ...................................................................................................................... 27
       5.2.2. Selecting top URLs ......................................................................................................................... 29
       5.2.3. Filtering content to produce tag clouds ........................................................................................ 29
       5.2.4. Filtering Content to produce location heatmaps .......................................................................... 31
       5.2.5. Topic extraction ............................................................................................................................. 32
6. Conclusions and Future Work ..................................................................................................................... 34
7. References ................................................................................................................................................... 35

                                                                                                                                                                     3
Deliverable 2.1 Report on Qualitative Crowdsourced and Open Data Filtering Methodology
PTwist – GA No. 780121                                                               D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                                                  Open Data Filtering Methodology

Table of Figures
Figure 1: PTwist architecture ............................................................................................................................. 6
Figure 2: Social Network Statistics .................................................................................................................... 8
Figure 3: Facebook users ................................................................................................................................... 9
Figure 4: Twitter active users .......................................................................................................................... 10
Figure 5: Twitter users’ age distribution ......................................................................................................... 11
Figure 6: Crowdsourcing tool components ..................................................................................................... 12
Figure 7: Twitter JSON object .......................................................................................................................... 15
Figure 8: Database Schema ............................................................................................................................. 16
Figure 9: English Tweets DB............................................................................................................................. 17
Figure 10: Dutch Tweets DB ............................................................................................................................ 17
Figure 11: German Tweets DB ......................................................................................................................... 18
Figure 12: Greek Tweets DB ............................................................................................................................ 18
Figure 13: Bbefore and after updating the keywords ..................................................................................... 19
Figure 14: Updating influential users’ posts .................................................................................................... 21
Figure 15: Thingiverse Db schema ................................................................................................................... 22
Figure 16 Open data repository example ........................................................................................................ 22
Figure 17: Sample Image of the 100 most influential users in a social network ............................................. 23
Figure 18: Twitter data to MongoDB flow ....................................................................................................... 24
Figure 19: Nodes and Edges in JSON document .............................................................................................. 24
Figure 20: NetShield pseudo-algorithm........................................................................................................... 25
Figure 21: Influencer Detection Demo ............................................................................................................ 26
Figure 22: Top Tweets Demo ........................................................................................................................... 27
Figure 23: Top tweets favourite count ............................................................................................................ 28
Figure 24: Top tweets replies count ................................................................................................................ 28
Figure 25: Top URLs Demo .............................................................................................................................. 29
Figure 26: JSON schema for wordcount .......................................................................................................... 30
Figure 27 Tag cloud Demo ............................................................................................................................... 30
Figure 28: Location heatmap ........................................................................................................................... 31
Figure 29: LDA topic modelling Demo ............................................................................................................. 33

                                                                                                                                                             4
Deliverable 2.1 Report on Qualitative Crowdsourced and Open Data Filtering Methodology
PTwist – GA No. 780121                                     D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                        Open Data Filtering Methodology

1. Executive Summary

This document presents the data filtering methodology that has been used up to now, during the process of
the Crowdsourced Data Collection and Analysis as described in Task 2.1 of the PTwist official proposal. The
whole filtering process consists of three main categories:

    1. Primitive Filtering, in charge of identifying the Social media sources which will be used for the
       collection of data.
    2. Dynamic content filtering, in charge of identifying terms related to plastic and plastics reuse.
    3. Data streams & User filtering, in charge of:
           a. Identifying and detecting influential – expert - users regarding plastic re-use thematology.
           b. Filtering incoming data in order to produce high quality content.
           c. Updating the keywords used to collect data.

Apart from the whole filtering process, the deliverable also provides a detailed evaluation of the filtering
methodology. This part refers to specific examples regarding each of the filtering phase as mentioned above,
except for the primitive filtering phase.

Considering the three-layer approach of the PTwist platform, the Data Filtering is under the responsibility of
the Crowdsourcing Component. Crowdsourcing will be used as a plastic topics insightful barometer, able to
detect new trends, identify interesting content, spot influential users and generally raise awareness regarding
the problem of plastic overuse. The crowdsourcing tool as a whole will be presented in month 10 of the
project.

                                                                                                              5
Deliverable 2.1 Report on Qualitative Crowdsourced and Open Data Filtering Methodology
PTwist – GA No. 780121                                       D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                          Open Data Filtering Methodology

2. Introduction

As a reminder, the PTwist architecture is composed of three different layers, the Application Layer, the
Processing Middleware Layer and the Peer to Peer Blockchain Layer. This deliverable refers to parts of the
first two Layers, as presented in Figure 1.

                                         Figure 1: PTwist architecture

The Plastics crowdsourced topic observatory and the Open data & plastics designs collection are built upon
the intelligence that has been extracted by data collected using Social Media or Open Data sources.

   •   The first filtering phase as described in section 3, focuses on the data sources that will be used,
       explaining the reasons behind the final selection.
   •   The second filtering stage as described in section 4, presents the initial set of terms which will be
       used to collect relative content. It also presents a detailed evaluation of the data that have been
       collected up to now, using this methodology.
   •   In order to make sure that the content of both components is reliable and of high quality, the data
       collected in raw format must be filtered and processed. This process takes place in the Middleware
       layer, which is responsible for all the processing of the data produced in the PTwist platform. The
       third filtering stage as described in section 5 presents:

                                                                                                              6
Deliverable 2.1 Report on Qualitative Crowdsourced and Open Data Filtering Methodology
PTwist – GA No. 780121                                    D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                       Open Data Filtering Methodology

            ❖ The methodology behind the discovery of influential users, i.e. the algorithms responsible
              for filtering the whole set of users that make plastic related posts, to detect the ones who
              are considered experts in their field (in our case Plastic re-use, plastic pollution, etc.)
            ❖ The process behind selecting content based on its popularity among the social network
              users.
            ❖ the methodology behind the iterative process of updating the filters used to collect data

Finally, the conclusion and next steps (section 6) recapitulates what is presented, how it is linked with the
work engaged in work package two (WP2) and the plan for the next steps.

                                                                                                            7
Deliverable 2.1 Report on Qualitative Crowdsourced and Open Data Filtering Methodology
PTwist – GA No. 780121                                         D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                            Open Data Filtering Methodology

3. Primitive Filtering

This section refers to the data accessibility of Social Media and Open Data sources, their advantages and
disadvantages and concludes with the final selection of the sources that will be used throughout the PTwist
Project.

3.1. Social Media Sources
The crowdsourcing tool will be built upon the intelligence extracted by posts of users in Social Networks.
However, very few of the popular Social Networks allow users – developers to have access on data, even if
these are public posts. Figure 2 , depicts the ranking of the most popular social networks based on the number
of active users.

                                        Figure 2: Social Network Statistics

                                                                                                                8
Deliverable 2.1 Report on Qualitative Crowdsourced and Open Data Filtering Methodology
PTwist – GA No. 780121                                        D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                           Open Data Filtering Methodology

Most of the Social Networks in the list above, are either not related with the scope of the PTwist project or
do not offer an official API which provides access to publicly shared posts or images. Such examples are the
following:

    ▪   Instagram [1] : Although access to Instagram’s public posts would be really valuable, the official
        Instagram API [2] does not allow other users, except for the one that created the post, to retrieve
        this data
    ▪   Pinterest [3]: Pinterest could be used as a pool of plastic re-use ideas, because of the creative nature
        of most of its users. Although the version 1 of the official Pinterest API allowed users/developers to
        search for and collect publicly published posts, the latest version 2 of the Pinterest API [4] does not
        give the opportunity to search for a post by providing a certain “pin” (pin is a tag for each image
        posted in Pinterest)
    ▪   Tumblr [5] : Tumblr is a microblogging and social networking website, where each user can have
        his/her own blog. Tumblr could be used as a source for opinions and ideas regarding plastic. It does
        offer an open API [6] but it is quite restricted. Although it allows the user/developer to make
        1000calls/hour or 5000calls/day using certain tags as query terms, it only returns a maximum of 20
        results which remain the same after each call. Collecting data that refer to the same 20 accounts
        would produce a biased dataset. Moreover, after searching manually for topics related to plastic
        reuse – recycle thematic, it proved out that Tumblr is not quite popular for posting such content

Therefore, the PTwist crowdsourcing tool will offer information and knowledge that has been collected by
analyzing data on the following social networks:

    •   Facebook
    •   Twitter
    •   Flickr

3.1.1. Facebook
Facebook [7] is the most popular social network globally with the highest user engagement worldwide, Figure
3 . More than 500 Terabytes of data are stored in a daily basis, showing that there is actually tons of
information on any possible subject.

                                             Figure 3: Facebook users

                                                                                                               9
PTwist – GA No. 780121                                         D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                            Open Data Filtering Methodology

However, its official API [8] does not give you the ability to search by a specific term, even the public posts
containing this term are not discoverable. It allows access only on publicly open Facebook pages. Moreover,
it has a very strict Rate Limit as well (200calls / user), which makes data collection an ongoing challenge. In
the PTwist project, data are collected from certain Facebook pages that have been provided as input by the
pilot partners. Further info regarding these specific pages can be found in section 4.

3.1.2. Twitter
Although Twitter [9] is not included in the top ten Social networks worldwide (based on the number of users)
it still remains the most popular one for social media research, both in academia and in industry. This does
not mean that the number of daily active users is low, on the contrary, during the 1st quarter of 2018 the
number of monthly active users was around 336 Million, Figure 4.

                                           Figure 4: Twitter active users

Twitter is unique in the sense that the Twitter Official API [10] , provides almost 100% coverage of its data.
Another important fact, for Twitter is the age distribution of its users. As presented in Figure 5, the vast
majority of Twitter’s users are between the ages of 18 – 64, a rather more mature audience than that offered
by other social networks.

                                                                                                                10
PTwist – GA No. 780121                                           D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                              Open Data Filtering Methodology

                                       Figure 5: Twitter users’ age distribution

In the PTwist project, Twitter will play the role of the main data source for the crowdsourcing tool. More
information regarding data that have been collected using a term filtering method is provided in section 4.

3.1.3. Flickr
Flickr [11] is a photo sharing platform and social network where users upload photos for others to see.
Although it is not the most popular photo sharing network, it still holds a community of 120Million users.
Flickr also offers an official API [12], which provides access to all the publicly shared images on the platform.
Flickr API will provide access to images that have been described with certain tags relevant to plastic related
terms. These images will be used in the open data and plastic design collection.

3.2. Open Data Sources
Data collected from such sources will frame the content of the open data and plastic design collection
repository. Since it will actually provide access to plastic reuse ideas, 3d printer designs, etc. the available
data sources are limited.

3.2.1. Precious Plastics
Precious Plastic [13] is a global community of hundreds of people working towards a solution to plastic
pollution. Knowledge, tools and techniques are shared online, for free. PTwist will offer access to this
knowledge by inter-linking the open data repository with the one provided by precious plastic.

                                                                                                                  11
PTwist – GA No. 780121                                       D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                          Open Data Filtering Methodology

3.2.2. Thingiverse
Thingiverse [14] is a thriving design community for discovering, making, and sharing 3D printable things and
the world's largest 3D printing community. Thousands of 3d printing designs are available over their open
API [15] . PTwist offers direct access to Thingiverse database, including designs that will be implemented by
PTwist users and pilots in a specific Thingiverse group.

                                     Figure 6: Crowdsourcing tool components

                                                                                                              12
PTwist – GA No. 780121                                        D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                           Open Data Filtering Methodology

4. Dynamic Content Filtering

This section describes in detail the process of filtering of data that are going to be used in the crowdsourcing
tool, based on their content. PTwist thematic focuses on plastic waste, plastic recycle, plastic reuse, etc. as
described in the official PTwist proposal. Collecting data from various sources, implies the use of specific
filters so that the content of the collected data remains as close to the “plastic” thematic as possible.

4.1. Identifying Content in Twitter
Although the primitive goal of Social Networks, in this case Twitter, was to strengthen the social interactions
between their users, within time it has differentiated. Twitter can now be considered as a mighty real time
human-powered sensor and how could anyone stand for the opposite, since there are 336 million active
monthly users. Its content is updated incessantly and can affect opinions and behaviours; it can be used as a
prediction tool and the real-time information that derives from microblogs like Twitter are really useful for
different kinds of applications. In this case, it will be used as a topic observatory for plastic.

Twitter allows users to search for specific keywords or user accounts. Since PTwist focuses on plastic, the
pilot partners provided AUTH with a list of specific keywords and a list of certain user accounts that have
been classified as experts. The multidisciplinary nature of the project, meaning that each pilot focuses on
plastic using different perspectives, allows the creation of a very wide and focused list of keywords. Each pilot
has contributed in the creation of that list, which is then translated to the native language of each pilot in
order to collect data that reflect trends and interesting topics in each pilot’s country. These keywords are
then classified using a taxonomy provided by the pilot partners, offering an initial set of different topics –
thematic. This taxonomy and the keywords for each class is presented below:

 English              Dutch               German                     Greek                     Groups
 plastic              plastic             Plastik                    πλαστικό                  General terms
 single use plastic                       Einwegplastik              πλαστικό μιας             General terms
                                                                     χρήσης
 reuse                hergebruik          Wiederverwendung           επαναχρησιμοποίηση        General terms
 reduce               verminderen         Reduktion                  μείωση                    General terms
 recycle              recycleren          rezyklieren                ανακύκλωση                General terms
 upcycling            upcycling           upcyclen                   upcycling                 General terms
 downcycling                                                         downcycling               General terms
 waste                afval               Abfall                     απόβλητα                  General terms
 litter               zwerfafval          Abfall                     σκουπίδια /               General terms
                                                                     απορρίμματα
 plastic soup         plastic soep        Plastiksuppe               πλαστική σούπα            General terms
 zero-waste           zero-waste          Null-Abfall                μηδενικά απόβλητα         General terms
 no-waste             afvalvrij                                      χωρίς απόβλητα            General terms
 plastic free         plastic vrij        frei von Plastik           χωρίς πλαστικό            General terms
 virgin plastics      virgin plastics                                καθαρό/ καινούριο/        General terms
                                                                     πρωτογενές πλαστικό
 deposit fee          statiegeld          ohne Inhaltsstoffe         τέλος ταφής               General terms
 recycling fee                                                       τέλος ανακύκλωσης         General terms
 deposit return                                                      σύστημα                   General terms
 system                                                              επιστροφής,

                                                                                                                13
PTwist – GA No. 780121                                  D2.1 – Report on Qualitative Crowdsourced and
H2020 ICT-11-2017                                                     Open Data Filtering Methodology

                                                       συλλογής / σύστημα
                                                       εγγυοδοσίας
pollution         vervuiling       Umweltverschmutzung ρύπανση                        General terms
packaging                                              συσκευασία                     General terms
Eco-design                         Ökodesign                                          General terms
End-of-waste                                                                          General terms
Microplastic                       Mikroplastik                                       General terms
"Unrecyclable"                     Nichtrezyklierbar                                  General terms
Jetsam                             Strandgut                                          General terms
single-use                         Einwegprodukt                                      General terms
products
Bag                                Tüte                                               General terms
Plastic tax                        Plastiksteuer                                      General terms
Bio-based                          Biobasierte                                        General terms
packaging                          Verpackung
Recyclability                      Rezyklierbarkeit                                   General terms
Waste recovery                                                                        General terms
Anthropogenic                                                                         General terms
Litter
Incineration                                                                          General terms
Trash                              Müll                        σκουπίδια              General terms
                                      Table 1: General Terms

English           Dutch            German                      Greek                  Groups
Straws            Rietje           Strohhalme                  Καλαμάκια              Product
plastic cup       Plastic beker    Plastiktassen               Κύπελλα / Ποτήρια      Product
plastic bottle    plastic fles     Plastikflasche              Μπουκάλι               Product
plastic cap       plastic dop      Plastikdeckel               Καπάκι                 Product
Wrapping          Verpakking       Verpackung                  Περιτύλιγμα            Product
Foil              Folie            Folie                       Αλουμινόχαρτο          Product
Filament          Filament         Filament                    Νήμα                   Product
plastic bag       plastieken zak   Plastiktüte                 πλαστικές σακούλες     Product
Sachet            Zakje            Beutel                      Φακελάκι               Product
                                          Table 2: Products

English           Dutch            German                      Greek                  Groups
Shredder          Vermaler         Reisswolf                   Τεμαχιστής             Machines
Extruder                           Extruder                    Extruder               Machines
Ultimaker                          Ultimaker                   Ultimaker              Machines
                                                               Τρισδιάστατος          Machines
3D printer                         3D-Drucker                  εκτυπωτής
Container         Container        Container                   Περιέκτης              Machines
                                          Table 3: Machines

                                                                                                        14
PTwist – GA No. 780121                                       D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                          Open Data Filtering Methodology

 English             Dutch               German                           Greek               Groups
 Granulation         Granuleren          Granulation                      Κοκκοποίηση         Process
 Molding             Spuitgieten         Formen                           Καλούπι             Process
 Injection           Injectie            Injektion                        Έγχυση              Process
                                               Table 4: Processes

 English             Dutch                German                          Greek               Groups
 Compostable         Composteerbaar       compostierbar                   Κομποστοποιήσιμο    Innovations
 Biodegradable       biologisch           biologisch abbaubar             Βιοαποικοδομήσιμο   Innovations
                     afbreekbaar
 Coating             Coating           Beschichtung                       Επικάλυψη           Innovations
 Bioplastics         Bioplastic        biologischer                       Βιοπλαστικά         Innovations
                                       Kunststoff
 Biobased            Biobased          biobasiert                         Biobased            Innovations
 Sea-weed            Zeewierverpakking Seegrasverpackung                  Συσκευασία από      Innovations
 packaging                                                                φύκια
                     Meelmotlarwe                                                             Innovations
                     Pyrolyse                                                                 Innovations
                                              Table 5: Innovations

Twitter’s Streaming API has been used to collect all tweets that contain any of the keywords. However, as it
has already been mentioned, since there are four different languages an extra filtering parameter must be
taken into account. After making a call to Twitter Streaming API each response is in JSON format; each
response is actually a single tweet, packed together with multiple type of information as presented in Figure
7: Twitter JSON object.

                                          Figure 7: Twitter JSON object

                                                                                                              15
PTwist – GA No. 780121                                       D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                          Open Data Filtering Methodology

The “text” field of the JSON document is actually the textual content of each tweet. If any of the keywords
match any of the words in this field, the tweet is collected. The second part of filtering process has to do with
the language of the tweet. The language of the tweet is described in the “lang” field. For each of the
languages, there is a different Mongo Db [16] Database with five different collections, each one referring to
a different group of keywords. If the content of “lang” is equal to any of the following languages [el, en, nl,
de] each collected tweet is stored in the Database and then redirected and stored to the collection in charge
of each keyword as shown in Figure 8 .

                                            Figure 8: Database Schema

                                                                                                                16
PTwist – GA No. 780121                                      D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                         Open Data Filtering Methodology

4.1.1. First evaluation of the collected data
Using the process described above, AUTH has started collecting and filtering data on 12th March 2018. After
two months, a total of 12.5 Million tweets had been collected, of which more than 90% were English. A short
presentation with data statistics is presented below:

                                     Tweets - English DB
                                       8161          695577
                                                                271234

                                                          2186228

                                          9018546

                            Innovations        Machines                 Processes
                            Products           General terms

                                          Figure 9: English Tweets DB

                                      Tweets - Dutch DB
                                               1543 8200 1573

                                                             19993

                                          97585

                            Innovations        Machines                 Processes
                            Products           General terms

                                          Figure 10: Dutch Tweets DB

                                                                                                             17
PTwist – GA No. 780121                                D2.1 – Report on Qualitative Crowdsourced and
H2020 ICT-11-2017                                                   Open Data Filtering Methodology

                              Tweets - German DB
                                              58

                                                10535

                                                          18636

                               57642
                                                        17492

               Innovations   Machines     Processes        Products    General terms

                                  Figure 11: German Tweets DB

                                Tweets -Greek DB
                                    53        163

                                                          13579

                               22256

                                                                0

               Innovations   Machines     Processes        Products    General terms

                                   Figure 12: Greek Tweets DB

                                                                                                      18
PTwist – GA No. 780121                                          D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                             Open Data Filtering Methodology

By applying some simple techniques to get an initial input for plastic-related crowd’s standing, AUTH
produced some basic tag clouds that revealed some problems regarding the terms used to collect these data.
The main problem was that some of the terms were quite generic; for example, the term “bag” did not cover
just the plastic thematic but returned tweets that could refer to fashion. For this reason, all the tweets that
had been stored until that moment, were filtered out so that each tweet contained the word “plastic” AND
the keyword that had already been proposed. Initially all the tweets that had been collected were filtered
out and once this filtering process finished, all the keywords were updated accordingly. After applying this
filter, updating the terms and three months of collecting data, the dataset was reduced to a total of
approximately 1.5 Million tweets. The following Figure (Figure 13 depicts the difference regarding the
number of collected tweets before and after the filtering process.

                                  Figure 13: Bbefore and after updating the keywords

4.1.2. Profanity Filtering
A large percentage of the content shared in Twitter may be noisy, spam or offensive. As such, in order to
ensure the quality of the final delivered content the use of a profanity filtered is considered necessary.
Twitter itself does not provide any automatic filtering, except for the cases when other users flag a certain
post as possibly offensive. Moreover, in the JSON document of each tweet there is the sensitive field, which
is actually activated by the user himself and thus cannot be considered as a reliable indicator of inappropriate
content.

In our initial evaluation of the collected data we did not notice any abusive content, except for some swearing
language. This is mainly due to the fact that the plastic thematic does not offer a fertile ground for adult
content. However, a typical text filter has been used to filter the dataset from tweets with inappropriate
words included in their text. After detecting such tweets, the second step is identifying the user that posted
the tweet. All the users that have been identified as possibly “malicious” are blacklisted and are then ignored
by the Twitter crawler.

                                                                                                                 19
PTwist – GA No. 780121                                           D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                              Open Data Filtering Methodology

4.1.3. Filtering data from specific Twitter users
Apart from collecting streaming tweets, AUTH has collected data for a number of users individually using
Twitter’s Search API which allows the collection of the last 3600 tweets for each specified account. This initial
set of users was also selected by the pilot partners and can be found below:

        Account                   Type                                         Account                Type
  Supporter van Schoon        Government                                     PlasticWhale            Social
                                Program                                                            Enterprise
         Nedvang              Government                              TheOceanCleanUp                Social
                                Program                                                            Enterprise
      PlasticsHeroes          Government                                        Ioniqa              Industry
                                Program
      Beach Cleanup               NGO                               GreenWavePlastics               Industry
     Searious Business            NGO                            Qualitive Circular Polymers        Industry
 PlasticOceanFoundation           NGO                                      Renewi                   Industry
  PlasticSoupFoundation           NGO                                      Coolrec                  Industry
    PlasticSoupSurfers            NGO                                Van Gansewinkel                Industry
    WasteFreeOceans               NGO                                    LogicWaste                 Industry
      PreciousPlastics            NGO                           Eastman Chemical Company            Industry
          WASTED                  NGO                                RDM Makerspace                  Fablab
       Recycled Park              NGO                                 De Waag Society                Fablab
          Zwerfie                 NGO                                     Bouwkeet                   Fablab
   BetterFutureFactory           Social                             Stadslab Rotterdam               Fablab
                               Enterprise
           Refil                 Social                                      Makerversity           Fablab
                               Enterprise
 PerpetualPlasticProject         Social                                      Mediamatic             Fablab
                               Enterprise
       NewMarble                 Social                                The Green Village            Fablab
                               Enterprise
          Dopper                 Social                            PreciousPlasticsGreece            NGO
                               Enterprise
       Plastic Circle            Social                          EllenMcArthurFoundation            Thought
                               Enterprise                                                            Leader
          #dHubs                 Social
                               Enterprise
         Biofutura               Social
                               Enterprise
          Milgro                 Social
                               Enterprise
        Van Plestik              Social
                               Enterprise
   Community Plastics            Social
                               Enterprise
   GreatBubbleBarrier            Social
                               Enterprise
                                             Table 6: Initial set of users

                                                                                                                  20
PTwist – GA No. 780121                                          D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                             Open Data Filtering Methodology

These data were not further filtered out, because they will then be used to infer interesting topics that may
be related to plastic. However, since it is important to track the impact of the posts published by influential
users to other users, a method for updating the most recent posts has been developed (see Figure 14).

                                     Figure 14: Updating influential users’ posts

An iterative process makes sure that the impact of each post is tracked. For example, a post published by
National Geographic may have been collected in the same hour that it had been posted and did not have
enough time to get spread along the Twitter user pool. As a result, the number of people that added this post
to their favourites (or Likes in Facebook) up to the time of the collection could be quite small. A casual
evaluation of the post would lead to improper conclusions (e.g. that the post did not “influence” many
people). Using the methodology described above, we make sure that the impact of all posts is updated, thus
every post is evaluated accordingly.

4.1.4. Filtering data from specific Facebook Pages
Using Facebook’s Graph API, AUTH also collected data from the Facebook pages described in Table 6 (if
available). Similarly to the data collected from Twitter for these specific users, the data did not pass any
filtering process at this point as they will be used later on in the extraction of plastic related topics. The
iterative update process described in Figure 13 is applied in these data as well.

                                                                                                                 21
PTwist – GA No. 780121                                         D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                            Open Data Filtering Methodology

4.1.5. Filtering data from Thingiverse and Flickr
AUTH has used the Thingiverse API and the Flickr API to develop an open data repository which will be
accessed via the PTwist Crowdsourcing platform. Both APIs return data in JSON format [Figure 15].

                                          Figure 15: Thingiverse Db schema

Regarding Thingiverse, no filtering has yet been applied since Thingiverse is a repository of open designs with
free license. All the designs can be accessed via the PTwist platform. Up to now the demo version of the
crowdsourced tool displays information for the top-30 designs (see Figure 16), sorted by the number of
downloads (number of downloads is included as an attribute in the JSON document which describes each
Design).

                                       Figure 16 Open data repository example

In relation to Flickr, up to now no filtering process has taken place. A detailed description of the future plans
regarding this task is included in Section 6.

                                                                                                                22
PTwist – GA No. 780121                                            D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                               Open Data Filtering Methodology

5. Data Streams & User Filtering

This section describes the main filtering methods that have been implemented up to this point. The first part
refers to the filtering of users whose tweets has been collected by the Twitter crawler and can be identified
as experts – influencers. The second part presents the methodology used to designate posts / tweets that
have attracted people’s attention, more specifically the filtering process which is used to select high quality
posts. The third part is about the methodology behind the iterative process of updating the already known
keywords, i.e. a filtering process that detects new emerging hashtags – keywords.

5.1. Identifying Influential Users
This part is about detecting influential users in the Twitter dataset that has been collected up to this moment.
Influential users – experts are expected to be the ones that will diffuse information regarding plastic, more
efficiently and will share content of higher quality in comparison with other users. Identifying influencers
(see Fig. 17) has attracted the interest of multiple researchers since Social Networks dominance started
affecting the old-fashioned marketing strategies.

                       Figure 17: Sample Image of the 100 most influential users in a social network

Twitter, is based on a social-networking model, in which users can choose who to follow or interact with (via
retweets, mentions or replies). Based on that notion we can think of Twitter as a large graph G(U,E) where
U stands for the whole set of users (nodes) and E all the possible interactions between users (edges). Our
goal is to use a filtering methodology capable of detecting the users that diffuse better the plastic related
information throughout the network. In this case, the graph will consist of all the users that have posted a
tweet regarding plastic and have been collected by Twitter Crawler. Every retweet, mention or reply to a
certain user will create an edge between the author user and the retweet-er, mention-er accordingly. For
example, let’s have user A post a tweet regarding plastic reuse in Rotterdam. If users B, C and D retweet the
original post of user A, a small graph of 4 nodes will be created, where users B, C and D will be connected
directly to user A.

                                                                                                                   23
PTwist – GA No. 780121                                        D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                           Open Data Filtering Methodology

                                     Figure 18: Twitter data to MongoDB flow

As it has already been mentioned, every tweet that fits the criteria described in Section 4 is stored in a
MongoDB database (Fig. 18). Since we store the whole JSON document, as returned by the Twitter Streaming
API, a plethora of information is available regarding each tweet. In the JSON document, the entities field
contains information about the id of the user that has been mentioned, retweeted or replied while in the
user field, we can find the id of the user author (see Figure 19).

                                   Figure 19: Nodes and Edges in JSON document

For the graph creation we use all three entities (if and where available), retweets, mentions and replies. The
dataset that has been collected up to now, regarding tweets of English language related to the terms
presented in Tables 1-5, contains approximately 1.3 Million tweets. Table 7 presents a brief overview of the
Graph generated from all these data.

                   Number of nodes                        629787
                   Number of edges                        994376
                   Avg degree                             3.15
                                      Table 7: Graph of Twitter-data overview

Research regarding identifying influencers has been focused either on approaches based on the topology of
the graph or on approaches that try to identify experts using various specific attributes (for example in our
case number of friends – followers, tweeting activity, retweeting activity, etc.). Most studies are based on
Twitter and Facebook, where user engagement is quite high [17] [18] [19] [20] [21] [22] [23] [24] [25].

                                                                                                               24
PTwist – GA No. 780121                                       D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                          Open Data Filtering Methodology

Since we are not just interested in nodes with high reputation, the Pagerank [26] algorithm is not quite
suitable for our case. Influence is another case of an epidemic process, where a topic, a hashtag or content
in general is spread as a virus would. Tong et al [27] presented an epidemic immunization approach called
NetShield (see Figure 20) which can be described shortly as an algorithm that gives an answer to a problem
formulated in our case as: Given a graph G of Twitter accounts, find the best x users that are expected to
propagate a tweet or information in general better and faster.

                                      Figure 20: NetShield pseudo-algorithm

NetShield is able to detect the nodes that if isolated, one could immunize the whole network. In other words,
it is able to detect those nodes that are able to spread the virus, i.e. information throughout the whole
network. Specifically, NetShield (a) gives an effective immunization strategy and (b) scales linearly with the
size of the graph (number of edges).

5.1.1. Evaluation of the Influencer filtering process
Focusing on the dataset presented above (English tweets for the three past months), we generated the graph
as described previously and applied the NetShield algorithm for x=100 (top-100 influencers). The whole
process lasted 4 minutes and 19 seconds on a desktop Computer, a statistic that shows the efficiency of this
algorithm. The goal of this process was to filter out the top 100 influential users in a graph that contained
629787 nodes (users) and 994376 edges. A small sample of the top influential users as displayed in the demo
version of the crowdsourcing tool is presented below:

                                                                                                              25
PTwist – GA No. 780121                                        D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                           Open Data Filtering Methodology

                                       Figure 21: Influencer Detection Demo

The top six accounts regarding the quality of the content and their influence upon other users based on the
NetShield algorithm in our dataset are: (1) National Geographic, (2) CoralReefFish, (3) UNEnvironment, (4)
SealScotland, (5) EllenMacArthur, (6) BBCNews. As someone can notice, all the accounts are top-notch and
validate that the algorithm used produces accurate and reliable results.

5.2. Selecting high quality content
This subsection refers to the work done under the filtering umbrella theme and aims to efficiently track high
quality posts that seem to have an impact on Twitter users.

                                                                                                               26
PTwist – GA No. 780121                                      D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                         Open Data Filtering Methodology

5.2.1. Selecting top tweets
The crowdsourcing tool aims to raise awareness on the plastic pollution problem by bringing popular plastic
related content to the surface. The quality of this content can be tracked by filtering out posts, based on
other users’ interactions. In the case of Twitter, interactions refer to the number of times a post has been
retweeted, replied or marked as favourite.

5.2.1.1. Ranking tweets by retweet count
Since all the collected tweets are stored in a Mongo Database, the process of filtering the most retweeted
ones refers to a Database query. In this case the query is described as:

    1.   From Collection X: Get all tweets, sorted by retweet count
    2.   Store top k
    3.   If doubles exist, keep the one with the highest retweet count
    4.   Return top k – doubles

This process is repeated for all collections (different group of keywords) and for all databases (languages). An
example of the emerging results is presented in the figure below:

                                           Figure 22: Top Tweets Demo

                                                                                                               27
PTwist – GA No. 780121                                        D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                           Open Data Filtering Methodology

5.2.1.2. Ranking tweets by favourite count / replies
This process follows the exact methodology of 5.2.1.1 but returns all the tweets sorted by the favourite count
or replies count accordingly. Both statistics are included in the tweet’s JSON document. Figure 23 and Figure
24 depict such an example:

                                       Figure 23: Top tweets favourite count

                                        Figure 24: Top tweets replies count

                                                                                                               28
PTwist – GA No. 780121                                      D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                         Open Data Filtering Methodology

5.2.2. Selecting top URLs
Twitter posts very often contain URLs. The expanded URL is also included in the tweet’s JSON document.
URLs are collected monthly for each database (language). The process of filtering out the top URLs is an
iterative process that takes place at the end of every month. For each month we calculate the frequency of
each URL published in the tweets of all collections in the Database. The Map that contains the URLs and their
frequency is then sorted and stored into our Results DB. An example of this selection is shown below:

                                            Figure 25: Top URLs Demo

5.2.3. Filtering content to produce tag clouds
Tag clouds is a novel visual representation of text data; tags are usually single words whose importance is
shown with increased font size and different colour. In the crowdsourcing platform, tag clouds offer the
possibility to the user to visually identify trending terms related to plastic thematic groups (pollution,
innovations, general terms, etc.). The text of each tweet usually includes a number of different entities that
add noise to the tag clouds and therefore needs to be cleaned in order to produce more reliable and
qualitative results. The cleaning (filtering) process consists of the following steps:

    •   Remove mentions
    •   Remove hashtags
    •   Remove punctuation and symbols
    •   Remove stopwords (words that are too common and do not add any value)
    •   Remove distinct numbers
    •   Remove social network’s reserved words (e.g. RT)
    •   Remove single letter words
    •   Remove multiple consecutive blank spaces
    •   Tokenize text
    •   Calculate the frequency of each word for all the tweets in each collection
    •   Create a JSON document that includes:
            o A word frequency map

                                                                                                             29
PTwist – GA No. 780121                                        D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                           Open Data Filtering Methodology

             o A timestamp from (from which date)
             o A timestamp to (until which date)
             o The name of the collection
    •    Store the JSON document in the Results DB [Figure 26]

                                       Figure 26: JSON schema for wordcount

When a user asks for the tag cloud for a certain period t: “from date1: to date2”, for a certain collection X a
mongo query is executed:

    1.   From results DB:
    2.   get all wordclouds of collection X with timestamp in between period t
    3.   sum all wordclouds
    4.   return total

The process for creating tag clouds based on hashtags is simpler, since hashtags are stored in a specific field
of the tweet’s JSON document. A sample result of such a process is presented in Figure 27:

                                            Figure 27 Tag cloud Demo

                                                                                                               30
PTwist – GA No. 780121                                       D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                          Open Data Filtering Methodology

5.2.4. Filtering Content to produce location heatmaps
Estimating the real location for a Social Network user is a challenge, especially when someone refers to
Twitter which supports only 140 characters and contains such sparse data. Many researches have been
conducted on user geolocation, in other words how to infer the location of users exploiting all source of
information available [28].

In our case we follow a rather simple location filtering approach:

    •   Collect tweets from users that share their exact location (100% accurate – only 1-2% of the total
        dataset)
    •   Collect tweets from users that share their exact place (high level accuracy – low granularity – city
        level)
    •   Collect tweets from users that declare their location in their profile (research has shown 50-60%
        accuracy on city level)
    •   Aggregate the total number of tweets collected along with the geospatial information
    •   Visualize the results

Figure 28: Location heatmap - shows a demo example of this process:

                                           Figure 28: Location heatmap

                                                                                                               31
PTwist – GA No. 780121                                    D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                       Open Data Filtering Methodology

5.2.5. Topic extraction
Topic modelling is an unsupervised learning approach used to discover topics based on the content of
documents. In our case, the text of each tweet plays the role of the document. Up to now we have
implemented two different approaches for topic modelling:

    •   LDA (Latent Dirichlet Allocation): a probabilistic topic modelling method; and
    •   NMF (non-negative Matrix factorization)

LDA [29] is an unsupervised machine learning technique used for the discovery of latent topic information
from large document collections. It uses a bag of words approach to transform user’s corpus into a vector of
word counts. It uses two probability values: P(word | topics) and P(topics | documents). These values are
calculated based on an initial random assignment, after which they are repeated for each word in each
document, to decide their topic assignment. In an iterative procedure, these probabilities are calculated
multiple times, until the convergence of the algorithm.

NMF [30] on the other hand relies on linear algebra. It is a Linear-algebraic model that factors high-
dimensional vectors into a low-dimensionality representation. Similar to Principal component analysis (PCA),
NMF takes advantage of the fact that the vectors are non-negative.

Both NMF and LDA take a bag of words matrix (no documents * no words) as input. In the bag of words
matrix, documents are represented as rows, while words are represented as columns. Both algorithms also
require the number of topics (k) that must be derived as a parameter. The output produced by the topic
modelling algorithms is then 2 matrices: a document to topics matrix (no documents * k topics) and a topics-
to-words matrix (k topics * no words). Most topic model output only uses the topics to words matrix and
displays the words with the highest weights in a topic.

The whole process can be considered as a two-step process. The first step contains the training of the model,
while the second refers to the evaluation of the results. As an example, in this case we used LDA on the
corpus of the “English terms” collection to train our model and asked for 10 distinct topics. In order to use
these approaches, the text of the tweet needs to be filtered and pre-processed. The filtering steps are
described below:

    •   Remove mentions
    •   Remove hashtags
    •   Remove punctuation and symbols
    •   Remove stopwords (words that are too common and do not add any value)
    •   Remove distinct numbers
    •   Remove social network’s reserved words (e.g. RT)
    •   Remove single letter words
    •   Remove multiple consecutive blank spaces
    •   Tokenize text
    •   Apply a stemming process (reduce the words in their base root)

A demo example of the produced topics and the words related to them is presented in Figure 29. This demo
is using the PyLDAvis [31] to produce the graphic visualization of the topics.

                                                                                                            32
PTwist – GA No. 780121                          D2.1 – Report on Qualitative Crowdsourced and
H2020 ICT-11-2017                                             Open Data Filtering Methodology

                         Figure 29: LDA topic modelling Demo

                                                                                                33
PTwist – GA No. 780121                                    D2.1 – Report on Qualitative Crowdsourced and
 H2020 ICT-11-2017                                                       Open Data Filtering Methodology

6. Conclusions and Future Work

In this deliverable we report the work done in WP2 with focus on the filtering process that has been used
during the data collection and analysis phase up to this point. Moreover, since this is the first deliverable
regarding the crowdsourcing tool (which is to be delivered by Month 10), some of the filtering methods have
not yet been completed. Future work plans include:

    •   Extracting high quality content from the users that have been identified as experts by our system and
        the pilots and use it to train an LDA model, which will then be used to classify other users.
    •   Extract topics using topic modelling per location.
    •   Provide a filtering process for identifying high quality content in Flickr.
    •   Develop an iterative methodology that will be built upon the intelligence extracted by the already
        available high-quality content (top tweets – top URLs) to identify new trends and dynamically update
        the keywords used to track tweets of specific content.

                                                                                                            34
PTwist – GA No. 780121                                       D2.1 – Report on Qualitative Crowdsourced and
H2020 ICT-11-2017                                                          Open Data Filtering Methodology

7. References

[1] "instagram," [Online]. Available: https://www.instagram.com/.

[2] "instagram API," instagram API, [Online]. Available: https://www.instagram.com/developer/.

[3] "pinterest," [Online]. Available: pinterest.com.

[4] "pinterest api," [Online]. Available: https://developers.pinterest.com/docs/api/pins/?.

[5] "https://www.tumblr.com/," [Online]. Available: https://www.tumblr.com/.

[6] "https://www.tumblr.com/docs/en/api/v2," [Online]. Available:
    https://www.tumblr.com/docs/en/api/v2.

[7] "Facebook," [Online]. Available: www.facebook.com.

[8] "Graph API," [Online]. Available: https://developers.facebook.com/docs/graph-api/.

[9] "Twitter," [Online]. Available: www.twitter.com.

[10] "twitter api," [Online]. Available: https://developer.twitter.com/en/docs.

[11] "Flickr," [Online]. Available: www.flickr.com.

[12] "flickr API," [Online]. Available: https://www.flickr.com/services/api/.

[13] "precious plastic," [Online]. Available: https://preciousplastic.com.

[14] "Thingiverse," [Online]. Available: https://www.thingiverse.com/about/.

[15] "Thingiverse API," [Online]. Available: https://www.thingiverse.com/developers.

[16] "MongoDB," [Online]. Available: https://www.mongodb.com/.

[17] E. Dubois and D. Gaffney, "The multiple facets of influence: Identifying political influentials and
     opinion leaders on Twitter," American Behavioral Scientist, vol. 58, pp. 1260-1277, 2014.

[18] W. Chen, S. Cheng, X. He and F. Jiang, "Influencerank: An efficient social influence measurement for
     millions of users in microblog," in Cloud and Green Computing (CGC), 2012 Second International
     Conference on, 2012.

[19] W. Chen, Y. Wang and S. Yang, "Efficient influence maximization in social networks," in Proceedings of
     the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009.

[20] M. Cha, H. Haddadi, F. Benevenuto and P. K. Gummadi, "Measuring user influence in twitter: The
     million follower fallacy.," Icwsm, vol. 10, p. 30, 2010.

                                                                                                              35
PTwist – GA No. 780121                                       D2.1 – Report on Qualitative Crowdsourced and
H2020 ICT-11-2017                                                          Open Data Filtering Methodology

[21] F. Bonchi, C. Castillo, A. Gionis and A. Jaimes, "Social network analysis and mining for business
     applications," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, p. 22, 2011.

[22] E. Bakshy, J. M. Hofman, W. A. Mason and D. J. Watts, "Everyone's an influencer: quantifying
     influence on twitter," in Proceedings of the fourth ACM international conference on Web search and
     data mining, 2011.

[23] K. Lee, J. Mahmud, J. Chen, M. Zhou and J. Nichols, "Who will retweet this?: Automatically identifying
     and engaging strangers on twitter to spread information," in Proceedings of the 19th international
     conference on Intelligent User Interfaces, 2014.

[24] F. Riquelme and P. González-Cantergiani, "Measuring user influence on Twitter: A survey,"
     Information Processing & Management, vol. 52, pp. 949-975, 2016.

[25] T. Rodrigues, F. Benevenuto, M. Cha, K. Gummadi and V. Almeida, "On word-of-mouth based
     discovery of the web," in Proceedings of the 2011 ACM SIGCOMM conference on Internet
     measurement conference, 2011.

[26] S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," Computer
     networks and ISDN systems, vol. 30, pp. 107-117, 1998.

[27] H. Tong, B. A. Prakash, C. Tsourakakis, T. Eliassi-Rad, C. Faloutsos and D. H. Chau, "On the vulnerability
     of large graphs," in Data Mining (ICDM), 2010 IEEE 10th International Conference on, 2010.

[28] D. Jurgens, T. Finethy, J. McCorriston, Y. T. Xu and D. Ruths, "Geolocation Prediction in Twitter Using
     Social Networks: A Critical Analysis and Review of Current Practice.," ICWSM, vol. 15, pp. 188-197,
     2015.

[29] D. M. Blei, A. Y. Ng and M. I. Jordan, "Latent dirichlet allocation," Journal of machine Learning
     research, vol. 3, pp. 993-1022, 2003.

[30] A. Cichocki and A.-H. Phan, "Fast local algorithms for large scale nonnegative matrix and tensor
     factorizations," IEICE transactions on fundamentals of electronics, communications and computer
     sciences, vol. 92, pp. 708-721, 2009.

[31] C. Sievert and K. Shirley, "LDAvis: A method for visualizing and interpreting topics," in Proceedings of
     the workshop on interactive language learning, visualization, and interfaces, 2014.

                                                                                                                  36
You can also read