My-Trac: System for Recommendation of Points of Interest on the Basis of Twitter Profiles

Page created by Lisa Gordon

Society

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

My-Trac: System for Recommendation of Points of Interest on the Basis of Twitter Profiles

electronics
Article
My-Trac: System for Recommendation of Points of Interest on
the Basis of Twitter Profiles
Alberto Rivas 1,2, * , Alfonso González-Briones 1,2,3 , Juan J. Cea-Morán 1 , Arnau Prat-Pérez 4
and Juan M. Corchado 1,2

                                          1   BISITE Research Group, University of Salamanca, Edificio I+D+i, Calle Espejo 2, 37007 Salamanca, Spain;
                                              alfonsogb@usal.es (A.G.-B.); juanju_97@usal.es (J.J.C.-M.); corchado@usal.es (J.M.C.)
                                          2   Air Institute, IoT Digital Innovation Hub, Carbajosa de la Sagrada, 37188 Salamanca, Spain
                                          3   Research Group on Agent-Based, Social and Interdisciplinary Applications (GRASIA), Complutense
                                              University of Madrid, 28040 Madrid, Spain
                                          4   Sparsity-Technologies, 08034 Barcelona, Spain; arnau@sparsity-technologies.com
                                          *   Correspondence: rivis@usal.es

                                          Abstract: New mapping and location applications focus on offering improved usability and services
                                          based on multi-modal door to door passenger experiences. This helps citizens develop greater
                                          confidence in and adherence to multi-modal transport services. These applications adapt to the needs
                                          of the user during their journey through the data, statistics and trends extracted from their previous
                                          uses of the application. The My-Trac application is dedicated to the research and development
                                          of these user-centered services to improve the multi-modal experience using various techniques.
                                          Among these techniques are preference extraction systems, which extract user information from
                                social networks, such as Twitter. In this article, we present a system that allows to develop a profile
         
                                          of the preferences of each user, on the basis of the tweets published on their Twitter account. The
Citation: Rivas, A.; González-Briones,
                                          system extracts the tweets from the profile and analyzes them using the proposed algorithms and
A.; Cea-Morán, J.J.; Prat-Pérez, A.;
                                          returns the result in a document containing the categories and the degree of affinity that the user has
Corchado, J.M. My-Trac: System for
                                          with each category. In this way, the My-Trac application includes a recommender system where the
Recommendation of Points of Interest
on the Basis of Twitter Profiles.
                                          user receives preference-based suggestions about activities or services on the route to be taken.
Electronics 2021, 10, 1263. https://
doi.org/10.3390/electronics10111263       Keywords: users’ profiling; data extraction; natural language processing; recommender system;
                                          mapping application
Academic Editors: Dimitris
Apostolou and Osvaldo Gervasi

Received: 13 April 2021                   1. Introduction
Accepted: 20 May 2021
                                                Humans are social beings; we always seek to be in contact with other people and to
Published: 25 May 2021
                                          have as much information as possible about the world around us. The philosopher Aristotle
                                          (384–322 B.C.) in his phrase “Man is a social being by nature” states that human beings are
Publisher’s Note: MDPI stays neutral
                                          born with the social characteristic and develop it throughout their lives, as they need others
with regard to jurisdictional claims in
                                          in order to survive. Socialization is a learning process; the ability to socialize means we are
published maps and institutional affil-
                                          capable of relating with other members of the society with autonomy, self-realization and
iations.
                                          self-regulation. For example, the incorporation of rules associated with behavior, language,
                                          and culture improves our communication skills and the ability to establish relationships
                                          within a community.
                                                In the search for improvement, communication, and relationships, human beings seek
Copyright: © 2021 by the authors.
                                          to get in contact with other people and to obtain as much information as possible about the
Licensee MDPI, Basel, Switzerland.
                                          environment in order to achieve the above objectives. The emergence of the Internet has
This article is an open access article
                                          made it possible to define new forms of communication between people. It has also made it
distributed under the terms and
                                          possible to make a large amount of information on any subject available to the average user
conditions of the Creative Commons
                                          at any time. This is materialized in the development of social networks. The concept of
Attribution (CC BY) license (https://
                                          social networking emerged in the 2000s as a place that allows for interconnection between
creativecommons.org/licenses/by/
4.0/).
                                          people, and, very soon, the first social networking platforms appeared on the Internet that

Electronics 2021, 10, 1263. https://doi.org/10.3390/electronics10111263                                   https://www.mdpi.com/journal/electronics

Electronics 2021, 10, 1263                                                                                            2 of 19

                             served to bring people together. Among the first platforms that emerged were Fotolog,
                             MySpace, Hi5, Buzz, and SecondLife; however, most of them have declined in popularity
                             or disappeared. Today, Facebook, Twitter, and Instagram stand out; they are used by
                             millions of people all over the world. Thanks to these technologies, people from different
                             parts of the world can engage in conversation, post photos of their latest trip, or keep their
                             followers updated by sharing their opinions or experiences.
                                   With regard to writing opinions or experiences, Twitter is the social network par
                             excellence. Twitter is based on the concept of microblogging, i.e., users can post messages
                             about their opinions, preferences, experiences, etc., with a maximum of 280 characters.
                             Twitter allows its users to follow other accounts that interest them, or to comment on events
                             in real time using hashtags. All this translates into one word: information. The information
                             that users provide on social networks can be used in a variety of ways, many of them
                             negative. Exposure on the Internet means that anyone can access the users’ data and use it
                             for financial gain. However, it can also be used to make life easier for users who choose
                             to do so, always bearing in mind that there must be express consent on their part. This is
                             precisely the case of the work presented here.
                                   Since the emergence of the first social networks much progress has been made towards
                             the current state of maturity of the social network life cycle. As presented above, they
                             are of vital importance to society, as they fulfill the innate communication function of
                             human beings. In addition to their use as a means of communication, they have begun
                             to be exploited for business purposes in order to profit from the enormous amount of
                             information that is generated on a daily basis. This information, which is generated by the
                             society, is of great value once analyzed and processed correctly.
                                   The data generated by users on social networks allows for the development of com-
                             mercial and advertising actions that are much more effective than with traditional formats.
                             Advertisement platforms have developed the ability to segment advertising on the basis of
                             the behavior of each user, to show products and services to those who are really interested
                             in those products and services. This increases the effectiveness of advertisements.
                                   Social network users publish practically everything that happens in their daily lives,
                             their opinions, where they are, what they eat, what they would like to buy, where they
                             are going on holiday, and a long list of behaviors that are transformed into valuable
                             information for analysis. The information obtained through the analysis is very interesting
                             as it allows for the elaboration of demographic, socio-economic, and consumer trend
                             profiles. The companies that own these social networks sell high-value information to other
                             companies to enable them to carry out much more powerful and effective marketing and
                             advertising strategies. Another remarkable aspect of data analytics on social networks is
                             the ability to perform real-time analysis of the information to offer products and services
                             according to their characteristics.
                                   Information is a very precious commodity, and, as presented above, Twitter is a great
                             source of data when analyzing human behavior and interactions or when learning about
                             the opinion of certain users on certain topics. This information can be used to improve
                             the multi-modal experience of users when they use the My-Trac application. Therefore,
                             an adaptation of these systems for adoption in mapping applications is proposed.
                                   The European My-TRAC project focuses on providing user-centered services to im-
                             prove the multi-modal experience of passengers from door-to-door. This helps citizens
                             develop greater confidence in and adherence to multi-modal transport services. In addition,
                             My-TRAC improves customization to users’ needs through data, statistics, and trends pro-
                             vided by passengers’ experiences when using the proposed platform. Part of the tailoring
                             of services and recommendations to users is determined by the knowledge obtained from
                             their Twitter posts through the use of NLP techniques to classify and understand users.
                                   There are other services that offer similar functionalities to My-Trac, such as ROSE [1],
                             CTRR, and CTRR+ based systems for city-based tourism [2], or to participate in solidarity
                             projects in rural environments [3]. From among the above, ROSE (ROuting SErvice) stands
                             out, which is a mobile phone application that suggests events and places to the user and

Electronics 2021, 10, 1263                                                                                                       3 of 19

                                guides them via public transport. There are many different systems that incorporate
                                both recommendation and navigation. However, there is no system that combines event
                                recommendation and pedestrian navigation with (real-time) public transport. However, it
                                does not employ multi-modal navigation between different public transport modes (bus,
                                train, carpooling, plane, etc.) in different countries and that would use information from the
                                user’s social network profile. Instead, current systems utilize a set of information initially
                                entered into the application which is not updated afterwards. Finally, Tables 1 and 2
                                present a review of similar works.

                                             Table 1. Review of similar works: Part I.

      Title / Publication           Functionality                       Advantages                           Shortcomings
                                                                                                   There is no system that combines
                                                                                                       event recommendation and
                                                                                                       pedestrian navigation with
                              Mobile phone application                                               (real-time) public transport. It
                                                             The current systems utilize a set
                              that suggests events and                                               does not employ multi-modal
   ROSE (ROuting SErvice)                                     of information initially entered
                               places to the user and                                                 navigation between different
           [1]                                               into the application which is not
                               guides them via public                                              public transport modes (bus, train,
                                                                    updated afterwards.
                                     transport.                                                         carpooling, plane, etc.) in
                                                                                                       different countries and that
                                                                                                    would use information from the
                                                                                                      user’s social network profile.
                                                              The experimental results show
                                A personalized travel
                                                                that the proposed methods
                               route recommendation                                                It does not use information from
    Systems for city-based                                    achieve better results for travel
                                  based on the road                                                public transport services in route
         tourism [2]                                              route recommendations
                              networks and users’ travel                                                   recommendations.
                                                                compared with the shortest
                                     preferences.
                                                                   distance path method.
                              This paper argues that the
                              clustering of activities and
                                   attractions, and the
                                 development of rural
   Tourism routes as a tool   tourism routes, stimulates          The article analyzes the         Preliminary project that requires
       for the economic             co-operation and         realization of routes that include    public cooperation (institutions,
     development of rural     partnerships between local     activities and attractions in a way       transport, services) for a
   areas—vibrant hope or       areas. The paper further        that encourages and enhances        comprehensive improvement of
    impossible dream? [3]     discusses the development         rural development in Africa.                the proposal.
                               of rural tourism routes in
                                    South Africa and
                                 highlights the factors
                                  critical to its success.

                                     This article improves on the previous system for the extraction of information re-
                                garding Twitter users [4]. The system is capable of obtaining information about a par-
                                ticular user and of elaborating a profile with the user’s preferences in a series of pre-
                                established categories. A review of existing reputation systems is presented in Section 2.
                                Section 3 describes the proposal. Section 4 presents the assessment made with synthetic
                                data. Section 5 shows how the system is integrated in My-Trac app. Finally, Section 6
                                presents the conclusions.

Electronics 2021, 10, 1263                                                                                                         4 of 19

                                            Table 2. Review of similar works: Part II.

      Title / Publication          Functionality                       Advantages                            Shortcomings
                                Outlife recommender             In addition to the user’s             Although it uses information
                             assists in finding the ideal    preferences, the recommender          from the user’s groups of friends,
  Social Recommendations         event by providing         uses information from the user’s         no use is made of information
        for Events [5]        recommendations based          group of friends to make event        from the user’s social networks to
                               on the user’s personal           recommendations more                 complement the analysis and
                                     preferences.                     satisfactory.                        recommendation.
                                 This paper presents a
                             system designed to utilize
                                   innovative spatial
                                    interconnection          The system adapts the services
                                                                                                   The very comprehensive system
                              technologies for sites and       offered to meet the needs of
                                                                                                      that uses external services,
                               events of environmental,     specific individuals, or groups of
                                                                                                          scraping, crawling,
      Smart Discovery of          cultural and tourist           users who share similar
                                                                                                     geo-positioning but does not
     Cultural and Natural        interests. The system        characteristics, such as visual,
                                                                                                   include information from social
       Tourist Routes [6]     discover and consolidate        acoustic, or motor disabilities.
                                                                                                     networks to complement the
                             semantic information from         Personalization is done in a
                                                                                                   analysis and recommendation of
                                    multiple sources,       dynamic way that takes place at
                                                                                                                events.
                             providing the end-user the     the time and place of the service.
                                ability to organize and
                             implement integrated and
                                    enhanced tours.
                                                                  The system integrates
                                                                collaborative filtering and
                                Hybrid recommender            community-based algorithms
                                                                                                   The main drawback is the absence
                              system (RS) in the artistic     with semantic technologies to
     Enhancing cultural                                                                               of extensive control over the
                             and cultural heritage area,     exploit linked open data sources
 recommendations through                                                                            semantics that are not taken into
                              which takes into account       in the recommendation process.
   social and linked open                                                                          account. It generates difficulties in
                                the activities on social       Furthermore, the proposed
           data [7]                                                                                justifying, explaining, and hence
                               media performed by the       recommender provides the active
                                                                                                     analyzing the resulting scores.
                             target user and her friends       user with personalized and
                                                             context-aware itineraries among
                                                                cultural points of interest.
                                                                                                        Future works consists on
                                                               We have modeled the tourist           extending the system to more
                                                              planning problem, integrating           cities with a different public
                                                            public transportation, as the Time        transport network topology.
                             Intelligent routing system
                                                              Dependent Team Orienteering               The next one consists on
                                able to generate and
                                                               Problem with Time Windows                integrating an advanced
     Personalized Tourist     customize personalized
                                                            (TDTOPTW). We have designed a            recommendation system in a
     Route Generation [8]    tourist routes in real-time
                                                              heuristic able to solve it in real         wholly functional PET.
                              and taking into account
                                                             time, precalculating the average         The systema don’t use social
                               public transportation.
                                                            travel times between each pair of       network capabilities, that allows
                                                               POIs in a preprocessing step.         to store, share and add travel
                                                                                                   experiences to better help tourists
                                                                                                            on the destination.

                               2. Natural Language Processing Techniques Applied to Twitter Profiles
                                    In this section, we review the main techniques applied in the analysis that make
                               it possible to get to know the users preferences through their tweets. This allows for
                               recommendations to be made according to the user profile.

                               2.1. Word Embedding Techniques
                                    NLP techniques allow computers to analyze human language, interpret it, and derive
                               its meaning so that it can be used in practical ways. These techniques allow for tasks,
                               such as automatic text summarization, language translation, relation extraction, sentiment

Electronics 2021, 10, 1263 5 of 19

analysis, speech recognition, and item classification, to be carried out. Currently, NLP
is considered to be one of the great challenges of artificial intelligence as it is one of the
fields with the highest development activity since it presents tasks of great complexity:
how to really understand the meaning of a text, how to intuit neologisms, ironies, jokes, or
poetry? It is a challenge to apply the techniques and algorithms that allow us to obtain the
expected results.
One of the most commonly used NLP techniques is Topic Modeling. This technique is
a type of statistical modeling that is used to discover the abstract “topics” that appear in a
series of input texts. Topic modeling is a very useful text mining tool for discovering hidden
semantic structures in texts. Generally, the text of a document deals with a particular topic,
and the words related to that topic are likely to appear more frequently in the document
than those that are unrelated to the text. Topic Modeling collects the set of more frequent
words in a mathematical framework, which allows one to examine a set of text documents
and discover, on the basis of the statistics of the words in each one, what the topics may be
and what the balance is between the topics in each document.
The input of topic modeling is a document-term matrix. The order of words does not
matter. In a document-term matrix, each row is a question (or document), each column
is a term (or word), we label “0” if that document does not contain that term, “1” if that
document contains that term once, “2” if that document contains that term twice, and so on.
Algorithms, such as Bag-of-words or TF-IDF, among others, make it possible to
represent the words used by the models and create the matrix defined above, representing
a token in each column and counting the number of times that token appears in each
sentence (represented in each row).
• Bag-of-words. This model allows to extract the characteristics of texts (also images,
audios, etc.). It is, therefore, a feature extraction model. The model consists of two
parts: a representation of all the words in the text and a vector representing the number
of occurrences of each word throughout the text. That is why it is called Bag-of-words.
This model completely ignores the structure of the text, it simply counts the number
of times words appear in it. It has been implemented through the Genism library [9].
• Term Frequency - Inverse Document Frequency (TF-IDF). This is the product of two
measures that indicate, numerically, the degree of relevance that a word has in a
document within a collection of documents [10]. It is broken down into two parts:
– Term frequency: Measures the frequency with which certain terms appear in a
document. There are several measurement options, the simplest being the gross
frequency, i.e., the number of times a term t appears in a document d. However,
in order to avoid a predisposition towards long documents, the normalized
frequency is used:
f(t, d)
tf(t, d) = . (1)
max{f(t, d) : t ∈ d}
As shown in Equation (1), the frequency of the term is divided by the maximum
frequency of the terms in the document.
– Inverse document frequency: If a term appears very frequently in all of the analyzed
documents, its weight is reduced. If it appears infrequently, it is increased.

|D|
idf(t, D ) = log . (2)
|{d ∈ D : t ∈ d}|

As shown in Equation (2), the total number of documents is divided by the
number of documents containing the term. Term frequency—Inverse document
frequency: The entire formula is as shown in Equation (3).

tf − idf(t, d, D ) = tf(t, d) × idf(t, D ). (3)

Electronics 2021, 10, 1263 6 of 19

Word embedding is a term used for the representation of words for text analysis,
typically in the form of a real-valued vector that encodes the meaning of the word such
that the words that are closer in the vector space are expected to be similar in meaning [11].
Word embeddings can be obtained using a set of language modeling and feature learning
techniques where words or phrases from the vocabulary are mapped to vectors of real
numbers [12].
• Word2vec. This technique uses huge amounts of text as input and is able to identify
which words appear to be similar in various contexts [13–15]. Once trained on a
sufficientñy big dataset, 300-dimensional vectors are generated for each word, forming
a new vocabulary where "similar" words are placed close to each other. Pre-trained
vectors are used, achieving a wealth of information from which to understand the
semantic meaning of the texts.
• Doc2vec. This technique is an extension of Word2Vec and is applied to a document
as a whole instead of individual words, it uses an unsupervised learning approach
to better understand documents as a whole [16]. Doc2Vec model, as opposed to
Word2Vec model [17], is used to create a vectorized representation of a group of words
taken collectively as a single unit. It does not only give the simple average of the
words in the sentence.

2.2. Topic Modeling
As already presented in the previous section, topic modeling is a tool that takes an
individual text (or corpus) as input and looks for patterns in word usage; it is an attempt to
find semantic meaning in the vocabulary of that text (or corpus).
This set of tools enables the extraction of topics from texts; a topic is a list of words
that is presented in a way that is statistically significant. Topic modeling programs do not
know anything about the meaning of the words in a text. Instead, they assume that each
text fragment is composed (by an author) through the selection of words from possible
word baskets, where each basket corresponds to a topic. If that is true, then it is possible to
mathematically decompose a text into the baskets from which the words that compose it
are most likely to come. The tool repeats the process over and over again until the most
probable distribution of words within the baskets, the so-called topics, is established.
The techniques executed by the proposed system are used to discover word usage
patterns of each user on Twitter, and they make it possible to group users into different
categories. To this end, a thorough review of the main tools for topic modeling has been
carried out. Most of the algorithms are based on the paradigm of unsupervised learning.
These algorithms return a set of topics, as many as indicated in the training. Each topic
represents a cluster of terms that must be related to one of those categories. Precisely for
this reason, a large number of tweets have been retrieved as training data. Keywords have
been searched for for each category. As part of this research, a total of three algorithms
have been evaluated: LDA, LSI, and NMF. In the NMF experiment, the best results were
obtained, although the techniques applied in other works have been reviewed in order to
contrast their results with this method.
Apart from the comparison itself, there are numerous studies that have made similar
comparisons between these techniques so that the decision is supported by similar studies.
In the work of Tunazzina Islam, in 2019 a similar experiment was carried out to the one
proposed in this paper [18]. In this paper, Apache Kafka is employed to handle the big
streaming data from Twitter. Tweets on yoga and veganism are extracted and processed
in parallel with data mining by integrating Apache Kafka and Spark Streaming. Topic
modeling is then used to obtain the semantic structure of the unstructured data (i.e. Tweets).
They then perform a comparison of the three different algorithms LSA, NMF, and LDA,
with NMF being the best performing model.
Another noteworthy work is that carried out by Chen et al. [19], in which an experi-
ment is carried out to detect topics in small text fragments.

Electronics 2021, 10, 1263                                                                                             7 of 19

                                  This is similar to the proposal made in this paper, since tweets can be considered small
                             texts. In this work a comparison is made between the LDA and NMF methods, the latter
                             being the one that provided the best results.
                             •    Latent Dirichlet allocation (LDA). Is a generative statistical model that allows sets of
                                  observations to be explained by unobserved groups that explain why some parts of
                                  the data are similar [20–22]. For example, if observations are collections of words in
                                  documents, each document is a mixture of a small number of topics and each word’s
                                  presence is attributable to one of the document’s topics. LDA is an example of a topic
                                  model and belongs to the machine learning toolbox and in wider sense to the artificial
                                  intelligence toolbox.
                             •    Nonnegative Matrix Factorization (NMF). Is an unsupervised learning algorithm
                                  belonging to the field of linear algebra. NMF reduces the dimensionality of an input
                                  matrix by factoring it in two and approximating it to another of a smaller range.
                                  The formula is V ≈ W H. Let us suppose, observing Equation (4), a vectorization of P
                                  documents with an associated dictionary of N terms (weight). That is, each document
                                  is represented as a vector of N dimensions. All documents, therefore, correspond to a
                                  V matrix
                                                                              ··· ··· 
                                                                                                
                                                                                             .. 
                                                                            . . .
                                                                           
                                                                   N×P                        .
                                                             V∈R         =.
                                                                           
                                                                                              .. ,
                                                                                                                      (4)
                                                                             .   . .   . .
                                                                           .        .     . .
                                                                              ···          
                                  where N is the number of rows in the matrix, and each of them represents a term, while
                                  P is the number of columns in the matrix and each of them represents a document.
                                  Equations (5) and (6) shows matrices W and H. The value r marks the number of
                                  topics to be extracted from the texts.
                                  Matrix W contains the characteristic vectors that make up these topics. The number
                                  of characteristics (dimensionality) of these vectors is identical to that of the data in
                                  the input matrix V. Since only a few topic vectors are used to represent many data
                                  vectors, it is ensured that these topic vectors discover latent structures in the text.
                                  The H-matrix indicates how to reconstruct an approximation of the V-matrix by means
                                  of a linear combination with the W-columns.

                                                                              ··· ··· 
                                                                                               
                                                                                             .. 
                                                                            . . .
                                                                           
                                                                     N ×r                     .
                                                             W∈R          =.
                                                                           
                                                                                              . ,
                                                                                                                         (5)
                                                                            .. . . . . . . .. 
                                                                              ···  

                                  where N is the number of rows in matrix W, and each of them represents a term
                                  (weight), and r is the number of columns in matrix W, where r is the number of
                                  characteristics to be extracted.

                                                                             ··· ··· 
                                                                                            
                                                                                         .. 
                                                                           . . .
                                                                          
                                                                                          .
                                                             H ∈ Rr × P = 
                                                                          . .
                                                                                             ,              (6)
                                                                           .. . . . . . ... 
                                                                                             

                                                                              ···          

                                  where r is the number of rows in matrix H, r is the number of characteristics to
                                  be extracted, and P is the number of columns, with one column for each document.
                                  The result of the matrix product between W and H is, therefore, a matrix of dimensions
                                  NxP corresponding to a compressed version of V.
                                 The use of Machine Learning techniques for the analysis of information extracted
                             from Twitter is a very common case study today. It is convenient to study what kind of

Electronics 2021, 10, 1263 8 of 19

research is being carried out on this subject. One of the main applications is the use of
Twitter and Natural Language Processing techniques in order to extract a user’s opinion
about what is being tweeted at a given time. The article “A system for real-time Twitter
sentiment analysis of 2012 U.S. presidential election cycle”, written by Hao Wang et al. [23],
presents a system for real-time polarity analysis of tweets related to candidates for the 2012
U.S. elections.
The system collects tweets in real time, tokens and cleans them, identifies which user
is being talked about in the tweet, and analyzes the polarity. For training, it applies Naïve
Bayes, a statistical classifier. It uses hand-categorized tweets as input. Another study
similar to this one is the one proposed by J.M.Cotelo et al. from the University of Seville:
“Tweet Categorization by combining content and structural knowledge” [24]. It proposes
a method to extract the users’ opinion about the two main Spanish parties in the 2013
elections. It uses two processing pipelines, one based on the structural analysis of the
tweets, and the other based on the analysis of their content.
Another possible line of research is based on categorizing Twitter content. This is the
case of the article “Twitter Trending Topic Classification” written by Kathy Lee et al. [25]. It
studies the way to classify trending topics (hashtags highlighted) in 18 different categories.
To this end, Topic Modeling techniques were used. The key point lies in providing a
solution based on the analysis of the network underlying the hashtags and not only the
text: “our main contribution lies in the use of the social network structure instead of using
only textual information, which can often be noisy considering the social network context”.
As it can be seen, there are many studies currently oriented to the analysis of Twitter
using Machine Learning tools. The challenge to be faced in this work is to find the optimal
way of classifying users according to their tweets. The sections that follow describe the
objectives of the project and detail the research and testing that led to the construction of a
stable system fit for the purpose for which it has been designed.

3. Proposal
This section proposes a system for the extraction of information about Twitter users.
The system is capable of obtaining information about a particular user and of elaborating a
profile with the user’s preferences in a series of pre-established categories. From an abstract
point of view, the proposal could be seen as a processing pipeline, as shown in Figure 1.
The different phases of this pipeline contribute to the achievement of the main objective:
user classification.

Figure 1. Pipeline representing the system processing steps.

3.1. Category Definition
Matching a given profile to a specific category or topic is one of the objectives of NLP
algorithms. As a starting point, it is necessary to prepare the training dataset that is used
when investigating the algorithmic model. The strategy followed is based on the model
of the Interactive Advertising Bureau (IAB) association [26]. Today, IAB is a benchmark
standard for the classification of digital content. In particular, the IAB Tech Lab has
developed and released a content taxonomy on which the present categorization is based.
This taxonomy proposes a total of 23 categories with their corresponding subcategories
covering the main topics of interest. In this way, 8000 tweets from each of these categories
have been ingested. As a result, 23 datasets with examples of tweets related to each category

Electronics 2021, 10, 1263                                                                                             9 of 19

                             were obtained, these datasets have been used to train the system at a later stage. Specifically,
                             the list of topics is shown in Table 3.

                             Table 3. Categories taxonomy.

                                                                         Topics
                                                                  Arts & Entertainment
                                                                       Automotive
                                                                         Business
                                                                         Careers
                                                                        Education
                                                                   Family & Parenting
                                                                    Health & Fitness
                                                                      Food & Drink
                                                                   Hobbies & Interests
                                                                    Home & Garden
                                                                  Law, Gov’t & Politics
                                                                           News
                                                                    Personal Finance
                                                                          Society
                                                                         Science
                                                                           Pets
                                                                          Sports
                                                                     Style & Fashion
                                                                Technology & Computing
                                                                          Travel
                                                                       Real Estate
                                                                        Shopping
                                                                 Religion & Spirituality

                             3.2. Twitter Data Extraction
                                  The Twitter data extraction mechanism is a fundamental element of the system.
                             The goal of this mechanism is to recover two types of data.
                                  On the one hand, the system extracts a set of anonymous tweets related to each of the de-
                             fined preference categories; these tweets are used to train the data classification algorithms.
                                  On the other hand, the mechanism extracts information about the given user for the
                             analysis of their preferences.
                                  Twitter’s API enables developers to perform all kinds of operations on the social
                             network. It is, therefore, necessary for our system to use this powerful API. This API
                             could be used by elaborating a module that would make HTTP requests to the API so
                             that the endpoints of interest are executed. However, this involves a remarkably high
                             development cost.
                                  Another option would be to make use of one of the multiple Python libraries that
                             encapsulate this logic and offer a simple interface to developers. The latter option has been
                             chosen for the development of this system, more specifically, library Tweepy [27].

                             3.3. Preprocessing of Tweets
                                  Once the data has been extracted, it must be prepared for the classification algorithms.
                             Cleaning and preprocessing techniques must be applied, so that the text is prepared for
                             topic modeling algorithms. Libraries, such as NLTK and Spacy, have been used, as can be
                             observed in Listing 1.
                                  The first step involves cleaning tweets, by removing content that does not provide
                             information for language processing. More specifically, this task consists in eliminating
                             URLs, hashtags, mentions, punctuation marks, etc.
                                  Another of the techniques applied to obtain more information from tweets is the
                             transformation of the emojis contained in the text into a format from which it is possible
                             to extract information. To do this, a dictionary of emojis is used as a starting point for the

Electronics 2021, 10, 1263                                                                                          10 of 19

                             conversion of the data. This dictionary contains a series of values that interpret each of the
                             existing emojis when applying the corresponding analysis. In this way, it has been possible
                             to identify and give a certain value to each emoji for its treatment.
                                   The key activity performed during the preprocessing consist of eliminating stopwords
                             and tokenization. Whether it is a paragraph, an entire document or a simple tweet, every
                             text contains a set of empty words or stopwords. This set of words is characterized by its
                             continuous repetition in the document and its low value within the analysis. These words
                             are mainly articles, determiners, synonyms, conjunctions, and others.

                             Listing 1. Preprocessing step pseudocode.
                             from n l t k . t o k e n i z e import word_tokenize
                             import~spacy

                             sp = spacy . load ( ’ en_core_web_sm ’ )
                             s t o p w o r d s _ d i c t = sp . D e f a u l t s . stop_words

                             def t w e e t _ p r e p r o c e s s i n g ( tweet ) :
                                 tweet = hashtag_removal ( tweet )
                                 tweet = mentions_removal ( tweet )
                                 tweet = url_removal ( tweet )
                                 tweet = html_removal ( tweet )
                                 tweet = punctuation_removal ( tweet )
                                 tweet = emojis_removal ( tweet )
                                 tweet = word_tokenize ( tweet )
                                 tweet = [ word f o r word in tweet i f not word in s t o p w o r d s _ d i c t ]
                                 r e t u r n tweet

                                  Table 4 shows the results obtained after the tweets have gone through the preprocess-
                             ing and preparation process which had been carried out using the tools listed above.

                             3.4. Vectorization
                                  Vectorization is the application of models that convert texts into numerical vectors so
                             that the algorithms can work with the data. Two algorithms have been considered for the
                             performance of this task, “Bag-Of-Words” and “Tf-Idf”. Both are widely used in the field
                             of NLP, but, in general, creation of tf-idf weights from text works properly and is not very
                             expensive computationally. Moreover, NMF expects as input a Term-Document matrix,
                             typically a “Tf-Idf” normalized.
                                  The vectorizer have been tuned manually with some parameters according to the
                             dataset, as can be observed in Listing 2. Min_d f was set to 100 to ignore words that appear
                             in less than 100 tweets. In the same way, max_d f was set to 0.85 to ignore words that
                             appear in more than 85% of the tweets. Thanks to that feature, it is possible to remove
                             words that introduce noise in the model. Finally, the algorithm only takes into account
                             single words, so, in order to include bigrams, the parameter ngram_range was set to (1, 2)

Electronics 2021, 10, 1263                                                                                                                                11 of 19

                             Table 4. Preprocessing results using NLTK tokenization.

                                                                              Text                                                 Nltk_tokenized
                                                                   Read This Before                                              [read, taking, road,
                                      0
                                                                   Taking a Road Trip                                            trip, pet]
                                                                   with a Pet
                                                                @kenwardskorner                                                  [also, name, imply,
                                      1                         @Senators                                                        take, acid, road,
                                                                @Canucks In addition,                                            trips, I. . . ]
                                                                does . . .
                                                                     Our Art is our                                             [art, passion,
                                      2                              Passion                                                    apnatruckart, truck,
                                                                     \n#apnatruckart                                            art, uniqu. . . ]
                                                                     #truck
                                                                     Lelang drop acc                                              [lelang, drop, acc,
                                      3
                                                                     budget 40–50 k                                               budget, dong]
                                                                     dong?
                                                                    We agree. . . and                                             [agree, want,
                                      4
                                                                    want everyone to                                              everyone, know,
                                                                    know that ou                                                  tours, relaxed, . . .
                                                                     Choosing a hotel                                           [choosing, hotel,
                                      5
                                                                     for a break away                                           break, away, family,
                                                                     with the fam. . .                                          special. . .

                                      6                              @_JassyJass Are                                                    [camping]
                                                                     you,camping?
                                                                    How to Pack Your                                              [pack, electronics,
                                      7
                                                                    Electronics for Air                                           air, travel]
                                                                    Travel ht

                                      8                       Dasar low budget!                                                         [dasar, low,
                                                              https://t.co/2YUmUrGjj5                                                   budget]

                             Listing 2. Vectorization step pseudocode.
                             from s k l e a r n . f e a t u r e _ e x t r a c t i o n . t e x t import~ T f i d f V e c t o r i z e r

                             def t f i d f _ v e c t o r i z a t i o n ( t w e e t s ) :
                                 vectorizer = TfidfVectorizer (
                                         min_df =100 ,
                                         max_df = 0 . 8 5 ,
                                         ngram_range = ( 1 , 2 ) ,
                                         preprocessor= ’ ’ . join ,
                                         u s e _ i d f =True
                                 )
                                 vectorized_tweets = v e c t o r i z e r . f i t _ t r a n s f o r m ( tweets )
                                 return vectorized_tweets

                             3.5. Topic Modeling
                                   Topic Modeling is a typical NLP task that aims to discover abstract topics in texts.
                             It is widely used to discover hidden semantic structures. In the present work, this tech-
                             nique has been used to discover the main topics of interest of the My-Trac application
                             users based on their Twitter profiles, which should correspond to some of the previously
                             defined categories.
                                   Regarding the features of the model, in Section 3.4, the training tweets were vec-
                             torized to create a Term-Document matrix which has been the input of the NMF model.
                             In addition, NMF needs one important parameter, the number of topics to be discovered
                             “n_components”. In this case, n_components was set manually to 23, which is the number
                             of topics that were defined initially in the categories taxonomy. Following this approach,

Electronics 2021, 10, 1263                                                                                            12 of 19

                             the algorithm is trained with 184,000 tweets (8000 per category) with the aim of obtaining
                             as many topics as categories were defined in the taxonomy. Once the model has been
                             trained, it has been possible to determine in which topics a user’s profile fits on the basis of
                             their tweets. The implementation of the topic modeling algorithm has been carried out on
                             the basis of NMF using SKLearn library, as is deailed in Listing 3.

                             Listing 3. NMF Sklearn implementation.
                             from s k l e a r n . decomposition import~NMF

                             def t ra i n_ m od e l ( v e c t o r i z e d _ t w e e t s ) :
                                 nmf_model = NMF( n_components =23 , alpha = . 1 ,
                                                              l 1 _ r a t i o = . 5 , i n i t = ’ nndsvda ’ )
                                 nmf_model . f i t ( v e c t o r i z e d _ t w e e t s )
                                 r e t u r n nmf_model

                                  Finally, it is worth mentioning the use of some extra parameters which were set in
                             the implementation of the model. The method used to initialize the procedure was set to
                             “NNDSVa” which works better with the tweet dataset since this kind of data it is not sparse.
                             Al pha and l1_ratio both are parameters which helps to define regularization.

                             4. Evaluation and Results
                                  In order to evaluate the results of the algorithm, the most relevant terms have been
                             identified for each resulting topic. Then, by reviewing the main terms for each topic, it is
                             possible to determine if that words really represent the content of the topic. An example
                             is shown in Figure 2, where the most relevant terms have been identified for 4 different
                             topics, proving how well the algorithm identifies the terms associated with each one. As it
                             can be seen, all of them are unambiguously related to their defined categories. Topic 1:
                             Travel. Topic 2: Arts & Entertainment. Topic 9: Personal Finance. Topic 10: Pets.

                             Figure 2. Example topics generated by NMF.

Electronics 2021, 10, 1263                                                                                                         13 of 19

                                          The full list of topics and their top 10 related keywords identified by the algorithm
                                     can be seen in Table 5. It should be noted that some of the previously defined categories in
                                     Table 3 have been removed during the evaluation of this model. This fact is due to the lack
                                     of tweets that would fit into those categories, as well as some topics were quite overlapped
                                     amongst them. The initially defined categories that have been removed during training
                                     process and evaluation are: “Home & Garden”, “Real State”, “Society”, and “News”. In the
                                     same way, the algorithm has been able to discover new categories related to the original
                                     ones, such as: “Movies”, “Videogames”, “Music”, “Events”, and “Medicine & Health”,
                                     leaving a total of 23 categories in the system.

                                               Table 5. Topics obtained by the algorithm.

                        Topic                                                       Top 10 Words
 1         Travel                          travel, trip, budget, air, hotel, adventure, road, camp, family, day
 2         Movies                          movie, horror, action, comedy, movies, romance, watch, animation, family, drama
 3         Videogames                      game, xbox, pc, mmo, nintendo, videogame, esport, rpg, play, console

 4         Careers                         apprenticeship, internship, job, search, career, interview, vocational, training,
                                           remote, advice
 5         Events                          amusement, concert, cinema, restaurant, birthday, match, holiday, football, funeral, park
 6         Health & Fitness                health, nutrition, therapy, physical, fitness, workout, exercise, wellness, medicine, weight
                                           islam, christianity, hinduism, judaism, buddhism, spirituality, religion, astrology,
 7         Religion & Spirituality
                                           atheism, sikhism
 8         Shopping                        grocery, lotto, shopping, gift, discount, sale, card, coupon, sales, code
 9         Personal Finance                insurance, loan, credit, option, card, student, financial, service, assistance, phone
 10        Pets                            veterinary, dog, pug, puppy, aquarium, pet, cat, reptile, dolphin, hamster
 11        Automotive                      truck, car, auto, motorcycle, tesla, van, scooter, pickup, luxury, minivan
                                           chemistry, geography, geology, biology, physics, genetic, astronomy, environment,
 12        Science
                                           math, science
 13        Law, Gov’t & Politics           election, political, law, issue, vote, news, state, people, country, trump
                                           preschool, college, university, exam, electoral, education, homework, student,
 14        Education
                                           school, language
 15        Food & Drink                    coffee, vegetarian, beer, eat, vegan, cook, drink, wine, tea, dining
 16        Family & Parenting              marriage, parent, single, baby, daycare, teen, date, life, toddler, adopt
 17        Style & Fashion                 wear, beauty, clothing, perfume, deodorant, wallet, casual, fashion, shave, trainer
 18        Technology & Computing          software, app, developer, mongodb, database, email, android, internet, computer, ai
 19        Hobbies & Interests             meme, draw, puzzle, collect, comic_strip, antique, guitar, art, woodworke, painting
 20        Sports                          martial, rugby, golf, sport, climb, pool, racing, cricket, skating, basketball
                                           industry, agriculture, construction, startup, recall, economy, business, automotive,
 21        Business
                                           butterball, turkey
                                           vaccine, menopause, pregnancy, health, mental, surgery, injury, disease,
 22        Medicine & Health
                                           psychology, substance
 23        Music                           music, radio, rock, funk, pop, soul, classic, songwriter, listen, classical

                                          Once the resulting model has been evaluated and verified, the next step is to check
                                     the effectiveness of the model with real Twitter profiles. The tests have been performed
                                     extracting 1200 tweets from different users and predicting for each user the most related
                                     topics based on their tweets. The final test results are shown in Table 6, where it can
                                     be observed how each profile name match with related topics according to the profile.

Electronics 2021, 10, 1263                                                                                             14 of 19

                             As an example, the main topics for the profile “Tesla” are “Automotive”, “Technology and
                             computing”, and “Travel”.
                                  Finally, in order to suggest the main topics of a specific user in the My-Trac app,
                             for each user, the model returns the associated categories, along with the percentage of
                             weight that each category has on the user. The lower the percentage, the less relation the
                             user has with the category. The results of the final classification using some known Twitter
                             accounts are given in Table 7. It should be noted that only the three main categories are
                             shown in the table (together with their associated percentage), as they are the most accurate
                             for categorizing the user.

                             Table 6. NMF evaluation with data from real Twitter profiles.

                                                                Cat 1                  Cat 2                 Cat 2
                               Pontifex               Religion & Spirituality   Family & Parenting     Tech & Computing
                               Tesla                        Automotive          Tech & Computing             Travel
                               BBCNews                 Law, Gov’t & Politics           Sports         Family & Parenting
                               NintendoAmerica              Videogames          Hobbies & Interests          Sports
                               Theresa_may             Law, Gov’t & Politics         Business           Personal Finance
                               Oprah                           Events           Family & Parenting           Sports
                               SkyFootball                     Sports                 Events          Hobbies & Interests
                               ScuderiaFerrari                 Sports               Automotive        Law, Gov’t & Politics
                               IMDb                            Events                 Movies          Hobbies & Interests
                               ScienceMagazine                Science           Medicine & Health      Tech & Computing
                               Spotify                         Music            Hobbies & Interests   Family & Parenting
                               Airbnb                          Travel                 Events                Careers

                             Table 7. Final results with different accounts.

                                                           First Category        Second Category        Third Category
                                                            Automotive           Tech & Computing             Travel
                               Tesla
                                                             (70.92%)                 (10.07%)               (3.86%)
                                                       Law, Gov’t & Politics          Business                Sports
                               RealDonaldTrump
                                                             (28.02%)                 (11.36%)               (8.02%)
                                                                Science          Medicine & Health     Tech & Computing
                               ScienceMagazine
                                                               (24.41%)              (16.73%)               (12.03%)
                                                            Videogames          Hobbies & Interests           Sports
                               NintendoAmerica
                                                              (41.65%)              (12.19%)                 (9.74%)

                             5. Final System Integration in My-Trac Application
                                   Having passed the entire research and evaluation process, a trained algorithm has
                             been obtained capable of classifying different Twitter accounts according to defined and
                             discovered categories. In addition, a reliable data extraction method has been developed.
                             Therefore, the next step consists of applying the algorithm to the My-Trac app to create
                             a system that allows recommendations to My-Trac users based on their Twitter profiles,
                             which is the objective of the present work.
                                   The final system for My-Trac app consists of a mobile app where the user logs in,
                             as it can be seen in Figure 3, and is asked to grant access to their Twitter data. Once the
                             user signs in to the application, My-Trac seeks for the optimal means of transport to reach
                             a specific destination given by the user and suggests the best conveyance for the trip,
                             as Figures 4 and 5 show.

Electronics 2021, 10, 1263                                                                                       15 of 19

                                 Finally, when the user chooses the route and mean of transport that best fits his trip,
                             based on the present work, My-Trac app recommends some activities and points of interest
                             for the user during the way based on its Twitter information, as can be
                             observed in Figures 6 and 7.

                             Figure 3. My-Trac application.

                             Figure 4. Trip planification using My-Trac app.

Electronics 2021, 10, 1263                                                                                                    16 of 19

                             Figure 5. My-Trac suggests optimal means of transport for the destination.

                             Figure 6. My-Trac recommends activities and point of interest for the user using its Twitter information.

Electronics 2021, 10, 1263                                                                                         17 of 19

                             Figure 7. My-Trac suggestions.

                                  Moreover, is possible to get some detailed information for each activity recommended,
                             as Figure 8 shows. In this way, thanks to My-Trac app, the user can improve his experience
                             not only by receiving suggestions for the best conveyance for the trip but also receiving
                             customized activity recommendations and points of interest.

                             Figure 8. Detailed information about a suggested point of interest.

                             6. Conclusions and Future Work
                                  This article presents a novel approach to extracting preferences from a Twitter profile
                             by analyzing the tweets published by the user for use in mapping applications. This
                             approach has successfully defined a consistent and representative list of categories, and the
                             mechanisms needed for information extraction have been developed, both for model
                             training and end-user analysis. It is a unique system, with which it has been possible to

Electronics 2021, 10, 1263                                                                                                           18 of 19

                                   develop an important feature in the My-Trac app, whereby it is possible to recommend
                                   relevant point of interest to the end users.
                                         Regarding future work on this system, many areas of improvement and development
                                   have been identified. Tweets are not the only source of information that allows to discern
                                   the interests of a profile. It may be the case that a user only writes about football but is
                                   the follows many news-related and political accounts. The current system would only
                                   be able to extract the sports category. Therefore, one of the improvements would be the
                                   implementation of a model that would analyze followed users. This has been started,
                                   by extracting the followers and creating wordclouds with the most relevant ones. Similarly,
                                   hashtags also provide additional information suitable for analysis. Another line of research
                                   is the training of a model that allows to analyze the tweets individually. This would open
                                   the doors to performing a polarity analysis that would allow us to know if a user who
                                   writes about a certain category does it in a positive, negative, or neutral way.
                                         As for the limitations of the system, it is possible that, in some regions, there may
                                   be restrictive regulations on the use of information published on social networks for this
                                   type of analysis. Therefore, the user should carry out a study of data protection and the
                                   legal framework adapted to each region where the service is to be provided. Furthermore,
                                   in terms of performance, it is possible that specific context-dependent systems training an
                                   algorithm for each individual user may perform slightly better than the proposed solution.

                                   Author Contributions: Conceptualization, A.R. and A.G.-B.; methodology, A.R. and A.G.-B.; soft-
                                   ware, A.R. and J.J.C.-M.; validation, A.R., A.G.-B., J.J.C.-M. and A.P.-P.; formal analysis, A.R. and
                                   J.J.C.-M; investigation, A.R., A.G.-B. and A.P.-P.; resources, A.R., A.G.-B., J.J.C.-M.; data curation,
                                   J.J.C.-M.; writing—original draft preparation, A.R., A.G.-B., J.J.C.-M.; writing—review and editing,
                                   A.R., A.G.-B., J.J.C.-M., A.P.-P. and J.M.C.; visualization, A.R. and J.J.C.-M.; supervision, A.G.-B.,
                                   A.P.-P. and J.M.C.; project administration, A.G.-B., A.P.-P. and J.M.C.; funding acquisition, J.M.C. All
                                   authors have read and agreed to the published version of the manuscript.
                                   Funding: This research was funded by the Spanish Ministerio de Ciencia e Innovación under grant
                                   number TIN2017-89314-P.
                                   Acknowledgments: This research has been supported by the European Union’s Horizon 2020
                                   research and innovation programme under grant agreement No 777,640.
                                   Conflicts of Interest: The authors declare no conflict of interest.

References
1.    Ludwig, B.; Zenker, B.; Schrader, J. Recommendation of Personalized Routes with Public Transport Connections. In Proceedings
      of the International Conference on Intelligent Interactive Assistance and Mobile Multimedia Computing, Rostock-Warnemünde,
      Germany, 9–11 November 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 97–107.
2.    Cui, G.; Luo, J.; Wang, X. Personalized travel route recommendation using collaborative filtering based on GPS trajectories. Int. J.
      Digit. Earth 2018, 11, 284–307. [CrossRef]
3.    Briedenhann, J.; Wickens, E. Tourism routes as a tool for the economic development of rural areas—Vibrant hope or impossible
      dream? Tour. Manag. 2004, 25, 71–79. [CrossRef]
4.    Cea-Morán, J.J.; González-Briones, A.; De La Prieta, F.; Prat-Pérez, A.; Prieto, J. Extraction of Travellers’ Preferences Using Their
      Tweets. In Proceedings of the International Symposium on Ambient Intelligence, L Aquila, Italy, 17–19 June 2020; pp. 224–235.
5.    De Pessemier, T.; Minnaert, J.; Vanhecke, K.; Dooms, S.; Martens, L. Social recommendations for events. In Proceedings of the
      CEUR workshop Proceedings, Miami, FL, USA, 1 October 2013; Volume 1066.
6.    Stathopoulos, E.A.; Paliokas, I.; Meditskos, G.; Diplaris, S.; Tsafaras, S.; Valkouma, E.; Pehlivanides, G.; Riggas, C.; Vrochidis, S.;
      Votis, K.; et al. Smart discovery of cultural and natural tourist routes. In Proceedings of the IEEE/WIC/ACM International
      Conference on Web Intelligence-Companion, Thessaloniki, Greece, 14–17 October 2019; pp. 208–214.
7.    Sansonetti, G.; Gasparetti, F.; Micarelli, A.; Cena, F.; Gena, C. Enhancing cultural recommendations through social and linked
      open data. User Model. User-Adapt. Interact. 2019, 29, 121–159. [CrossRef]
8.    Garcia, A.; Arbelaitz, O.; Linaza, M.T.; Vansteenwegen, P.; Souffriau, W. Personalized tourist route generation. In Proceedings of
      the International Conference on Web Engineering, Vienna, Austria, 5–9 July 2010; Springer: Berlin/Heidelberg, Germany, 2010;
      pp. 486–497.
9.    University, Y. About Yale: Yale Facts. 2017. Available online: https://www.yale.edu/about-yale/yale-facts (accessed on 24
      May 2021).

Electronics 2021, 10, 1263                                                                                                         19 of 19

10.   Demestichas, K.; Kosmides, P. An offline, statistical method for cost efficient design of experiments and field trials involving
      electric vehicles. In Proceedings of the 11th ITS European Congress, Glasgow, Scotland, 6–9 June 2016.
11.   Ferrari, A.; Donati, B.; Gnesi, S. Detecting domain-specific ambiguities: An NLP approach based on Wikipedia crawling and
      word embeddings. In Proceedings of the 017 IEEE 25th International Requirements Engineering Conference Workshops (REW),
      Lisbon, Portugal, 4–8 September 2017; pp. 393–399.
12.   Wallace, E.; Wang, Y.; Li, S.; Singh, S.; Gardner, M. Do nlp models know numbers? Probing numeracy in embeddings. arXiv 2019,
      arXiv:1909.07940.
13.   Church, K.W. Word2Vec. Nat. Lang. Eng. 2017, 23, 155–162. [CrossRef]
14.   Rong, X. word2vec parameter learning explained. arXiv 2014, arXiv:1411.2738.
15.   Lilleberg, J.; Zhu, Y.; Zhang, Y. Support vector machines and word2vec for text classification with semantic features. In
      Proceedings of the 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC), Beijing,
      China, 6–8 July 2015; pp. 136–140.
16.   Bilgin, M.; Şentürk, İ.F. Sentiment analysis on Twitter data with semi-supervised Doc2Vec. In Proceedings of the 2017 International
      Conference on Computer Science and Engineering (UBMK), London, UK, 5–7 July 2017; pp. 661–666.
17.   Chen, Q.; Sokolova, M. Word2Vec and Doc2Vec in unsupervised sentiment analysis of clinical discharge summaries. arXiv 2018,
      arXiv:1805.00352.
18.   Islam, T. Yoga-veganism: Correlation mining of twitter health data. arXiv 2019, arXiv:1906.07668.
19.   Chen, Y.; Zhang, H.; Liu, R.; Ye, Z.; Lin, J. Experimental explorations on short text topic mining between LDA and NMF based
      Schemes. Knowl.-Based Syst. 2019, 163, 1–13. [CrossRef]
20.   Resnik, P.; Armstrong, W.; Claudino, L.; Nguyen, T.; Nguyen, V.A.; Boyd-Graber, J. Beyond LDA: Exploring supervised topic
      modeling for depression-related language in Twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and
      Clinical Psychology: From Linguistic Signal to Clinical Reality, Denver, CO, USA, 5 June 2015; pp. 99–107.
21.   Mehrotra, R.; Sanner, S.; Buntine, W.; Xie, L. Improving lda topic models for microblogs via tweet pooling and automatic labeling.
      In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, Dublin,
      Ireland, 28 July–1 August 2013; pp. 889–892.
22.   Tajbakhsh, M.S.; Bagherzadeh, J. Semantic knowledge LDA with topic vector for recommending hashtags: Twitter use case. Intell.
      Data Anal. 2019, 23, 609–622. [CrossRef]
23.   Wang, H.; Can, D.; Kazemzadeh, A.; Bar, F.; Narayanan, S. A system for real-time twitter sentiment analysis of 2012 us presidential
      election cycle. In Proceedings of the ACL 2012 System Demonstrations. Association for Computational Linguistics, Jeju Island,
      Korea, 8–14 July 2012; pp. 115–120.
24.   Cotelo, J.M.; Cruz, F.L.; Enríquez, F.; Troyano, J. Tweet categorization by combining content and structural knowledge. Inf. Fusion
      2016, 31, 54–64. [CrossRef]
25.   Lee, K.; Palsetia, D.; Narayanan, R.; Patwary, M.M.A.; Agrawal, A.; Choudhary, A. Twitter trending topic classification. In
      Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, Vancouver, BC, Canada, 11 December
      2011; pp. 251–258.
26.   IAB Categories | MoPub Publisher UI | MoPub Developers. Available online: https://developers.mopub.com/publishers/ui/
      iab-category-blocking/ (accessed on 4 May 2021).
27.   Tweepy. Available online: https://www.tweepy.org/ (accessed on 4 May 2021).

You can also read