TweeNLP: A Twitter Exploration Portal for Natural Language Processing
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
TweeNLP: A Twitter Exploration Portal for Natural Language Processing Viraj Shah, Shruti Singh and Mayank Singh Indian Institute of Technology Gandhinagar, Gujarat, India {shah.viraj, singh shruti, singh.mayank}@iitgn.ac.in Abstract Although several mailing lists, slack channels, We present T WEE NLP, a one-stop portal that and subreddits exist for communication, most natu- organizes Twitter’s natural language process- ral language processing (henceforth NLP) commu- arXiv:2106.10512v1 [cs.CL] 19 Jun 2021 ing (NLP) data and builds a visualization and nity discussions are primarily carried out on Twit- exploration platform. It curates 19,395 tweets ter due to its open accessibility and wider reach. (as of April 2021) from various NLP confer- Announcements of calls for papers and submis- ences and general NLP discussions. It sup- sion deadlines, recently accepted papers, interest- ports multiple features such as TweetExplorer ing talks and seminars, lecture videos, and tutori- to explore tweets by topics, visualize insights from Twitter activity throughout the organiza- als on various topics are often posted on Twitter. tion cycle of conferences, discover popular re- These are a great medium to stay updated on the search papers and researchers. It also builds recent developments in the NLP field. It is also a timeline of conference and workshop sub- a medium for researchers to engage in informal mission deadlines. We envision T WEE NLP research discussions which might be unreported to function as a collective memory unit for in official publications. We present a sample of the NLP community by integrating the tweets diverse NLP tweets in Figure 1 to emphasise the pertaining to research papers with the NLPEx- plorer scientific literature search engine. The utility of the platform. current system is hosted at URL. However, unlike subreddits or communities like Stack Overflow and AskUbuntu, Twitter is not an 1 Introduction exclusive channel for NLP discussions. Exclusive Online communication channels have become pop- channels provide users a one-stop destination for ular in the Internet era, and several online commu- their interests and allow extremely topic-specific nities of like-minded people have evolved around exploration. While Twitter allows search by hash- these channels. For example, communities such tags to narrow down to specific topics, the usage as Stack Overflow and AskUbuntu are question- of hashtags is highly irregular. Furthermore, Twit- answering forums; Twitter and Reddit are content- ter is more suited to live discussions and less suit- sharing forums. These forums over the years have able for maintaining a snapshot of the discussions provided a platform for novice users to learn from taking place in the online community. Relevant the experts, facilitated discussions among the com- Twitter discussions about specific research papers munity members, and have over the years accumu- are often forgotten in the long run because there lated a rich database of questions, answers, and is no infrastructure to link these discussions with discussions. the papers on the proceedings archives or research According to the theory of diffusion of inno- paper search engines. In an attempt to address vation proposed by Rogers (2003), the communi- these issues, we extend the functionality of NLP- cation channel is one of the four main elements Explorer (Parmar et al., 2020) platform by integrat- influencing the spread of a new idea. Notably, the ing T WEE NLP with it. NLPExplorer is a portal for communication channel serves as a collective long- searching, and visualizing NLP research volume term memory or a knowledge archive of the com- based on the ACL Anthology (ACL Anthology). In munity, which any member can access to study the our current work, we build an automatic pipeline community’s stance on diverse topics at any point for curating NLP tweets and build a one-stop por- in time. tal - T WEE NLP, for the search and browsing of
Figure 1: A sample of diverse natural language processing tweets. NLP discussion on Twitter. The system has curated papers. 19,395 NLP tweets as of April 2021. 2 NLPExplorer T WEE NLP organizes NLP tweets into topics: (i) New paper announcements, (ii) Call for Pa- NLPExplorer1 (Parmar et al., 2020) is an auto- per announcements, (iii) Reading Materials & Tu- matic portal for indexing, searching, and visualiz- torials, (iv) Career Opportunities, (v) Talks & ing Natural Language Processing research volume. Seminars, and (vi) Others. Topic-wise tweets It presents multiple paper, venue, and author statis- are presented via dashboards for easy exploration. tics, including paper citation distribution, paper T WEE NLP supports dashboards to browse through topic distribution, authors, their field of study, their popular NLP tweets in the previous week and the citation distributions, etc. It also presents category month. We construct a CFP Timeline from ‘Call information of research papers into various topics for Papers’ announcements on Twitter and arrange broadly arranged in five categories: (i) Linguistic it according to the upcoming submission deadlines Target (Syntax, Discourse, etc.), (ii) Tasks (Tag- of various workshops and conferences. We link the ging, Summarization, etc.), (iii) Approaches (un- research paper tweets to the research paper’s meta- supervised, supervised, etc.), (iv) Languages (En- data, accessible via the NLPExplorer paper discov- glish, Chinese, etc.) and (v) Dataset types (news, ery feature. We also build live Conference Visu- clinical notes, etc.). The current snapshot consists alization dashboards, which curate tweets about of 75k research papers and 50k authors. Since its the conference schedule, ongoing talks, poster ses- inception, it has been accessed by more than 7.3k sions, and interesting papers at the conference, and users having a close to 9.7k sessions. present statistics such as popular hashtags, users, tweet languages, etc. 3 Dataset We integrate T WEE NLP with NLPExplorer We curate the dataset from two primary sources: (Section 2) to build a joint-portal that aims to bridge 3.1 Twitter the gap between published research and its infor- mal communication on the social media platform We curate the Twitter data using the open-source Twitter. Our automatic data curation pipeline and library Twint2 by retrieving tweets with the hashtag the architecture of the system is described in Sec- NLProc. We also curate tweets with NLP confer- tion 3 and Section 4 respectively. We describe the ence hashtags such as #acl2020, #emnlp2020, etc. features of T WEE NLP in detail in Section 5. In The list of NLP conferences is compiled via ACL Section 6, we discuss previous works in organiz- 1 http://nlpexplorer.org/ 2 ing the NLP literature and visualization of research https://github.com/twintproject/twint
sented in Section 5.1. We experiment by fine- tuning a BERT-base(Devlin et al., 2019) clas- sifier and twitter-roberta-base(Barbieri et al., 2020) to predict the tweet topics. The BERT- base model4 obtains the best test accuracy of 75% on a small manually annotated dataset5 . 2. Conference Page Builder: The Conference Page Builder classifies a tweet either as dis- cussing an ongoing conference or other topics. The module builds specific conference pages Figure 2: The architecture of T WEE NLP. Arrow direc- using such tweets. tions denote the flow of data. AAD represents the ACL 3. CFP Timeline Builder: The module processes Anthology Dataset which is the other data source apart ‘Call for Papers’ tweets identified by the Tweet from Twitter. Classifier module. It extracts the conference (and workshop) name by regex-based key- Anthology. Our system is scheduled to download word matching against a pre-compiled list of the Twitter data for each day automatically. For on- venues. The submission date are extracted going conferences, our system curates new tweets from the tweets by labeling dates using the every hour to continually update the Conference Spacy6 library. The tweets are arranged in a Visualizer page. The current snapshot (as of April timeline sorted by the submission deadline. 2021) contains data since October 2017 (around 4. Paper Tweet Linker: The Paper Tweet Linker 1300 days) and consists of 19,395 tweets. module maps specific tweets to research pa- pers using regex matching of the paper title 3.2 ACL Anthology and paper URL. The Paper Tweet Visualizer We curate the conference and journal names and uses these mappings to embed the tweets on URLs from the ACL Anthology github reposi- the research paper page on NLPExplorer. tory3 . We also curate the paper titles and their links. The pipeline then stores the tweets in the database Tweets are collected periodically every day, and the after processing by the above modules. We sched- system checks for paper mentions in the tweets by ule our system to automatically curate the Twitter substring matching the paper URLs collected from data daily and increase it to an hourly frequency the ACL Anthology github repository. during ongoing conferences. 4 Architecture 5 T WEE NLP Features We present the pipeline of our system in Figure 2. 5.1 Tweet Explorer The Data Curator module curates tweets daily. The We present a Tweet Explorer dashboard that allows curated tweets are processed before we perform fur- a user to browse tweets from specific topics such ther steps. The following modules process tweets: as: (i) Tweet Classifier, (ii) Conference Page Builder, 1. New paper announcements: This topic orga- (iii) CFP Timeline Builder, and (iv) Paper Tweet nizes tweets about recent papers, which often Linker. We describe the tweet processing modules involve the summary or a short introduction in detail below: of the research paper. These twitter threads 1. Tweet Classifier: The Tweet Classifier module facilitate other researchers to communicate in- classifies a tweet into one of the six topics: (i) formally with the paper authors. These also New Paper Announcements, (ii) Call for Pa- contain interesting discussions by the commu- per announcements, (iii) Reading Materials & nity on the insights, merits, and critiques of Tutorials, (iv) Career Opportunities, (v) Talks the research paper, and post questions about & Seminars, and (vi) Others. The Tweet Ex- plorer feature utilizes these tweet categories. 4 We also experimented with a zero-shot classifier but it The detailed description of each topic is pre- underperformed the BERT-base classifier. 5 each tweet was annotated by two ML/NLP students and 3 inter annotator agreement computed using Cohen’s κ=0.68 https://github.com/acl-org/ 6 acl-anthology https://spacy.io/
the work. The authors’ short introductions of- Topic Tweet Count New Paper Announcements 6,337 fer an informal account of the paper compared Call for Papers 972 to the paper alert services that usually present Reading Materials & Tutorials 1,400 the title and the abstract of the research paper. NLP Career Opportunities 681 NLP Talks & Seminars 2,382 2. Call for Papers (CFPs) by various confer- Others 7,623 ences and workshops: Users can view the Total Tweets 19,395 announcements for call for papers and sub- mission deadlines by various workshops and Table 1: Distribution of tweets (curated since October 2017) into various topics. conferences. 3. Reading Materials & Tutorials: It lists var- ious study material, such as lecture slides tags, top linked URLs, and top discussed papers and videos, tutorials, online courses, and blog in tweets. We present the most popular hashtags, posts. mentions, URLs, and highly discussed papers for 4. NLP Career Opportunities: Individuals fre- ACL2020 in Table 2. A summary of Twitter activ- quently advertise opportunities for various po- ity from the Conference Visualizer page for ACL sitions such as interns, full-time, Ph.D., post- 2020 is presented in Table 3. Apart from Twitter doctoral fellows, and research fellows on Twit- discussions about a conference in a specific month, ter. we also show insights from the conferences across 5. NLP Talks & Seminars: Various online NLP the year. The insights from ACL conference over talks and seminars can be accessed using the time is presented in Figure 4. We also present other NLP Talks & Seminars filter on the Tweet conference-specific statistics such as the number Explorer dashboard. of tweets per month, daily distribution of tweets 6. Others: This category contains the NLP in the conference month, most active users tweet- tweets which do not belong to any of the above ing about the conference, and a distribution of the topics. tweet languages other than English. The Tweet Explorer feature allows users to specif- ically browse through tweets by topics and filter 5.3 Popular Paper Visualizer them based on their immediate interests. A snap- shot of the same is presented in Figure 3. We We showcase widely discussed papers on Twitter in present the distribution of tweets in the six cate- the Popular Paper Visualizer dashboard. It presents gories from tweets curated by the system in the last the titles and provides direct links to the full-text 1,300 days in Table 1. of the top discussed papers for quick reference. The system extracts tweets mentioning research pa- pers and assigns a popularity score to each paper based on the count of tweets that mention it, and the likes, retweets, and replies on the paper tweets. We present a snapshot of few popular papers iden- tified by our platform in Figure 5. It also presents the most active users tweeting about #NLProc on Twitter. Popular Paper Visualizer dashboard also supports exploration of most liked and retweeted Figure 3: Tweet Explorer feature of T WEE NLP which #NLProc tweets of all times and in the last month. facilitates browsing tweets by six different topics. 5.4 CFP Timeline 5.2 Conference Visualizer – Near real-time T WEE NLP presents a timeline of the upcoming view for conferences submission deadlines. The timeline is created by T WEE NLP supports real-time statistics for mul- identifying ‘Call for Papers’ tweets using keyword tiple top conferences and the popular #NLProc based filtering of tweets and also lists the confer- hashtag. The information is updated hourly for ence/workshop website. The details are described live events and weekly for past events. Some of in the CFP Timeline Builder module 3. We present the statistics presented are top mentions, top hash- a snapshot of the timeline in Figure 6.
Top Hashtags Top Mentions Top URLs Top Papers Discussed Beyond Accuracy: Behavioral Testing #acl2020nlp @aclmeeting virtual.acl2020.org/socials.html of NLP models with CheckList virtual.acl2020.org/plenary Photon: A Robust Cross-Domain Text- #acl2020en @emilymbender session keynote kathy mckeown.html to-SQL System Climbing towards NLU: On Meaning, #nlproc @akoller virtual.acl2020.org/paper main.701.html Form, and Understanding in the Age of Data Language Models as an Alternative Eval- #acl2020zht @winlpworkshop virtual.acl2020.org/workshop W1.html uator of Word Order Hypotheses: A Case Study in Japanese www.aclweb.org/anthology/ Don’t Stop Pretraining: Adapt Language #acl2020hi @xandaschofield 2020.acl-main.442/ Models to Domains and Tasks The State and Fate of Linguistic Diver- #mt @gneubig virtual.acl2020.org/workshop W10.html sity and Inclusion in the NLP World Table 2: ACL 2020 Twitter Coverage: Top discussed papers, mentions and URLs and popular hashtags. Tweet Counter Likes Counter Retweet Counter Unique Mentions Unique Paper Mentions 5,343 58,160 11,440 907 251 Table 3: ACL 2020 Twitter Statistics of various activities. Figure 4: Conference Visualizer: ACL2020 Statistics. (a) Distribution of tweets across different months. (b) Daily distribution of ACL2020 tweets in July 2020. (c) Distribution of tweet languages except English. (d) Twitter users with highest ACL2020 tweets. 5.5 Paper Tweet Visualizer ally, it also provides interesting insights like similar NLPExplorer supports a research paper search papers, topical distribution and mentioned URLs. interface and builds research paper pages which We map research paper discussion tweets on Twit- showcase standard paper related statistics such as ter to the NLPExplorer paper page. This feature the publication year and venue, author information, allows users to browse through discussions about citations, citation distribution over the years and the paper along with the metadata of the paper. We the link to the corresponding PDF article. Addition- present a snapshot of the feature in Figure 7.
Figure 5: Popular papers identified T WEE NLP based on twitter activity. the ACL Anthology Network (AAN) by manual annotation of the references to complete the cita- tion network and analysed the network to present central papers, authors and other network statis- tics (Radev et al., 2016). Works by Schäfer et al. (2011) and Parmar et al. (2020) provide a compre- hensive search interface to browse through the NLP based on parameters such as author, full text, year of publication, title, and the field of study. Moham- mad (2020) built the NLPScholar platform which Figure 6: CFP Timeline built from tweets. ‘W’ on top- consists of interactive dashboards that present vari- right denotes Workshop. ous aspects of NLP research papers. The platform uses ACL Anthology and Google Scholar as the information source. Few works have analysed Twitter data to predict scholarly impact. Shuai et al. (2012) report a sta- tistical correlation between high volume of Twitter mentions and arXiv downloads and early citations (i.e., citations occurring less than seven months af- ter the publication of a preprint). However, they also point out that Twitter mentions cannot be di- Figure 7: Paper Tweet Visualizer curates tweets and rectly concluded to be causative of higher levels of metadata of a research paper on a joint portal. The download and early citations. Several other works image background is a ‘Paper’ page from NLPEx- such as Eysenbach (2011), Thelwall et al. (2013), plorer which lists paper metadata, citing papers, field- and Haustein et al. (2014) have tried to analyze of-study tags, and similar papers alongwith the associ- whether tweets correlate with citations. ated tweets. However, to the best of our knowledge, no prior work has tried to curate NLP discussions data from 5.6 Popular Last Week Twitter in an attempt to organize it and link it to re- Lastly, we present popular tweets in the NLP com- search papers via a search engine or a visualization munity on Twitter (also referred as NLP Twitter). portal. This feature allows researchers to catch up with the recent NLP-related Twitter discussions in a single 7 Future Scope and Extensions dashboard without searching for them specifically Currently, the system is implemented only for NLP in the Twitter feed. papers present in the ACL Anthology. The system 6 Related Works could be extended to papers from NeurIPS, ICLR, and CVPR as the data for these conferences is avail- Bird et al. (2008) curated the ACL Anthology Ref- able publicly. The system is versatile and can be erence Corpus (ACL ARC) of research papers in easily extended to other domains. T WEE NLP pro- NLP and CL. Radev et al. (2009, 2013) constructed vides basic visualization graphs over Twitter activ-
ity. Over time, these discussions could be used to Monarch Parmar, Naman Jain, Pranjali Jain, P Jayakr- build a timeline of evolution of research in various ishna Sahit, Soham Pachpande, Shruti Singh, and Mayank Singh. 2020. NLPExplorer: exploring the domains of NLP based on the Twitter activity of universe of nlp papers. In European Conference on researchers. Tweets by popular users attain likes Information Retrieval, pages 476–480. Springer. and retweets at a higher rate in comparison to new users (or users with less followers) of the commu- Dragomir R Radev, Mark Thomas Joseph, Bryan Gib- son, and Pradeep Muthukrishnan. 2016. A biblio- nity. T WEE NLP currently only presents popular metric and network analysis of the field of computa- tweets based on retweets and likes count which tional linguistics. Journal of the Association for In- can bias the conversations, understanding and pre- formation Science and Technology, 67(3):683–706. sentation of ideas by emphasising the tweets of a Dragomir R Radev, Pradeep Muthukrishnan, and Va- small set of popular users. Future work includes hed Qazvinian. 2009. The ACL Anthology Network identifying novel alternative ideas and perspectives Corpus. ACL-IJCNLP, page 54. by adjusting user popularity to create an inclusive Dragomir R Radev, Pradeep Muthukrishnan, Vahed space for the community. Qazvinian, and Amjad Abu-Jbara. 2013. The ACL anthology network corpus. Language Resources and Evaluation, 47(4):919–944. References Everett M. Rogers. 2003. Diffusion of innovations, 5th ACL Anthology. ACL Anthology. https://www. edition. Free Press, New York, NY [u.a.]. aclweb.org/anthology/. Accessed: 2021-03- 01. Ulrich Schäfer, Bernd Kiefer, Christian Spurk, Jörg Steffen, and Rui Wang. 2011. The ACL Anthology Francesco Barbieri, Jose Camacho-Collados, Luis searchbench. In Proceedings of the ACL-HLT 2011 Espinosa-Anke, and Leonardo Neves. 2020. Tweet- System Demonstrations, pages 7–13, Portland, Ore- Eval:Unified Benchmark and Comparative Evalua- gon. Association for Computational Linguistics. tion for Tweet Classification. In Proceedings of Xin Shuai, Alberto Pepe, and Johan Bollen. 2012. How Findings of EMNLP. the scientific community reacts to newly submitted preprints: Article downloads, twitter mentions, and Steven Bird, Robert Dale, Bonnie J Dorr, Bryan Gib- citations. PloS one, 7(11):e47523. son, Mark Thomas Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir R Radev, and Yee Fan Mike Thelwall, Stefanie Haustein, Vincent Larivière, Tan. 2008. The ACL Anthology Reference Corpus: and Cassidy R Sugimoto. 2013. Do altmetrics work? A Reference Dataset for Bibliographic Research in twitter and ten other social web services. PloS one, Computational Linguistics. Language Resources 8(5):e64841. and Evaluation Conference. J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidi- rectional transformers for language understanding. In NAACL-HLT. Gunther Eysenbach. 2011. Can tweets predict cita- tions? Metrics of social impact based on Twit- ter and correlation with traditional metrics of scien- tific impact. Journal of medical Internet research, 13(4):e123. Stefanie Haustein, Isabella Peters, Cassidy R Sugimoto, Mike Thelwall, and Vincent Larivière. 2014. Tweet- ing biomedicine: An analysis of tweets and cita- tions in the biomedical literature. Journal of the As- sociation for Information Science and Technology, 65(4):656–669. Saif M. Mohammad. 2020. NLP scholar: An interac- tive visual explorer for natural language processing literature. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics: System Demonstrations, pages 232–255, Online. As- sociation for Computational Linguistics.
You can also read