THE CURIOUS CASE OF POSTS ON STACK OVERFLOW - SHAILJA SHUKLA - DIVA PORTAL
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
The curious case of posts on Stack Overflow Shailja Shukla Subject: (Information Systems) Corresponds to: (30 hp) Presented: (VT 2020) Supervisor: Mudassir Imran Mustafa Department of Informatics and Media 1
Contents Abstract ...................................................................................................................................... 6 Acknowledgements .................................................................................................................... 7 Chapter 1 .................................................................................................................................... 8 1. Introduction ........................................................................................................................ 8 1.1. Background ................................................................................................................ 8 1.2. Motivation ................................................................................................................ 10 1.2 Research Questions .................................................................................................. 11 1.3 Delimitation: ............................................................................................................ 12 1.4 Limitation:................................................................................................................ 12 Chapter 2 .................................................................................................................................. 13 2. Theory ............................................................................................................................... 13 2.1 Topic Modelling: ..................................................................................................... 13 2.2 Latent Dirichlet Allocation (LDA): ......................................................................... 14 2.3 Related Work ........................................................................................................... 15 Chapter 3 .................................................................................................................................. 17 3. Methodology:.................................................................................................................... 17 3.1 Data Collection: ....................................................................................................... 18 3.2 Data Extraction: ....................................................................................................... 18 3.2.1 Schema: ................................................................................................................. 19 3.3 Data Pre-processing: ................................................................................................ 20 3.1.1 Subset corpus data: .............................................................................................. 20 3.1.2 Remove code snippets: ........................................................................................ 21 3.3.3 Combine related documents to form a single corpus: .......................................... 22 3.3.4 Tokenization: ....................................................................................................... 22 3.3.5 Lowercasing: ........................................................................................................ 23 3.3.6 Remove punctuations: .......................................................................................... 23 3.3.7 Text Standardization/Replace Contractions:........................................................ 23 3.3.8 Remove stop words: ............................................................................................. 24 3.3.9 Remove URLs:..................................................................................................... 24 3.3.10 Minimum size words: ...................................................................................... 24 3.3.11 Remove multiple whitespaces: ........................................................................ 25 3.3.12 Generate N-Grams: .......................................................................................... 25 3.3.13 Stemming: ........................................................................................................ 25 3.3.14 Lemmatisation: ................................................................................................ 26 2
3.4 Create Dictionary and Term Document Frequency: ................................................ 26 3.5 Run the LDA model: ................................................................................................ 28 Chapter 4 .................................................................................................................................. 29 4 Analysis: ........................................................................................................................... 29 Chapter 5 .................................................................................................................................. 34 5 Result ................................................................................................................................ 34 5.1 RQ1- What are the popular discussion topics in Stack Overflow? .......................... 34 5.1.1 Web as a recurring discussion topic: ................................................................... 36 5.1.2 UI Development as a recurring discussion topic: ................................................ 37 5.1.3 Data management as a recurring discussion topic: .............................................. 37 5.2 RQ2- How does the developer's interest change over time? ................................... 38 5.3 RQ3- How do the interests in specific technologies change over time?.................. 39 5.3.1 React vs Angular .................................................................................................. 39 5.3.2 Python vs JavaScript ............................................................................................ 40 5.3.3. Popular discussion topics related to Web technologies ................................... 40 5.3.4 Relational Databases (RDBMS) .......................................................................... 41 5.3.5 Android vs iOS .................................................................................................... 42 5.3.6 Object-Oriented Programming............................................................................. 43 5.3.7 Machine Learning ................................................................................................ 44 Chapter 6 .................................................................................................................................. 45 6 Validity of research and experiences: ............................................................................... 45 Chapter 7 .................................................................................................................................. 46 7 Conclusion: ....................................................................................................................... 46 Chapter 8 .................................................................................................................................. 47 8 Discussion & Future Work: .............................................................................................. 47 Appendix 1: Tools and technology .......................................................................................... 48 Appendix 2: Popular discussion topics lists among developers: ............................................. 49 Appendix 3: Acronym / Abbreviation Table ........................................................................... 54 References: ............................................................................................................................... 56 3
Table of Figures: Figure 1: Venn Diagram of the intersection of the Text Mining and six related fields (Miner et al., 2012) ................................................................................................................................ 9 Figure 2: Schematic Overview of LDA (Debortoli et al., 2016). ............................................ 14 Figure 3: Methodology Model ................................................................................................. 17 Figure 4: Sample user post before cleaning of code snippet from the text content. ................ 21 Figure 5: Sample user post after cleaning of code snippet from the text content. ................... 21 Figure 6: Title of sample user post .......................................................................................... 22 Figure 7: Body of sample user post ......................................................................................... 22 Figure 8: Combined title and body of sample user post text ................................................... 22 Figure 9: Sample text before pre-processing .......................................................................... 25 Figure 10: Sample text after partial pre-processing ................................................................. 25 Figure 11: Sample text before stemming and lemmatisation................................................... 26 Figure 12: Sample text after stemming and lemmatisation ..................................................... 26 Figure 13: Sample pre-processed text ...................................................................................... 27 Figure 14: Term Document Frequency of sample text, generated from a dictionary .............. 27 Figure 15: Post types count ...................................................................................................... 30 Figure 16: Question Answer Ratio .......................................................................................... 31 Figure 17: Coherence score for different value of K (number of topics) ................................ 32 Figure 18: Intertopic distance map .......................................................................................... 35 Figure 19: Sample bar chart showing top 30 relevant terms for the topic (Topic 18-Function) .................................................................................................................................................. 36 Figure 20: Top 20 trending tags ............................................................................................... 39 Figure 21: React vs Angular trend ........................................................................................... 39 Figure 22: Python vs JavaScript trend ..................................................................................... 40 Figure 23: Web technology trends ........................................................................................... 41 Figure 24: Relational DBMS trend .......................................................................................... 42 Figure 25: Android vs iOS trend .............................................................................................. 43 Figure 26: Object-Oriented Programming language trend ...................................................... 43 Figure 27: Machine Learning language trend .......................................................................... 44 Figure 28: Topic 1- Machine Learning .................................................................................... 49 Figure 29: Topic 2 - Javascript UI development ..................................................................... 49 Figure 30: Topic 3 - Relational DBMS.................................................................................... 49 Figure 31: Topic 4 - UI development ...................................................................................... 49 Figure 32: Topic 5 - Object-Oriented Programming ............................................................... 50 Figure 33: Topic 6 - Web Design ............................................................................................ 50 Figure 34: Topic 7 - Web Development .................................................................................. 50 Figure 35: Topic 8 - Data warehousing ................................................................................... 50 Figure 36: Topic 9 - Mobile Development .............................................................................. 51 Figure 37: Topic 10 - Text processing ..................................................................................... 51 Figure 38: Topic 11 - Coding style / practice .......................................................................... 51 Figure 39: Topic 12- CLI programming ................................................................................. 51 Figure 40: Topic 13 - Web Service Application ...................................................................... 52 Figure 41: Topic 14- Tabular data ........................................................................................... 52 Figure 42: Topic 15- Security / Authentication ....................................................................... 52 Figure 43: Topic 16 -Version Control Management................................................................ 52 Figure 44: Topic 17- File Operations....................................................................................... 53 Figure 45: Topic 18- Function ................................................................................................. 53 4
Figure 46: Topic 19- Cloud / Container technologies ............................................................. 53 Figure 47: Topic 20- Server-Side Scripting ............................................................................. 53 List of Tables Table 1: Stack Overflow Posts schema .................................................................................... 19 Table 2: Post type and Post type Id .......................................................................................... 29 Table 3: Questions with Answers per Year ............................................................................. 30 Table 4: Coherence Scores of generated models with varying number of topics .................... 32 Table 5: Discovered Latent Topics .......................................................................................... 35 5
Abstract Community website for programming related Q&A (Question and Answer), Stack Overflow serves as a popular platform to ask questions and respond from other community members. Over the period, user posts on Stack Overflow have turned into a source of valuable information for programmers and the programming industry. By understanding the essential topic of discussions among developers, new insights found about developers' changing trends and needs. This thesis proposes an analysis of user posts on Stack Overflow to find topics of user posts. Distributed topics in the text content of user posts extracted by using the topic modelling technique. Latent Dirichlet allocation (LDA) is applied for topic discovery and extracted the optimal number of topics. The trend of developer interest derived by combining the view count of questions and discovered topics. Based on the analysis within the thesis's scope, developers discuss topics ranging from programming languages, language runtimes, storage, cloud to networking. Scripting programming languages are more discussed compared to non-scripting languages. Discovered topics consist of several recurring categories, i.e., Web Development, Data management and UI development. According to our findings, Machine Learning is gaining popularity as well as data processing and analytics solutions. Mobile development is another favoured subject among developers. The analysis of research findings has inferred that one technology's popularity also reflects in related technology's popularity trend. 6
Acknowledgements Thanks to Uppsala University's supervisor, Mudassir Imran Mustafa, for his support, feedback, and motivation during the thesis. Also, David Johnson for sharing the idea to analyze Stack Overflow content during our thesis meeting and to Ruth Lochan, our course coordinator, for approving the proposal. Finally, I would like to thank my spouse for his support. Glad to present the thesis work, and it was beautiful to experience. Thanks to all for the support. Shailja Shukla Uppsala, 2021-04-21 7
Chapter 1 1. Introduction 1.1. Background Programmers all around the world look to solve the problems at hand. Sometimes tasked with solving technical issues and other times to help others or satisfy their curiosity, programmers often turn towards web-based online "social question and answer developer community platforms". On such platforms, programmers can interact with other developers and find answers to their questions. With the spread of technology in novel fields, software developer communities are also growing day by day. There are many forums where developers can post questions related to technical issues. Stack overflow is one such website where a developer can post questions, answers, add comments on answers provided by other developers. For this thesis's scope, visitors or users of the website Stack Overflow referred to as developers. Stack Overflow is the most popular Q&A website among software developers (Alrashedy et al., 2020). As a platform for knowledge sharing and acquisition (Alrashedy et al., 2020). Stack Overflow is a valuable source of support for developers seeking probable solutions from the web (Rubei et al., 2020). Stack overflow allows its users to ask questions, tag a question with keywords to categorize the questions, provide answers, comment on questions or answers, up or down vote a question or answer. Over the period, Stack Overflow has become a community knowledge base for programming-related subjects. This knowledge base in its current form is still most popular. Stack Overflow knowledge base can be accessed through a web search or internal search functionality provided by the Stack Overflow website navigation. Navigating the website through tags is one of the navigation options. User-created tags lead to tag explosion, challenging to manage (Li et al., 2019). Developers find exciting posts with the help of tags associated with the post. Since tags are user-created, sometimes they might be missing from specific posts or possibly irrelevant. Content of the post itself not used to assign tags to posts, which leaves a gap of untapped opportunity to find exciting insight from the corpus of content posted by users of the website (Barua, Thomas and Hassan, 2012). 8
User posts on Stack Overflow expressed in rich and ambiguous natural language (Debortoli et al., 2016). One way to analyze natural language is qualitative data analysis using manual coding; the size of text data sets obtained from the Internet makes manual analysis virtually impossible (Debortoli et al., 2016). "Text mining" and "text analytics" are broad umbrella terms describing a variety of technologies for analyzing and processing semi-structured and unstructured text data (Miner et al., 2012). Text mining techniques allow to automatically extract implicit, previously unknown, and potentially practical knowledge from enormous amounts of unstructured textual data in a scalable and repeatable way (Debortoli et al., 2016). Automated text mining allows Information systems researchers to overcome the limitations of manual approaches to qualitative data analysis; the study can be repeated easier and faster, and it yields insights that could otherwise not be found (Debortoli et al., 2016). Text mining is divided into seven practice areas depending on that area's unique characteristics (Miner et al., 2012). These text mining divisions are interrelated and often require skills in more than one area (Miner et al., 2012). A topic model is a probabilistic generative model used broadly in computer science with a specific focus on text mining and information retrieval (Liu et al., 2016). The position and relation of topic modelling/information retrieval practice area illustrated in the following diagram: Figure 1: Venn Diagram of the intersection of the Text Mining and six related fields (Miner et al., 2012) 9
To get optimal text mining results, it requires required skill sets from Computer Science and linguistics. An IS researcher might not be equipped with knowledge and skills in all these fields. Much technical literature on the ideas and methods underlying specific text-mining algorithms exists, such as topic modelling (Debortoli et al., 2016). 1.2. Motivation Stack Overflow is not only a famous programming question and answers community platform but also an Information System, which is based on user-generated content and ranking based moderation system. A search is a primary tool to find solutions from already answered questions and find questions to answer or comment on Stack Overflow. Another method to find content is navigating through tags associated with questions. Tags are helpful for navigation and discovering interesting content within the website, but it has some shortcomings. The wild nature of user-generated tagging makes them prone to inconsistencies caused by spelling variations, synonyms, acronyms, and hyponyms. Which affect the tag quality, and as a result, tags do not entirely represent underlying topics from the text contents of user posts. These inconsistencies might cause "tag explosion", which means a small subset of tags are overused (Joorabchi, English and Mahdi, 2015). "Tag explosion describes the phenomenon that the number of tags dramatically increases along with continuous additions of software objects" (Li et al., 2019). Two factors that affect the tag quality are tag synonym and tag explosion. Tag synonym is caused by when a post is tagged with similar tags i.e., "javascript", "javascript", "c#", "c-sharp", "ios", "i-os", ".net", "dotnet" etc. Tag explosion is caused by the inclusion of total overall tags into the system, making it hard to navigate and analyze content topics manually (Li et al., 2019). Tags cannot be associated with an answer or comment type explicitly, which leaves much user- generated content untagged. However, user-generated tags for each "question" type post might also be perceived as representative of "answer" and "post" type posts associated with identical question type posts. In tag misuse or incorrect tagging, the tags might not represent the posts' content (Meta Stack Overflow, n.d.). Thus, it is concluded that tags associated with posts do not stand for an extensive part of total user posts. Analysis of this untagged data can give several types of insights and answer many different questions. This finding makes uses of tags as indexing metadata unsuitable for Stack 10
Overflow data-based Information Systems. Topic modelling is an alternative solution to find associated topics with each user post through unsupervised learning. It helps in categorizing posts into broad categories. Topic modelling is an information retrieval technique used to find structure in the collection of documents. These techniques were developed to make browsing an enormous collection of documents more accessible (Eickhof and Neuss, 2017). User-generated content can be analyzed through the topic model to get more insight into the community platform's conversations. Similar studies on the subject of topic modelling on Stack Overflow have been conducted in the past, first by Barua, Thomas and Hassan in 2012 and second by Verma, Sardana and Lal in 2019, the same subject with Stack Overflow posts for different periods. Our motivation to conduct this study is to see how technological trends and discussion topics have changed over time with different period. 1.2 Research Questions The text content of user posts from the Stack Overflow website used to find the technology trends over time, discussion topics among developers. Research questions inspired by a similar study on Stack Overflow data conducted by Barua, Thomas and Hassan (2012). 1. What are the popular discussion topics in Stack Overflow? Knowledge of popular discussion topics among programmers can be a valuable piece of information. It gives vital insights to technology analysts, vendors, authors, educators, and companies in general, which helps them decide on their work and products. Topics generated through LDA can help in improving the accessibility of discovered topics. Stack Overflow website will benefit from the knowledge by improving the reach of LDA topics through navigation. These benefits motivated us to find the main discussion topics among developers on Stack Overflow. 2. How do the developer's interests change over time? By seeing most active topics, businesses, professionals, book authors, institutes may better assess newfound opportunities and risks, predict trends, and change their focus in a different direction that is better suited for their respective goals. Information systems help by recording such experiences in a knowledge base. 11
3. How do the interests in specific technologies change over the period? Observing change in popularity of topics help maintainers of software libraries to understand growing or losing interest in their released work. If an open-source JavaScript- based library, "React" from Facebook, starts gaining more developer interest against another popular library from Google called "Angular," then analyzing this trend might help maintainers of Angular library. 1.3 Delimitation: This study is enclosed to Stack Overflow data and not dependent on any other data. The study uses an English language dictionary for natural language processing and topic modelling. It is limited to performing topic modelling using Latent Dirichlet Allocation (LDA) on data from 01 January 2018 to 01 March 2020 and analyzing the result. 1.4 Limitation: The hardware and software specification computing resource used in the study is MacBook Pro with 16GB RAM, 2.6 GHz 6-core Intel Core i7, on Mac OS, resulting in restricted computational power. It affects the LDA model and evaluation of models. The study is based on 20 topics, and it does not include the discovery of a dominant topic in each document. A mix of custom and default parameters of algorithms from the opensource python library Gensim is used to clean up our dataset. For example, to remove stop words from the text, a custom list of the words used. This list is not exhaustive to ensure the removal of all the meaningless words. The applied Natural Language Processing techniques are based on popular best practices, and a more suitable approach might be possible. For stemming, an English language dictionary is used. Using the custom dictionary more suitable for the subject is expected to get a better result. 12
Chapter 2 2. Theory 2.1 Topic Modelling: Topic modelling is a popular information retrieval method to find and extract essential terms from the collection of many documents without or partial human intervention. It helps in the deriving structure of the relationship between document (Arora et al., 2013). The principle behind topic modelling is that each document is a mixture of topics in a collection of documents. Here a topic refers to a probability distribution over words (Alghamdi and Alfalqi, 2015). Topic models are algorithms for discovering the main themes that pervade a large and otherwise unstructured collection of documents and can organize the collection according to the discovered themes (Blei, 2012). There are many topic modelling techniques. The First Probabilistic topic modelling model was Probabilistic Latent Semantic Analysis (PLSA), introduced by T. Hofmann in 1999. In 2003, D. Blei, A. Ng and M. Jordan proposed its Bayesian extension named the Latent Dirichlet Allocation (LDA) (Kochedykov et al., 2017). Since then, topic modelling has been developed within graphical models and Bayesian learning (Kochedykov et al., 2017). The use of "vanilla" LSA or LDA is prevalent in IS research for topic modelling due to the lack of publicly available implementations for many specialized topic modelling methods (Eickhoff and Neuss, 2017). LSA extracts the underlying topics from a term-document matrix by applying singular value decomposition (SVD); this approach contradicts human intuition about topics, LDA evolves from LSA and pLSA by imposing Dirichlet distributed priors to its word to topic and topic to document distributions to produce a result more in line with human intuition (Eickhoff and Neuss, 2017). Apart from general uses in research and verified results of topic modelling in LDA empirical studies, it has been implemented in many programming languages like Python, Java, R. Several implementations of LDA are publicly available open-source and free software (Debortoli et al., 2016). Latent Dirichlet allocation (LDA) selected for topic modelling for the study's scope, a 13
generative probabilistic model of a corpus (Blei, Ng and Jordan, 2003). The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words (Blei, Ng and Jordan, 2003). LDA tries to find the proper assignment of a topic to every word such that the parameters of the generative model are maximized (Arun et al., 2010). 2.2 Latent Dirichlet Allocation (LDA): LDA uses an imaginary generative process that assumes that authors composed documents by choosing a discrete distribution of t topics to write about and draw w words from a discrete distribution of typical words for each topic (see Figure 2) (Debortoli et al., 2016). Figure 2: Schematic Overview of LDA (Debortoli et al., 2016). The LDA algorithm computationally estimates the hidden topic and word distributions given the observed per-document word occurrences (Debortoli et al., 2016). LDA can perform this estimation via sampling approaches (e.g., Gibbs sampling) or optimization approaches (e.g., Variational Bayes) (Debortoli et al., 2016). Based on these features provided by the LDA algorithm, it derived that the LDA topic modelling method is most suitable to apply on Stack Overflow data to extract topics. In 2010 14
David M. Blei published another optimization of LDA as "Online Learning for Latent Dirichlet Allocation". Online LDA uses an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA) (Hoffman, Bach and Blei, 2010). This development enhances the capability of the LDA algorithm for a large set of documents. Online LDA can be applied to even streaming text. To get the benefit of this performance and boost, Online LDA has been used in the study. 2.3 Related Work There are many research studies conducted on Stack Overflow as a subject. In 2012 research conducted by Barua, Thomas and Hassan had analyzed the questions and answers posted on Stack Overflow from June 2008 to September 2010 to find the main discussion topics on Stack Overflow, change in developer interests over time and change in specific technologies. They have used the Latent Dirichlet Allocation topic modelling technique to find topics; they found around 40 different topics. They have also used the Cox-Stuart test to analyze changing trends over time. Few findings were that mobile application development being on the rise, faster than web development. Android and iPhone development are far more prevalent than Blackberry development. The PHP scripting language is becoming extremely popular, much more so than, say, Perl. Java is also a continuing player within the programming languages and APIs sector, while the .NET framework is decreasing slightly. Git has surpassed SVN in the VCS popularity contest, and MySQL is the hottest DBMS of the last few years (Barua, Thomas and Hassan, 2012). In 2018, Johri and Bansal analyzed the Stack Overflow data for 2014, 2015 to get insight into trends in technologies for different subdisciplines of computer science and programming languages. They found that Website Design/CSS is the most impactful topic. Data Analysis/Visualization and Mobile App Development are hot topics, and their popularity is increasing, while the impact of Object-Oriented Programming and Coding Style/Practice has decreased over time. On the other hand, topics like Authentication/Security and UI Development have shown steady trends over time. Furthermore, R and Python have dominated in Data Analysis/Visualization topic, Oracle and MySQL are the most popular database platform, Python is the most impactful scripting language (Johri and Bansal, 2018). In 2019 research conducted by Verma, Sardana and Lal had analyzed the questions and answers posted on Stack Exchange for 2015, 2016, 2017 to find the critical discussion dimensions topics, interest of developers changes over time, interests in a specific technology changes over 15
time. They found that the discussion's popular topics are Programming skills, object-oriented design, and design & development in all three successive years. The topics are labelled meaningfully based on top words assigned by LDA. The leading technologies discussed for which question was raised were Java and C# (Verma, Sardana and Lal, 2019). Our study is related to the topic modelling of developer posts on Stack Overflow. 16
Chapter 3 3. Methodology: To achieve the goal to find out the main discussion topics among developers on the Stack Overflow website. Topic modelling is performed on user posts retrieved from the Stack Overflow dataset. With the topic models, it is possible to retrieve topics from a collection of texts without document metadata. The methodology is motivated by similar studies conducted in the past by Barua, Thomas and Hassan (2012) and Verma, Sardana and Lal (2019), as referenced in section 2.2., It does not replicate the same steps, and there are few changes made according to the requirement of the study, like customized stop words, removal of contractions. Online variational Bayes algorithm for latent Dirichlet allocation (Online LDA) by Hoffman, Bach and Blei, (2010) is used for topic modelling. Topic modelling is performed under the following steps shown in figure 3: Figure 3: Methodology Model 17
3.1 Data Collection: Popular programming community Q&A website Stack Overflow is part of the Stack Exchange network comprising 173 Q&A communities for various communities (About - Stack Exchange, 2020). Stack Overflow serves over 120 million visitors every month (About, 2020). Data for research has been downloaded from archive.org. Archive.org (internet archive), a non- profit, builds a digital library of Internet sites and other cultural artefacts in digital form (Internet Archive: About IA, 2020). Stack Exchange provides quarterly data dump of Stack Exchange network sites at [https://archive.org/details/stackexchange]. Stack Overflow data has been shared publicly by Stack Exchange under creative commons licence. Eight datasets available from Stack Overflow as Badges.7z, Comments.7z, PostHistory.7z, PostsLinks.7z, Posts.7z, Tags.7z, Users.7z, Votes.7z (Stack Exchange Data Dump: Stack Exchange, Inc.: Free Download, Borrow, and Streaming: Internet Archive, 2020). 3.2 Data Extraction: For topic modelling, the file Posts.xml is selected, which is the most suited data dump file used as the data source in the thesis because it has the contents of questions, answers as posts; this collection of posts will serve as a corpus for topic modelling. The archive has posts until 01 March 2020, since data is updated quarterly and only the most recent data dump is made available at the given URL [https://archive.org/download/stackexchange/stackoverflow.com- Posts.7z]. The data source used in the thesis is available below URL: [https://archive.org/details/stackoverflow.com-Posts.7z]. MD5 checksum of 7zip archive is e5c0b370d5f9a6905c88fdb5971b145a The size of archive Posts.7z on the disc is 14.6 GB. After extracting the archive, the file size of the extracted file Posts.XML is approx. 75 GB. It has total 4,79,31,101 posts from 2008 till 01 March 2020. 18
3.2.1 Schema: The schema of Posts.xml is as follows (Stack Exchange Data Dump, 2020): Posts.xml Id PostTypeId 1: Question 2: Answer ParentID (only present if PostTypeId is 2) AcceptedAnswerId (only present if PostTypeId is 1) CreationDate Score ViewCount Body OwnerUserId LastEditorUserId LastEditorDisplayName LastEditDate LastActivityDate Community Owned Date ClosedDate Title Tags AnswerCount CommentCount FavoriteCount Table 1: Stack Overflow Posts schema For the scope of the study, data from 01 January 2018 queried. MySQL database supports XML file as schema input. Following SQL command is used to import Posts.xml into a MySQL database. 19
load XML local infile '/Path/To/stackexchange/Posts.xml' into table posts rows identified by ''; Following SQL query is used to query posts from 01 January 2018: select * from posts where CreationDate>='2018-01-01' order by CreationDate Desc; The query returned 97,47,021 posts, out of which 43,49,023 posts related to questions and 53,97,998 posts related to answers, found by the value of PostTypeId. The initial dataset is created from the exported query result as a SQL data dump. Exported SQL dump size is 4.3 GB Gzip. Prepare the data for pre-processing, and the data format needs to be one of the file formats supported by data pre-processing and data cleaning software. CSV, Avro, JSON and Parquet are among the most widely used and supported file formats by data processing applications. Google Big Query is used to import SQL data and export it as an Avro file format; later, Avro files were converted to Parquet file format supported by data processing python library Pandas. Online LDA algorithm implementation from python library Gensim is used for LDA topic modelling. 3.3 Data Pre-processing: An Exported SQL dump is used as source input to create and export the dataset for processing. The data needs pre-processing to get meaningful results before processing it in the LDA model. Data pre-processing consists of a sequence of steps to transform the raw data derived from data extraction into a clean and tidy dataset before analysis (Malley, Ramazzotti and Wu, 2016) to reduce noise in the dataset. Data cleaning has been performed in the following steps: - 3.1.1 Subset corpus data: For creating a corpus, data from the "Title" and "Body" columns are retrieved, and the rest of the columns discarded. Only those posts which are also a question will have nonempty value in the "Title" column because Stack Overflow only allows questions to have a title. Column "Id" is also kept for backtracking to the entire dataset and can map processed values with an unprocessed dataset table if needed. 20
3.1.2 Remove code snippets: To automate topic modelling, any part of the corpus which is not meaningful needs to be cleaned up. In Stack Overflow posts, code snippets are a regular part of the post content, but these snippets add to the noise for the algorithms used for Natural Language Processing. Code snippets are enclosed inside and HTML tags. Code snippets, including enclosing "code" and "pre", are removed. The posts corpus still has many other HTML tags e.g., . These tags are removed, but the enclosing text is kept. Sample user post for before (figure 4) and after (figure 5) clean-up of code snippet from the text content is shown below. Figure 4: Sample user post before cleaning of code snippet from the text content. Figure 5: Sample user post after cleaning of code snippet from the text content. After performing these steps, the dataset size reduced to 4.3 GB. 21
3.3.3 Combine related documents to form a single corpus: Dataset table converted into a Pandas data frame for further data processing, and posts are still divided into "Title" and "Body" columns. Both columns are combined to form a single corpus then null or empty values are replaced by an empty string. The title of the sample user post is shown below in figure 6. Figure 6: Title of sample user post The body text content of the sample user post is shown below in figure 7. Figure 7: Body of sample user post Combined sample text of title and body is shown below in figure 8 Figure 8: Combined title and body of sample user post text 3.3.4 Tokenization: In NLP studies, the focus is on analysis and not on the basic units called tokens, i.e., words, but without clear segregation of words, it is impossible to carry out analysis on documents written in natural languages (Webster and Kit, 1992). The text analyzed is converted into the list of meaningful segments called tokens (Bhargav Srinivasa-Desikan, 2018). These segments could be words, punctuation, numbers, or other special characters that are the building blocks of a sentence (Bhargav Srinivasa-Desikan, 2018). Tokens are units that need not be decomposed in further processing. This process achieves automatic segmentation by constructing the dictionary and applying strategies for disambiguation (Webster and Kit, 1992). The study uses whitespace as a delimiter; This can be difficult in non-English languages. This study is scoped to English language text; hence delimitation using white spaces is not a problem. 22
3.3.5 Lowercasing: Changing the letter case is a part of text pre-processing as such that tokens will be cleaned from text case related ambiguity (Kulkarni and Shivananda, 2019). User post content from the Stack Overflow dataset is a natural language text. While using automated natural language processing tools, it is possible to get overwhelmingly fragmented results because of case sensitivity. Changing all the text into lower case reduces this ambiguity. E.g., "Sass" and "SASS" will be changed into "sass" and "JavaScript", and "Javascript" will be transformed into "javascript" (Bhargav Srinivasa-Desikan, 2018). 3.3.6 Remove punctuations: Punctuations do not add meaning or supply additional value to the text being pre-processed for topic modelling. Removing punctuations from the enormous collection of text pre-processing will also reduce the size of the text. Processing of more amount of text will require more computing resources. The smaller text collection size will enable us to perform text pre- processing with reduced computing resources (Bhargav Srinivasa-Desikan, 2018). A most popular method to remove punctuation from text documents is to use regular expression along with a list of punctuations to be removed. Python supplies a list of punctuations as a part of the standard library. Several programming languages have punctuation in their names, e.g., "+" used in "C++," "#" in "C#". Our topic model's result might be skewed by removing punctuation text from the technology keywords erroneously. For example, if all the punctuations from the text are removed, then technological words will change from "C++, C, C#" to "c, c, c". This sample transformation might triple the probabilistic distribution of token "c" and incorrectly remove "c#" and "c++" from the text and later from the topic model. All punctuation from the text except tokens from a non-exhaustive exception list of such programming keywords is removed. 3.3.7 Text Standardization/Replace Contractions: It is converting a raw corpus into a canonical and standard form to ensure that the textual input is consistent before analysis and processing (Bokka et al., 2019). Contractions are shortened versions of words or syllables (Sarkar, 2019). Shortened versions of existing words are created 23
by removing specific letters and sounds, and contractions pose a problem for NLP and text analytics (Sarkar, 2019). Stack Overflow is primarily a social community. Users interact in natural languages and often not in their primary spoken language. There is a high possibility of people using short words and abbreviations to stand for the same meaning. In many cases, certain words might be misspelt, and popular slang substitutes have been used. Canonical forms of abbreviation in the corpus have been corrected, e.g., gotta -> got to, brb > be right back. Spellings of tokens were not corrected because dictionary-based auto-correct might remove non-English technical terms, and that would misrepresent token sparsity. 3.3.8 Remove stop words: As a next step, stop words have been removed. Stop words are common words that appear to be of little value in helping to select documents matching a user need are excluded from the vocabulary entirely (Manning, Raghavan and Schütze, 2008) for example: 'a', 'an', 'the', 'of', 'else'. By removing the commonly used words, the focus shifts to the essential keywords instead (Bhargav Srinivasa-Desikan, 2018). For example, in the following text, "How do I atomically move a 64bit value in x86 ASM?" common English words, i.e. "How," "do," "I," "a," "in" have been removed. For stop word removal, the study uses a custom list of stop words that include additional tokens to be removed to improve coherence technology-related terms. 3.3.9 Remove URLs: Text containing URLs and emails is one of the often-used tokens, which adds to the noise in the pre-processed text's quality. URLs have been removed from the text using regular python expression. 3.3.10 Minimum size words: Words less than a certain number of letters are often not useful. Any word less than two letters have been removed. Some whitelisted technological terms “'c++', 'c#', 'f#',' r ', 'c'” were not removed. 24
3.3.11 Remove multiple whitespaces: Whitespace delimiter used for tokenization. If there are multiple continuous occurrences of whitespaces, it might create null tokens. Python regular expression used to remove multiple whitespaces. Sample of before (figure 9) and after partially pre-processed (figure 10) text document. Figure 9: Sample text before pre-processing Figure 10: Sample text after partial pre-processing 3.3.12 Generate N-Grams: Every token in Natural Language Processing is considered a feature, and an n-gram is a contiguous sequence of n features in the text (Bhargav Srinivasa-Desikan, 2018). For a single feature, the value of N is 1. This representation of features is called "Unigrams". Sometimes a token derives meaning by combining with previous or next feature. When an occurrence of token "Java" and "Script" found together, then by capturing these tokens, a feature "JavaScript" is derived. This process is called the generation of "N-Grams". Here "N" denotes the number of tokens captured to form a new feature. When two tokens form a feature, then it is called "Bi- Gram" if three tokens form a feature, it is called "Tri Grams" and more (Bhargav Srinivasa- Desikan, 2018). Bigrams and Trigrams were generated for the scope of the study. 3.3.13 Stemming: It refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time and often includes the removal of derivational affixes (Manning, Raghavan and Schütze, 2008). Stemming is a process of extracting a root word; for example, "fish," "fishes," and "fishing" are stemmed into fish (Bhargav Srinivasa-Desikan, 2018). 25
Snowball stemming algorithm implemented in Python library NLTK used in the study improved over the Porter stemming algorithm. Snowball stemming algorithm stems words to its base form. Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval (Snowball, n.d.). 3.3.14 Lemmatisation: It refers to doing things correctly using vocabulary and morphological analysis of words, usually aiming to remove inflectional endings only and return the base or dictionary form of a word, known as the lemma (Manning, Raghavan and Schütze, 2008). Like Stemming, Lemmatisation also is a process of extracting a root word, but by considering the vocabulary, for example, "good," "better," or "best", lemmatised into good (Bokka et al., 2019). While stemming returns the root word, Lemmatised word must be a valid dictionary word (Bokka et al., 2019). WordNet lemmatiser implemented in Python library NLTK used in this study. Example before n-gram generation, stemming, and lemmatisation is shown below in figure11. Figure 11: Sample text before stemming and lemmatisation Example after n-gram generation, stemming, and lemmatisation is shown below in figure 11. Figure 12: Sample text after stemming and lemmatisation 3.4 Create Dictionary and Term Document Frequency: Dictionary is generated from the lemmatized data using the dictionary module of python library Gensim. Dictionary is a list of unique words found in the collection of documents, with each word being assigned an index value. A word's integer-id mapping made by the dictionary also referred to as "word id" (Bhargav Srinivasa-Desikan, 2018). A dictionary-based text categorization relies on experts assembling lists of words and phrases that are likely to indicate a chunk of text membership in a specific category (Debortoli et al., 2016). Dictionary encapsulates the mapping between normalized words and their integer ids (Řehůřek, 2019). The dictionary is used to create the Term Document Frequency for the LDA topic model input. Term Document Frequency is the corpus that has word-id and its frequency in each document. 26
Each document in our collection of documents is converted into a bag of words using the doc2bow method of the dictionary created using Gensim. It is a list of lists, where each list is a documents bag-of-words representation (Bhargav Srinivasa-Desikan, 2018). The Bag of Words model represents each text document as a numeric vector where each dimension is a specific word from the corpus, and the value is its frequency in the document, the occurrence denoted by 1 or 0, or even weighted values (Sarkar, 2019). The dictionary must create Term Document Frequency for LDA topic model input. Term Document Frequency is the corpus that has the word id and its frequency in each document. Sample pre-processed text used to create Term Document Frequency is shown below in figure 13. Figure 13: Sample pre-processed text Term Document Frequency (figure 14) of sample text (figure 13), generated from a dictionary. Figure 14: Term Document Frequency of sample text, generated from a dictionary 27
3.5 Run the LDA model: There are several topic modelling algorithms. For this study, the Latent Dirichlet Allocation (LDA) model has been selected. It is crucial to find an optimal number of latent topics suitable to generate a Topic model based on a given corpus. The idea behind this is that a small number of latent topics are enough to effectively represent a large corpus (Arun et al., 2010). The computational task of the LDA algorithm is to estimate the hidden topic and word distributions, given the observed per-document word occurrences, an estimation can be done either via sampling approaches (e.g., Gibbs sampling) or optimization approaches (e.g., Variational Bayes) (Debortoli et al., 2016). Online LDA model implemented in python-based natural language processing library Gensim is used for the topic modelling in this study. The LDA model ran multiple times to tweak the model parameters based on the LDA model results, and text pre-processing steps were revised, e.g., creating custom stop words and custom punctuation. The initial topic model has run with 50 topics. Afterwards, the model has run with a different number of topics and then calculates each model's coherence score using the Coherence Model from Gensim; the model with the best coherence score has been selected for further analysis. Tools and libraries used in the study are listed in Appendix-1. 28
Chapter 4 4 Analysis: Upon analyzing the dataset, a total number of 9,74,7021 posts found. PostTypeId is used to determine the post whether it is a question, answer or any other type of post. Post Type and PostTypeId mapping shown in the below table (SEDE, 2020). PostTypeId Post Type 1 Question 2 Answer 3 Orphaned tag wiki 4 Tag wiki excerpt 5 Tag wiki 6 Moderator nomination 7 Wiki placeholder 8 Privilege wiki Table 2: Post type and Post type Id Out of 9,7,47,021 posts, there are 4,3,37,053 (44.5 %) “Question” and 5,3,97,998 (55.4%) “Answer” type post. These two types of posts contribute to 99.9% of the total posts. The remaining 11,970 (0.1%) posts are distributed in other types of posts. 29
Figure 15: Post types count The analysis finds the number of questions asked and answered during each year. In the year 2018 total of 1,9,07,440 questions were asked, out of which 34.7% of questions were answered. Similarly, in the year 2019, 32% of the total 2,0,56,068 questions were answered. Complete data for the year 2020 is not available; the dataset contains data until 01 March 2020, during which a total of 3,73,545 questions with 29.6% of the answer rate. Year Number of questions Questions with answers 2018 1907440 34.7 % 2019 2056068 32.0 % 2020 373545 29.6 % Table 3: Questions with Answers per Year Analysis of the count of posts over the period, based on post type, reveals that the count of answers posted has been continuously higher than the count of questions asked. Apart from that, the gap between the count of questions to answer type posts is continuously reducing. In fact, in February 2020, the count of questions crossed the count of answers. It is noted that the upward trend in answer count, again going downwards towards the remaining end of the dataset period, but downwards trend is based on the data of a pretty short interval to conclude that the 30
answer upwards trend is short-lived. In the below figure, the question graph series colour is turquoise blue, and the answer graph series is navy blue. Legend given in graph displays the colour of PostType Id. Figure 16: Question Answer Ratio Since LDA is an unsupervised topic modelling technique, it is unknown before running the model how many numbers of topics corpus consists? The number of topics represented by K denotes the refinement of the discovered topics. K's optimal value is derived through the 'Topic Coherence' score; the higher the coherence more optimal the value of K (Verma, Sardana and Lal, 2019). Larger values of K will produce finer-grained, more complex topics, while smaller values of K will produce coarser-grained, more general topics (Barua, Thomas and Hassan, 2012). Multiple models were generated to find the optimal number of topics, with 2, 8, 14, 20, 26, 32, 38, 44 number of topics and their coherence scores were compared. CoherenceModel from the Gensim library is used to derive the coherence score of generated topic models. The number of topics to be discovered is limited to a maximum of 50. For the study's scope to find an optimal value of K. For every generated model, the value of K increased by six from the value used in the previous model; the value of K used in the first model was 2. 31
You can also read