Newcomer Candidate: Characterizing Contributions of a Novice Developer to GitHub
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Noname manuscript No. (will be inserted by the editor) Newcomer Candidate: Characterizing Contributions of a Novice Developer to GitHub IFraz Rehman · Dong Wang · Raula Gaikovina Kula · Takashi Ishio · Kenichi Matsumoto Received: date / Accepted: date arXiv:2101.08903v1 [cs.SE] 22 Jan 2021 Abstract The ability for an Open Source Software (OSS) project to attract, onboard, and retain any newcomer is vital to its livelihood. Evidence suggests more new users are joining GitHub, however, the extent to which they con- tribute to OSS projects is unknown. In this study, we coin the term ‘newcomer candidate’ to describe a novice developer that is a new user to the GitHub platform, with the intention to later onboard an OSS project. Our objective is to track and characterize their initial contributions using a mixed-method ap- proach. Our results show that 68% of newcomer candidates are more likely to practice non-social coding, 86% tend to work on forward-engineering activities in their first commits, and 53% show their interest of targeting non-software repositories. Our quantitative analysis did match only 3% of newcomer can- didates contributions to established OSS repositories, yet 70% of newcomer candidates claim to already onboard an OSS project. This study opens up new avenues for future work, especially in terms of targeting potential contribu- tors to onboard an existing OSS project. More practical applications would be tool support to (i) recommend practical examples that OSS project teams can use to lower their barriers for a newcomer candidate to successfully make a contribution and (ii) recommend suitable repositories for newcomer candi- dates based on their preference. Researchers can explore strategies to sustain newcomer candidate activities until they are ready to onboard an OSS project. Keywords Newcomer, Open Source Projects, GitHub 1 Introduction The success of Open Source Software (OSS) has always been the continuous influx of newcomers and their active involvement (Park and Jensen, 2009). IFraz Rehman, Dong Wang, Raula Gaikovina Kula, Takashi Ishio, Kenichi Matsumoto Nara Institute of Science and Technology, Japan E-mail: {rehman.ifraz.qy4,wang.dong.vt8,raula-k,ishio,matumoto}@is.naist.jp
2 IFraz Rehman et al. Recent studies have shown evidence that many contemporary projects are at risk of failure, with one of the reasons, i.e., inability to attract and retain newcomers (Fang and Neufeld, 2009; Valiev et al, 2018). For example, Coelho and Valente (2017) proposed two strategies that include newcomers which aim to transfer the project to new maintainers and to accept new core developers. In another study, Steinmacher et al (2014a) presented a model that analyzed the forces influential to newcomers being drawn or pushed away from a project. Most of the work revolve around newcomers onboarding OSS projects. Newcomers can be novice developers who are starting their career, or experi- enced developers from an industry who are new to OSS projects, or developers who migrated from other OSS projects. The term newcomer has usually been used in a loose way in literature Steinmacher et al (2014b). Inspired by incu- bation projects of OSS, we coin the term a newcomer candidate as “a novice developer that is a new user to the GitHub platform, with the intention to later onboard an OSS project”. Interestingly, GitHub reported around 10 million new users in 2019.1 With this upsurge in newcomer candidate activity, the extent to which these con- tributions assist OSS projects is unknown. In addition, GitHub2 as a social coding platform allows over 40 million developers to showcase their skills to the world’s largest community (44 million repositories). Although there is a com- plete body of work that have studied the barriers and struggles of newcomers, none have explored the contribution types of newcomer candidates. To fill this gap, our paper executes the research protocols of a registered report (Rehman et al, 2020) to investigate the contributions of newcomer can- didates. We received 177 newcomer candidates who are verified not having any experience of contributing to OSS projects. We formulate four research questions along with their motivations to guide our study: – (RQ1) To what extent does a newcomer candidate practice social coding? Scacchi (2002) showed that newcomers are more likely to learn on their own. Our motivation of the first research question is to understand whether or not a newcomer candidate tends to collaborate with other users. Since GitHub is a social platform, we are not sure whether the newcomer candidates do social coding or learn on their own. Thus, we raise the fol- lowing hypothesis to confirm our assumption: (H1) A newcomer candidate is more likely to practice social coding to GitHub. – (RQ2) What are the kinds of initial contributions that come from a newcomer candidate? We would like to investigate the typical activ- ities engaged by newcomer candidates. Answering this research question will allow us to understand the nature of their initial contributions. Our hypothesis is (H2) A contribution to Github repository for a newcomer candidate is more likely to add new content. – (RQ3) What kinds of repositories does a newcomer candidate tar- get? Kalliamvakou et al (2014) showed that most repositories on GitHub 1 Statistics from https://octoverse.github.com accessed January 2020 2 https://github.com
Title Suppressed Due to Excessive Length 3 are non-software related and are for personal use. Thus, the motivation is to understand the kinds of projects that attract interest of a newcomer candidate. Our hypothesis is (H3) A newcomer candidate is more likely to target software repositories. – (RQ4) What proportion of newcomer candidates eventually on- board an OSS project? In this exploratory research question, we inves- tigate the proportion of newcomer candidates that eventually onboard an OSS project. Additionally, we validate what kinds of barriers newcomer candidates face when onboarding. Key results of each RQ are as follows: For RQ1, we show that 68% of new- comer candidates do not practice social coding after joining GitHub. These results indicate that the newcomer candidates are less likely to collaborate with other developers with their initial contributions. For RQ2, we identified that 86% of newcomer candidates’ contributions are adding new features and requirements (i.e. forward-engineering activities). For RQ3, results show that 53% of newcomer candidates are likely to target non-software based reposito- ries, with 21% of documentation and 24% experimental being the most fre- quently targeted repository kinds (fork and PR, clone and push workflows). For RQ4, although our quantitative analysis matched only 3% of newcomer can- didates onboard established OSS repositories, in the survey, 70% of newcomer candidates claimed that they already started to contribute to OSS repositories. Furthermore, newcomer candidates strongly agree that they face the barrier of finding a way to start, while social interaction received the most mixed responses as a barrier. This study has the following implications and recommendations. We rec- ommend newcomer candidates to read the social coding related guidelines and become familiar with the environment. More practical applications would be tool support to (i) recommend practical examples that OSS project teams can use to lower their barriers for a newcomer candidate to successfully make a contribution and (ii) recommend suitable repositories for newcomer candi- dates based on their preference. Researchers can explore strategies to sustain newcomer candidate activities until they are ready to onboard an OSS project. The remainder of this paper is organized as follows: Section 2 introduces the concept of making a contribution to GitHub. Section 3 describes the data preparation, which includes preliminary survey verification and mining new- comer candidate repositories. Section 4 and Section 5 reports the approaches and results of our empirical study, while Section 6 discusses the implications of our findings. Section 7 discloses the threats to validity, and Section 8 presents related work. Finally, we conclude the paper in Section 9. 2 Making a Contribution to GitHub To contribute to a GitHub repository, we first need to understand the workflow of contributions. This section describes two contribution workflows and then further defines how we characterize a GitHub contribution.
4 IFraz Rehman et al. Fig. 1: Two workflows for GitHub contributions: i) Fork and PR and ii) Clone and Push. Figure shows the basic conceptual diagram that shows the fork and PR workflow for an author AUT. Let R denote a repository, C for a set of commit changes and PR represents a pull request. Fork and Pull Request (PR) Workflow. Figure 1 shows the basic conceptual diagram that shows the fork and PR workflow for an author AUT. Let R denote a repository, C for a set of commit changes and PR represents a pull request. We now detail each step: 1. Forking a repository. In order to make changes, an author has to create an online copy of the repository that they intend to make a contribution. As shown in the figure, AUTA makes a fork of repository R. We now call this repository R0 . 2. Cloning a forked repository. Once a fork is made, an author downloads a local copy of the forked repository, thus creating a local copy on the computer to sync between fork. As shown in the figure, AUTA clones R0 onto their local computer, become a clone repository R0C . 3. Committing changes to a forked repository. Once the local copy is cloned, the author can change the local git repository, which involves individual changes such as adding, deleting, or modifying files. These set of changes are known as commit changes. As shown in the figure, we make a set of commit changes C1 to repository R0C . 4. Submitting changes as a Pull Request (PR). Finally, in order to commit changes to the original repository, an author needs to submit a PR. The PR allows the author to inform others about changes you have pushed to a branch in a repository hosting on GitHub. The owner of the original repository then decides whether or not to accept the PR. As shown in the
Title Suppressed Due to Excessive Length 5 Fig. 2: An example of social coding, where more than one author contributes to git.gemspec file. The example is available at https://github.com/ruby-git/ ruby-git/blame/master/git.gemspec . figure, the pull request PR contains the set of commit changes C, that will submit to the original repository R, thus completing the workflow. Clone and Push Workflow. We now detail each step of clone and push work- flow: 1. Cloning a repository. Similar to step two of the fork and PR workflow, the author downloads a local copy of the repository. As shown in the figure the first step, AUTB directly downloads a local copy of repository R. 2. Committing changes to a repository. Similar to step three of the fork and PR workflow, an author can make changes to the local git repository. In this example, we push a set of commit changes C2 to repository R. 2.1 Characterizing GitHub Contributions To characterize each newcomer candidate contribution, we measure character- istics from three dimensions of social coding, kinds of repositories (i.e., software and non-software), and contributions. Social Coding. Figure 2 illustrates an example of how we measure whether or not a contribution to a GitHub repository is social. As shown in the Figure, there is two authors (i.e., author A for lines 1-3 and author B for line 4) that contribute to a single file (i.e., git.gemspec) in a (i.e., ruby-git) GitHub repository. Since there is more than one author has modified the file, we can conclude that both authors make a social contribution. To extract who is the author of a line by line modifications of a file, we use the git-blame command.3 Kinds of Contributions. Purushothaman and Perry (2005) used Swanson clas- sification of maintenance activities to analyze very small changes while Hindle et al (2008) perform a similar study for large commits. For this purpose, we adopt the same kinds of contribution proposed by Hattori and Lanza (2008): (a) Forward Engineering, (b) Re-engineering, (c) Corrective Engineering, and (d) Management. 3 https://www.atlassian.com/git/tutorials/inspecting-a-repository/git-blame
6 IFraz Rehman et al. Table 1: Survey Questions sent to potential respondents Survey Questions for newcomer candidates Q1) What is your motivation to make a contribution to GitHub? (a) Learning to Code. (b) Assignment or Experiment Project. (c) Intend to contribute to an Open Source. (d) Use to showcase my programming skills. (e) Others. Q2) Did you have prior experience contributing to an OSS before GitHub? (Yes/No) Q3) List your programming knowledge/interests? (short answer) Software vs. Non-software. Following Munaiah et al (2016) we first distinguish between software projects (i.e., an engineered software project with documen- tation, testing, and project management) and non-software repositories. Con- cretely, we first classify software repositories based on the Borges et al (2016) classifications: (a) Application Software, (b) System Software, (c) Web-based- application, libraries, and frameworks, (d) Non-web libraries and frameworks, (e) Software tools, (f) Documentation. We use the Kalliamvakou et al (2014) classifications for non-software repositories, (a) Experimental, (b) Storage, (c) Academic, (d) Web, (e) No longer accessible, and (f) Empty. 3 Data Preparation To ensure newcomer candidates, we conducted a preliminary survey to explic- itly verify the newcomer candidate experience with OSS repositories. 3.1 Preliminary Survey: newcomer candidate verification Survey Design. Table 1 shows the three survey questions. Apart from the ex- plicit verification for the requirements of being a newcomer candidate, respon- dents were asked about their motivations, interests, and rank their perception of their programming skill. For potential respondents, with the consent of the repository owners, we mined the community of the first-contributions repository.4 From the reposi- tory, we were able to collect 10,000 emails. Survey was sent out through emails over a four-weeks period and the anonymous responses were collected.5 In the end, we received 219 responses. Table 2b details the results of respondents, showing that 85% of respon- dents (i.e., 187 responses) do not have any experience, while only 15% (i.e., 32 responses) have experience contributing to an OSS. From the results, we 4 https://github.com/firstcontributions/first-contributions 5 Our questionnaire is available at https://tinyurl.com/r7acxvn
Title Suppressed Due to Excessive Length 7 Table 2: Table 2a shows evidence that most respondents do not have any prior experience contributing to an OSS before GitHub and Table 2b shows that most respondents were motivated with the intent to contribute to an OSS project. What is the motivation to contribute? Percent (a) Learning to Code. 58% (b) Assignment or Experiment Project. 21% (c) Intend to contribute to an Open Source. 82% (d) Use to showcase my programming skills. 42% (e) Others 5% (a) Answers to Q1 of the survey Have you had any prior OSS experience? Percent No 85% Yes 15% (b) Answers to Q2 of the survey find that 187 respondents are recognized as newcomer candidates by our defi- nition, i.e., a newcomer candidate is a novice developer that is a new user to the GitHub platform. Furthermore, 82% of respondents were motivated with the intent to contribute to an OSS project i.e., Table 2a. 3.2 Mining Newcomer Candidate Repositories To construct our dataset, we map our verified newcomer candidate information with their GitHub repository contributions. To do so, we use the GitHub REST API (GitHub, 2020) to retrieve newcomer candidate related information (i.e., contributed repositories, submitted commits) according to their GitHub accounts that were left in the survey. In the end, we successfully matched 177 newcomer candidates with their 2,437 contributed repositories. Note that these 2,437 repositories are unique. Filter non-contribution First commit dataset newcomer candidates Distinguish Clone and Push, Representative repository Newcomer Candidate Fork and PR repositories dataset Datasets Fig. 3: An overview of sub-dataset preparation. Two sub-datasets are con- structed based on newcomer candidate dataset: first commit dataset and rep- resentative repository dataset.
8 IFraz Rehman et al. Table 3: Dataset summary. 177 newcomer candidates are studied. # Newcomer candidates 177 Newcomer Candidate datasets # Contributed repositories 2,437 First Commit Dataset # Commits 174 # Fork and PR repositories 274 Representative Repository Dataset # Clone and Push repositories 305 Figure 3 shows an overview of our sub-dataset preparation as discussed below, while Table 3 shows the details of our newcomer candidate datasets: First commit dataset. First, we construct a dataset consisting of first com- mits that newcomer candidates contributed. To do so, we first cloned the ear- liest GitHub repositories of each 177 newcomer candidates. We then extract the first commit id (i.e., sha) from each repository’s commit log as their first- ever contributions. After applying a filter to remove the newcomer candidates who do not place any of their contributions after joining GitHub, we found that only three newcomer candidates forked the repository. Finally, we get the total number of 174 newcomer candidates who did commits to their GitHub repositories, i.e., 174 first commits, as shown in Table 3. Representative repository dataset. We construct another dataset for a qual- itative analysis of the repositories from 955 fork and PR workflow and 1,482 Clone and Push workflow repositories. To do so, from 2,437 repositories of 177 newcomer candidates, we draw a statistically representative sample dataset (i.e., a confidence level of 95% and a confidence interval of 5.6 ) The calcu- lation of statistically significant sample sizes based on population size, confi- dence interval, and confidence level is well established (Krejcie and Morgan, 1970). We randomly sampled 274 fork and PR repositories and 305 Clone and Push repositories to get a representative repository dataset that consists of 579 sample repositories, as shown in Table 3. 4 Approach In this section, we follow the protocols highlighted in our registered report Rehman et al (2020) to answer the research questions. 4.1 Answering RQ1 To answer RQ1, we use a quantitative method to identify whether newcomer candidates practice social coding. To do so, we adopt the first commit dataset (See Section 3.2) that includes the first commits of all 174 newcomer candi- dates. Then we identify social coding using the Algorithm 1. Algorithm 1 details our procedure to identify social coding. We first extract the files contained in the first commit of a newcomer candidate (line 1). Second, 6 https://www.surveysystem.com/sscalc.html
Title Suppressed Due to Excessive Length 9 Input : f irst_commit performed by an author au Output : Contribution type of the first commit: social or non-social 1 F ← A set of files modified by f irst_commit; 2 T ype(F ) = non-social; 3 for f ∈ F do 4 D ← extract_authors(git-blame(f )); 5 if au ∈ D & |D| > 1 then 6 T ype(f ) = social; 7 end 8 end 9 return T ype; Algorithm 1: Our algorithm to classify a first commit to social or non-social by default, we labeled the type of all first commits as non-social (line 2). Then, we apply the git-blame command on each contained file in the commit to check whether the files received changes from more than one unique author (lines 3-4). Next, we classify the type of first commit as social if the newcomer candidate changed a file edited by other authors (lines 5-9). Otherwise, by default, the type of contribution remains non-social. To validate our hypothesis (H1) A newcomer candidate is more likely to practice social coding to GitHub, we use the one proportion Z-test (Paternoster et al, 1998). The one proportion Z-test compares an observed proportion to a theoretical one when the categories are binary. 4.2 Answering RQ2 To answer RQ2, we use a semi-automatic approach to identify the different kinds of first-contribution done by newcomer candidates described in Section 2.1. To do so, we use the first commit dataset, same as (RQ1) that includes 174 first commits of all newcomer candidates from their first projects. Our approach consists of two rounds. In the first round, we applied the keyword list from Hattori and Lanza (2008) to automatically classify commits into a particular category, successfully matching 158 commit kinds based on keyword lists. In the second round, we performed an additional manual check for the remaining 16 commits not covered by the keyword list. Our new predefined keyword list includes (Forward Engineering: first), (Corrective Engineering: so- lution, break), (Re-engineering: revisi, reforma, chang, simpl), (Management: note). To validate our hypothesis (H2) A contribution to Github repository for a newcomer candidate is more likely to add new content, similar to RQ1, we use the one proportion Z-test (Paternoster et al, 1998). Note that Corrective Engineering, Re-Engineering, and Management are merged into Non-Forward Engineering in our significance test.
10 IFraz Rehman et al. 4.3 Answering RQ3 To answer RQ3, we use a qualitative method to identify the different kinds of repositories described in Section 2.1. We use the representative project dataset, as described in Section 3.2. For the manual classification, we first validate with 30 samples by the three authors of this paper. We then measure the inter-rater agreement using Cohen’s Kappa. The Kappa agreement score of classifying fork and PR workflow projects is 0.91, which is implied as “almost perfect”, while the Kappa agreement score of classifying clone and push work- flow projects is 0.76, which is implied as ”substantial agreement” (Viera et al, 2005). After the validation, the two authors completed the manual coding for the remaining repositories in the representative sample. To validate our hypothesis (H3) A newcomer candidate is more likely to target software repositories, similar to RQ1, we use the one proportion Z- test (Paternoster et al, 1998). 4.4 Answering RQ4 To answer RQ4, we perform both a quantitative and qualitative analysis. In quantitative analysis, we use a total of 2,437 projects (See Section 3.2) of 177 newcomer candidates. Using a curated dataset of engineered software reposi- tories provided by Munaiah et al (2016), we decide to classify whether or not a newcomer candidate has onboarded an engineered software project. To complement this quantitative analysis, we conducted a survey as qual- itative analysis to acquire the perception of newcomer candidates. The per- ception is split into two questions. The first question is related to whether newcomer candidates onboard or not. Then, in the second question, inspired by the previous work by (Steinmacher et al, 2014b), we would like to validate the barriers faced by newcomer candidates when placing their initial contribu- tions to OSS projects. We focused on five popular barriers same as Steinmacher et al (2014b): (a) Social Interaction, (b) Newcomer Previous Knowledge, (c) Finding a Way to Start, (d) Technical Hurdles, and (e) Documentation. In terms of the answer options, we set levels of agreement on a five-point Likert scale (from "strongly disagree" to "strongly agree"). Our survey details are available at https://forms.gle/JQiVamovUXdJiy8z5. 5 Results In this section, we present the results for each of our research questions. 5.1 (RQ1) To what extent does a newcomer candidate practice social coding? Social Coding. The majority of the newcomer candidates do not practice so- cial coding after joining GitHub. Table 4 presents the frequency of social and
Title Suppressed Due to Excessive Length 11 Table 4: Frequency of newcomer candidates social and non-social contribu- tions. 68% of newcomer candidates do non-social based initial contributions after joining GitHub. Coding Category Percent (%) Non-Social 68 Social 32 non-social contributions done by newcomer candidates. We find that 68% of newcomer candidates make non-social-based initial contributions after joining GitHub, while 32% of newcomer candidates make social-based initial contri- butions. The results suggest that newcomer candidates are less likely to col- laborate with other developers when placing their first GitHub contributions. Our statistical test reveals that a significant difference exists between the proportion of social and non-social based contributions, with a p-value < 0.001. Newcomer candidates are more likely to practice non-social coding. The result indicates that our proposed hypothesis, i.e., (H1) A newcomer candidate is more likely to practice social coding to GitHub, is not established. RQ1 Summary: Our results show that 68% of the newcomer candi- dates do not practice social coding (i.e., newcomer candidates are less likely to collaborate with other developers with their initial contribu- tions) after joining GitHub. It indicates that our proposed hypothesis that a newcomer candidate is more likely to practice social coding to GitHub is not established. 5.2 (RQ2) What are the kinds of initial contributions that come from a newcomer candidate? Frequency of initial contribution kinds. 86% of newcomer candidates typically engage in a forward-engineering activity. Table 5 depicts the distribution for kinds of initial contributions that come from newcomer candidates. The Ta- ble reveals that newcomer candidates are most likely to engage in development activities related to incorporating new features and implementing new require- ments. The following activity frequently referenced by a newcomer candidate is the maintenance activity related to refactoring and redesign, i.e., 8%. On the other hand, we observe that only 1% of newcomer candidates contribute to corrective-engineering and management. The results indicate that newcomer candidates are less likely to engage in those maintenance activities related to handling defects, formatting code, cleaning up, and updating documentation. Specifically, 5% of initial contributions are classified as Others. Through our manual check, we find that these initial contributions are either inaccessible (i.e., 404 errors in first commit links) or can not be classified into any category based on our keyword list.
12 IFraz Rehman et al. Table 5: Frequency for initial contribution kinds from newcomer candidates. 86% of newcomer candidates typically engage in forward-engineering activity. Initial contribution kinds Percent (%) Forward-Engineering 86 Re-Engineering 8 Management 1 Corrective-Engineering 1 Others 5 Our statistical test confirms a significant difference between the proportion of forward-engineering and non-forward-engineering contributions, with a p- value < 0.001. A newcomer candidate is more likely to add new content (i.e., forward-engineering) in their first contributions. Such a result indicates that our raised hypothesis, i.e., (H2) A contribution to Github repository for a newcomer candidate is more likely to add new content, is established. RQ2 Summary: We find that 86% of newcomer candidates’ contri- butions are new features and requirements (i.e., forward-engineering activities), statistically confirming our hypothesis that a contribution to the Github repository for a newcomer candidate is more likely to add new content. 5.3 (RQ3) What kinds of repositories does a newcomer candidate target? Frequency for kinds of repositories target. Around 53% of newcomer candi- dates target repositories that are non-software based. Table 6 shows the pro- portion of software and non-software based repositories that newcomer can- didate target. We find that newcomer candidates are less likely to target software-based repositories that leverage sound software engineering practices in each of its dimensions, accounting for 47%. Upon closer inspection into two workflows (i.e., fork and PR, clone and push), we observe that the dominant workflow for software-based repositories is clone and push, i.e., 56%. While, in non-software based repositories, we do not find the dominant workflow, i.e., 50% for clone and push, and fork and PR. We now further examine what kinds of repositories are targeted with the aspects of two workflows (i.e., clone and push, fork and PR) by newcomer candidates. Based on a manual coding on a statistical representative sample, Figure 4 shows that Documentation (21%), Experimental (15%), and Web- based-application (15%), libraries, and frameworks are the most frequently targeted repository kinds in the clone and push workflow. The other kinds of repositories that newcomer candidates frequently target are Academic (12%), Web (10%), and Application Software (9%).
Title Suppressed Due to Excessive Length 13 Non−Software Software Clone and Push Fork and PR Web Storage No longer accessible Experimental Empty Academic Web−based−application, etc System Software Software tools Non−web libraries and frameworks Documentation Application Software 0 5 10 15 20 25 0 5 10 15 20 25 Percent (%) Fig. 4: Frequency for contributed repository kinds within Clone and Push, and Fork and PR workflows. Documentation and Experimental are the most frequently targeted repository kinds in two workflows, i.e., 21% and 24% re- spectively. Specifically, we do not find any repositories related to System Software. On the other hand, in the fork and PR workflow, we find that Experimen- tal (24%) and Web-based-application, libraries, and frameworks (16%) are the most commonly targeted repository kinds. The other kinds of repositories commonly targeted are Documentation (13%) and Academic (12%). Our statistical test validates no significant difference between the propor- tion of software and non-software based repositories that newcomer candi- dates target, with a p-value > 0.05. The result indicates that our proposed hypothesis, i.e., (H3) A newcomer candidate is more likely to target software repositories, is not established. RQ3 Summary: Results show that 53% of newcomer candidates tar- geted non-software based repositories. Statistically, we cannot deter- mine whether newcomer candidates are likely to choose software repos- itories over non-software or vice-versa. Table 6: The proportion of software and non-software repositories targeted by newcomer candidates. Around 53% of newcomer candidates targeted Non- Software repositories. Category Percent (%) Contribution Workflow (%) Clone and Push (56) Software 47 Fork and PR (44) Clone and Push (50) Non-Software 53 Fork and PR (50)
14 IFraz Rehman et al. Table 7: Frequency of Newcomer candidates onboard OSS projects from quan- titative and qualitative analysis. (a) Newcomer candidates onboard OSS from qualitative analysis Onboarded by Munaiah et al (2016) Percent Onboard 3% Not-Onboard 97% (b) Newcomer candidates onboard OSS from qualitative analysis. Onboarded by survey response Percent Onboard 70% Not-Onboard 30% 5.4 (RQ4) What proportion of newcomer candidates eventually onboard an OSS project? Onboard OSS. We now discuss the results of whether newcomer candidates on- board OSS projects. Table 7a presents the distribution of newcomer candidates onboard OSS projects in terms of the quantitative analysis. The quantitative results show that only 3% of newcomer candidates onboard OSS projects, while 97% of newcomer candidates do not onboard. One explanation for such low matching, is that the curated engineered OSS projects are a smaller and outdated subset of OSS projects. On the other hand, our qualitative validates our perception and the results show that, 70% of newcomer candidates claim that they successfully contribute to OSS projects since joining GitHub. Ta- ble 7b shows the distribution of newcomer candidates onboard OSS projects from qualitative analysis. Barriers faced by newcomer candidates. We now further validate the bar- riers faced by 27 surveyed newcomer candidates. Figure 5 shows the results of our Likert-scale question related to barriers. The figure shows that finding a way to start is the most crucial barrier, with 22 responses being positive (i.e., 12 agree responses and 10 strongly agree responses). The second most posi- tive barrier is technical hurdles, receiving 18 positive responses (i.e., 15 agree responses and 3 strongly agree responses). Newcomer previous knowledge is considered the third most positive barrier with 16 responses (i.e., 10 agree re- sponses and 6 strongly agree responses). On the other hand, the respondents are more likely to disagree with the statement that social interaction and doc- umentation can be barriers for them to onboard OSS projects (i.e., 7 negative responses for each barrier).
Title Suppressed Due to Excessive Length 15 Social Interaction Newcomer Previous Knowledge Finding a Way to Start Technical Hurdles Documentation 10 0 10 20 Count Strongly Disagree Partially Disagree Neutral Partially Agree Strongly Agree Fig. 5: Barriers faced by newcomer candidates. Most newcomer candidates (i.e., 22 out of 27 responses) strongly agree that finding a way to start is a barrier. RQ4 Summary: Although our quantitative analysis matched only 3% of newcomer candidates onboard established OSS repositories, 70% of newcomer candidates claimed that they already started to contribute to OSS repositories. Furthermore, newcomer candidates strongly agree that they face the barrier of finding a way to start, while social inter- action received the most mixed responses as a barrier. 6 Implications We now discuss the implications of our results and provide suggestions for newcomer candidates, OSS projects, and researchers: Suggestions for Newcomer Candidates. RQ1 shows that most new- comer candidates are not practicing social coding while making their initial contribution, with Table 4 showing that 68% of newcomer candidates’ initial contributions are non-social based. These results indicate that newcomer can- didates tend to stick to their solo projects and personal activities even after joining the GitHub platform. Although recent studies have shown evidence that social coding indeed improves collaboration among developers Thung et al (2013), our results show likewise. Our practical suggestions would be for newcomer candidates to actively read documentation such as contributing guidelines and engage in discussions and threads. It may increase their confi- dence and the likelihood of engaging in social coding interactions on GitHub.
16 IFraz Rehman et al. Also there are initiatives such as the Hacktoberfest7 that encourage contribu- tions, especially for newcomers. Our qualitative analysis for RQ2 and RQ3 helps to understand the con- tribution and repository kinds. This analysis will help newcomer candidates provide insights in choosing suitable repositories that matches the newcomer candidate prior contributions. The complementary results of RQ2, RQ3 reveal that after joining GitHub, newcomer candidates prefer to add new content to non-software experimental repositories. The results show that these reposito- ries serve an essential purpose of engaging newcomer candidates and could be crucial to keep newcomer candidates motivated before they make a move to a real OSS project. According to our newcomer candidate responses in RQ4, we reveal which barriers explain why some newcomers never end up contributing to an OSS project. As responses show, Finding a way to start is one of the most chal- lenging barrier. To this end, newcomer candidates should use Subramanian et al (2020) suggestions, including minor feature additions (a change of around 36 lines of code), minor documentation changes, and select bug fixes (as de- scribed) first-timer friendly task which could reduce this problem. Further- more, there is an online resources8 that help find easy issues or opportunities for newcomer candidates to find a way to make a contribution. Suggestions for OSS Projects. Our findings provide practical implica- tions to assist with the onboarding process. The results for RQ2 and RQ3 show that the repository and contribution kinds help newcomer candidates provide insights into selecting projects for contribution purposes, which plays a role in attracting a potential contributor. Therefore, OSS projects that want to attract newcomer candidates can use our results to find the most prominent contributions and repository kinds. However, there are still many practical problems and difficulties that exist. Thus, OSS projects may benefit from of- fering the right contributions to target a specific type of newcomer candidate (e.g., documentation opportunities or a particular type of forward engineer- ing). Analysis of our results regarding barriers highlighted from RQ4, OSS project teams should identify practical examples to lower them for a new- comer candidate to contribute. (Tan et al, 2020) showed that OSS projects now highlight specific issues that are potentially good first issues that new- comer candidates can target. We propose that similar strategies be highlighted, especially targeting non-software components such as documentation. Suggestions for Researchers. We envision researchers to build on top of our results and open research questions to widen our understanding and de- velop strategies to encourage newcomer candidates’ onboarding process. For example, based on the manual classification results obtained in RQ3 that 53% of non-software repositories and 47% of software repositories, we have an idea of these newcomer candidates’ advertised skill levels. At this stage, our classifi- cations are rather generic. We envision that future work could include concrete 7 https://hacktoberfest.digitalocean.com/ 8 https://www.firsttimersonly.com/
Title Suppressed Due to Excessive Length 17 examples of newcomer candidate source code patches and understand the min- imal skill levels required for a newcomer candidate to onboard into the OSS world. Interestingly, we find that the perception of OSS projects may be dif- ferent from what the research community regards as an OSS project. Hence, further research is needed to understand to what extend is an OSS project, as this definition may be changing over time. Future research could be tool support to match the skill levels with potential OSS repositories that seek this skill. Other interesting avenues would be explored different motivations of GitHub users (i.e., advertise their skills for a job, practice skills, or for learn- ing or educational purposes), and what are the minimal skills to teach these newcomer candidates to help them become successful contributing members of the different OSS projects. 7 Threats to Validity We now discuss threats to the validity of our empirical study. External Validity. We perform an empirical study on newcomer candidates relying on the GitHub platform. Our key limitation is that our newcomer can- didates are restricted to the GitHub platform collected from our preliminary survey. Newcomer candidates have existed on platforms other than GitHub - our approach picks up only a newcomer candidate’s first GitHub contribution. Construct Validity. We summarize two threats regarding construct validity. First, in our qualitative analysis, especially for projects targeted by newcomer candidates (RQ3), categories may be miscoded due to the subjective nature of our coding approach. To mitigate this threat, we took a systematic approach to first test our comprehension with 30 samples using Kappa agreement scores by three separate individuals. Only until the Kappa score reaches more than 0.91 for fork and PR workflow projects and 0.76 for clone and push workflow projects, we were able to complete the rest of the sample dataset. The second possible threat is in our quantitative analysis of RQ4, to see what proportion of newcomer candidates onboard OSS project. We matched newcomer candidates’ projects with the curated dataset of engineered software projects provided by Munaiah et al (2016) which was last updated in 2017. We might get different results regarding the proportion of newcomer candidates onboard OSS projects if the provided curated dataset would be updated. Internal Validity. Newcomer candidates have full control over the repos- itories listed in the owned repositories section, so if they decide to remove their first contribution or first project from the page, we can’t pick up their actual first project or contribution. However, we don’t know why a newcomer candidate would do so. Another internal threat to validity is related to results obtained from quan- titative analysis of RQ1 adapted to data visualization. As per the result, 32%
18 IFraz Rehman et al. of social coding is done by newcomer candidates. With the git-blame com- mand’s support, we count down the number of developers on committed files in their initial contribution and regard that contribution as social if we found changes done by more than one author. However, we analyzed that in some initial contributions, the same newcomer candidates use different IDs to make their first contribution as social. Thus, future in-depth qualitative analysis or experiment studies are needed to better understand the reason for this pur- pose. 8 Related Work In this section, we present significant findings in the respect of related work about newcomers. Motivation for Newcomers and OSS Projects. To tempt the outsiders towards joining process of the project, motivation and project’s attractiveness plays vital part. A complete body of work which well explored OSS research topic about developer’s motivation and project’s attractiveness Meirelles et al (2010); Santos et al (2013); Shah (2006); Ye and Kishida (2003). Other studies investigate that in order to become the core project member how newcomers join projects Ducheneaut (2005); Fang and Neufeld (2009); Krogh et al (2003); Marlow et al (2013); Nakakoji et al (2003). From a more positive angle, Choi et al (2010) found a welcome message, technical assistance and constructive criticism delayed the natural decline of newcomer editing. Other parts of the literature focus on the forces of motivation and attractiveness that drive new- comers toward projects. For example, Lakhani and Wolf (2003) have found that external benefits (eg, better work, career advancement) motivate primar- ily new contributors, along with fun, code-based challenges, and improved programming skills. Onboarding OSS Projects. Onboarding OSS projects has been extensively studied (Krogh et al, 2003; Nakakoji et al, 2003). Fagerholm et al (2013) includes preliminary results of his study which deals directly with the process of onboarding OSS projects. Commercial software development settings also affects by newcomers onboarding, as described by Begel and Simon (2008); Dagenais et al (2010). Considering the perspective of individual developers, Ducheneaut (2005) approached onboarding from a sociological point of view. To support the onboarding of newcomers towards OSS, mentorship is rec- ognized as an important activity. Swap et al (2001) describes mentoring in his study as a basic knowledge transfer mechanism in the enterprise. Integrate new developers into software projects there is occurrence of mentoring pattern, a study present by Sim et al (1998). A joining script proposed in another study by Krogh et al (2003) for developers who want to take participate in project. Nakakoji et al (2003) also studied the OSS project and proposed eight possible
Title Suppressed Due to Excessive Length 19 joining roles comprise of concentric layers called "the onion patch". For ex- ample, Zhou and Mockus (2015), found that the willingness of individual and project’s climate were associated with odds that an individual would become a long-term contributor. Barriers for Newcomers. Newcomers are important to the survival, long- term success, and continuity of OSS projects Kula and Robles (2019). How- ever, newcomers face many difficulties when making their first contribution to a project. OSS project newcomers are usually expected to learn about the project on their own Scacchi (2002). Conversely, newcomers to a project, send contributions which are not incorporated into the source code and give up try- ing Steinmacher et al (2015); Steinmacher et al (2015). As discussed by Zhou and Mockus (2010), the transfer of entire projects to renewal of core developers, participation in OSS projects, present similar challenges of rapidly increasing newcomer competence in software projects. Several research activities addressed for reducing the barriers for newcom- ers previously. Steinmacher et al (2014a) proposed a developer joining model that represents the stages that are common to and the forces that are influ- ential to newcomers being drawn or pushed away from a project. Steinmacher et al (2016) created a portal called FLOSScoach based on a conceptual model of barriers to support newcomers. The evaluation shows that FLOSScoach played an important role in guiding newcomers and in lowering barriers re- lated to the orientation and contribution process. Besides these studies, in terms of barriers, our research has done the complement work for Steinmacher et al (2014b), which highlighted the most crucial barrier among others, i.e., finding a way to start due to which newcomer candidates face difficulty in contributing OSS projects. Compared to other work, our study takes a first look at these candidates to better understand their social interaction, initial contribution kinds, targeted repositories, and onboard issue with their barriers. Other work extensively investigated the nature of newcomers, with none that focus on newcomer can- didates who are novice developers, with the intention of later onboarding OSS projects. 9 Conclusion This paper analyzes a new category of potential contributors to OSS projects (i.e., newcomer candidates). Our results show that these newcomer candidates are more likely to practice non-social coding (i.e., 68%), and they tend to work on forward-engineering activities (i.e., 86%) in their first commits. Neverthe- less, we cannot determine whether newcomer candidates are likely to choose software repositories over non-software or vice-versa. Regarding onboarding, although very few (i.e., 3%) newcomer candidates onboard established OSS engineered repositories, 70% of newcomer candidates claim they already con-
20 IFraz Rehman et al. tribute to an OSS, citing that finding a way to contribute as a key barrier to onboarding. As GitHub continues to grow, so does the potential for the newcomer can- didate. This study opens up new avenues for future work, especially targeting potential contributors to onboard existing OSS projects. Researchers can also analyze how to sustain their newcomer candidates’ needs until they are ready to successfully onboard. More practical applications would be tool support to (i) recommend suitable repositories for newcomer candidates and (ii) identify practical examples OSS project teams can use to lower their barriers for a newcomer candidate to contribute. Acknowledgement This work is supported by Japanese Society for the Promotion of Science (JSPS) KAKENHI Grant Numbers 18H04094 and 20K19774 and 20H05706. References Begel A, Simon B (2008) Novice software developers, all over again. ICER’08 - Proceedings of the ACM Workshop on International Computing Education Research Borges H, Hora A, Valente MT (2016) Understanding the factors that impact the popularity of GitHub repositories. In: ICSME Choi B, Alexander K, Kraut RE, Levine JM (2010) Socialization tactics in wikipedia and their effects. In: Proceedings of the 2010 ACM conference on Computer supported cooperative work, pp 107–116 Coelho J, Valente MT (2017) Why modern open source projects fail. In: FSE Dagenais B, Ossher H, Bellamy RKE, Robillard MP, de Vries JP (2010) Mov- ing into a New Software Project Landscape, Association for Computing Machinery, p 275–284 Ducheneaut N (2005) Socialization in an open source software community: A socio-technical analysis. Computer Supported Cooperative Work (CSCW) 14:323–368 Fagerholm F, Johnson P, Guinea A, Borenstein J, Münch J (2013) Onboarding in open source software projects: A preliminary analysis. In: 2013 IEEE 8th International Conference on Global Software Engineering Workshops Fang Y, Neufeld D (2009) Understanding sustained participation in open source software projects. J Manage Inf Syst GitHub (2020) URL https://developer.github.com/v3/ Hattori LP, Lanza M (2008) On the nature of commits. In: ASE Hindle A, German DM, Holt R (2008) What do large commits tell us? a tax- onomical study of large commits. In: Proceedings of the 2008 international working conference on Mining software repositories, pp 99–108 Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014) The promises and perils of mining GitHub. In: MSR
Title Suppressed Due to Excessive Length 21 Krejcie RV, Morgan DW (1970) Determining sample size for research activities. Educational and Psychological Measurement 30(3):607–610 Krogh G, Spaeth S, Lakhani K (2003) Community, joining, and specialization in open source software innovation: A case study. Research Policy 32:1217– 1241 Kula RG, Robles G (2019) The Life and Death of Software Ecosystems, Springer, pp 97–105 Lakhani K, Wolf R (2003) Why hackers do what they do: Understanding motivation and effort in free/open source software projects. Perspectives on Free and Open Source Software Marlow J, Dabbish L, Herbsleb J (2013) Impression formation in online peer production: Activity traces and personal profiles in github. In: Proceedings of the 2013 conference on Computer supported cooperative work, Associa- tion for Computing Machinery, New York, NY, USA, CSCW ’13, p 117–128 Meirelles P, Santos Jr C, Miranda J, Kon F, Terceiro A, Chavez C (2010) A study of the relationships between source code metrics and attractive- ness in free software projects. In: 2010 Brazilian Symposium on Software Engineering, pp 11 – 20 Munaiah N, Kroh S, Cabrey C, Nagappan M (2016) Curating github for en- gineered software projects. EMSE Nakakoji K, Yamamoto Y, NISHINAKA Y, Kishida K, Ye Y (2003) Evolution patterns of open-source software systems and communities. International Workshop on Principles of Software Evolution (IWPSE) Park Y, Jensen C (2009) Beyond pretty pictures: Examining the benefits of code visualization for open source newcomers. In: VISSOFT Paternoster R, Brame R, Mazerolle P, Piquero A (1998) Using the correct sta- tistical test for the equality of regression coefficients. Criminology 36(4):859– 866 Purushothaman R, Perry DE (2005) Toward understanding the rhetoric of small source code changes. IEEE Transactions on Software Engineering 31(6):511–526 Rehman I, Wang D, Kula RG, Ishio T, Matsumoto K (2020) Newcomer candi- date: Characterizing contributions of a novice developer to github. In: 2020 IEEE International Conference on Software Maintenance and Evolution (IC- SME), pp 855–855 Santos C, Kuk G, Kon F, Pearson J (2013) The attraction of contributors in free and open source software projects. J Strateg Inf Syst 22(1):26–45 Scacchi W (2002) Understanding the requirements for developing open source software systems. IEE Proc Soft Shah S (2006) Motivation, governance, and the viability of hybrid forms in open source software development. Management Science 52:1000–1014 Sim S, Richard S, Holt C (1998) The ramp-up problem in software projects: A case study of how software immigrants naturalize. Proceedings of the 20th international conference on Software engineering pp 361–370 Steinmacher I, Gerosa MA, Redmiles D (2014a) Attracting, onboarding, and retaining newcomer developers in open source software projects. In: CSCW
22 IFraz Rehman et al. Steinmacher I, Graciotto Silva MA, Gerosa MA, Redmiles D (2014b) A sys- tematic literature review on the barriers faced by newcomers to open source software projects. IST Steinmacher I, Conte T, Gerosa MA, Redmiles DF (2015) Social barriers faced by newcomers placing their first contribution in open source software projects. In: CSCW Steinmacher I, Conte TU, Gerosa MA (2015) Understanding and supporting the choice of an appropriate task to start with in open source software communities. In: HICSS Steinmacher I, Conte TU, Treude C, Gerosa MA (2016) Overcoming open source project entry barriers with a portal for newcomers. In: ICSE Subramanian VN, Rehman I, Nagappan M, Kula RG (2020) Analyzing first contributions on github: What do newcomers do. IEEE Software pp 0–0 Swap W, Leonard D, Shields M, Abrams L (2001) Using mentoring and story- telling to transfer knowledge in the workplace. J of Management Information Systems 18:95–114 Tan X, Zhou M, Sun Z (2020) A First Look at Good First Issues on GitHub, Association for Computing Machinery, New York, NY, USA, p 398–409 Thung F, Bissyande TF, Lo D, Jiang L (2013) Network structure of social cod- ing in github. In: 2013 17th European conference on software maintenance and reengineering, IEEE, pp 323–326 Valiev M, Vasilescu B, Herbsleb J (2018) Ecosystem-level determinants of sus- tained activity in open-source projects: A case study of the PyPI ecosystem. In: FSE Viera AJ, Garrett JM, et al (2005) Understanding Interobserver Agreement: The Kappa Statistic. Family Medicine 37(5):360–363 Ye Y, Kishida K (2003) Toward an understanding of the motivation open source software developers. In: Proceedings of the 25th International Con- ference on Software Engineering, IEEE Computer Society, USA, ICSE ’03, p 419–429 Zhou M, Mockus A (2010) Growth of newcomer competence: Challenges of globalization. In: FoSER Zhou M, Mockus A (2015) Who will stay in the floss community? modeling participant’s initial behavior. TSE
You can also read