Quantitative empirical methods - Second intro week Master's, 7 January 2021 Dag Sjøberg and Gunnar Bergesen
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Second intro week Master’s, 7 January 2021 Quantitative empirical methods Dag Sjøberg and Gunnar Bergesen 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 1
About Me • Current position: Professor at University of Oslo – Software Process Improvement, Agile and Lean Methods, Software Quality, Empirical Research Methods • Education: – MSc, University of Oslo, 1987 – PhD, University of Glasgow, Computing Science, 1993 • Work experience – University of Oslo since 1993 – SINTEF ICT (20 %) 2014–2019 – Simula Research Laboratory, 2001–2009 – University of Oslo 1993–2000 – Statistics Norway (SSB) 1987–1990 – National University Hospital (Rikshospitalet) 1985–1986 • Startup – Member of steering committee and co-owner of four startup companies 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 3
About me • Current: – Associate professor II at Programming & Software Engineering (20%) – Chief Product Officer Technebies (skill testing of developers) • Education – MSc (2001) and PhD (2015) from University of Oslo – PhD thesis: “Measuring programming skill” • Prior work experience – Programmer – IT Project leader two companies – CEO three companies 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 4
Writing a Master’s thesis You may wish to • propose or develop a new method, tool, technique, language, practice, etc. that is supposed to be better than what exists, or • find out whether an existing X is better than an existing Y, or • how to improve X Most Master’s theses should include some of this • Thesis: A scientific statement • Master (or PhD thesis): A justification of that statement 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 5
How to find out whether something is better? • Investigate what works best in practice, that is, perform an empirical study – Experiment – Survey that asks people about their opinions – Other studies such as case studies, action research, ethnographic studies 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 6
Structure • Quantitative vs. qualitative data • The research life cycle • Controlled experiments – Experiment example – AB-Experiments • Surveys • Quality of experiments and surveys – Hypothesis testing and effect size – Validity • Summary 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 7
What does it means to be better? – How much? – How many? – How large? – How old? – How long? – How high? – How warm? – How thick, firm, etc. Answers are measured in terms of quantitative data 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 8
Quantitative data Qualitative data • Data expresses quantity • Data expresses quality in • Data expressed as some sense numbers • Data expressed as text, • Used in statistics images and forms except numbers • Can obtain quantitative data indirectly if a mapping exists from quantitative to quality data • Not used in statistics 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 9
Quantitative empirical methods • Experiments and surveys typically collect quantitative data • Therefore called “quantitative empirical methods” Empirical means using evidence based on observation or experience rather than theory or pure logic 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 10
• What kind of data do you plan to collect during your master’s thesis? • Which challenges do you have regarding choice of research method and data collection? • Please spend 3 minutes writing in the chat the answers to the two questions: You may wish to • propose or develop a new method, tool, technique, language, practice, etc. that is supposed to be better than what exists, or • find out whether an existing X is better than an existing Y, or • how to improve X – or something else 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 11
Structure • Quantitative vs. qualitative data • The research life cycle • Controlled experiments – Experiment example – AB-Experiments • Surveys • Quality of experiments and surveys – Hypothesis testing and effect size – Validity • Summary 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 12
You may wish to • propose or develop a new method, tool, technique, language, practice, etc. that is supposed to be better than what exists, or • find out whether an existing X is better than an existing Y, or • how to improve X What do you believe is the case? That is, formulate a hypotheses Test the hypothesis: given that the hypothesis is correct, which results do you expect (predict)? If the hypothesis is confirmed, what you believed is not hypothetical any more; you have a thesis
Structure • Quantitative vs. qualitative data • The research life cycle • Controlled experiments – Experiment example – AB-Experiments • Surveys • Quality of experiments and surveys – Hypothesis testing and effect size – Validity • Summary 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 14
Experiment Independent variables Dependent variables (treatment) (outcome) method, tool, Effect technique, time, quality, language, costs, etc. practice, etc. Moderator variables (context) An experiment is a study in which an intervention (treatment) is introduced to observe its effects 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 15
Controlled experiments • An experiment investigates cause-effect relations • Experiments are conducted when the investigator wants control over the situation, with direct, precise, and systematic manipulation of the behavior of the phenomenon to be studied • Difference in the outcome is supposed to be caused by the different treatments (or by the treatment compared to no treatment, that is, the control group) 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 16
Structure • Quantitative vs. qualitative data • The research life cycle • Controlled experiments – Experiment example – AB-Experiments • Surveys • Quality of experiments and surveys – Hypothesis testing and effect size – Validity • Summary 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 17
What is best? Pair programming or solo programming* • 295 junior, intermediate and senior professional Java consultants from 29 companies were paid to participate (one work day) • 99 individuals; 98 pairs • The pairs and individuals performed the same Java maintenance tasks on either: – a ”simple” system (centralized control style), or – a ”complex” system (delegated control style) • We measured: – duration (elapsed time) – effort (cost) – quality (correctness) of their solutions *E. Arisholm, H. Gallis, T. Dybå, and D. Sjøberg, “Evaluating Pair Programming with Respect to System Complexity and Programmer Expertise,” IEEE Transactions on Software Engineering, 2007, 33(2): 65-86. 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 18
Delegated vs. centralized control – expert opinions • The Delegated Control Style: – Rebecca Wirfs-Brock: A delegated control style Object1 ideally has clusters of well defined e 3 M M e ss es ag ag sa e responsibilities distributed among a number of M es s ge 1 2 Responsibility objects. To me, a delegated control architecture Driven Design feels like object design at its best… Object5 M es Object2 sa – Alistair Cockburn: [The delegated coffee- 6 ge ge 4 sa es Object-Oriented Design Method M machine design] is, I am happy to see, robust Message5 Role Modelling with respect to change, and it is a much more Object4 Object3 reasonable ''model of the world.” Control Style Delegated Control Style • The Centralized Control Style: Centralized Control Style – Rebecca Wirfs-Brock: A centralized control M Use-Case Driven An M e ss style is characterized by single points of control Object3 es a Object2 1 Design sa g e ge ge 2 sa interacting with many simple objects. To me, 5 es M centralized control feels like a "procedural solution" cloaked in objects… a ly Object1 M es tic a sa ge 4 Data Driven Design – Alistair Cockburn: Any oversight in the 3 ge sa es Object5 Object4 l “mainframe” object (even a typo!) [in the M centralized coffee-machine design] means potential damage to many modules, with endless testing and unpredictable bugs. 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 19
Evaluating the effect of a delegated vs. centralized control style on the maintainability of object- oriented software Em p i ri cal “Assuming that it is not only highly skilled experts who are going to maintain an object-oriented system, a viable conclusion from the controlled experiment reported in this paper is that a design with a centralized control style may be more maintainable than is a design with a delegated control style.” Erik Arisholm and Dag Sjøberg, IEEE Transactions on Software Engineering, vol. 30, no. 8, August 2004, pp. 521-534. 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 20
Total Effect of PP 160 % 140 % 120 % Difference from individuals 100 % 84 % 80 % 60 % 40 % 20 % 7% 0% -20 % -8 % -40 % Duration Effort Correctness 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 21
Effect of PP for Juniors 160 % 140 % 120 % 111 % Difference from individuals 100 % 80 % 73 % 60 % 40 % 20 % 5% 0% -20 % -40 % Duration Effort Correctness 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 22 22
Moderating Effect of System Complexity for Juniors 160 % 149 % 140 % CC (easy) 120 % 109 %112 % Difference from individuals DC (complex) 100 % 80 % 60 % 40 % 32 % 20 % 6% 4% 0% -20 % -40 % Duration Effort Correctness 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 23
Effect of PP for Seniors 160 % 140 % 120 % Difference from individuals 100 % 83 % 80 % 60 % 40 % 20 % 0% -20 % -9 % -8 % -40 % Duration Effort Correctness 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 24
Moderating Effect of System Complexity for Seniors 160 % 140 % 120 % CC (easy) 115 % Difference from individuals DC (complex) 100 % 80 % 60 % 55 % 40 % 20 % 8% 0% -2 % -20 % -13 % -23 % -40 % Duration Effort Correctness 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 25
So, when should we use PP? Programmer Task Complexity Use PP? Comments Expertise Junior Easy Yes Provided that increased quality is the main goal Complex Yes Provided that increased quality is the main goal Intermediate Easy No Complex Yes Provided that increased quality is the main goal Expert Easy No Complex No Unless you are sure that the task is too complex to be solved satisfactorily even by solo seniors The question of whether PP is beneficial or not, is meaningless! 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 26
Structure • Quantitative vs. qualitative data • The research life cycle • Controlled experiments – Experiment example – AB-Experiments • Surveys • Quality of experiments and surveys – Hypothesis testing and effect size – Validity • Summary 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 27
A/B testing 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 28
A/B testing • Background – Historically used in marketing, design, games – Data-driven product development & profitability tuning – Assumption: ”we’re all wrong most of the time” • Typical use – Fast release cycles – Bottom-up and context depedent (i.e., ”I want to improve X for part Y) – Between-subject design 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 29
Requires • Two (or more) version to be compared (A, B, ….) • At least one outcome variable (KPI or other metric or measure) to improve 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 30
Experiments Advantages Disadvantages • They are a well established strategy, • Laboratory experiments often create seen by many as the most ‘scientific’ artificial situations that are not approach comparable to real-world situations • The only research strategy that can • Often difficult or impossible to prove cause-effect relationships control all the relevant variables • Laboratory experiments permit high • It is often difficult to recruit a levels of precision in measuring representative sample of participants outcomes and in analyzing data • It may be necessary to conceal from the participants the purpose of the research so they do not skew the results 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 31
Structure • Quantitative vs. qualitative data • The research life cycle • Controlled experiments – Experiment example – AB-Experiments • Surveys • Quality of experiments and surveys – Hypothesis testing and effect size – Validity • Summary 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 32
Surveys 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 33
Common in society • Requires relatively few resources to include many people • Create statistics and test hypotheses over characteristics of the target group (the population being investigated) • Obtain information about people’s opinion about what, how much, how many, how and why or what people say they do – As opposed to case studies and ethnography, one does not observe 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 34
Types of surveys • Cross-sectional surveys are used to gather information on a population at a single point in time. • Longitudinal surveys gather data over a period of time. The researcher may then analyze changes in the population and attempt to describe and/or explain them. – Trend studies focus on a particular population, which is sampled and scrutinized repeatedly. While samples are of the same population, they are typically not composed of the same people. – Cohort studies also focus on a particular population, sampled and studied more than once. A cohort study would sample the same group of people, every time. – Panel studies allow the researcher to find out why changes in the population are occurring, since they use the same sample of people every time. 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 35
The importance of wording … Two catholic priests wondered if one is allowed to smoke when one prays? They both sent a letter to the Pope: P1: “Is it allowed to smoke when one prays?” Answer: NO – the pray should get full attention P2: “Is it allowed to pray when one smokes?” Answer: YES – it is always a good thing to pray 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 36
Question formats • Classification of objects or individuals “Are you: Male___ Female ___ Other ___?” • Ranking of items in order to reflect the relative ordering of phenomena “Please rank the following factors in order of importance (1-4)” • Pairwise comparison “Which do you prefer” • Rating of characteristics Simple, single-item scales, e.g., “Programming is a terrific course (check one)” “Strongly agree__, agree__, neither__, disagree__, strongly disagree__” 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 37
Advantages Disadvantages • They provide a wide an inclusive • They lack depth coverage of people or events • They tend to focus on what can be • They can be administered from counted or measured remote locations using internet tools or email • They do not establish cause-effect • They can provide a lot of data in a • They cannot judge the accuracy or short time at a reasonable cost honesty of people’s responses by observing their body language • They can be quantitatively analysed • They can be replicated 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 38
Structure • Quantitative vs. qualitative data • The research life cycle • Controlled experiments – Experiment example – AB-Experiments • Surveys • Quality of experiments and surveys – Hypothesis testing and effect size – Validity • Summary 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 39
Hypothesis testing • Null hypothesis: – ”there is no difference between the effect of treatments (experiments) or between groups (surveys)” – ”there is no difference between individual and pair programming” 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 40
P-value • If it’s very unlikely, for example < 5 % (significance level), that we had obtained the results that we actually got, if the null hypothesis were true, then we reject the null hypothesis and – claim the alternative hypothesis (”there is a difference …”) • The likelihood that we obtained the results we did assumring the null hypothesis is true, is called the p-value (probability value) • One uses statistical methods to test null hypotheses 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 41
Effect size • P-values is about how likely it is that there is a difference • Effect size is about how large the difference actually is – Is the difference large enough to have any meaning in practice? V.B. Kampenes, T. Dybå, J.E. Hannay and D.I.K. Sjøberg. A Systematic Review of Effect Size in Software Engineering Experiments, Information and Software Technology 49(11-12):1073-1086, 2007 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 42
Structure • Quantitative vs. qualitative data • The research life cycle • Controlled experiments – Experiment example – AB-Experiments • Surveys • Quality of experiments and surveys – Hypothesis testing and effect size – Validity • Summary 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 43
Validity of empirical studies • Internal validity – Is the difference that we observed between the groups that received the treatments actually caused by the treatments, or may there be other causes for the difference? • External validity – Can we generalize the results we found; that is, is it likely that we would obtain the same result in other settings? • Construct validity – If a complex concept is measured, does the measure represent the concept in a satisfactory way? For example, is it OK to measure quality of a software system only in terms of number of bugs found? • Statistical conclusion validity – Is correct statistics used? 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 44
Achieving internal validity: Randomized vs. quasi-experiment • To achieve internal validity, we would ideally run randomized experiments • A randomized (or true) experiment is characterized by random assignments of subjects to experimental groups to infer the effect of the treatment • A quasi-experiment do not use random assignment of treatment to groups – Instead, the comparisons depend on nonequivalent groups that differ from each other in other ways than the presence of the treatment. For example, consider two methods to be compared. If the subjects only know one of the methods, a quasi-experiment is appropriate if learning the other method is infeasible. – Challenge: How to separate the effect of the treatment from the effect of the differences among the groups? 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 45
External validity is about generalization W.M. Trochim, The Research Methods Knowledge Base, http://www.socialresearchmethods.net/kb/ 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 46
Sampling • Representativeness – https://www.aftenposten.no/viten/i/OnK87b/Psykologiforsk ning-gjelder-bare-for-noen-fa-mennesker--Nina-Kristiansen – “Nitti prosent av deltagerne i psykologistudier kommer fra Europa og USA, men de utgjør bare 18 prosent av verdens befolkning, … forskere i psykologi dessuten henter inn studenter som deltagere i studiene sine. Disse hvite, urbane ungdommene er ikke engang representative for befolkningen i sitt eget land, mener Gurven.” 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 47
Construct validity “Independent” construct Dependent construct Conceptual “Independent” Cause-effect “Dependent” level concept concept Moderator concept Operationalization Operationalization Moderator variable Operational level Independent Dependent variable(s) variable(s) 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 48
Structure • Quantitative vs. qualitative data • The research life cycle • Controlled experiments – Experiment example – AB-Experiments • Surveys • Quality of experiments and surveys – Hypothesis testing and effect size – Validity • Summary 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 49
Empirical research methods – literature The Future of Empirical Methods in Software Engineering Research Dag I. K. Sjøberg, Tore Dybå and Magne Jørgensen Dag I.K. Sjøberg received the MSc degree in computer science from the University of Oslo in 1987 and the PhD degree in computing science from the University of Glasgow in 1993. He has five years of industry experience as a consultant and group leader. He is now research director of the Department of Software Engineering, Simula Research Laboratory, and a professor of software engineering in the Department of Informatics, University of Oslo. Among his research interests are research methods in empirical software engineering, software processes, software process improvement, software effort estimation, and object-oriented analysis and design. He is a member of the International Software Engineering Research Network, the IEEE, and the editorial board of Empirical Software Engineering. Tore Dybå received the MSc degree in electrical engineering and computer science from the Norwegian Institute of Technology in 1986 and the PhD degree in computer and information science from the Norwegian University of Science and Technology in 2001. He is the chief scientist at SINTEF ICT and a visiting scientist at the Simula Research Laboratory. Dr. Dybå worked as a consultant for eight years in Norway and Saudi Arabia before he joined SINTEF in 1994. His research interests include empirical and evidence-based software engineering, software process improvement, and organizational learning. He is on the editorial board of Empirical Software Engineering and he is a member of the IEEE and the IEEE Computer Society. Magne Jørgensen received the Diplom Ingeneur degree in Wirtschaftswissenschaften from the University of Karlsruhe, Germany, in 1988 and the Dr. Scient. degree in informatics from the University of Oslo, Norway in 1994. He has about 10 years industry experience as software developer, project leader and manager. He is now professor in software engineering at University of Oslo and member of the software engineering research group of Simula Research Laboratory in Oslo, Norway. His research focus is on software cost estimation. Future of Software Engineering(FOSE'07) 358 0-7695-2829-5/07 $20.00 © 2007 7.1.2021, Dag Sjøberg and Gunnar Bergersen Slide 50
You can also read