R Workshop for Postgraduate Students - School of Life Sciences Prof. Ursula Scharler and Dr. Anna Bastian 1 November 2018 and 13 November 2018 ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian R Workshop for Postgraduate Students School of Life Sciences Prof. Ursula Scharler and Dr. Anna Bastian 1 November 2018 and 13 November 2018 This workshop is designed to expose School of Life Sciences postgraduate students to R, and provide you with basic skills in data management, analysis and visualisation. ___________________________________________________________________________ 2019: If you are interested in carrying on working with R, and extend your skills beyond basics, we will hold bi-weekly meetings in 2019. These will take the form of discussions and finding out together about functions, packages, visualisations and analyses in R. Announcements for these will follow in January. 1
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian WELCOME to R! WELCOME to R! WELCOME to R! WELCOME to R! WELCOME to R! WELCOME to R! Here the basic themes that we will be working on today: 1. DATA MANAGEMENT 2. R AND R-STUDIO 3. SIMPLE OPERATIONS, MAKING OBJECTS 4. IMPORTING/EXPORTING FILES, WORKING DIRECTORY 5. WHAT IS AN R PACKAGE? 6. VISUALISATIONS - EXAMPLES OF SIMPLE GRAPHS 7. EXAMPLES OF UNIVARIATE STATISTICAL ANALYSIS 8. WHERE TO LOOK FOR HELP 9. ADDITIONAL PLOT FUNCTIONS 2
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian Contents 1 DATA MANAGEMENT:..................................................................................................................... 4 1.1 BEST-PRACTISE: SETTING UP AND MAINTAINING A DATABASE ............................................. 4 1.1.1 Criteria of good data management:................................................................................ 4 1.2 DATA FORMATS (SAVE AS…) ................................................................................................... 5 2 R AND R-STUDIO: ............................................................................................................................ 6 2.1 DOWNLOAD AND INSTALL R ................................................................................................... 6 2.2 DOWNLOAD AND INSTALL RSTUDIO....................................................................................... 6 2.3 RSTUDIO LAYOUT: ................................................................................................................... 6 3 SIMPLE OPERATIONS IN R, AND MAKING OBJECTS: ....................................................................... 7 3.1 FUNCTIONS ............................................................................................................................. 9 3.2 VECTORS, MATRICES, ARRAYS, DATA FRAMES, AND LISTS:.................................................. 10 3.2.1 Data types: .................................................................................................................... 10 3.2.2 Manipulating vectors and matrices: ............................................................................. 13 3.2.3 Summary statistics for vectors and matrices: ............................................................... 16 4 IMPORTING AND EXPORTING FILES, SETTING WORKING DIRECTORY, SCRIPT: ........................... 17 4.1 SCRIPT ................................................................................................................................... 19 5 WHAT ARE ‘R PACKAGES’? ............................................................................................................ 20 5.1 SOME USEFUL PACKAGES ..................................................................................................... 20 6 VISUALISATION - EXAMPLES OF SIMPLE GRAPHS ......................................................................... 21 6.1 SCATTERPLOT ........................................................................................................................ 22 6.2 BOXPLOT ............................................................................................................................... 24 7 EXAMPLES OF UNIVARIATE STATISTICAL ANALYSIS ...................................................................... 25 7.1 ANOVA .................................................................................................................................. 25 7.1.1 Assumptions to check before running an ANOVA ........................................................ 26 Testing whether a distribution is normal (Shapiro-Wilk’s test) .................................................... 26 Testing for homogeneity of variance (Levene’s test) ................................................................... 26 7.2 REGRESSION ANALYSIS ......................................................................................................... 28 7.3 Assumptions to check before running a Regression Analysis: .............................................. 28 8 WHERE TO LOOK FOR HELP .......................................................................................................... 29 9 GRAPHICS VISUALISATION WITH GGPLOT2 .................................................................................. 30 9.1 AN EXAMPLE: SCATTERPLOT ................................................................................................. 31 3
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 1 DATA MANAGEMENT: Data are the basic ingredients of your research. These basic ingredients should be stored in such a way that anyone is clear on the following points: 1. Which type of data are they? 2. What type of study do they relate to? 3. When, where and why were the measurements taken/data produced? 4. By who were the measurements taken/data produced? 5. What units do the data have? 6. Do the data show original measurements, or transformed measurements, or transformed data? 7. Where is the backup stored? 8. Who has knowledge of/access to the backup? 9. … and others depending on the study. 1.1 BEST-PRACTISE: SETTING UP AND MAINTAINING A DATABASE We are using MS Excel to capture the collected data as it is user friendly, widely used, and can save data in various formats for downstream applications. However, any other open source spreadsheet software will essentially perform the same tasks. 1.1.1 Criteria of good data management: 1. The dataset contains Metadata. This is a basic description of the data, including basic information on the study or programme the study is a part of. 2. Each datapoint is identifiable in terms of when, where and by whom it was taken. 3. Each datapoint has a unit attached to it. Make sure to use SI units and/or ISO standards and to be congruent. 4. Variables (horizontal): Proper naming of columns. For example “Temperature (°C)” or “Diameter (m)”. Cases (vertical): unique identifiers are crucial as they link the different variable values to each case. For example “Population 1” or “Treatment A”. 5. No cell contains more than one piece of information, i.e. date and sampling site, or measurement and unit, should NOT be in the same cell. Different bits of information should be captured in separate columns, e.g. sampling information such as “caught on 01-11-2018 at 10:45am” should appear in at least two columns “Sampling date (YYYY-MM-DD)” and “Sampling time (hh:mm:ss)”. This is important because each datapoint is not identifiable for analysis when mixed with other information 6. Always keep a Mastercopy of your data, i.e. a version of your original dataset that you do not change. 4
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 7. Databases are usually updated continuously. To avoid losing crucial information, keep backups before any major changes. In addition, an accompanying log file is used to keep track of the changes that have been made. 8. Keep track of any changes you make to the original dataset. For instance, you might want to change datapoints after you applied some form of quality control. Or you might want to exclude datapoints when you are not sure of their validity due to various reasons (e.g. faulty equipment, person taking measurements not knowledgable/sloppy, etc. ). Also list the reasons why you might not trust a datapoint, or why you decide to remove it altogether. 9. Depending on the amount of data you produce, consider a proper data management software. 1.2 DATA FORMATS (SAVE AS…) The different formats such as .xlsx (Excel Workbook) contain hidden formatting information written into the file which can interfere with other software programmes. For instance, ‘Cells’ in Excel and in R-Studio, do not contain the same hidden formatting. It is recommended to save the database in the default .xlsx format and to export the excel workbook as a tab delimited plain text file (.txt), or comma delimited plain text file (.csv). Plain text files contain a very small amount of hidden formatting. Plain text files can also be imported back into Excel (and SPSS). When you intend moving between software, first check which formats the receiving software can read, and which format the source software can save as. 5
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 2 R AND R-STUDIO: 2.1 DOWNLOAD AND INSTALL R As you know, R is entirely for free. You can download it from: If you work in Windows, go to 'Download R for Windows' (choose the most recent version) and follow the steps you are prompted for. Choose default answers for all questions. If you work on a different operating system, choose the version for the operating system. If you need to install a later version of R over an older version of R, follow the instructions on . Make sure you are working with the latest version (currently 3.5.1) as some packages won't install or run properly if you are using an older version of R. 2.2 DOWNLOAD AND INSTALL RSTUDIO You can work R from R commander, or from R-Studio, depending on the application (i.e. what you use R for). Whereas R commander is a GUI (Graphical User Interface) to R with drop-down menus for e.g. statistical analyses, R-Studio is an IDE (Integrated Development Environment) that allows you to develop programmes in R, and of course to run them. You can also run all of your statistical analyses from R-Studio. To install RStudio, go to: Choose the most recent version. If you work in Windows, click Download RStudio Desktop, and choose the free version of RStudio Desktop. 2.3 RSTUDIO LAYOUT: RStudio consists of four windows, showing your work and the output of your work. You can change the size of each window by dragging the dividing line. 6
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian On the bottom left is the console (or command) window. Here you can type and execute commands. On the top left is the script (or editor) window. Here you can type, edit and save your commands (or scripts). This is a very convenient window, since it allows you to type and edit your commands before you execute them. If the window does not show when you start RStudio, you can open it by clicking on File New File RScript. When typing a command into the script window, it does not run automatically, but you need to run it with Ctrl+Enter, or by clicking Run on top of the script window. Either one line, or several lines, or an entire script can be run at once. On the top right is the Environment/History window. In this workspace you can see all the code R has in its memory (History tab), and the data that are in the memory (Environment tab). You can click on any of them to view and edit. In this window, you can also import datasets into R (see steps below). On the bottom right, there is the Files/Plots/Packages/Help window. Here you can see all files that you had recently open, and open them, view plots that you made, load and install packages, and access Help. 3 SIMPLE OPERATIONS IN R, AND MAKING OBJECTS: Add two numbers: > 3 + 3 [1] 6 Make an object of the sum: > x x [1] 6 This is the same as: > x = 3 + 3 > x [1] 6 Note that “=” and “ 25/5 [1] 5 Making an object: > y = 25/5 7
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian > y [1] 5 Subtract two objects: > x - y [1] 1 Subtracting y from x and making a third object: > z = x - y > z [1] 1 Having made the objects 'x' and 'y', saves us from typing the operation in a more cumbersome way, i.e.: > (3+3)-(25/5) [1] 1 … and our objects are re-usable for other operations, e.g.: > x/y + z [1] 2.2 One often creates several objects during a working session. Sometimes one cannot remember all the objects one has created. To find out which objects you have created, type: > ls() # “ls” stands for list If you want to remove an object from your workspace, type: > rm() # “rm” stands for remove I.e. if you want to remove the object x that we created previously, type: > rm(x) If you want to remove all objects in your workspace (the entire list), type: > rm(list = ls()) What names can objects have? Letters in upper and lower case (e.g. x, y, X, Y), so object names are case sensitive. Symbols (e.g. ‘.’) Numbers (following a letter) NB object names cannot have spaces - i.e. ‘my_data
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian Keep the object name short (otherwise the name is often difficult to distinguish from other, similar names, and it is more cumbersome to type a long name rather than a short one) Avoid using function names (e.g. data, factor, sqrt) as object names. It gets confusing to call up a function of a certain name on an object of the same name. There are reserved words and signs in programming languages. To see the reserved words/signs type > ?reserved and R returns a list with all reserved words. Typical errors: Incomplete operation: You may notice a plus-sign at the beginning of a new line after you hit enter. This means the function is not complete yet and R waits for another input. Note that R is tolerant when it comes to spacing. “3 + 3” is the same as if we type “3+3”. However, this is NOT the case when it comes to functions. Names of functions are fixed and R will not perform a task if the name is misspelled, including added spaces: > sqrt(9) [1] 3 > sqrt (9) [1] 3 > sq rt(9) Error: unexpected symbol in "sq rt" 3.1 FUNCTIONS We can perform more complex calculations following the general order of operators in algebra. To do so we need to know the code for the operators in R as otherwise we will get an error message: > 10+(3X5^2+4-2) Error: unexpected symbol in "10+(3X5" To go back and correct the “X” to “*”, which is the correct operator for multiplication, click the arrow up key on your keyboard, then the arrow left key to go to the “X”. That way you can scroll back through all the commands you have used before. > 10+(3*5^2+4-2) [1] 87 9
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian Functions help us to avoid repetitive tasks. Functions are sequences of logical statements which will perform a named calculation or computation. The function which will take the square root of a number is “sqrt()”.There are many built- in functions, such as rounding a number “round()”. The number which we want to round will be placed inside the brackets. Everything inside the brackets is called an argument. For example, we can specify the output more by stating that we want two decimal places: > round(4.34567789, 2) [1] 4.35 This is the same as: > round(x=4.34567789, digits=2) [1] 4.35 Knowing the different options for specifying arguments is very helpful when it comes to more complex functions. Learning the names of functions and what they do is basically the same as learning the vocabulary of a new language, or expanding the vocabulary of your spoken or written language. Functions can be queried by using a ‘?’. This will indicate how to use the function and what are the arguments. For example: > ?round 3.2 VECTORS, MATRICES, ARRAYS, DATA FRAMES, AND LISTS: Some datasets consist of matrices and vectors. Here we will learn a few operations dealing with matrices and vectors, and learn about a few other data formats. 3.2.1 Data types: Vectors and matrices are data structures. You can make your own vectors (enter them by hand) by using ‘:’ or ‘c()’. Here a vector consisting of integer numbers: > 1:7 [1] 1 2 3 4 5 6 7 Or: > c(1, 2, 3, 4, 5, 6, 7) # “c” refers to columns [1] 1 2 3 4 5 6 7 10
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian You can also make it an object: > x = 1:7 > x [1] 1 2 3 4 5 6 7 (We have now simply overwritten our previous object 'x', which was 3 + 3) You can connect vectors into columns. But first, create a second vector called y: > y = 1:7 > cbind(x,y) # “cbind” takes a sequence of arguments and combines it by columns. x y [1,] 1 1 [2,] 2 2 [3,] 3 3 [4,] 4 4 [5,] 5 5 [6,] 6 6 [7,] 7 7 You can connect the same vectors into rows: > rbind(x,y) # “rbind” takes a sequence of arguments and combines it by rows. [,1] [,2] [,3] [,4] [,5] [,6] [,7] x 1 2 3 4 5 6 7 y 1 2 3 4 5 6 7 This is how one can create a matrix, here called M: > M = matrix(data = 1:16, nrow = 4, ncol = 4) > M [,1] [,2] [,3] [,4] [1,] 1 5 9 13 [2,] 2 6 10 14 [3,] 3 7 11 15 [4,] 4 8 12 16 Another data type is an array, which can have more dimensions than a matrix (see below Figure). 11
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian In a dataframe, columns can have different data formats, or modes, i.e. they can be numeric, characters, factors, and others). How can you check which class and mode of data you have? Here a check of our vectors x and y, and the matrix M. > class(x) [1] "integer" > mode(x) [1] "numeric" > class(y) [1] "integer" > mode(y) [1] "numeric" > class(M) [1] "matrix" > mode(M) [1] "numeric" You can change the data mode if you wish, i.e. change the numeric mode of the vector x to a character mode. > x = as.character(x) > x [1] "1" "2" "3" "4" "5" "6" "7" > class(x) [1] "character" > mode(x) [1] "character" Note that the inverted commas distinguish the data as character data. Be aware that if your data are character data you will not be able to perform functions on them which are based on numeric data. 12
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian Here is how to create a data frame with a numeric and a character vector: > x = 1:5 > y = c("a", "b", "c", "d", "e") > data.frame(x, y) x y 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e Another data type is a list. This is a collection of objects (components). A variety of objects that are possibly unrelated can be collected in a list, which is given a new name. You can for instance create a list of matrix M and vector x: > list(M, x) [[1]] [,1] [,2] [,3] [,4] [1,] 1 5 9 13 [2,] 2 6 10 14 [3,] 3 7 11 15 [4,] 4 8 12 16 [[2]] [1] 1 2 3 4 5 3.2.2 Manipulating vectors and matrices: Vectors and matrices can be manipulated in R. Here a few operations. Displaying a single value from a vector: > x = 1:7 > x[4] [1] 4 This displays the 4th element of the vector, which in this case is 4. If you want to take out several values from the vector: > x[c(2, 4, 6)] [1] 2 4 6 If the values you want to pull out are in a sequence, you can use: > x[2:5] [1] 2 3 4 5 13
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian To pull out all values of the vector x that are smaller than 3: > x[x < 3] [1] 1 2 The operations conducted above on a vector can also be conducted on a matrix, e.g. pull out one number from a matrix, here on our matrix M, denoting the row and column where the value is to be found: > M [,1] [,2] [,3] [,4] [1,] 1 5 9 13 [2,] 2 6 10 14 [3,] 3 7 11 15 [4,] 4 8 12 16 > M[1,4] [1] 13 One can also call up a whole row or column from a matrix, demonstrated again on the matrix M: > M [,1] [,2] [,3] [,4] [1,] 1 5 9 13 [2,] 2 6 10 14 [3,] 3 7 11 15 [4,] 4 8 12 16 > M[1, ] [1] 1 5 9 13 > M[, 3] [1] 9 10 11 12 … and find all values of a row that are higher than 4: > M [,1] [,2] [,3] [,4] [1,] 1 5 9 13 [2,] 2 6 10 14 [3,] 3 7 11 15 [4,] 4 8 12 16 > M[1, ] [1] 1 5 9 13 Are there values >4? > M[1,] >4 [1] FALSE TRUE TRUE TRUE 14
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian Which are the values that are >4? > M[1, M[1, ] > 4] [1] 5 9 13 Sorting the matrix by the first column. Order () is a function that allows you to sort variables: > order (M[,1]) [1] 1 2 3 4 The above result shows the current order of the values in column 1. We can see that the smallest value of column 1 is in position 1, the next largest value is in position 2, etc. > order (M[,1], decreasing = TRUE) [1] 4 3 2 1 The above result shows the order of the values in column 1, now in the opposite direction. We can see that that the largest value of column 1 is in position 1, the next largest value is in position 2, etc. and the smallest value in position 4. Check: > M [,1] [,2] [,3] [,4] [1,] 1 5 9 13 [2,] 2 6 10 14 [3,] 3 7 11 15 [4,] 4 8 12 16 This function orders and displays the sorted matrix: > M[order(M[,1], decreasing = TRUE), ] [,1] [,2] [,3] [,4] [1,] 4 8 12 16 [2,] 3 7 11 15 [3,] 2 6 10 14 [4,] 1 5 9 13 If your matrix has column headings, you can specify the column heading of the column to be sorted instead of the number of the column (in this case it was column 1). 15
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 3.2.3 Summary statistics for vectors and matrices: Summary statistics can be calculated for vectors and matrices. Here for instance various operations on vectors: > y = 2:8 > length(x) [1] 7 > mean(x) [1] 4 > sqrt(x) [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.64575 1 > sd(x) [1] 2.160247 > var(x) [1] 4.666667 A quick check on their correlation: > cor(x,y) [1] 1 > cor(x,sqrt(y)) [1] 0.9953148 How to calculate summary statistics for matrices? Here an example on the mean for all rows in our matrix M, where MARGIN = 1 indicates we want to apply the function to rows, and MARGIN = 2 indicates we want to apply it to columns. > M [,1] [,2] [,3] [,4] [1,] 1 5 9 13 [2,] 2 6 10 14 [3,] 3 7 11 15 [4,] 4 8 12 16 > apply(M, MARGIN = 1, FUN = mean) [1] 7 8 9 10 And here for columns: > M [,1] [,2] [,3] [,4] [1,] 1 5 9 13 [2,] 2 6 10 14 [3,] 3 7 11 15 [4,] 4 8 12 16 > apply(M, MARGIN = 2, FUN = mean) [1] 2.5 6.5 10.5 14.5 16
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 4 IMPORTING AND EXPORTING FILES, SETTING WORKING DIRECTORY, SCRIPT: Setting a working directory means specifying a place where your files will be saved to, and stored. After you set a working directory, you do not have to type in, or look for, the directory when saving, or importing a file. Datasets can be imported into R, either as a text file, or an Excel file. People often prefer to work with their data in text format, because they do not feature fancy formatting (usually invisible on the screen) that may interfere with the data format. So to have your data in text format, make sure your matrix is in a .csv format. E.g. if you work in Excel, save your file as .csv, which is a text file that can be read in easily. There are two ways you can set your working directory: 1. > setwd("D:/Documents/YourFavouriteDirectory") 2. Go to Session in the top menu, go to Set Working Directory, and Choose Directory. To see where the working directory is, type: > getwd() It is good practice to make a working directory before you start a new project, and make a data folder within that working directory. You can save the script in the working directory, and canToimport importfrom/export a file, go thetotothe environment/History data folder. panel, click Import dataset, and choose To import your data from Excel into R, install and load the package readxl: > install.packages("readxl”) > library(readxl) From Excel. Choose the file you want to import. In the panel, specify how your first row and column of the xlsx file will be displayed in your R matrix. Alternatively, you can type > read_excel("YourFile.xlsx") From Text. Choose the file you want to import. This interface looks slightly different to the one when importing an Excel file, but contains the same information. In the panel, specify how your first row and column of the csv file will be displayed in your R matrix. Alternatively, you can type > read.csv("YourFile.csv") 17
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian Since you set the working directory, you do not have to specify the entire path, including all the directories, where the file is found. The filename is sufficient. Saving your imported data as an object allows for better functionality going forward and prevents you from having to read it in every time you want to use it, e.g. > my_data write.csv(M,'Mnew',row.names=FALSE) Again, since you set your working directory, you only need to specify the name of the .csv file, which I named here 'Mnew'. 'Mnew' is a text file. As it is a text file of the format .csv, you can open it in Excel. There you need to specify which separation the file has between the values. For .csv files, that is commas. If you specify in the same function that it is a csv file, you can open the file in Excel directly: > write.csv(M,'Mnew.csv',row.names=FALSE) Needless to say, save your work FREQUENTLY. In R Studio, you can save not only your script and the figures you produce, but also the entire working environment. 18
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 4.1 SCRIPT You might experience that the first analyses in R are based on try-and-error trials, even when following instructions. It is therefore good practise if you set up you own script before starting, which is tailored to your data types, file names, object names etc. This can also serve as a template for future analyses or when you need to make minor changes to the analyses. Although it might seem time consuming, it is important to keep track of our work in R. In order to remember what your code means, it is customary to make notes. These should be descriptive and remind you of what you scripted. However, they should also be concise. If you add a #, everything thereafter will be disregarded by R, i.e. it will not be part of the code. For instance: # This is the code for writing my matrix to a .csv format, excluding row names. > write.csv(M,'Mnew.csv',row.names=FALSE) Or: > 3 + 3 # addition [1] 6 > x x [1] 6 > 25/5 # division [1] 5 You can also write notes regarding common errors and how to avoid or fix them. To write a script, you can use a Text Editor, such as NotePad, NotePad++ or, the built-in editor in R-Studio (File > New File > R Script). Notepad is included in MS Windows Notepad++ can be downloaded for free from: https://notepad-plus-plus.org/ At a more advanced stage, you can write a R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. You can also embed code in a Markdown. For more details on using R Markdown see . 19
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 5 WHAT ARE ‘R PACKAGES’? In R, packages are amalgamations of functions which enable us to complete certain tasks. These can range from very general tasks such as data organisation and visualisation or plotting, to packages which are used in very specific niche tasks such as certain types of data analyses. Anyone can make a new package, and offer it free of charge to R users. Depending on your objective, you may require certain packages to conduct a certain analysis. Some packages come with instructional documentation, called vignettes, which provide the user with a breakdown of the package capabilities, complete with examples of functions and their outputs. To install a package you can make use of the GUI in R-Studio. In the bottom right panel select the ‘Packages’ tab. Here click on ‘Install’, and a window will pop up. The repository drop down is defaulted to CRAN – this will almost always be the desired repository. Type in the name of the package you require. Checking ‘Install Dependencies’ means that if there are packages that are required for your desired package to function, these will be installed automatically. Click install. Alternatively, you can type > install.packages(“package_name”) This does the same as the above explanation. Once you have a package installed, you need to call it from your package library in order to load it into your active R session. This is done by: >library(package_name) Note that quotations are not needed here. 5.1 SOME USEFUL PACKAGES dplyr – subsetting, summarising and joining of datasets tidyr – changing the layout of datasets, keeping your data tidy stringr – expressions and character string manipulation ggplot2 – sleek customisable plotting These are just a few examples of the many packages available, so search for the ones appropriate for what you are trying to do and refer to their respective vignettes for guidelines. 20
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 6 VISUALISATION - EXAMPLES OF SIMPLE GRAPHS One of the major advantages of working in R is its versatility for data visualisations. Here we provide two simple examples. Take the data frame chickwts, which is available in the automatically loaded datasets package. The data come from an experiment which was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. Newly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. Their weights in grams after six weeks are given along with feed types. > library(help = "datasets") > chickwts Have a quick look at the summary data: > summary(chickwts) The basic version of R comes with a plot() function, which can create a wide variety of graphs (type ?plot in the command line for details), and the lattice() package is also helpful. However, the ggplot2 package is most commonly used for graphics. > install.packages("ggplot2") > library(ggplot2) Below you’ll find two different ways of generating plots. 21
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 6.1 SCATTERPLOT Although the dataset is not ideally visualised through a scatterplot (boxplots are the better choice), you can plot it by using the function plot(): > plot(chickwts$weight,chickwts$feed,xlab="Weight",ylab="Feed") # By using the "$" symbol, R returns all of the values in the column labelled "weight". Using ggplot will produce a similar scatterplot: > scatter scatter + geom_point() + labs(x = "Weight (g)", y = "Feed") 22
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian Note that the variables are presented differently – weight is on the y-axis here and was on the x-axes before. The reason is that when the x-variable is a factor, the built-in function plot() automatically produces boxplots (This can be seen in the next section). This would be the correct way to choose the data, with weight on the y-axis: > plot(chickwts$feed,chickwts$weight,xlab="Feed",ylab=" Weight") ggplot allows the user to choose which graph type is used. 23
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 6.2 BOXPLOT Using still the same dataset and the same function, we switch the x- and y-axis: > plot(chickwts$feed,chickwts$weight,xlab="Feed",ylab="Weight (g)") Compare this function to the plot() function we used above to produce a Scatterplot. Using ggplot we produce a similar boxplot: > box box + geom_boxplot() + labs(x = "Feed", y = "Weight (g)") 24
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 7 EXAMPLES OF UNIVARIATE STATISTICAL ANALYSIS 7.1 ANOVA R allows you to easily construct an ANOVA table for the chick weight test using the built- in aov function: > chick.anova summary(chick.anova) Note that you employ formula notation weight~feed to specify the measurement variable of interest, i.e. weight, as modelled by the categorical-nominal variable of interest, feed type. 25
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian By default, R annotates model-based summary output like this with significance stars. These show intervals of significance, and the number of stars increases as the p-value decreases beyond a cutoff mark of 0.1. In our example, the very small p-value provides strong evidence against the null that the mean chick weights are the same for the different diets. 5.94e-10 is the scientific notation of 0.000000000594. 7.1.1 Assumptions to check before running an ANOVA Before we can test our hypothesis using an ANOVA, we have to confirm that the dataset is suitable for this kind of analysis. The assumptions under which the F-statistic is reliable are the same as for all parametric tests based on the normal distribution. That is, the variances in each experimental condition need to be fairly similar (homogeneity of variance. Note: violating this assumption only matters if you have unequal group sizes), observations should be independent and the dependent variable should be measured on at least an interval scale. In terms of normality, what matters is that distributions within groups are normally distributed. Testing whether a distribution is normal (Shapiro-Wilk’s test) If the test is non-significant (p > .05) it tells us that the distribution of the sample is not significantly different from a normal distribution. > shapiro.test(chickwts$weight) This will give you the overall test of normality for “weight”. To check if the residuals of the data for each group (“feed”) are normally distributed, we first need to create o new object, specify that the test is done on residuals and then extract the p-values for each group: > SW_norm shapiro.test(residuals(SW_norm)) > do.call("rbind", with(chickwts, tapply(weight, feed, function(x) unlist(shapiro.test(x)[c("statistic", "p.value")])))) Testing for homogeneity of variance (Levene’s test) If Levene’s test is significant (Pr (>F) in the R output is less than .05) then the variances are significantly different in different groups. 26
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian To use Levene’s test, we use the leveneTest() function from the car package: > install.packages("car") > library(car) We enter two variables into the function: first the outcome variable of which we want to test the variances; and second, the grouping variable, which must be a factor. > leveneTest(chickwts$weight, chickwts$feed) Run the Anova again: > anova(SW_norm) 27
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 7.2 REGRESSION ANALYSIS For this analysis, we will use another one of the loaded datasets, namely ChickWeight. It provides four variables, but we will only use two of them (weight, time as a proxy of age). First, plot your data. To do so, we can use the scatterplot function we have learned already. We specify the name of the dataset (ChickWeight), and of the two variables (Time, weight): > scatterchicks scatterchicks + geom_point() + labs(x = "Time", y = "weight") To plot your data including a regression line, use e.g.: > scatterchicks + geom_point() + labs(x = "Time", y = "weight")+stat_smoot h() For a regression analysis, you can use the following simple function: > myregr summary(myregr) 7.3 Assumptions to check before running a Regression Analysis: The assumptions for a simple linear regression are that two variables are linearly related, and that the residuals are normally distributed and homoscedastic. You can plot the residuals in different ways. The following function produces four plots. To view them all at once, use: > par(mfrow = c(2, 2)) > plot(myregr) The four plots show the following (look up further information yourself, the plots are not exclusive to R): - Residual vs fitted, for linear relation assumption - Normal Q-Q – are residuals normally distributed? - Scale location – homogeneity of variance - Residual vs levelrage - Are extreme values influencing the analysis? You can statistically test the assumptions with the functions introduced in the Anova section. 28
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 8 WHERE TO LOOK FOR HELP There is a lot of help for R users, which is easy to access. When you have a question on how to do certain operations in R, or you are looking for an explanation to an error message and information on how to fix it, simply go to google with your query. Your queries will most likely have been posted already by someone else, which means you are likely to find a response and a solution. For operations that have to do with R itself, or any of its packages, use the R website and read through the package vignettes. They contain the code and instructions on how to use the packages. The query function ‘?’ will provide you with information on the function or package (e.g. ?read.csv will give information the read.csv function). This information will appear in the help window (bottom right) of R Studio. Using ‘??’ will search the web for help on a certain function or package. Some websites: Quick-R Stackoverflow.com www.rdocumentation.org www.r-bloggers.com stat.ethz.ch Github: https://github.com/trending/r Ggplot2: https://ggplot2.tidyverse.org/ http://manuals.bioinformatics.ucr.edu/home/programming-in-r#TOC-Debugging-Utilities Books: Springer Series: Use R! (https://www.springer.com/series/6991) Tutorials and course can be found online: DataCamp: https://www.datacamp.com/courses/free-introduction-to-r Coursera Udemy There are also video tutorials available on YouTube: UTSSC How to R …. and many more 29
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 9 GRAPHICS VISUALISATION WITH GGPLOT2 The graphs we have produced in the previous part can be improved! ggplot is very versatile and provides many ways to visualize data and to customize graphs. In ggplot a graph is made up of a series of layers. You can think of a layer as a plastic transparency with something printed on it such as text, data points, lines, bars and so on. To make a final graph, these layers are placed on top of each other. Each layer contains visual objects such as bars, data points, text. Visual elements are known as geoms (short for ‘geometric objects’). Most common diagram types (for a full list see http://had.co.nz/ggplot2/): • geom_point() • geom_bar() • geom_histogram() • geom_density() • geom_line() • geom_boxplot() • geom_text(): creates a layer with text on it. These geoms also have aesthetic properties that determine what they look like and where they are plotted. These aesthetics (aes() for short) control the appearance of graph elements (for example, their colour, size, style and location). Aesthetics can be defined in general for the whole plot, or individually for a specific layer. > ggplot((dataset) , aes(x=(x-coord.) , y=(y-coord.), + colour=(variable), + fill=(variable), + shape=(factor), + linetype=(factor), + group=(factor))) If you want to set an aesthetic to a specific value then you don’t specify it within the aes() function, but if you want an aesthetic to vary then you need to place the instruction within aes(). 30
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian 9.1 AN EXAMPLE: SCATTERPLOT The dataset we are using is called “Exam Anxiety.xlsx” Make it a tab delimited text file and read it into R: examData
R – Workshop Nov 2018 School of Life Sciences, UKZN Prof. Ursula Scharler & Dr. Anna Bastian element_rect() to change the appearance of the rectangle elements. Rectangle elements: plot background, panel background, legend background, etc. Removing the background: scatter + theme(panel.background = element_rect(fill = "white"), element_line(linetype = 'solid', colour = "black"), axis.line = element_line(size = 0.5, linetype = "solid",colour = "black")) + geom_point()+ geom_smooth(method = "lm", colour = "Red", fill = "Red")+ labs(x = "Exam Anxiety", y = "Exam Performance %") Larger dots: scatter + theme(panel.background = element_rect(fill = "white"), element_line(linetype = 'solid', colour = "black"), axis.line = element_line(size = 0.5, linetype = "solid",colour = "black")) + geom_point(size = 3)+ geom_smooth(method = "lm", colour = "Red", fill = "Red")+ labs(x = "Exam Anxiety", y = "Exam Performance %") Regression line (+CI) changed: scatter + theme(panel.background = element_rect(fill = "white"), element_line(linetype = 'solid', colour = "black"), axis.line = element_line(size = 0.5, linetype = "solid",colour = "black")) + geom_point(size = 3)+ geom_smooth(method = "lm", colour = "Red", fill = "Red", linetype = "dashed", alpha = 0.1)+ labs(x = "Exam Anxiety", y = "Exam Performance %") Save the graph by exporting the graph and saving it as a vector file (always the best format to keep backups of graphs): Save as image (format “.svg”) and as .pdf file. Alternatively, you can type”: ggsave("Exam Anxiety Plot2.pdf") Before quitting RStudio you can save your work by saving the History and/or by saving the Environment. The saved History file (.R format) can be loaded into RStudio and then sent to the Source and Run. 32
You can also read