Chapter 2: Introducton to the research process using R
Figure 2.1
Figure 2.1 Code
Click here to show code with comments.
# Tell R where you will be working. R will look here for your data: setwd("~/Documents/Books/Presenting/data/rData/") # On a machine running Windows, the path would be something like: # setwd("C:\\Projects\\Books\\Presenting\\data\\rData\\") # Load the dataset rhinitis out of the current folder/directory into R's # interactive working memory/environment. If you have not set the working # directory with the setwd() function, you could type something like this on a # Mac: # load("~/Documents/Books/Presenting/data/rData/rhinitis.RData") or # or this on a Windows machine: # load("C:\\Projects\\Books\\Presenting\\data\\rData\\rhinitis.RData") load("rhinitis.RData") # The head() function shows the beginning of an object. Here it shows the first # 14 records of a dataframe. You also use head() to show the first few values # in other kinds of objects; for example a vector of numbers or character # strings. head(rhinitis, n = 14) # Display the first 14 records.
Figure 2.3
Figure 2.3 Code
Click here to show complete code with comments.
Figure 2.4
Figure 2.4 Code
Click here to show the code with comments.setwd("~/Documents/Books/Presenting/data/rData/") load("rhinitis.RData") # The code below adds a variable called whenrhinFactor to the rhinitis dataset. # The categorical factor variable has the values 1, 2 and 3, which are labeled # with phrases describing the 3 seasons. # The factor() function is applied to the whenrin variable in the rhinitis # dataframe. factor() uses a comma-delimited set of details. A programmer calls # the details that a function understands "arguments". Notice the ( following # the word factor and the commas ending the lines with the arguments until # the closing ) . rhinitis$whenrhinFactor <- factor(rhinitis$whenrhin, # Actual values in the variable are 1, 2, 3. # The c() function can glue together a set of # similar things, for example, a bunch of # numbers, to make a vector. The levels # argument is expecting a vector. levels = c(1, 2, 3), # Remember that the c() function stores # similar things as a vector. In this case, # the vector is storing character strings for # the labels argument. labels = c("Dry season", "Wet season", "Anytime") ) # This ) is the end of the factor function. # The comment() function is used to add a descriptive label to the new variable. # Sadly, it is not typically automatically shown in output. To later display the # comment, type comment(rhinitis$whenrhinFactor). comment(rhinitis$whenrhinFactor) <- "When get rhinitis" rhinitis$rhinitisFactor <- factor(rhinitis$rhinitis, levels = c(0, 1) , labels = c("No", "Yes") ) comment(rhinitis$rhinitisFactor) <- "Rhinitis with a cold in last 12 months" # The code below makes a simple summary table. # The with() function takes two arguments, the name of a dataframe and a # function that will be applied to variables in that dataframe. # The with() function, used below, allows you to work with the variables within # a dataframe without having to repeatedly mention the data frame. The code # below could have been written as: # table(rhinitis$whenrhinFactor, # rows of the table # rhinitis$rhinitisFactor, # column of the table # dnn = c(comment(whenrhinFactor), # row label # comment(rhinitisFactor)) # column label # ) # closing ) for table function # # Another option is to attach the dataframe at the "top" of the search list with # code like this: # attach(nonResponse) # table(whenrhinFactor, # rows of the table # rhinitisFactor, # column of the table # dnn = c(comment(whenrhinFactor), # row label # comment(rhinitisFactor)) # column label # ) # closing ) for table function # detach(nonResponse) # That strategy can work if you are only working with a single dataframe but it # can lead to problems. Problems will ensue if you do not detach the data when # you have variables with the same name in multiple dataframes. For example, # if you are working to predict who has a sexually transmitted infection and # if you have a variable called sex in a dataframe called coitus and another # variable called sex in a dataframe called demographics, you can easily # accidentally use a person's gender when you intended to use an indicator for # sexual activity. Look here a good brief introduction: # http://www.r-bloggers.com/to-attach-or-not-attach-that-is-the-question/ with(rhinitis, # name of data frame with the data # The table() function does simple cross tabulation tables. table(whenrhinFactor, # rows of the table rhinitisFactor, # column of the table dnn = c(comment(whenrhinFactor), # row label comment(rhinitisFactor)) # column label ) # closing ) for table function ) # closing ) for with function # The code below makes a good looking summary table. # The library() function attaches a package full of functions (and sometimes # dataframes) to R's "search list". When you type CrossTable(), R searches that # list to figure out what to do. If you have not attached the package that # included a riskratio() function, R will complain that it can not find the # function "CrossTable". # You can encounter problems when two packages include functions with the same # name. R will give a warning like: # The following object(s) are masked ... # when you attach a package that includes a function that is already in the # search list. R calls this problem masking. The detach() function, shown # below, can be used to remove packages from the search list and help avoid this # problem. The CrossTable() function is unique enough that I do not bother to # detach and reattach it repeatedly. # Load the gmodels package to get the fancy CrossTable() function. library(gmodels) with(rhinitis, # The CrossTable() function generates frequency tables and it can do # many common cross tabulation statistics. CrossTable(whenrhinFactor, # rows of the table rhinitisFactor, # columns of the table prop.r=FALSE, # don't show row percentages prop.c=FALSE, # don't show column percentages prop.t=FALSE, # don't show total percentages # don't show each cell's impact on a chi-square test prop.chisq=FALSE, dnn = c(comment(whenrhinFactor), # row label comment(rhinitisFactor) # column label) ) ) ) # Make a dataset (an R dataframe object) that has the heart variables. # The subset() function takes a dataframe as its first argument and it can use # an argument called select to select only a subset of variables/columns. It # can also include a "logic check" argument called subset which is used to # determine which records/rows to include. That option appears in later # chapters. hearts <- subset(rhinitis, # the dataframe name select = c(syst1, syst2, syst3, # variables to keep diast1, diast2, diast3, pulse1, pulse2, pulse3 ) ) # People who hate typing could type this instead: # hearts <- rhinitis[, 1:7] # The [ , ] after the dataframe name can be used to select which rows and # columns are used. The " [, " indicates that there is # no filter on the rows. So, include all subjects. The ", 1:7]" is a filter on # which columns to include. The analysis will be limited to the 1st through the # 7th columns. # The summary() function pays attention to the type of object it is being asked # summarize . Here, because the object is a data frame, it produces a 5 number # summary of the variables. It is missing standard deviation. So another # function needs to be tried (the describe function). In general it is a good # plan to try the summary() function to describe objects. summary(hearts) # Load psych package to get the describe() function. This masks many objects in # other packages including the widely used describe() function whith is in the # Hmisc package. library(psych) # The describe() funcion in the psych package produces a more complete summary, # including a standard deviation. describe(hearts) # Detach the package to prevent masking conflicts. detach("package:psych", unload=TRUE) with(rhinitis, hist(diast2) # The hist() function draws a simple histogram. ) # Load the ggplot2 package to get fancy graphics functions including qplot() # and theme_bw() which are used below. The syntax/grammar used by ggplot2 is # somewhat different from the rest of R. The basic idea with ggplot2 is that # you add more and more details to get basic geometric objects to behave. It, # perhaps even more than the rest of R, is evolving and features get devaluated. # To learn more about ggplot2 look here: # http://ggplot2.org/ # http://www.statmethods.net/advgraphs/ggplot2.html library(ggplot2) with(rhinitis, qplot(diast2, # make a quickplot of the dist2 variable geom = "histogram", # the geometry is a histogram main = "ggplot2 histogram", # the main title xlab = "Diastolic 2", # x-axis label ylab = "Frequency", # y-axis label binwidth = 10 # how wide to make the bins/bars ) + # finish quickplot but add stuff theme_bw() # change to black on white color scheme )