Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016
Data splitting in R refers to calls that split data into subgroups for analysis based on one or more criteria. The functions that split data in R are very powerful. They not only accept R base functions (e.g., mean()
, sd()
), but also can accept customized functions related to particular research objectives.
Data from Exercise #6 (objects f1, m1, m2 ,m3, m4, t1, and w1) were saved as mod3data.RData
. Some of these objects will be needed, so load them first into your workspace.
# load objects from Exercise #6; should have saved as mod3data.RData
# REMEMBER your directory path will be different .. I'm a long-winded instructor
getwd() # in correct directory ?
## [1] "C:/Users/tce/Documents/words/classes/baseR_ALLversions/baseR-V2016.2/data/powerpoint_dat"
list.files(pattern = ".RData") # is it there ?
## [1] "mod3data.RData"
load("mod3data.RData") # load it
ls() # check workspace; objects present ?
## [1] "f1" "m1" "m2" "m3" "m4" "t1" "w1"
If the objects are not there, or you did not save an .RData
from Exercise #6, you will need to return to Module 3.4, Exercise #6, and re-import the data before proceeding further.
apply()
The simplest of splits is to calculate a summary statistic for rows and/or columns. The function apply(ObjectVar, Margins, Function2Apply)
summarizes data by applying a specified function to rows and columns in a data object, where:
Any existing R function, or any function you construct, can be applied to the data object rows and columns.
# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
head(w1[, (3:14)], 2) # examine the data; 1st 2 lines
## tmax01 tmax02 tmax03 tmax04 tmax05 tmax06 tmax07 tmax08 tmax09 tmax10
## 1 14 17 20 24 29 34 34 32 30 25
## 2 14 17 20 24 29 34 34 32 30 25
## tmax11 tmax12
## 1 19 14
## 2 19 15
# the apply function
apply(w1[, (3:14)], 1, sum) # apply sum to rows; call vars by column number
## [1] 292 293 309 274 296 230 266 225 231 223
apply(w1[, c("tmax01", "tmax02")], 2, sum) # apply sum to columns; vars by name
## tmax01 tmax02
## 119 143
lapply()
The lapply()
calculates a summary statistic for data objects and returns a list. The minimal required options are as lapply(ObjectVar, Function2Apply)
, where as with apply()
ObjectVar is the columns (variables) selected for analysis, and Function2Apply is the function to apply to the selected ObjectVar. As before, any existing R function, or any function you construct, can be applied to the data object.
lappy()
returns a list.
# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
s1 <- lapply(w1[, 3:5], sum); s1 # list of column sums vars 3-5
## $tmax01
## [1] 119
##
## $tmax02
## [1] 143
##
## $tmax03
## [1] 169
is.list(s1) # check to see if list
## [1] TRUE
lapply(w1[c("tmax01", "tmax02")], sum) # list of sums by col names
## $tmax01
## [1] 119
##
## $tmax02
## [1] 143
WARNING!!
Two sets of [[ ]] are required to access values in a list. Use of single [ ] returns the list element but not the value.
s1 # recall the list structure; NOTE elemtn mames as $, eg $tmax1
## $tmax01
## [1] 119
##
## $tmax02
## [1] 143
##
## $tmax03
## [1] 169
# examine and access lapply list output
is.list(s1[1]) # is element [] of class = list? TRUE
## [1] TRUE
is.numeric(s1[1]) # is element [] of class = numeric? FALSE
## [1] FALSE
s1[1] # list element []
## $tmax01
## [1] 119
is.list(s1[[1]]) # is element [[]] of class = list? FALSE
## [1] FALSE
is.numeric(s1[[1]]) # is element [[]] of class = numeric? TRUE
## [1] TRUE
s1[[1]] # list value [[]]
## [1] 119
sapply()
sapply(ObjectVar, Function2Apply)
returns numeric values, where, as before, ObjectVar is the selected columns (variables) in the data object, and Function2Apply is the function to apply to the selected ObjectVar. Again, you can apply a customized function.
Most application for sapply()
is return numerics.
# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
s2 <- sapply(w1[, 3:5], sd); s2 # column sd of vars 3-5
## tmax01 tmax02 tmax03
## 2.846050 2.869379 3.348300
is.list(s2) # is s2 a list? FALSE ...
## [1] FALSE
class(s2) # what is it then? class of numeric ...
## [1] "numeric"
is.numeric(s2) # is s2 a numeric? TRUE as well ...
## [1] TRUE
str(s2) # let's check using str()
## Named num [1:3] 2.85 2.87 3.35
## - attr(*, "names")= chr [1:3] "tmax01" "tmax02" "tmax03"
As noted above, the functions used in sapply()
do not have to be numerically-based and return numerics only. Here we ask for the class of each variable in a data object.
# assume f1 from Exercise #6; fish species captured by osprey
sapply(f1, class) # example of utility of R splitting calls; here class of each variable is returned
## FishSpp Male Female
## "factor" "integer" "integer"
tapply()
The tapply(ObjectVar, Factor, Function2Apply)
applies a specified function to a factor with multiple levels, but is restricted to a single variable, where ObjectVar is the selected column(s); Factor is the factor column(s) having discrete levels, with multiple factors applied using Factor = list(Factor1, … , FactorN), and Function2Apply is the function.
Any existing R function, or any function you construct, can be applied to the data object.
# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
unique(w1$prab) # factor prab with two levels [0,1]
## [1] 0 1
tapply(w1$tmax01, w1$prab, mean) # mean tmax01 by prab=0,1
## 0 1
## 14.4 9.4
tapply(w1$tmax06, w1$prab, mean) # mean tmax06 by prab=0,1
## 0 1
## 34 29
aggregate()
aggregate()
is the most powerful of the R splitting functions. It allows you apply a function to factors with different levels for multiple variables. The basic call is: aggregate(ObjectVar, by = list(Factor), FUN = Function2Apply)
, where ObjectVar is 1 or more selected variables for analysis; by = list(Factor) is a list of 1 or more factors with discrete levels; and Function2Apply is the function to be applied to the variables in factor levels.
By default aggregate()
ignores NA. Note that in the output the factor levels are reassigned names Group.1, …, Group.N. You can assign names to each of the factors to override the default Group.N naming convention.
# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
# calculate mean tmax01-06 (6 variables) by prab=0,1 (i factor, 2 levels)
aggregate(w1[, 3:8], by = list(presab = w1$prab), FUN = mean) # NOTE name presab assigned to factor w1$prab
## presab tmax01 tmax02 tmax03 tmax04 tmax05 tmax06
## 1 0 14.4 16.8 19.8 24.2 29.0 34
## 2 1 9.4 11.8 14.0 19.0 23.6 29
Basic calls related splitting data are:
apply()
=> Summarize data by rows or columns; numericlapply
=> Apply function to data objects; creates a listsapply
=> Apply function to numeric or non-numeric data objectstapply
=> Apply function to data object class=factor with â¥1 levelsaggregate
=> Apply function â¥1 data object of class=factor for multiple variables with â¥1 levelsData for this exercise are in: ../baseR-V2016.2/data/exercise_dat.
Import the dataset gsg_leks from the dwr_data directory. This dataset consists of lek complex, leks within those complexes, and a count of males on those leks by year of survey. Disturbance (Y,N) is also recorded.