MODULE 4.4 Data Splitting

baseR-V2016.2 - Data Management and Manipulation using R

Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016


Objective:

  • Split data into sub-groups for analysis



Starting once again with …

Some Learning Questions

  • I have a variable with 3 levels - how can I calculate a mean (or other summary statistics) for each level?

  • Can data splitting calls be applied to lists of R data objects?

  • Can I use splitting functions to estimate values for nested strata?


Some Background


Data splitting in R refers to calls that split data into subgroups for analysis based on one or more criteria. The functions that split data in R are very powerful. They not only accept R base functions (e.g., mean(), sd()), but also can accept customized functions related to particular research objectives.

  • EXAMPLE: Calculate the sum of counts of fish species where each variable (a column in the data object) represents a different species of fish.
  • EXAMPLE: Calculate the mean and standard deviation of weight for males and females.
  • EXAMPLE: Estimate the proportion of forbs and grasses in study plots subjected to two types of grazing cross-classified by three types of habitat manipulation.
  • EXAMPLE: Apply a regression model where the predictors are the same to three different age classes.

Some Initialization Before We Proceed …


Data from Exercise #6 (objects f1, m1, m2 ,m3, m4, t1, and w1) were saved as mod3data.RData. Some of these objects will be needed, so load them first into your workspace.

# load objects from Exercise #6; should have saved as mod3data.RData
# REMEMBER your directory path will be different .. I'm a long-winded instructor
  getwd()  # in correct directory ?  
## [1] "C:/Users/tce/Documents/words/classes/baseR_ALLversions/baseR-V2016.2/data/powerpoint_dat"
  list.files(pattern = ".RData") # is it there ?
## [1] "mod3data.RData"
  load("mod3data.RData")  # load it
  ls()  # check workspace; objects present ?
## [1] "f1" "m1" "m2" "m3" "m4" "t1" "w1"

If the objects are not there, or you did not save an .RData from Exercise #6, you will need to return to Module 3.4, Exercise #6, and re-import the data before proceeding further.


Applying Splits to Dataframe Rows and Columns: The apply()


The simplest of splits is to calculate a summary statistic for rows and/or columns. The function apply(ObjectVar, Margins, Function2Apply) summarizes data by applying a specified function to rows and columns in a data object, where:

  • ObjectVar is the columns (variables) selected for analysis;
  • Margins are defined as 1 = rows, 2 = columns; and
  • Function2Apply is the function to apply to the selected ObjectVar.

Any existing R function, or any function you construct, can be applied to the data object rows and columns.

# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
head(w1[, (3:14)], 2)  # examine the data; 1st 2 lines
##   tmax01 tmax02 tmax03 tmax04 tmax05 tmax06 tmax07 tmax08 tmax09 tmax10
## 1     14     17     20     24     29     34     34     32     30     25
## 2     14     17     20     24     29     34     34     32     30     25
##   tmax11 tmax12
## 1     19     14
## 2     19     15
# the apply function
apply(w1[, (3:14)], 1, sum)  # apply sum to rows; call vars by column number
##  [1] 292 293 309 274 296 230 266 225 231 223
apply(w1[, c("tmax01", "tmax02")], 2, sum)  # apply sum to columns; vars by name
## tmax01 tmax02 
##    119    143

Applying Splits to Dataframe Columns as Nonnumeric Lists: The lapply()


The lapply() calculates a summary statistic for data objects and returns a list. The minimal required options are as lapply(ObjectVar, Function2Apply), where as with apply() ObjectVar is the columns (variables) selected for analysis, and Function2Apply is the function to apply to the selected ObjectVar. As before, any existing R function, or any function you construct, can be applied to the data object.

lappy() returns a list.

# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
s1 <- lapply(w1[, 3:5], sum); s1  # list of column sums vars 3-5
## $tmax01
## [1] 119
## 
## $tmax02
## [1] 143
## 
## $tmax03
## [1] 169
is.list(s1)   # check to see if list
## [1] TRUE
lapply(w1[c("tmax01", "tmax02")], sum)  # list of sums by col names
## $tmax01
## [1] 119
## 
## $tmax02
## [1] 143

WARNING!!
Two sets of [[ ]] are required to access values in a list. Use of single [ ] returns the list element but not the value.

s1  # recall the list structure; NOTE elemtn mames as $, eg $tmax1
## $tmax01
## [1] 119
## 
## $tmax02
## [1] 143
## 
## $tmax03
## [1] 169
# examine and access lapply list output
is.list(s1[1])  # is element [] of class = list? TRUE
## [1] TRUE
is.numeric(s1[1])  # is element [] of class = numeric? FALSE
## [1] FALSE
s1[1]  # list element []
## $tmax01
## [1] 119
is.list(s1[[1]])  # is element [[]] of class = list? FALSE
## [1] FALSE
is.numeric(s1[[1]])  # is element [[]] of class = numeric? TRUE
## [1] TRUE
s1[[1]]   # list value [[]]
## [1] 119

Applying Splits to Dataframe Columns as Numeric and Nonnumeric Lists: The sapply()


sapply(ObjectVar, Function2Apply) returns numeric values, where, as before, ObjectVar is the selected columns (variables) in the data object, and Function2Apply is the function to apply to the selected ObjectVar. Again, you can apply a customized function.

Most application for sapply() is return numerics.

# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
s2 <- sapply(w1[, 3:5], sd); s2  # column sd of vars 3-5
##   tmax01   tmax02   tmax03 
## 2.846050 2.869379 3.348300
is.list(s2)  # is s2 a list? FALSE ...
## [1] FALSE
class(s2)  # what is it then?  class of numeric ...
## [1] "numeric"
is.numeric(s2)  # is s2 a numeric? TRUE as well ...
## [1] TRUE
str(s2)  # let's check using str()
##  Named num [1:3] 2.85 2.87 3.35
##  - attr(*, "names")= chr [1:3] "tmax01" "tmax02" "tmax03"

As noted above, the functions used in sapply() do not have to be numerically-based and return numerics only. Here we ask for the class of each variable in a data object.

# assume f1 from Exercise #6; fish species captured by osprey
sapply(f1, class)  # example of utility of R splitting calls; here class of each variable is returned 
##   FishSpp      Male    Female 
##  "factor" "integer" "integer"

Splitting Data Object Over Factors with Levels and a Sngle Variable: The tapply()


The tapply(ObjectVar, Factor, Function2Apply) applies a specified function to a factor with multiple levels, but is restricted to a single variable, where ObjectVar is the selected column(s); Factor is the factor column(s) having discrete levels, with multiple factors applied using Factor = list(Factor1, … , FactorN), and Function2Apply is the function.

Any existing R function, or any function you construct, can be applied to the data object.

# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
unique(w1$prab)  # factor prab with two levels [0,1]
## [1] 0 1
tapply(w1$tmax01, w1$prab, mean)  # mean tmax01 by prab=0,1
##    0    1 
## 14.4  9.4
tapply(w1$tmax06, w1$prab, mean)  # mean tmax06 by prab=0,1
##  0  1 
## 34 29

Splitting Data Object Over Multiple Factors and Variables: The aggregate()


aggregate() is the most powerful of the R splitting functions. It allows you apply a function to factors with different levels for multiple variables. The basic call is: aggregate(ObjectVar, by = list(Factor), FUN = Function2Apply), where ObjectVar is 1 or more selected variables for analysis; by = list(Factor) is a list of 1 or more factors with discrete levels; and Function2Apply is the function to be applied to the variables in factor levels.

By default aggregate() ignores NA. Note that in the output the factor levels are reassigned names Group.1, …, Group.N. You can assign names to each of the factors to override the default Group.N naming convention.

# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
# calculate mean tmax01-06 (6 variables) by prab=0,1 (i factor, 2 levels)
aggregate(w1[, 3:8], by = list(presab = w1$prab), FUN = mean)  # NOTE name presab assigned to factor w1$prab
##   presab tmax01 tmax02 tmax03 tmax04 tmax05 tmax06
## 1      0   14.4   16.8   19.8   24.2   29.0     34
## 2      1    9.4   11.8   14.0   19.0   23.6     29

Summary of Module 4.4 functions


Basic calls related splitting data are:

  • apply() => Summarize data by rows or columns; numeric
  • lapply => Apply function to data objects; creates a list
  • sapply => Apply function to numeric or non-numeric data objects
  • tapply => Apply function to data object class=factor with ≥1 levels
  • aggregate => Apply function ≥1 data object of class=factor for multiple variables with ≥1 levels

Exercise #14


Data for this exercise are in: ../baseR-V2016.2/data/exercise_dat.

Import the dataset gsg_leks from the dwr_data directory. This dataset consists of lek complex, leks within those complexes, and a count of males on those leks by year of survey. Disturbance (Y,N) is also recorded.

  • Compute general statistics min, max, mean, and sd of tot_male for:
    • lek_id within complex
    • disturbance; and
    • year within complex Use any of the appropriate R data splitting functions, and assign logical column names where it seems appropriate
  • Save the last of the 3 analyses (year within complex) as separate .RData objects for each of the computed statistics (there will be 4).

END MODULE 4.4


Printable Version