MODULE 4.4 Data Splitting

baseR-V2016.2 - Data Management and Manipulation using R

Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016

Objective:

Split data into sub-groups for analysis

Starting once again with …

Some Learning Questions

I have a variable with 3 levels - how can I calculate a mean (or other summary statistics) for each level?
Can data splitting calls be applied to lists of R data objects?
Can I use splitting functions to estimate values for nested strata?

Some Background

Data splitting in R refers to calls that split data into subgroups for analysis based on one or more criteria. The functions that split data in R are very powerful. They not only accept R base functions (e.g., mean(), sd()), but also can accept customized functions related to particular research objectives.

EXAMPLE: Calculate the sum of counts of fish species where each variable (a column in the data object) represents a different species of fish.
EXAMPLE: Calculate the mean and standard deviation of weight for males and females.
EXAMPLE: Estimate the proportion of forbs and grasses in study plots subjected to two types of grazing cross-classified by three types of habitat manipulation.
EXAMPLE: Apply a regression model where the predictors are the same to three different age classes.

Some Initialization Before We Proceed …

Data from Exercise #6 (objects f1, m1, m2 ,m3, m4, t1, and w1) were saved as mod3data.RData. Some of these objects will be needed, so load them first into your workspace.

# load objects from Exercise #6; should have saved as mod3data.RData
# REMEMBER your directory path will be different .. I'm a long-winded instructor
  getwd()  # in correct directory ?

## [1] "C:/Users/tce/Documents/words/classes/baseR_ALLversions/baseR-V2016.2/data/powerpoint_dat"

  list.files(pattern = ".RData") # is it there ?

## [1] "mod3data.RData"

  load("mod3data.RData")  # load it
  ls()  # check workspace; objects present ?

## [1] "f1" "m1" "m2" "m3" "m4" "t1" "w1"

If the objects are not there, or you did not save an .RData from Exercise #6, you will need to return to Module 3.4, Exercise #6, and re-import the data before proceeding further.

Applying Splits to Dataframe Rows and Columns: The `apply()`

The simplest of splits is to calculate a summary statistic for rows and/or columns. The function apply(ObjectVar, Margins, Function2Apply) summarizes data by applying a specified function to rows and columns in a data object, where:

ObjectVar is the columns (variables) selected for analysis;
Margins are defined as 1 = rows, 2 = columns; and
Function2Apply is the function to apply to the selected ObjectVar.

Any existing R function, or any function you construct, can be applied to the data object rows and columns.

# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
head(w1[, (3:14)], 2)  # examine the data; 1st 2 lines

##   tmax01 tmax02 tmax03 tmax04 tmax05 tmax06 tmax07 tmax08 tmax09 tmax10
## 1     14     17     20     24     29     34     34     32     30     25
## 2     14     17     20     24     29     34     34     32     30     25
##   tmax11 tmax12
## 1     19     14
## 2     19     15

# the apply function
apply(w1[, (3:14)], 1, sum)  # apply sum to rows; call vars by column number

##  [1] 292 293 309 274 296 230 266 225 231 223

apply(w1[, c("tmax01", "tmax02")], 2, sum)  # apply sum to columns; vars by name

## tmax01 tmax02 
##    119    143

Applying Splits to Dataframe Columns as Nonnumeric Lists: The `lapply()`

The lapply() calculates a summary statistic for data objects and returns a list. The minimal required options are as lapply(ObjectVar, Function2Apply), where as with apply() ObjectVar is the columns (variables) selected for analysis, and Function2Apply is the function to apply to the selected ObjectVar. As before, any existing R function, or any function you construct, can be applied to the data object.

lappy() returns a list.

# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
s1 <- lapply(w1[, 3:5], sum); s1  # list of column sums vars 3-5

## $tmax01
## [1] 119
## 
## $tmax02
## [1] 143
## 
## $tmax03
## [1] 169

is.list(s1)   # check to see if list

## [1] TRUE

lapply(w1[c("tmax01", "tmax02")], sum)  # list of sums by col names

## $tmax01
## [1] 119
## 
## $tmax02
## [1] 143

WARNING!!
Two sets of [[ ]] are required to access values in a list. Use of single [ ] returns the list element but not the value.

s1  # recall the list structure; NOTE elemtn mames as $, eg $tmax1

## $tmax01
## [1] 119
## 
## $tmax02
## [1] 143
## 
## $tmax03
## [1] 169

# examine and access lapply list output
is.list(s1[1])  # is element [] of class = list? TRUE

## [1] TRUE

is.numeric(s1[1])  # is element [] of class = numeric? FALSE

## [1] FALSE

s1[1]  # list element []

## $tmax01
## [1] 119

is.list(s1[[1]])  # is element [[]] of class = list? FALSE

## [1] FALSE

is.numeric(s1[[1]])  # is element [[]] of class = numeric? TRUE

## [1] TRUE

s1[[1]]   # list value [[]]

## [1] 119

Applying Splits to Dataframe Columns as Numeric and Nonnumeric Lists: The `sapply()`

sapply(ObjectVar, Function2Apply) returns numeric values, where, as before, ObjectVar is the selected columns (variables) in the data object, and Function2Apply is the function to apply to the selected ObjectVar. Again, you can apply a customized function.

Most application for sapply() is return numerics.

# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
s2 <- sapply(w1[, 3:5], sd); s2  # column sd of vars 3-5

##   tmax01   tmax02   tmax03 
## 2.846050 2.869379 3.348300

is.list(s2)  # is s2 a list? FALSE ...

## [1] FALSE

class(s2)  # what is it then?  class of numeric ...

## [1] "numeric"

is.numeric(s2)  # is s2 a numeric? TRUE as well ...

## [1] TRUE

str(s2)  # let's check using str()

##  Named num [1:3] 2.85 2.87 3.35
##  - attr(*, "names")= chr [1:3] "tmax01" "tmax02" "tmax03"

As noted above, the functions used in sapply() do not have to be numerically-based and return numerics only. Here we ask for the class of each variable in a data object.

# assume f1 from Exercise #6; fish species captured by osprey
sapply(f1, class)  # example of utility of R splitting calls; here class of each variable is returned

##   FishSpp      Male    Female 
##  "factor" "integer" "integer"

Splitting Data Object Over Factors with Levels and a Sngle Variable: The `tapply()`

The tapply(ObjectVar, Factor, Function2Apply) applies a specified function to a factor with multiple levels, but is restricted to a single variable, where ObjectVar is the selected column(s); Factor is the factor column(s) having discrete levels, with multiple factors applied using Factor = list(Factor1, … , FactorN), and Function2Apply is the function.

Any existing R function, or any function you construct, can be applied to the data object.

# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
unique(w1$prab)  # factor prab with two levels [0,1]

## [1] 0 1

tapply(w1$tmax01, w1$prab, mean)  # mean tmax01 by prab=0,1

##    0    1 
## 14.4  9.4

tapply(w1$tmax06, w1$prab, mean)  # mean tmax06 by prab=0,1

##  0  1 
## 34 29

Splitting Data Object Over Multiple Factors and Variables: The `aggregate()`

aggregate() is the most powerful of the R splitting functions. It allows you apply a function to factors with different levels for multiple variables. The basic call is: aggregate(ObjectVar, by = list(Factor), FUN = Function2Apply), where ObjectVar is 1 or more selected variables for analysis; by = list(Factor) is a list of 1 or more factors with discrete levels; and Function2Apply is the function to be applied to the variables in factor levels.

By default aggregate() ignores NA. Note that in the output the factor levels are reassigned names Group.1, …, Group.N. You can assign names to each of the factors to override the default Group.N naming convention.

# assume data w1 from Exercise #6; a series of maximumn temperature data by month number
# calculate mean tmax01-06 (6 variables) by prab=0,1 (i factor, 2 levels)
aggregate(w1[, 3:8], by = list(presab = w1$prab), FUN = mean)  # NOTE name presab assigned to factor w1$prab

##   presab tmax01 tmax02 tmax03 tmax04 tmax05 tmax06
## 1      0   14.4   16.8   19.8   24.2   29.0     34
## 2      1    9.4   11.8   14.0   19.0   23.6     29

Summary of Module 4.4 functions

Basic calls related splitting data are:

apply() => Summarize data by rows or columns; numeric
lapply => Apply function to data objects; creates a list
sapply => Apply function to numeric or non-numeric data objects
tapply => Apply function to data object class=factor with â¥1 levels
aggregate => Apply function â¥1 data object of class=factor for multiple variables with â¥1 levels

Exercise #14

Data for this exercise are in: ../baseR-V2016.2/data/exercise_dat.

Import the dataset gsg_leks from the dwr_data directory. This dataset consists of lek complex, leks within those complexes, and a count of males on those leks by year of survey. Disturbance (Y,N) is also recorded.

Compute general statistics min, max, mean, and sd of tot_male for:
- lek_id within complex
- disturbance; and
- year within complex Use any of the appropriate R data splitting functions, and assign logical column names where it seems appropriate
Save the last of the 3 analyses (year within complex) as separate .RData objects for each of the computed statistics (there will be 4).

Learning R

MODULE 4.4 Data Splitting

baseR-V2016.2 - Data Management and Manipulation using R

Objective:

Split data into sub-groups for analysis

Starting once again with …

Some Learning Questions

I have a variable with 3 levels - how can I calculate a mean (or other summary statistics) for each level?

Can data splitting calls be applied to lists of R data objects?

Can I use splitting functions to estimate values for nested strata?

Some Background

Some Initialization Before We Proceed …

Applying Splits to Dataframe Rows and Columns: The `apply()`

Applying Splits to Dataframe Columns as Nonnumeric Lists: The `lapply()`

Applying Splits to Dataframe Columns as Numeric and Nonnumeric Lists: The `sapply()`

Splitting Data Object Over Factors with Levels and a Sngle Variable: The `tapply()`

Splitting Data Object Over Multiple Factors and Variables: The `aggregate()`

Summary of Module 4.4 functions

Exercise #14

END MODULE 4.4

MODULE 4.4 Data Splitting

baseR-V2016.2 - Data Management and Manipulation using R

Objective:

Split data into sub-groups for analysis

Starting once again with …

Some Learning Questions

I have a variable with 3 levels - how can I calculate a mean (or other summary statistics) for each level?

Can data splitting calls be applied to lists of R data objects?

Can I use splitting functions to estimate values for nested strata?

Some Background

Some Initialization Before We Proceed …

Applying Splits to Dataframe Rows and Columns: The apply()

Applying Splits to Dataframe Columns as Nonnumeric Lists: The lapply()

Applying Splits to Dataframe Columns as Numeric and Nonnumeric Lists: The sapply()

Splitting Data Object Over Factors with Levels and a Sngle Variable: The tapply()

Splitting Data Object Over Multiple Factors and Variables: The aggregate()

Summary of Module 4.4 functions

Exercise #14

END MODULE 4.4

Applying Splits to Dataframe Rows and Columns: The `apply()`

Applying Splits to Dataframe Columns as Nonnumeric Lists: The `lapply()`

Applying Splits to Dataframe Columns as Numeric and Nonnumeric Lists: The `sapply()`

Splitting Data Object Over Factors with Levels and a Sngle Variable: The `tapply()`

Splitting Data Object Over Multiple Factors and Variables: The `aggregate()`