MODULE 3.6 Data Coercing in R

baseR-V2016.2 - Data Management and Manipulation using R

Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016


Objective:

  • How to change - thsi is, coerce - R data characteristics from one type to another



Lets begin with …

Some learning questions about data coercing in R include:

  • Why is having the correct class and/or mode of an object important in R?

  • How do I determine the class and/or mode of objects in R?

  • How do I change a R object from one class to another?

  • How do I change a R object from one mode to another?


Some Background - Why You Should Care


Many analytical procedures in R require objects be of a certain data class or mode for operation. Data import from external sources often results in a class or mode not amenable with desired R function.

For example, an analysis of variance may require a variable class to be factor when data were read into the data object as class numeric. While we learned how to control some of this during data import (Module 3.4), there often arise situations where data characteristics need to be changed after the initial data importation.

Coercing is the name of this process in R.


Some Initilaization …


Data from Exercise #6 (objects f1, m1, m2 ,m3, m4, t1, and w1) were saved as mod3data.RData. Some of these objects will be needed, so load them first into your workspace.

# load objects from Exercise #6; should have saved as mod3data.RData
# REMEMBER your directory path will be different .. I'm a long-winded instructor
  getwd()  # in correct directory ?  
## [1] "C:/Users/tce/Documents/words/classes/baseR_ALLversions/baseR-V2016.2/data/powerpoint_dat"
  list.files(pattern = ".RData") # is it there ?
## [1] "mod3data.RData"
  load("mod3data.RData")  # load it
  ls()  # check workspace; objects present ?
## [1] "f1" "m1" "m2" "m3" "m4" "t1" "w1"

If the objects are not there, or you did not save an .RData from Exercise #6, you will need to return to Module 3.4, Exercise #6, and re-import the data before proceeding further.


Determining Object Class or Mode Prior to Coercing


Data imported from external sources (e.g., read.csv()) assume a specific R class depending on the nature of element values (i.e., the recorded measurements). Once imported the data will ahve a mode as well.

Determining class:
Rule sets for this class assignment are not always obvious, but the easiest way to check is to apply the calls class(DataObject) or class(DataObject$VariableName).

These calls return the class of the data object, and each of the variables in the data object.

# assume object f1 from Exercise #6
class(f1)  # class of data object
## [1] "data.frame"
names(f1)  # names in data object
## [1] "FishSpp" "Male"    "Female"
class(f1$FishSpp)  # class of var1
## [1] "factor"
class(f1$Male)  # class of var2
## [1] "integer"
class(f1$Female)  # class of var3
## [1] "integer"

In addition to calls that return the object class, data objects can be queried using is.OfClass, where OfClass is a known R class (see Module 3.2). Queries return the logical of TRUE or FALSE. Some common queries are:

is.numeric() | is.character() | is.factor() | is.vector() | is.matrix() | is.data.frame()

# assume object f1 from Exercise #6
is.data.frame(f1)  # is f1 a dataframe?
## [1] TRUE
is.vector(f1$FishSpp)  # is f1$FishSpp a vector?
## [1] FALSE
is.numeric(f1$FishSpp)  # is f1$FishSpp numeric?
## [1] FALSE
is.factor(f1$FishSpp)  # is f1$FishSpp factor?
## [1] TRUE

Let’s learn a shortcut …
Rather that testing each variable independently as before, use str() to determine class for all variables in a data object at once. This is much simpler than querying each variable one-by-one.

# assume object f1 from Exercise #6
names(f1)  # what are the variable names in f1?
## [1] "FishSpp" "Male"    "Female"
f1  # examine data
##   FishSpp Male Female
## 1 Sunfish   59     72
## 2    Bass   14     21
## 3    Shad  189    138
str(f1)  # find variable classes in data object f1
## 'data.frame':    3 obs. of  3 variables:
##  $ FishSpp: Factor w/ 3 levels "Bass","Shad",..: 3 1 2
##  $ Male   : int  59 14 189
##  $ Female : int  72 21 138

In the example above, str() returns output indicating the data are in a class data.frame, while the variables FishSpp, Male, and Female are classes of Factor, and int (for integer), respectively.

Determining mode:
Just as with class(), you can use mode() to ascertain the R mode of data objects and the elements they contain.

# assume object f1 from Exercise #6
mode(f1)  # class of data object
## [1] "list"
mode(f1$FishSpp)  # mode of data object
## [1] "numeric"
mode(f1$Male)  # mode of var2
## [1] "numeric"
mode(f1$Female)  # mode of var3
## [1] "numeric"

Coercing Object Class and Mode


Data may be coerced to appropriate data class or mode using as.NewClassOrMode(), where some options for NewClassOrMode are:

as.numeric() | as.character() | as.factor() | as.vector() | as.matrix | as.data.frame()

WARNING !!
as.NewClassOrMode() coercions are temporary only. To make them permanent the variable in the R data object, or the data object itself, must be over-written.

# an example of temporary vs. permanent coercion
x <- c("pied", "pimo"); x  # simple vector of characters
## [1] "pied" "pimo"
is.factor(x)  # are values in x factors?
## [1] FALSE
str(x)  # what is class of x?
##  chr [1:2] "pied" "pimo"
as.factor(x)  # coerce into factors
## [1] pied pimo
## Levels: pied pimo
is.factor(x)  # factor yet?  why not?
## [1] FALSE
x <- as.factor(c("pied", "pimo")); x  # MUST make permanent as factor
## [1] pied pimo
## Levels: pied pimo
is.factor(x)  # finally !!
## [1] TRUE

class can be specified during data import using the option colClasses= c() within the read.FileType() calls. Some basic, commonly used class options (see Module 3.2) are logical, integer, numeric, character, factor, and Date.

# set variable class during import; make sure you're in the data directory !!
m1c <- read.csv("m1.csv", header = T,
  colClasses = c("factor", "factor", "integer", "numeric", "numeric", "numeric"))
str(m1c)  # check class in dataframe
## 'data.frame':    19 obs. of  6 variables:
##  $ catno : Factor w/ 19 levels "17573","17574",..: 19 1 2 3 4 5 6 7 8 9 ...
##  $ sex   : Factor w/ 2 levels "F","M": 2 1 2 1 1 2 1 1 2 1 ...
##  $ elev  : int  1878 3230 3230 3047 3047 3047 3047 3047 3047 3047 ...
##  $ conlen: num  22.4 NA NA 20.4 21.7 ...
##  $ zygbre: num  12.6 12.4 11.8 11.4 12.1 ...
##  $ lstiob: num  4.83 4.28 4.45 4.36 4.51 4.45 4.13 4.26 4.32 3.94 ...

Although assignment of class can be done during import, lots of variables in an import dataset makes this a somewhat cumbersome operation unless the import is a continuous, repeated exercise, such as weather data being received on a daily basis. For one-time imports you are better off using str() to examine the imported data, and adjusting class or mode as needed.


Summary of Module 3.6 functions


Basic calls related to class and mode identification and coercion are:

  • class() => Returns column names in object in sequence
  • str() => Same as above
  • mode() => Returns column names in object in alphabetic sequence
  • rownames() => Returns row names in object in sequence
  • is.ClassOrMode => $ returns specified column name from object

Exercise #7


Data for this exercise are in: ../baseR-V2016.2/data/exercise_dat.

Examine the zapusmorph.csv file. The “.” (periods) represent “missing values” in the input .csv file.

  • Convert all “.” in the data object to numeric NA (missing value) in R, not character (missing value). HINT: recall option for “.” when using read.FileType() call.
  • What is the mode of the variables with NA?
  • There are 5 variables in the data object built from coyotebehav.csv.
  • Which are numeric, and which are factors?
  • How many levels are in each of the variables that are factors?
  • Thinking logically, which of the variables not a factor should coerced to a “factor?” Make this variable a permanent coercion in the R data object.

Two columns in the coyotebehav.csv data object are labeled “date” and “time.” These are EXCEL-based.

  • Convert the EXCEL date and time into three R formats: R date, R time, and R date:time
  • Save each date, time, and date:time as new variables in the R data object:
  • Determine number of Julian days since 01Jan1991 and the start (i.e., the first) date found in the coyote data object
  • Determine the number of hours and tenths of minutes since the beginning and end of the observation period

Challenge Exercise:
The data object built with fish_recapture.csv has a tag date and a recapture date by tag id (an individual fish)

  • Determine the days between tag and recapture for each fish

HINT. Each day will need to be subtrcated from the next day in the date column. See if you can use some of what you learned in Module 3.5 to solve this problem.


END MODULE 3.6


Printable Version