MODULE 3.8 Data Checking in R

baseR-V2016.2 - Data Management and Manipulation using R

Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016

Objective:

How to check data in objects for anomalies

Let’s begin by exploring …

Some learning questions about checking data in R:

I’ve instituted data input controls on my external data management software - why should I worry about data checking once my data are imported into R?
Is there a set of basic R calls that can help check my data for anomalies?

Some Background - Why You Should Care

Four reasons to never assume data to be analyzed are correct:

Number of observations and variable names may be incomplete / wrong
Error coding during entry, e.g., character transposition
Data class / mode from import may be incorrect for R analysis
Technicians. Technicians. Technicians. Technicians. (And an occasional hungry dog …)

Some Initilaization …

Data from Exercise #6 (objects f1, m1, m2 ,m3, m4, t1, and w1) were saved as mod3data.RData. Some of these objects will be needed, so load them first into your workspace.

# load objects from Exercise #6; should have saved as mod3data.RData
# REMEMBER your directory path will be different .. I'm a long-winded instructor
  getwd()  # in correct directory ?

## [1] "C:/Users/tce/Documents/words/classes/baseR_ALLversions/baseR-V2016.2/data/powerpoint_dat"

  list.files(pattern = ".RData") # is it there ?

## [1] "mod3data.RData"

  load("mod3data.RData")  # load it
  ls()  # check workspace; objects present ?

## [1] "f1" "m1" "m2" "m3" "m4" "t1" "w1"

If the objects are not there, or you did not save an .RData from Exercise #6, you will need to return to Module 3.4, Exercise #6, and re-import the data before proceeding further.

Checking Characteristics of Your R Data Object

Immediately after import, ALWAYS check your data object and determine:

The dimension of object, i.e., rows,columns, using dim(DataObject)
The variable names using names(DataObject) or ls(DataObject); and
The characteristics of the data object with str(DataObject) (easiest), or by using calls like class(DataObject), mode(DataObject), and class(Object$VariableNames)

# assume t1 from Exercise #6
dim(t1)  # dimension [rows x columns] of f2

## [1] 10  8

names(t1)  # variable names in f2 in true sequence

## [1] "locid"   "genus"   "epithet" "elev"    "aspect"  "slope"   "rough"  
## [8] "presabs"

str(t1)  # what are classes of vars in f2?

## 'data.frame':    10 obs. of  8 variables:
##  $ locid  : int  1 2 3 4 5 6 7 8 9 10
##  $ genus  : Factor w/ 1 level "Juniperus": 1 1 1 1 1 1 1 1 1 1
##  $ epithet: Factor w/ 1 level "monosperma": 1 1 1 1 1 1 1 1 1 1
##  $ elev   : int  1304 1623 1300 1166 1778 1644 1340 1378 1691 1184
##  $ aspect : num  25 30.6 28.9 67.3 62.8 ...
##  $ slope  : num  1.03 2.19 0.52 1.87 1.87 ...
##  $ rough  : num  5.16 8.64 2.85 7.14 13.72 ...
##  $ presabs: int  0 0 0 0 0 1 1 1 1 1

class(t1)  # class of f2

## [1] "data.frame"

class(t1$genus)  # class of var1 in t2; replace for others as found in names()

## [1] "factor"

Checking for character and factor miscodes:
Another call, unique(DataObject$ObjectVariable), is very useful when applied to variables of class factor or character. It returns unique measurements in the data, and is a quick means for determining if there are any miscodes.

# a miscode ("pim0" rather than "pimo") is deliberately introduced below
t <- c("pied", "pimo", "pim0", "pimo", "pimo", "pied") # vector spp codes
unique(t)  # returns all unique charcater codes; NOTE miscode "pim0"

## [1] "pied" "pimo" "pim0"

You can aslo apply unique() to columns in a data object.

# apply unique() to a column in a data object
unique(m1$sex)  # returns all unique charcater codes of M and F

## [1] M F
## Levels: F M

Checking for obvious numerical “outliers”:
You can apply the functions min(DataObject$Variable) and max(DataObject$Variable) to see if any numerical values are “outliers” and fall outside of an expected range of measurements.

y <- c(1, 7, 5, 6, 0, 3, 9, 68, 5, 3)  # assume values in y should be 0<=y<=10
min(y)  # check for min value; None <0

## [1] 0

max(y)  # check for max value; NOTE max of 68 >10

## [1] 68

While unique() can be applied to a column of numerics or integers, it merely returns all the possible measurmenets, with the exception of those that are identical. Consequently it is rarely used for this purpse.

However, adding a mathematical operator changes unique() from a simple listing to one inidcating observations that meet one or more condition.

# extract all elev < 3000 m from dataset m1
which(m1$elev < 3000)  # returns observation row number meetign condition

## [1]  1 12 13

We will learn more about operators in Module 4.1.

Some simple methods for correcting miscodes:
I digress momentarily.

Correcting data is both a philosophical and operational issue. One doesn’t “correct” data in the sense that a measurement is erased and replaced with a new value. This is particularly true regarding the so-called “raw” or “archival” data collections sheets. Even in today’s digital age of logger-collected data, the original should be archived, but never altered.

It is valid to return and determine if an anomaly existed - loss of power to a solar-powered data logger, or a length measurement was obviously made in English rather than metric (the funniest of this latter error will forever be linked to a Mars Orbiter in 1999). In these cases note(s) should be made about possible sources of the presumed error, but the original data should never be altered.

Operationally - that is, once the data have imported as an R data object - corrections based on obvious miscoding as above can and should be made. Other issues, such as how far outside the distribution of your sample data should a single observation be before it is an “outlier” worthy of correction are more problematic. As a general rule-of-thumb, even if the measurement seems an outlier but is within the range of conceivable possibility, it should be retained. Eliminating it could be construed as “cherry-picking”, a serious mistake in a sound and defensible scientific process.

Observation corrections in R range from complex to simple. In simple circumstances, one of the first steps once a miscoded error as above is detected is to determine where the error is in the data object. The which(Condition2Meet) does this quite nicely, where “Condition2Meet” is the miscode being searched for. which() returns the variable location of the the error(s) in a vector or row number(s) in a data matrix.

# let's find the error
which(t == "pim0"); t  # where is the "pim0" error ?

## [1] 3

## [1] "pied" "pimo" "pim0" "pimo" "pimo" "pied"

which(y == 68); y # determine location of error; returns location

## [1] 8

##  [1]  1  7  5  6  0  3  9 68  5  3

The errors were located at positions 3 and 8 in the t and y vectors, respectively. Use a simple assignment for the selected location and replace.

# let's correct the error
y[8] <- 6  # correct error w/ explicit assignment; replace 68 with 6
y # check corrected object

##  [1] 1 7 5 6 0 3 9 6 5 3

t[3] <- "pimo"
t # check corrected object

## [1] "pied" "pimo" "pimo" "pimo" "pimo" "pied"

We will see code for more complex operations allowing not only for correction but collapsing or creating entirely new variables in the Module 4 series of baseR.

Summary of Module 3.8 functions

Basic calls related to data reshaping are:

dim() => Returns data object row:column dimensions
names() => Returns names of variables in data object
str() => Returns class of all variables in data object
class() => Returns object class
mode() => Returns object mode
unique() => Returns all unique character strings or factors
which(Condition2Meet) => Returns observation number (or row) meeting condition

Exercise #9

Data for this exercise are in: ../baseR-V2016.2/data/exercise_dat.

Import the file called grazing_error.txt.

How many observations and variables are in the file?
What are the variable names? Which are categorical and which are numeric?
Determine whether the first 4 variables are factor or numeric
Are there are errors or anomalies, obvious or suspected, in the data? If so, identify these and correct these in a new R data object.
Hint: Any numeric errors should be obvious
How many total errors are there in the R data object?

Save this corrected dataset as grazing_impacts.RData

Learning R