MODULE 3.4 Data Distinctions in R

baseR-V2016.2 - Data Management and Manipulation using R

Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016

Objective:

How to import data into R

Let’s begin with some …

Common questions about importing external data into R:

What external data management software can be imported in R?
My data have been entered into an external data management software package - how do I import those data into R?
Are there any concerns about importing externally generated data into R?

Some Background on Data Input in R

R has considerable flexibility for creating data objects internally and by importing external data. However, not all options are equally flexible. Basic options for data input into R include:

Internal to R - useful for small data structures only
External input from non-R source

The most common - and stable - external input form is the so-called “flat file” .csv format. Options for import do exist for data stored in EXCEL, SAS, SPSS, S+, and other formats, as well as from relational databases like ORACLE and ACCESS.

If you insist on EXCEL … (but NOT!! Please!!)

The EXCEL (.xls/.xlsx) formats are common data structures for biologists, and there is often a desire to import directly into R from .xls/.xlsx files. But … quoting directly from the R Development Team: “The first piece of advice is to avoid doing so if possible!”

Even though EXCEL options exist, and are introduced below, the most stable import format is .csv. So follow the advice of the R Development Team and use .csv rather than .xls/.xlsx if at all possible.

Inevitably, some of you will ask “why?” Some (real! not just my Dinosaur perspective) reasons are:

Uncertainty in importation and conversion process, e.g., embedded formulae where cell values cascade from each other and among EXCEL sheets; and
The need for proprietary software, e.g., Perl for R package gdata, java for R package xlsx.

Note, however, that even though some of the R GUI/IDE shells have simple .xls/.xlsx imports, the advice still holds!! Nonetheless, a basic recommendation is to save, organize, and even manipulate your data in EXCEL .xls/.xlsx if you desire, but convert to a .csv format for import into R.

The special case of GIS-derived data

ArcGIS data are often exported in several formats (e.g., .dbf in addition to .xlsx) that can cause non-fatal but often irritating problems when importing into R. For example, integer values in ArcGIS can come across as floating point and, in many cases, as character rather than numeric values. If exporting data from ArcGIS for use in R, I recommend that data be exported as .dbf rather than .xls; the R import of .dbf is easier than .xlsx, as noted above.

Data from other statistical packages

Specific R packages and functions are available for importing data from SAS, SPSS, Systat, Minitab, and ACCESS, as well as others. Example packages with import capabilities include foreign, RODBC and gdata. Use help(PackageName) or ??PackageName in R or web searches on the package name for additional guidance on how to use these packages.

Note that these packages and their options are not covered in course. Additional information on data importation for a variety of other external databases, e.g., ORACLE, can be found at R-project.org.

Basic data import functions

The generic call for import is read.FileType(), where FileType is the . (dot) extension representing the file type (e.g., .csv, .dbf). Difference in function options among read.FileType() calls is based on:

The underlying data source / structure, i.e., the FileType; and
How headers are handled.

The basic read.FileType()’s not requiring a specific package are:

read.table(): default assumes no header in data, and whitespace between values in each column.
read.delim(): default assumes a header in data, and that tabs separate values in each column.
read.csv(): is the most commonly used (and recommended) input call, because .csv is a an export readily obtained from any EXCEL file.

Two other common imports requiring specific packages in your personal library are read.dbf() and read.xlsx() and their associated packages foreign and xlsx, respectively.

Available options make these calls highly flexible and capable of handling many different data variations. A common option for most data is header = T, which indicates a the first line of the data are (typically) the variable names; hence the “header.” sep = indicates how the data are separated, with "" and "\t" representing whitespace and tabs, respectively.

Let’s import some data and assign them as objects. We use the data objects later in this module.

# check and ensure in correct data directory
# NOTE yours will be specific to you; this is mine as instructor (I'm long-winded ...)
getwd()  # should be someting like ../baseR-V2016.2/data/powerpoint_dat

## [1] "C:/Users/tce/Documents/words/classes/baseR_ALLversions/baseR-V2016.2/data/powerpoint_dat"

# import whitespace-, tab-delim data
g1 <- read.table("g1.dat", header = T, sep = "")
# import tab-delimited data
m5 <- read.delim("m5.txt", header = T, sep = "\t")
# import comma .csv data
f1 <- read.csv("f1.csv", header = T)

These data now reside in your workspace as named objects g1, m5, and f1. Check this using the ls() call.

ls()  # are the data objects g1, m5, & f1 present?

## [1] "f1" "g1" "m5"

Some important and frequently used import options

Data file description lines:
Many .csv files contain additional information about the file in the first several lines. Below, for example, the first line in the data is “These are my mouse data,” a (somewhat silly example but) representative circumstance where information on data characteristics, such as sample site location, investigator, or sample date, is often placed. These must be skipped or R will consider the lines as “data” during the .csv import.

The option skip = N, where N is the number of lines to skip before the header, is used to ignore these kinds of description lines in *.csv files. Let’s see what happens if you forget to use skip =.

# .csv data; extra lines in file => ERROR: corrupted dataframe since skip= not used
m6 <- read.csv("m6.csv", header = T)
head(m6) # examine the error

##   These.are.my.mouse.data   X  X.1    X.2    X.3    X.4
## 1                   catno sex elev conlen zygbre lstiob
## 2                    9316   M 1878  22.37  12.64   4.83
## 3                   17573   F 3230         12.38   4.28
## 4                   17574   M 3230         11.75   4.45
## 5                   17575   F 3047  20.41  11.44   4.36
## 6                   17576   F 3047   21.7  12.06   4.51

# skip otpion used to ignore the descrriptions lines
m6 <- read.csv("m6.csv", header = T, skip = 1)  # .csv data; skip= option

Note that your first column has been named “These.are.my.mouse.data”, and the second line contains measurements which are actually the column names you desire.

Blanks or other characters as missing values:
The data snippet shown above also has “blank” values under the variable conlen. These blanks represent missing values; that is, data were not measured on conlen for those specific observations. As noted in Module 3.3, R requires the NA to correctly represent missing values. read.csv() automatically assigns NA to “blank”.

head(m6)  # correct dataframe since skip= used; NOTE also correct NA for missing

##   catno sex elev conlen zygbre lstiob
## 1  9316   M 1878  22.37  12.64   4.83
## 2 17573   F 3230     NA  12.38   4.28
## 3 17574   M 3230     NA  11.75   4.45
## 4 17575   F 3047  20.41  11.44   4.36
## 5 17576   F 3047  21.70  12.06   4.51
## 6 17577   M 3047  21.54  11.92   4.45

But what if characters other than “blank” were used to represent missing values, such as the “.” (dot)? If true, use the na.strings = "MissingVarSymbol" where MissingVarSymbol is the string in the .csv file used to represent missing.

# NOT RUN;shown as example
m6 <- read.csv("m6.csv", header = T, na.strings = ".") # make "." NA during import

Importing .dbf data

Importing data type .dbf requires the package foreign be installed in your library and loaded using library(foreign).

library(foreign)  # package foreign for .dbf
p1 <- read.dbf("p1.dbf")  # import .dbf data

By default read.dbf() converts all character variables to R factors. To retain these variables as class character and not factor use the option as.is = T.

p1 <- read.dbf("p1.dbf", as.is = T)  # import .dbf data

If You Insist on .xls/.xlsx Files …

The package xlsx (and dependencies) must first be installed. In addition, xlsx requires a correct version of java or it will not work. See MakingJavaWorkWithR-15Aug2016 for details on how to ensure the correct version of java is operating on your CPU.

The basic call read.xlsx("FileName", SheetNameIndex, header = T). FileName requires the explicit .xls/.xlsx extension and you must specify the EXCEL sheet either by name (SheetNameIndex) or numerical sequence. Use the startRow = option to skip lines; it is similar to skip = option in read.csv(). NOTE: startRow = can be buggy; if need to skip rows use read.xlsx2(), not read.xlsx() function. My experience it that ’read.xlsx2()` is more stable if skipping lines.

library(xlsx)  # load xlsx library

## Loading required package: rJava

## Warning: package 'rJava' was built under R version 3.2.3

## Loading required package: xlsxjars

t1 <- read.xlsx("t1.xlsx", 1, header = T)  # sheet by numerical sequence (1st)
t2 <- read.xlsx("t1.xlsx", "t1", header = T)  # sheet by name inside quotes
w1 <- read.xlsx("w1.xls", 1, header = T)  # older .xls files work too

These data objects should now be in your workspace. Use ls() to check.

ls()  # are t1, t2 & w1 in workspace?

## [1] "f1" "g1" "m5" "m6" "p1" "t1" "t2" "w1"

My bottom-line experience makes me uneasy using the xlsx package, so I prefer to convert .xls/.xlsx to .csv formats prior to data import.

Checking your data import

Once data are imported, check to ascertain if a correct import occurred.

ls() will show if the import created the named R object in your workspace. head(DataObject) returns variable names and first 6 observations and is a quick way to see if the top six data lines are correct; tail(DataObject) does the same for the last 6 observations. The dim(DataObject) returns the number of rows and columns, which can then be checked against the size of your external data structure.

ls()  # all data structures in workspace?

## [1] "f1" "g1" "m5" "m6" "p1" "t1" "t2" "w1"

head(m6)  # head to check var names and some obs

##   catno sex elev conlen zygbre lstiob
## 1  9316   M 1878  22.37  12.64   4.83
## 2 17573   F 3230     NA  12.38   4.28
## 3 17574   M 3230     NA  11.75   4.45
## 4 17575   F 3047  20.41  11.44   4.36
## 5 17576   F 3047  21.70  12.06   4.51
## 6 17577   M 3047  21.54  11.92   4.45

dim(p1)  # do data dimensions match expectations?

## [1] 211836     26

One of the more useful quick data checks is str(). This function returns information on the names, class, mode of the imported data object. If anything seems wrong, such as a variable being imported as character when it should be numeric, you can re-check the import call and adjust as necessary until names, class, and mode are all correct relative to your anticipated analysis.

str(m6)  # check class of data and variables

## 'data.frame':    19 obs. of  6 variables:
##  $ catno : int  9316 17573 17574 17575 17576 17577 17578 17579 17580 17581 ...
##  $ sex   : Factor w/ 2 levels "F","M": 2 1 2 1 1 2 1 1 2 1 ...
##  $ elev  : int  1878 3230 3230 3047 3047 3047 3047 3047 3047 3047 ...
##  $ conlen: num  22.4 NA NA 20.4 21.7 ...
##  $ zygbre: num  12.6 12.4 11.8 11.4 12.1 ...
##  $ lstiob: num  4.83 4.28 4.45 4.36 4.51 4.45 4.13 4.26 4.32 3.94 ...

Summary of Module 3.4 functions

Basic read.FileTpe() calls are:

read.table() => Import whitespace-delimited data
read.delim() => Import generic tab-delimited data
read.csv() => Import comma separated value data
read.dbf() => Import .dbf format data
read.xlsx() => Import .xls/.xlsx format data

Some frequently used options for read.*() calls

header = T => Reads header as column names
sep = "Delim" => Delim is how data are separated; whitespace “” or tab “”
as.is = T => Retains chaarcters as character not factor during .dbf import
skip = N => N is the number of description lines above data
na.strings = "Symbol" => Symbol is converted to NA during import

Exercise #6

Import into your R workspace the datasets:

coyotebehav.xlxs;
zapusmorph.csv;
fish_recapture.xlsx; and
tmax_all.dbf

and assign each a R data object name.

The datasets are in: ../baseR-V2016.2/data/exercise_dat. Once the data have been imported using the correct impprt function:

Check the data characteristics of each dataset
What are the variable names in each data object?
What is the data dimension of each data object?
Anything weird seem to happen with any of the data sets?

HINT: Compare original file variable names with R data object file variables.

In additon, import the following files below, which are found in your ../baseR-V2016.2/data/powerpoint_dat directory. Assign them the names as shown. Once the data are imported, save all seven as an .RData object called mod3data.RData. Make sure the .RData file is written to your ../baseR-V2016.2/data/powerpoint_dat directory.

m1 => subset of the zapusmorph file
m2 => subset of the zapusmorph file
m3 => subset of the zapusmorph file
m4 => subset of the zapusmorph file
t1 => topography and Juniperus (a conifer) presence/absence
w1 => max temperature and presence/absence
f1 => fish species captures (tallies) by sex

We will be using these example data sets throughout all successive Modules, so you will need to be able to load them for every Module. See Module 2.5 if you need a refresher on how to save data objects as a .RData file.

Learning R