Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016
R has considerable flexibility for creating data objects internally and by importing external data. However, not all options are equally flexible. Basic options for data input into R include:
The most common - and stable - external input form is the so-called “flat file” .csv format. Options for import do exist for data stored in EXCEL, SAS, SPSS, S+, and other formats, as well as from relational databases like ORACLE and ACCESS.
The EXCEL (.xls/.xlsx) formats are common data structures for biologists, and there is often a desire to import directly into R from .xls/.xlsx files. But … quoting directly from the R Development Team: “The first piece of advice is to avoid doing so if possible!”
Even though EXCEL options exist, and are introduced below, the most stable import format is .csv. So follow the advice of the R Development Team and use .csv rather than .xls/.xlsx if at all possible.
Inevitably, some of you will ask “why?” Some (real! not just my Dinosaur perspective) reasons are:
Note, however, that even though some of the R GUI/IDE shells have simple .xls/.xlsx imports, the advice still holds!! Nonetheless, a basic recommendation is to save, organize, and even manipulate your data in EXCEL .xls/.xlsx if you desire, but convert to a .csv format for import into R.
ArcGIS data are often exported in several formats (e.g., .dbf in addition to .xlsx) that can cause non-fatal but often irritating problems when importing into R. For example, integer values in ArcGIS can come across as floating point and, in many cases, as character rather than numeric values. If exporting data from ArcGIS for use in R, I recommend that data be exported as .dbf rather than .xls; the R import of .dbf is easier than .xlsx, as noted above.
Specific R packages and functions are available for importing data from SAS, SPSS, Systat, Minitab, and ACCESS, as well as others. Example packages with import capabilities include foreign, RODBC and gdata. Use help(PackageName)
or ??PackageName
in R or web searches on the package name for additional guidance on how to use these packages.
Note that these packages and their options are not covered in course. Additional information on data importation for a variety of other external databases, e.g., ORACLE, can be found at R-project.org.
The generic call for import is read.FileType()
, where FileType is the .
(dot) extension representing the file type (e.g., .csv, .dbf). Difference in function options among read.FileType()
calls is based on:
The basic read.FileType()
’s not requiring a specific package are:
read.table()
: default assumes no header in data, and whitespace between values in each column.read.delim()
: default assumes a header in data, and that tabs separate values in each column.read.csv()
: is the most commonly used (and recommended) input call, because .csv is a an export readily obtained from any EXCEL file.Two other common imports requiring specific packages in your personal library are read.dbf()
and read.xlsx()
and their associated packages foreign and xlsx, respectively.
Available options make these calls highly flexible and capable of handling many different data variations. A common option for most data is header = T
, which indicates a the first line of the data are (typically) the variable names; hence the “header.” sep =
indicates how the data are separated, with ""
and "\t"
representing whitespace and tabs, respectively.
Let’s import some data and assign them as objects. We use the data objects later in this module.
# check and ensure in correct data directory
# NOTE yours will be specific to you; this is mine as instructor (I'm long-winded ...)
getwd() # should be someting like ../baseR-V2016.2/data/powerpoint_dat
## [1] "C:/Users/tce/Documents/words/classes/baseR_ALLversions/baseR-V2016.2/data/powerpoint_dat"
# import whitespace-, tab-delim data
g1 <- read.table("g1.dat", header = T, sep = "")
# import tab-delimited data
m5 <- read.delim("m5.txt", header = T, sep = "\t")
# import comma .csv data
f1 <- read.csv("f1.csv", header = T)
These data now reside in your workspace as named objects g1
, m5
, and f1
. Check this using the ls()
call.
ls() # are the data objects g1, m5, & f1 present?
## [1] "f1" "g1" "m5"
Data file description lines:
Many .csv files contain additional information about the file in the first several lines. Below, for example, the first line in the data is “These are my mouse data,” a (somewhat silly example but) representative circumstance where information on data characteristics, such as sample site location, investigator, or sample date, is often placed. These must be skipped or R will consider the lines as “data” during the .csv import.
The option skip = N
, where N is the number of lines to skip before the header, is used to ignore these kinds of description lines in *.csv files. Let’s see what happens if you forget to use skip =
.
# .csv data; extra lines in file => ERROR: corrupted dataframe since skip= not used
m6 <- read.csv("m6.csv", header = T)
head(m6) # examine the error
## These.are.my.mouse.data X X.1 X.2 X.3 X.4
## 1 catno sex elev conlen zygbre lstiob
## 2 9316 M 1878 22.37 12.64 4.83
## 3 17573 F 3230 12.38 4.28
## 4 17574 M 3230 11.75 4.45
## 5 17575 F 3047 20.41 11.44 4.36
## 6 17576 F 3047 21.7 12.06 4.51
# skip otpion used to ignore the descrriptions lines
m6 <- read.csv("m6.csv", header = T, skip = 1) # .csv data; skip= option
Note that your first column has been named “These.are.my.mouse.data”, and the second line contains measurements which are actually the column names you desire.
Blanks or other characters as missing values:
The data snippet shown above also has “blank” values under the variable conlen. These blanks represent missing values; that is, data were not measured on conlen for those specific observations. As noted in Module 3.3, R requires the NA
to correctly represent missing values. read.csv()
automatically assigns NA
to “blank”.
head(m6) # correct dataframe since skip= used; NOTE also correct NA for missing
## catno sex elev conlen zygbre lstiob
## 1 9316 M 1878 22.37 12.64 4.83
## 2 17573 F 3230 NA 12.38 4.28
## 3 17574 M 3230 NA 11.75 4.45
## 4 17575 F 3047 20.41 11.44 4.36
## 5 17576 F 3047 21.70 12.06 4.51
## 6 17577 M 3047 21.54 11.92 4.45
But what if characters other than “blank” were used to represent missing values, such as the “.” (dot)? If true, use the na.strings = "MissingVarSymbol"
where MissingVarSymbol is the string in the .csv file used to represent missing.
# NOT RUN;shown as example
m6 <- read.csv("m6.csv", header = T, na.strings = ".") # make "." NA during import
Importing data type .dbf requires the package foreign be installed in your library and loaded using library(foreign)
.
library(foreign) # package foreign for .dbf
p1 <- read.dbf("p1.dbf") # import .dbf data
By default read.dbf()
converts all character variables to R factors. To retain these variables as class character and not factor use the option as.is = T
.
p1 <- read.dbf("p1.dbf", as.is = T) # import .dbf data
The package xlsx (and dependencies) must first be installed. In addition, xlsx
requires a correct version of java or it will not work. See MakingJavaWorkWithR-15Aug2016 for details on how to ensure the correct version of java is operating on your CPU.
The basic call read.xlsx("FileName", SheetNameIndex, header = T)
. FileName requires the explicit .xls/.xlsx extension and you must specify the EXCEL sheet either by name (SheetNameIndex) or numerical sequence. Use the startRow =
option to skip lines; it is similar to skip =
option in read.csv()
. NOTE: startRow =
can be buggy; if need to skip rows use read.xlsx2()
, not read.xlsx()
function. My experience it that ’read.xlsx2()` is more stable if skipping lines.
library(xlsx) # load xlsx library
## Loading required package: rJava
## Warning: package 'rJava' was built under R version 3.2.3
## Loading required package: xlsxjars
t1 <- read.xlsx("t1.xlsx", 1, header = T) # sheet by numerical sequence (1st)
t2 <- read.xlsx("t1.xlsx", "t1", header = T) # sheet by name inside quotes
w1 <- read.xlsx("w1.xls", 1, header = T) # older .xls files work too
These data objects should now be in your workspace. Use ls()
to check.
ls() # are t1, t2 & w1 in workspace?
## [1] "f1" "g1" "m5" "m6" "p1" "t1" "t2" "w1"
My bottom-line experience makes me uneasy using the xlsx package, so I prefer to convert .xls/.xlsx to .csv formats prior to data import.
Once data are imported, check to ascertain if a correct import occurred.
ls()
will show if the import created the named R object in your workspace. head(DataObject)
returns variable names and first 6 observations and is a quick way to see if the top six data lines are correct; tail(DataObject)
does the same for the last 6 observations. The dim(DataObject)
returns the number of rows and columns, which can then be checked against the size of your external data structure.
ls() # all data structures in workspace?
## [1] "f1" "g1" "m5" "m6" "p1" "t1" "t2" "w1"
head(m6) # head to check var names and some obs
## catno sex elev conlen zygbre lstiob
## 1 9316 M 1878 22.37 12.64 4.83
## 2 17573 F 3230 NA 12.38 4.28
## 3 17574 M 3230 NA 11.75 4.45
## 4 17575 F 3047 20.41 11.44 4.36
## 5 17576 F 3047 21.70 12.06 4.51
## 6 17577 M 3047 21.54 11.92 4.45
dim(p1) # do data dimensions match expectations?
## [1] 211836 26
One of the more useful quick data checks is str()
. This function returns information on the names, class, mode of the imported data object. If anything seems wrong, such as a variable being imported as character when it should be numeric, you can re-check the import call and adjust as necessary until names, class, and mode are all correct relative to your anticipated analysis.
str(m6) # check class of data and variables
## 'data.frame': 19 obs. of 6 variables:
## $ catno : int 9316 17573 17574 17575 17576 17577 17578 17579 17580 17581 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 2 1 1 2 1 1 2 1 ...
## $ elev : int 1878 3230 3230 3047 3047 3047 3047 3047 3047 3047 ...
## $ conlen: num 22.4 NA NA 20.4 21.7 ...
## $ zygbre: num 12.6 12.4 11.8 11.4 12.1 ...
## $ lstiob: num 4.83 4.28 4.45 4.36 4.51 4.45 4.13 4.26 4.32 3.94 ...
Basic read.FileTpe()
calls are:
read.table()
=> Import whitespace-delimited dataread.delim()
=> Import generic tab-delimited dataread.csv()
=> Import comma separated value dataread.dbf()
=> Import .dbf format dataread.xlsx()
=> Import .xls/.xlsx format dataSome frequently used options for read.*()
calls
header = T
=> Reads header as column namessep = "Delim"
=> Delim is how data are separated; whitespace “” or tab “”as.is = T
=> Retains chaarcters
as character not factor
during .dbf importskip = N
=> N is the number of description lines above datana.strings = "Symbol"
=> Symbol is converted to NA during importImport into your R workspace the datasets:
and assign each a R data object name.
The datasets are in: ../baseR-V2016.2/data/exercise_dat. Once the data have been imported using the correct impprt function:
HINT: Compare original file variable names with R data object file variables.
In additon, import the following files below, which are found in your ../baseR-V2016.2/data/powerpoint_dat directory. Assign them the names as shown. Once the data are imported, save all seven as an .RData
object called mod3data.RData. Make sure the .RData
file is written to your ../baseR-V2016.2/data/powerpoint_dat directory.
We will be using these example data sets throughout all successive Modules, so you will need to be able to load them for every Module. See Module 2.5 if you need a refresher on how to save data objects as a .RData
file.