Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016
These structures are collectively referred to as R objects. R objects are the basis of all data management and analysis in R.
A scalar is an object (by analogy, think variable …) to which a value has been assigned. The value can be simple or the result of a complex formula; it is commonly referred to as 0-dimensional array, or point value. It can be any of the R data classes (see Module 3.2).
Some examples of scalars:
# some scalars of different classes
s1 <- 3.14 # a numeric scalar
s2 <- "Pinus edulis" # a character scalar
s3 <- TRUE # a logical scalar
ls() # return objects in current workspace
## [1] "s1" "s2" "s3"
s1 # recall object s1
## [1] 3.14
s2 # recall object s2
## [1] "Pinus edulis"
s3 # recall object s3
## [1] TRUE
A vector is a 1-dimensional representation of values having numerical, character, or logical class. Vectors are built using the c(InputValues)
call, where InputValues are the values that will become the vector. Values within c()
are separated by ,
(comma) during entry.
Some examples of vectors:
# some vectors of different classes
v1 <- c(1, 2, 3, 4) # a numeric vector
v2 <- c("A", "B", "C", "D") # a character vector
v3 <- c(TRUE, FALSE, FALSE, FALSE) # a logical vector
v1 # recall object v1
## [1] 1 2 3 4
v2 # recall object v2
## [1] "A" "B" "C" "D"
v3 # recall object v3
## [1] TRUE FALSE FALSE FALSE
You can build a vector of specified length using rep(Value, NumberReps)
, where Value is a specified value (e.g., 0 or “p”) and NumberReps is number of elements (i.e., times repeated) in the vector.
rep(0, 10) # vector of 5 zeros; no assignment so returned to console window
## [1] 0 0 0 0 0 0 0 0 0 0
The matrix is a 2-dimensional array of data of \(r \times c\) size (row by column). Basic matrix()
syntax is:
matrix(object, nrow = r, ncol = c, dimnames = list(c("row_names"), c("col_names")))
where object is the name of the matrix, nrow =
and ncol =
are the number of rows and columns, respectively, and dimnames =
is a list of row and column names. Note that dimnames =
is an option and is not required to create matrix. The minimum syntax is matrix(nrow = r, ncol = c)
, where r and c are inputs by user. All values in matrix must of same mode: eg, numeric, character, logical, or the matrix will fail.
Some examples of matrices:
# matrices - type all cmds into console
matrix(nrow = 2, ncol = 3) # returns an empty matrix of NA
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
# matrix of empty w/row & column names
matrix(nrow = 2, ncol = 3, dimnames = list(c("R1", "R2"), c("C1", "C2 ", "C3 ")))
## C1 C2 C3
## R1 NA NA NA
## R2 NA NA NA
## matrix m1 of 0 w/row & column names
m1 <- matrix(0, nrow = 2, ncol = 3, dimnames = list(c("R1", "R2"), c("C1", "C2 ", "C3 ")))
m1 # matrix of 0's; view
## C1 C2 C3
## R1 0 0 0
## R2 0 0 0
matrix()
always builds rows first, so be careful populating a matrix with values from a vector.
# build a matrix from a vector
v4 <- c(17, 19, 20, 7, 45, 64) # note value assignment sequence
matrix(v4, nrow = 2, ncol = 3) # apply v4 to matrix; note row before column number sequence
## [,1] [,2] [,3]
## [1,] 17 20 45
## [2,] 19 7 64
An array is a n-dimensional matrix of data. The array()
call syntax is similar to the matrix()
syntax:
array(object, dim = c(r, c, d), dimnames = list(c("r_names"), c("c_names"), c("d_names")))
where object is the name of the matrix, dim =
assigns the dimensions of r, c, and d, which are the number of 1st, 2nd, and 3rd dimensions, respectively, and dimnames =
is a list of dimension names. Note that dimnames =
is an option and is not required to create an array. All values in an array must of same mode: eg, numeric
, characte
r, logical
, or the array will fail.
# build a 4x3x2 array of values 1 to 24
# NOTE value assignment sequence
# dimension r [r,] first
# dimension c [,c] second
# dimension d [,,d] third
a1 <- array(1:24, dim = c(4, 3, 2))
a1 # note value 1:24 assignment sequence
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 13 17 21
## [2,] 14 18 22
## [3,] 15 19 23
## [4,] 16 20 24
The dataframe is probably the most commonly used data class in R given it is closely aligned with the concept of a spreadsheet. Unlike the classes of vector
, matrix
, and array
, the data.frame()
can mix data modes. Thus, one column in a dataframe can be numeric
, another integer
, another factor
, and so on. R dataframes have an implicit row numbering sequence as well.
# assorted vectors of different modes
v1 <- c(1, 2, 3, 4) # a numeric vector
v2 <- c("A", "B", "C", "D") # a character vector
v3 <- c(TRUE, FALSE, FALSE, FALSE) # a logical vector
# dataframe of different data modes
d1 <- data.frame(v1, v2, v3)
d1
## v1 v2 v3
## 1 1 A TRUE
## 2 2 B FALSE
## 3 3 C FALSE
## 4 4 D FALSE
The list()
builds groupings of data structures. Simple lists can be all of the same class
, such as the characters below in list.1
, or compilations of character objects having the same or different modes
as in list.2
. Many functions in R return output as a list
, and understanding how to access elements of a list is fundamental to R. Lists are also powerful data structures given their ability to organize mixtures of R data class
and mode
.
# a simple list
list.1 <- list("jumo", "jude", "juos")
list.1
## [[1]]
## [1] "jumo"
##
## [[2]]
## [1] "jude"
##
## [[3]]
## [1] "juos"
# a complex list using objects built from above
list.2 <- list(spp = s2, mat = m1, logical = v3, dat1 = d1) # list of objects
list.2
## $spp
## [1] "Pinus edulis"
##
## $mat
## C1 C2 C3
## R1 0 0 0
## R2 0 0 0
##
## $logical
## [1] TRUE FALSE FALSE FALSE
##
## $dat1
## v1 v2 v3
## 1 1 A TRUE
## 2 2 B FALSE
## 3 3 C FALSE
## 4 4 D FALSE
Factors in R are labels to identify groups of data, and can be considered analogous to levels in ANOVA or strata in a sample-survey design, as in:
factor()
converts any vector of numerical
, character
, or logical
values into class factor
, with the number of levels being the number of unique data groupings.
# a simple factor; note change of numerics to levels in the factor
factor(c(1, 3, 2, 1, 2, 3, 3, 1, 1)) # vector of numerical values as factor levels
## [1] 1 3 2 1 2 3 3 1 1
## Levels: 1 2 3
Labels can be assigned to factor levels using labels = c("LevelNames")
option. There is an important caution to be aware of when assigning labels; it is related to the sequence in the vector of levels and alpha-numeric sequence of the labels.
In the example below, the numeric “1” is assigned label “REF”, 3 is assigned “2N” because as the second level it matches the second label name, and 2 becomes “4N” because 2 is third in the vector sequence.
factor(c(1, 3, 2, 1, 2, 3, 3, 1, 1), labels = c("REF", "2N", "4N")) # w/labels
## [1] REF 4N 2N REF 2N 4N 4N REF REF
## Levels: REF 2N 4N
There are two common options used for two different types of dates and times:
as.Date()
for year:month:day formatThe as.Date()
call:
The as.Date("year-month-day")
is the basic R call, where year is given as the full 4 numbers, month is 1:12, and day ranges from 1:28 to 1:31, depending on month and year.
Dates are often imported into R from .csv files generated using MS Excel, which has its own internal defaults for dates. For example, Excel automatically converts a date entered as 12-25-2000 or 12-25-00 into 12/25/2000.
Just about any date using characters in the month (e.g., 25 December 2000, 25Dec00, 25DEc2000, 25-Dec-2000), is automatically converted to an Excel date of 25-Dec-00. Consequently, Excel dates must be converted using the option format =
in the as.Date()
call before they become an R date. Failure to convert using the format =
options will lead to an error.
Different date options for format =
in as.Date()
include:
Code | Format Style | Code | Format Style |
---|---|---|---|
%d | Day of month (numeric) | %B | Month (full name) |
%m | Month (numeric) | %y | Year (2 numeric) |
%b | Month (3-character) | %Y | Year (4 numeric) |
The correct R date for 25 December 2000 is 2000-12-25.; the examples below show some different format =
options to ensure an Excel-derived date converts to a R date.
as.Date("2000-12-25") # note output format mimics input because it is the correct R format
## [1] "2000-12-25"
The next two dates - 12/25/2000 and 25-Dec-00 - are date formats that R will not recognize. Thus they require use of the format =
option for proper conversion.
as.Date("12/25/2000", format = "%m/%d/%Y") # provide EXCEL format w/format = option; now correct R format
## [1] "2000-12-25"
as.Date("25-Dec-00", format = "%d-%b-%y") # provide EXCEL format w/format = option; now correct R format
## [1] "2000-12-25"
Date and time in package chron:
The chron package has greater flexibility for dates and time. The basic syntax is: chron(dates = D, times = T)
, where dates =
and time =
perform the conversion of D (a vector of dates in format month/day/4-number year) and T (a vector of time in format hour:min:sec) into a format that R understands. Note that hour is the 24 hr clock.
Load the library chron
. Make sure you have already installed chron in your personal library!! If not, see Module 2.4.
library(chron) # load chron; if not installed see Mdoule 2.4
# assume Excel-based dates
d <- dates(c("6/20/2015", "6/25/2015")) # convert date for chron
# assume some times
h <- times(c("17:10:00", "18:30:32"))# convert time for chron
# bind chron date:time as object
dh <- chron(dates = d, times = h)
# examine the outputs; d=chron date, h= chron time, dh= chron date:time
d; h; dh
## [1] 06/20/15 06/25/15
## [1] 17:10:00 18:30:32
## [1] (06/20/15 17:10:00) (06/25/15 18:30:32)
Once you have converted date and time to a chron-based format you can apply a variety of functions to calculate date:time differences. Two of the more commonly applied include julian
and difftime
, which calculate Julian (technically day-of-year) and the difference between successive date:time, respectively.
Try those two functions out on the date:time objects just created. Remember to use help(FunctionName)
to determine correct syntax for the functions.
Missing values are assigned NA
for numerical
, character
, and logical
data modes. You will occasionally encounter a NaN
. NA
generally represents a missing value while NaN
is “not a number,” such as what happens when trying to divide by 0.
NA
causes problems in R for two basic reasons:
NA
consistently.is.na()
returns logical (TRUE, FALSE) identifying missing values in an object.
v1 <- c(1, 2, NA, 4) # vector of numeric values with 3rd obs missing=NA
is.na(v1) # find missing values; note logical TRUE for 3rd obs
## [1] FALSE FALSE TRUE FALSE
Note that blanks for values meant to be missing in analysis are not interpreted as NA
by R.
v2 <- c("X", "Y", "", "Z") # vector of characters with 3rd obs meant to be NA
is.na(v2) # find missing values; note logical FALSE for 3rd obs ie. not a R NA
## [1] FALSE FALSE FALSE FALSE
Some commonly used functions for eliminating NA
from an object are:
na.omit(DataObject)
, which removes all rows with NA
from the DataObject;DataObject[complete.cases(DataObject), ]
, which operates like na.omit()
; andDataObject[, Columns2Remove]
which removes specified columns if a NA
is present# dataframe w/NA
v3 <- data.frame(x = c(1, 2, NA), y = c(4, NA, 6), z = c(7, 8, 9)); v3
## x y z
## 1 1 4 7
## 2 2 NA 8
## 3 NA 6 9
na.omit(v3) # omit all rows w/NA
## x y z
## 1 1 4 7
v3[complete.cases(v3), ] # omit all rows w/NA (same as na.omit())
## x y z
## 1 1 4 7
v3[complete.cases(v3[, 1]), ] # omit rows w/NA in col=1
## x y z
## 1 1 4 7
## 2 2 NA 8
External programs often have specific conventions for missing values that must be converted to NA
in R. Some examples include the ArcGIS missing value -9999
, the SAS missing value of .
, or use of a blank
(empty) in many spreadsheets.
We will explore methods for dealing with these different types of external missing values in Module 3.4 next.
Data class and mode calls include:
c()
=> Construct vector of values; must be same classrep()
=> Create vector of set length with same value, classmatrix()
=> Build a 2-dimensional matrix; must be same classarray()
=> Build a n-dimensional array; must be same classdata.frame()
=> Create a dataframe; can combine different classeslist()
=> Combine different class and mode of data into listThe specialized class call for factor is:
factor()
=> Create a Factor containing specified levelsThe specialized calls for date and time include:
chron()
=> Formats external date and time to R date and time (also a package name)as.Date()
=> Convert external date format to R date formatjulian()
=> Formats R date to Julian days since specific datedifftime()
=> Computes date:time differences between dates