MODULE 3.3 Data Distinctions in R

baseR-V2016.2 - Data Management and Manipulation using R

Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016


Objective:

  • Understand the distinctions among R data objects



Let’s begin by asking …

Some common questions about data distinctions:

  • What are the different types of R objects I must distinguish among?

    • What is a scalar?
    • What is a vector?
    • How does a matrix differ from an array?
    • What is a list?
    • What is a data.frame?
  • What are common specialized data classes in R?

  • Why is each important?

  • What is the ** factor** in R?

  • How does R deal with date and time?

  • How are missing values NA handled in R?

These structures are collectively referred to as R objects. R objects are the basis of all data management and analysis in R.


The scalar


A scalar is an object (by analogy, think variable …) to which a value has been assigned. The value can be simple or the result of a complex formula; it is commonly referred to as 0-dimensional array, or point value. It can be any of the R data classes (see Module 3.2).

Some examples of scalars:

# some scalars of different classes
  s1 <- 3.14  # a numeric scalar
  s2 <- "Pinus edulis"  # a character scalar
  s3 <- TRUE  # a logical scalar
  ls()  # return objects in current workspace
## [1] "s1" "s2" "s3"
  s1  # recall object s1
## [1] 3.14
  s2  # recall object s2
## [1] "Pinus edulis"
  s3  # recall object s3
## [1] TRUE

The vector


A vector is a 1-dimensional representation of values having numerical, character, or logical class. Vectors are built using the c(InputValues) call, where InputValues are the values that will become the vector. Values within c() are separated by , (comma) during entry.

Some examples of vectors:

# some vectors of different classes
  v1 <- c(1, 2, 3, 4)  # a numeric vector
  v2 <- c("A", "B", "C", "D")  # a character vector
  v3 <- c(TRUE, FALSE, FALSE, FALSE)  # a logical vector
  v1 # recall object v1
## [1] 1 2 3 4
  v2 # recall object v2
## [1] "A" "B" "C" "D"
  v3 # recall object v3
## [1]  TRUE FALSE FALSE FALSE

You can build a vector of specified length using rep(Value, NumberReps), where Value is a specified value (e.g., 0 or “p”) and NumberReps is number of elements (i.e., times repeated) in the vector.

  rep(0, 10)  # vector of 5 zeros; no assignment so returned to console window
##  [1] 0 0 0 0 0 0 0 0 0 0

The matrix


The matrix is a 2-dimensional array of data of \(r \times c\) size (row by column). Basic matrix() syntax is:

matrix(object, nrow = r, ncol = c, dimnames = list(c("row_names"), c("col_names")))

where object is the name of the matrix, nrow = and ncol = are the number of rows and columns, respectively, and dimnames = is a list of row and column names. Note that dimnames = is an option and is not required to create matrix. The minimum syntax is matrix(nrow = r, ncol = c), where r and c are inputs by user. All values in matrix must of same mode: eg, numeric, character, logical, or the matrix will fail.

Some examples of matrices:

# matrices - type all cmds into console
  matrix(nrow = 2, ncol = 3) # returns an empty matrix of NA
##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA
# matrix of empty w/row & column names
  matrix(nrow = 2, ncol = 3, dimnames = list(c("R1", "R2"), c("C1", "C2 ", "C3 ")))  
##    C1 C2  C3 
## R1 NA  NA  NA
## R2 NA  NA  NA
## matrix m1 of 0 w/row & column names
   m1 <- matrix(0, nrow = 2, ncol = 3, dimnames = list(c("R1", "R2"), c("C1", "C2 ", "C3 ")))
   m1  # matrix of 0's; view
##    C1 C2  C3 
## R1  0   0   0
## R2  0   0   0

matrix() always builds rows first, so be careful populating a matrix with values from a vector.

# build a matrix from a vector
  v4 <- c(17, 19, 20, 7, 45, 64)  # note value assignment sequence
  matrix(v4, nrow = 2, ncol = 3)  # apply v4 to matrix; note row before column number sequence 
##      [,1] [,2] [,3]
## [1,]   17   20   45
## [2,]   19    7   64

The array


An array is a n-dimensional matrix of data. The array() call syntax is similar to the matrix() syntax:

array(object, dim = c(r, c, d), dimnames = list(c("r_names"), c("c_names"), c("d_names")))

where object is the name of the matrix, dim = assigns the dimensions of r, c, and d, which are the number of 1st, 2nd, and 3rd dimensions, respectively, and dimnames = is a list of dimension names. Note that dimnames = is an option and is not required to create an array. All values in an array must of same mode: eg, numeric, character, logical, or the array will fail.

# build a 4x3x2 array of values 1 to 24
# NOTE value assignment sequence
#  dimension r [r,] first
#  dimension c [,c] second
#  dimension d [,,d] third
  a1 <- array(1:24, dim = c(4, 3, 2))
  a1  # note value 1:24 assignment sequence
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   13   17   21
## [2,]   14   18   22
## [3,]   15   19   23
## [4,]   16   20   24

The dataframe


The dataframe is probably the most commonly used data class in R given it is closely aligned with the concept of a spreadsheet. Unlike the classes of vector, matrix, and array, the data.frame() can mix data modes. Thus, one column in a dataframe can be numeric, another integer, another factor, and so on. R dataframes have an implicit row numbering sequence as well.

# assorted vectors of different modes
  v1 <- c(1, 2, 3, 4)  # a numeric vector
  v2 <- c("A", "B", "C", "D")  # a character vector
  v3 <- c(TRUE, FALSE, FALSE, FALSE)  # a logical vector
# dataframe of different data modes
  d1 <- data.frame(v1, v2, v3)
  d1
##   v1 v2    v3
## 1  1  A  TRUE
## 2  2  B FALSE
## 3  3  C FALSE
## 4  4  D FALSE

The list


The list() builds groupings of data structures. Simple lists can be all of the same class, such as the characters below in list.1, or compilations of character objects having the same or different modes as in list.2. Many functions in R return output as a list, and understanding how to access elements of a list is fundamental to R. Lists are also powerful data structures given their ability to organize mixtures of R data class and mode.

# a simple list
  list.1 <- list("jumo", "jude", "juos")
  list.1
## [[1]]
## [1] "jumo"
## 
## [[2]]
## [1] "jude"
## 
## [[3]]
## [1] "juos"
# a complex list using objects built from above
  list.2 <- list(spp = s2, mat = m1, logical = v3, dat1 = d1)  # list of objects
  list.2                
## $spp
## [1] "Pinus edulis"
## 
## $mat
##    C1 C2  C3 
## R1  0   0   0
## R2  0   0   0
## 
## $logical
## [1]  TRUE FALSE FALSE FALSE
## 
## $dat1
##   v1 v2    v3
## 1  1  A  TRUE
## 2  2  B FALSE
## 3  3  C FALSE
## 4  4  D FALSE

The factor


Factors in R are labels to identify groups of data, and can be considered analogous to levels in ANOVA or strata in a sample-survey design, as in:

  • Factor1 => Sex with two levels [M, F]
  • Factor2 => Habitat [3 kinds], within => Factor1 [5 National Forests]

factor() converts any vector of numerical, character, or logical values into class factor, with the number of levels being the number of unique data groupings.

# a simple factor; note change of numerics to levels in the factor
factor(c(1, 3, 2, 1, 2, 3, 3, 1, 1))  # vector of numerical values as factor levels
## [1] 1 3 2 1 2 3 3 1 1
## Levels: 1 2 3

Labels can be assigned to factor levels using labels = c("LevelNames") option. There is an important caution to be aware of when assigning labels; it is related to the sequence in the vector of levels and alpha-numeric sequence of the labels.

In the example below, the numeric “1” is assigned label “REF”, 3 is assigned “2N” because as the second level it matches the second label name, and 2 becomes “4N” because 2 is third in the vector sequence.

factor(c(1, 3, 2, 1, 2, 3, 3, 1, 1), labels = c("REF", "2N", "4N")) # w/labels
## [1] REF 4N  2N  REF 2N  4N  4N  REF REF
## Levels: REF 2N 4N

Date and time


There are two common options used for two different types of dates and times:

  • as.Date() for year:month:day format
  • Package chron for year:month:date:hour:minutes:seconds format

The as.Date() call:
The as.Date("year-month-day") is the basic R call, where year is given as the full 4 numbers, month is 1:12, and day ranges from 1:28 to 1:31, depending on month and year.

Dates are often imported into R from .csv files generated using MS Excel, which has its own internal defaults for dates. For example, Excel automatically converts a date entered as 12-25-2000 or 12-25-00 into 12/25/2000.

Just about any date using characters in the month (e.g., 25 December 2000, 25Dec00, 25DEc2000, 25-Dec-2000), is automatically converted to an Excel date of 25-Dec-00. Consequently, Excel dates must be converted using the option format = in the as.Date() call before they become an R date. Failure to convert using the format = options will lead to an error.

Different date options for format = in as.Date() include:

Code Format Style Code Format Style
%d Day of month (numeric) %B Month (full name)
%m Month (numeric) %y Year (2 numeric)
%b Month (3-character) %Y Year (4 numeric)

The correct R date for 25 December 2000 is 2000-12-25.; the examples below show some different format = options to ensure an Excel-derived date converts to a R date.

as.Date("2000-12-25") # note output format mimics input because it is the correct R format
## [1] "2000-12-25"

The next two dates - 12/25/2000 and 25-Dec-00 - are date formats that R will not recognize. Thus they require use of the format = option for proper conversion.

as.Date("12/25/2000", format = "%m/%d/%Y")  # provide EXCEL format w/format = option; now correct R format
## [1] "2000-12-25"
as.Date("25-Dec-00", format = "%d-%b-%y")  # provide EXCEL format w/format = option; now correct R format
## [1] "2000-12-25"

Date and time in package chron:
The chron package has greater flexibility for dates and time. The basic syntax is: chron(dates = D, times = T), where dates = and time = perform the conversion of D (a vector of dates in format month/day/4-number year) and T (a vector of time in format hour:min:sec) into a format that R understands. Note that hour is the 24 hr clock.

Load the library chron. Make sure you have already installed chron in your personal library!! If not, see Module 2.4.

library(chron)  # load chron; if not installed see Mdoule 2.4
# assume Excel-based dates
d <- dates(c("6/20/2015", "6/25/2015")) # convert date for chron
# assume some times
h <- times(c("17:10:00", "18:30:32"))# convert time for chron
# bind chron date:time as object
dh <- chron(dates = d, times = h)
# examine the outputs; d=chron date, h= chron time, dh= chron date:time
d; h; dh
## [1] 06/20/15 06/25/15
## [1] 17:10:00 18:30:32
## [1] (06/20/15 17:10:00) (06/25/15 18:30:32)

Once you have converted date and time to a chron-based format you can apply a variety of functions to calculate date:time differences. Two of the more commonly applied include julian and difftime, which calculate Julian (technically day-of-year) and the difference between successive date:time, respectively.

Try those two functions out on the date:time objects just created. Remember to use help(FunctionName) to determine correct syntax for the functions.


Missing value NA


Missing values are assigned NA for numerical, character, and logical data modes. You will occasionally encounter a NaN. NA generally represents a missing value while NaN is “not a number,” such as what happens when trying to divide by 0.

NA causes problems in R for two basic reasons:

  • Origin of the data imported into R, e.g., spreadsheet, or other data bases, and
  • Not all R packages treat NA consistently.

is.na() returns logical (TRUE, FALSE) identifying missing values in an object.

v1 <- c(1, 2, NA, 4)  # vector of numeric values with 3rd obs missing=NA
is.na(v1)  # find missing values; note logical TRUE for 3rd obs
## [1] FALSE FALSE  TRUE FALSE

Note that blanks for values meant to be missing in analysis are not interpreted as NA by R.

v2 <- c("X", "Y", "", "Z")  # vector of characters with 3rd obs meant to be NA
is.na(v2)  # find missing values; note logical FALSE for 3rd obs ie. not a R NA 
## [1] FALSE FALSE FALSE FALSE

Some commonly used functions for eliminating NA from an object are:

  • na.omit(DataObject), which removes all rows with NA from the DataObject;
  • DataObject[complete.cases(DataObject), ], which operates like na.omit(); and
  • DataObject[, Columns2Remove] which removes specified columns if a NA is present
# dataframe w/NA
v3 <- data.frame(x = c(1, 2, NA), y = c(4, NA, 6), z = c(7, 8, 9)); v3 
##    x  y z
## 1  1  4 7
## 2  2 NA 8
## 3 NA  6 9
na.omit(v3)  # omit all rows w/NA
##   x y z
## 1 1 4 7
v3[complete.cases(v3), ]  # omit all rows w/NA (same as na.omit())
##   x y z
## 1 1 4 7
v3[complete.cases(v3[, 1]), ]  # omit rows w/NA in col=1
##   x  y z
## 1 1  4 7
## 2 2 NA 8

External programs often have specific conventions for missing values that must be converted to NA in R. Some examples include the ArcGIS missing value -9999, the SAS missing value of ., or use of a blank (empty) in many spreadsheets.

We will explore methods for dealing with these different types of external missing values in Module 3.4 next.


Summary of Module 3.3 functions


Data class and mode calls include:

  • c() => Construct vector of values; must be same class
  • rep() => Create vector of set length with same value, class
  • matrix() => Build a 2-dimensional matrix; must be same class
  • array() => Build a n-dimensional array; must be same class
  • data.frame() => Create a dataframe; can combine different classes
  • list() => Combine different class and mode of data into list

The specialized class call for factor is:

  • factor() => Create a Factor containing specified levels

The specialized calls for date and time include:

  • chron() => Formats external date and time to R date and time (also a package name)
  • as.Date() => Convert external date format to R date format
  • julian() => Formats R date to Julian days since specific date
  • difftime() => Computes date:time differences between dates

Exercise #5


  • Build a vector of 10 zeros
  • Similarly, build a 10?12 matrix of 0’s
  • Label the rows f1 . f10 and the 7 columns t1 . t12.
  • Write code that generates a list of the following species names: Juniperus pinchotii, Juniperus ashei, Juniperus deppeana, Juniperus occidentalis, Juniperus osteosperma, Juniperus scopulorum, Juniperus monosperma.
  • Create a data frame with separate columns (appropriately titled) for the genus and epithet of the species above.

END MODULE 3.3


Printable Version