MODULE 3.5 Accessing Variables in R

baseR-V2016.2 - Data Management and Manipulation using R

Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016


Objective:

  • How to access variables in a R data object



Let’s begin by asking …

Some questions about accessing data after importing external data into R:

  • How do I determine the column and row names of my data after importation into R?

  • After data importation, how do I access specific columns that I have defined as my variables?

  • What about specific rows that have labels or names as well?

  • How are intersections of specific rows and columns (i.e., dataframe cells) accessed?


Some Background - Why You Should Care


To effectively manipulate data in R requires knowledge on how to access elements in R objects.

Often an external data structure successfully imported into R has many columns and rows that are extraneous to the immediate analysis. Subsetting these columns and rows, and the elements they contain, is often one of the first steps in data management. In addition, accessing specific rows and columns is also a precursor to data extraction meeting specific criteria within a column or row (see Module 4.5)


Some Initilaization …


Data from Exercise #6 (objects f1, m1, m2 ,m3, m4, t1, and w1) were saved as mod3data.RData. These objects will be needed, so load them first into your workspace.

# load objects from Exercise #6; should have saved as mod3data.RData
# REMEMBER your directory path will be different .. I'm a long-winded instructor
  getwd()  # in correct directory ?  
## [1] "C:/Users/tce/Documents/words/classes/baseR_ALLversions/baseR-V2016.2/data/powerpoint_dat"

Let’s learn some R trickery!!

We previously used the ls() call to see what named objects are in the R workspace. But what if you wanted to see what files are external to your worksapce; that is, files that are in the current working directory of your CPU?

Above, the getwd() returned the current external working directory. The list.files() call returns a list of the external files in that directory. When modified with the pattern = option, the call becomes a convenient means for determing if a file of interest resides in the curent working directoy. In this case, we are looking to see if the file mod3data.RData is present. We search for specific files with the .RData in their name.

# search for and load objects from Exercise #6; should have saved as mod3data.RData
  list.files(pattern = ".RData") # is it there ?
## [1] "mod3data.RData"
  load("mod3data.RData")  # load it
  ls()  # check workspace; objects present ?
## [1] "f1" "m1" "m2" "m3" "m4" "t1" "w1"

If the objects are not there, or you did not save an .RData from Exercise #6, you will need to return to Module 3.4, Exercise #6, and re-import the data before proceeding further.


Determining Variable Names in Existing Objects


Both names(DataObject) and colnames(DataObject) return the names of variables in the sequence found in the DataObject. ls(DataObject) also returns variable names as well, but in alpha-numeric sequence. rownames(DataObject) returns row names if they exist; otherwise it returns a list of observation numbers.

# assume data m1 per Exercise #6 in your workspace
  names(m1)  # determine names of object m1
## [1] "catno"  "sex"    "elev"   "conlen" "zygbre" "lstiob"
  colnames(m1)  # alternative method to determine names; identical to above
## [1] "catno"  "sex"    "elev"   "conlen" "zygbre" "lstiob"
  ls(m1)  # names in alpha-numeric sequence
## [1] "catno"  "conlen" "elev"   "lstiob" "sex"    "zygbre"
  rownames(m1)  # determine row names; none so obs numbers instead
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19"

An alternative, quick view of variable names and associated data that we’ve seen before is obtained from head(DataObject). You can modify head() to specify the number of observations to return as head(DataObject, N) where N is the number of observations to return. tail(DataObject) returns the last set of specified observations.

  head(m1)  # naive call; first 6 obs in m1 returned
##   catno sex elev conlen zygbre lstiob
## 1  9316   M 1878  22.37  12.64   4.83
## 2 17573   F 3230     NA  12.38   4.28
## 3 17574   M 3230     NA  11.75   4.45
## 4 17575   F 3047  20.41  11.44   4.36
## 5 17576   F 3047  21.70  12.06   4.51
## 6 17577   M 3047  21.54  11.92   4.45
  head(m1, 2)  # first 2 observations in m1 returned
##   catno sex elev conlen zygbre lstiob
## 1  9316   M 1878  22.37  12.64   4.83
## 2 17573   F 3230     NA  12.38   4.28
  tail(m1, 2)  # last 2 observations in m1 returned
##    catno sex elev conlen zygbre lstiob
## 18 27670   M 3181  21.28  12.10   4.42
## 19 91354   M 3002  21.65  12.06   4.66

Accessing Specific Variables in an Object


Accessing a particular variable (i.e., column) in a data object is simple: DataObject$VarName, where DataObject is the data object and VarName the variable desired. The $ (dollar) symbol is how R links the requested variable to the data object. A single accessed variable is returned as a vector.

  m1$sex  # access variable=sex in object m1; NOTE output as factor
##  [1] M F M F F M F F M F F F F F M F M M M
## Levels: F M
  m1$elev  # access variable=elev; NOTE output as numeric
##  [1] 1878 3230 3230 3047 3047 3047 3047 3047 3047 3047 3047 2682 2712 3181
## [15] 3181 3181 3181 3181 3002

A sneak peak of what’s ahead …
In addition to the simple variable access shown above, the results of analyses in R, such as those from a t.test, can be assigned as an object. The results of the analysis, such as the p-value, test statistic, and confidence intervals, then become of elements of the results object. They can be accessed just as a specific variable name in a data object can be accessed with DataObject$VarName.

We’ll learn more about this when we reach the statistical analysis Modules statR1 - statR4.


Accessing Specific Rows and Columns


The class data.frame is a matrix of values (think .xls/.xlsx type spreadsheet). Specific rows and/or columns in the data object can be accessed by referencing the location(s) in the matrix. DataObject[Row, Column] is basic the call.

For rows only, use DataObject[Row, ], where , for column after Row is a wildcard for all columns in that row. Columns only are the reverse, DataObject[, Column], where the , in front of Column is the wildcard for all rows in that column. A column name can be specified as in DataObject[, "ColumnName"]; the name must be inside “” (quotes). Specific cells in the data object matrix are accessed as DataObject[Row, Column].

  m1[2, ]  # access row=2
##   catno sex elev conlen zygbre lstiob
## 2 17573   F 3230     NA  12.38   4.28
  m1[, 2]  # access column=2
##  [1] M F M F F M F F M F F F F F M F M M M
## Levels: F M
  m1[, "elev"] # access column by name
##  [1] 1878 3230 3230 3047 3047 3047 3047 3047 3047 3047 3047 2682 2712 3181
## [15] 3181 3181 3181 3181 3002
  m1[2, 3]  # access cell row=2 column=3
## [1] 3230

Sequences of multiple rows and columns by number can be accessed using the : (colon) to link rows and columns, as in DataObject[StartRow:StopRow, StartColumn:StopColumn]. Multiple column names are called as c("ColumnName1", "ColumnName2"). Non-sequential column numbers also require use of c() as in c("Column1:Column5, Column7, Column9").

  m1[2:4, ]  # access multiple rows 2 to 4
##   catno sex elev conlen zygbre lstiob
## 2 17573   F 3230     NA  12.38   4.28
## 3 17574   M 3230     NA  11.75   4.45
## 4 17575   F 3047  20.41  11.44   4.36
# m1[, 3:5]  # NOT RUN; access multiple columns 3 to 5
  m1[2:4, 3:5]  # access rows=2 to 4, columns=3 to 5
##   elev conlen zygbre
## 2 3230     NA  12.38
## 3 3230     NA  11.75
## 4 3047  20.41  11.44
# m1[, c("sex", "elev")]  # NOT RUN; multiple columns by name
# m1[, c(2, 3:5)] # NOT RUN; non-sequential column access

How to Avoid Extra Variable Name Typing


Rather than using the $ (dollar) link for each variable, all variables in object can be released and made available globally with attach(DataObject). This means they can be accessed without having to specify the data object.

  m1$sex  # access variable=sex in object m1
##  [1] M F M F F M F F M F F F F F M F M M M
## Levels: F M
  attach(m1)  # make m1 variables global
  sex  # access var=sex; same as m1$sex and m1[,2]
##  [1] M F M F F M F F M F F F F F M F M M M
## Levels: F M

Use detach(DataObject) to remove global access of variables.

  detach(m1)  # no longer gloabl access; must link to data object with $
# sex # NOT RUN; will error return, try it and see
  m1$sex  # works when attached to data object
##  [1] M F M F F M F F M F F F F F M F M M M
## Levels: F M

WARNING !!
attach(DataObject) provides short-hand coding benefits, but a process called masking occurs when two data objects have been detached, and both have variables of same name. In all cases the R default is to over-ride the variable names in the older dataframe(s) with the most recent dataframe variable names. Multiple detach() calls on multiple data objects can lead to confusion as to which data object a variable is referencing, so be careful.

For example, assume two objects, x and y as below:

x  # view x
##   a b
## 1 1 3
## 2 2 4
y  # view y
##   x b
## 1 5 7
## 2 6 8

Note both have a variable b. Run the code as observe how the sequential attach() calls mask variable b in the first data object x. Calling b returns values of the second data object; the first have been masked.

attach(x); b  # make vars in x global; call var b
## [1] 3 4
attach(y); b  # make vars in y global; call var b and observe warning
## The following object is masked _by_ .GlobalEnv:
## 
##     x
## The following object is masked from x:
## 
##     b
## [1] 7 8

Apply detach() to both data objects to re-link the variable b to each respective data object.


Summary of Module 3.5 functions


Basic read.FileTpe() calls are:

  • names() => Returns column names in object in sequence
  • colNames() => Same as above
  • ls(DataObject) => Returns column names in object in alphabetic sequence
  • rownames() => Returns row names in object in sequence
  • DataObject$Name => $ returns specified column Name from object
  • DataObject[row(s),column(s)] => Returns specified sequence of rows & columns; wildcard when rows or columns not specified
  • attach() => Make data object variable names global
  • detach() => Remove global access to variable names

END MODULE 3.5


Printable Version