MODULE 4.3 Extracting Subsets of Data

baseR-V2016.2 - Data Management and Manipulation using R

Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016


Objective:

  • Extract subsets of data from a R data object



Let’s begin with the …

Learning Questions:

  • How do I extract rows (observations) from a data object?

  • How do I extract columns (variables) from a data object?

  • I have a large dataset with many, many columns (variables) - is it easier to drop or keep columns?

  • Is there a method for keep subsets of a column (variable) that meet one or more conditions?

  • How do I extract a random sample of observations from my data object?


Some Background


Common operations involving subsetting of data from a data object are to:

  • Extract rows (observations), and
  • Extract both rows and columns (variables) meeting two or more criteria

The ultimate goal is reduce large data sets (i.e., many rows of observations and many columns of variables) to smaller data objects for analysis.


Some Initialization Before We Proceed …


Data from Exercise #6 (objects f1, m1, m2 ,m3, m4, t1, and w1) were saved as mod3data.RData. Some of these objects will be needed, so load them first into your workspace.

# load objects from Exercise #6; should have saved as mod3data.RData
# REMEMBER your directory path will be different .. I'm a long-winded instructor
  getwd()  # in correct directory ?  
## [1] "C:/Users/tce/Documents/words/classes/baseR_ALLversions/baseR-V2016.2/data/powerpoint_dat"
  list.files(pattern = ".RData") # is it there ?
## [1] "mod3data.RData"
  load("mod3data.RData")  # load it
  ls()  # check workspace; objects present ?
## [1] "f1" "m1" "m2" "m3" "m4" "t1" "w1"

If the objects are not there, or you did not save an .RData from Exercise #6, you will need to return to Module 3.4, Exercise #6, and re-import the data before proceeding further.


Extracting Rows from a Dataframe


Here, the objective is to extract only those rows (observations) meeting a condition that is applied to the data object. The subsetting process uses subset(Object, Condition) where Object is a data object, and Condition is specified criteria or a criterion. The Condition can be operators or conditionals (see Module 4.1).

# assume m1 from Exercise #6 
  head(m1, 2)  # a quick refresh of what m1 looks like 
##   catno sex elev conlen zygbre lstiob
## 1  9316   M 1878  22.37  12.64   4.83
## 2 17573   F 3230     NA  12.38   4.28
# some subsets of different conditions ....
  subset(m1, elev < 3000)  #  extract obs where elev<3000
##    catno sex elev conlen zygbre lstiob
## 1   9316   M 1878  22.37  12.64   4.83
## 12 26500   F 2682  21.93  12.49   4.24
## 13 26566   F 2712  22.05  12.84   4.53
  subset(m1, elev < 3000 & elev > 2000)  # extract obs 2000
##    catno sex elev conlen zygbre lstiob
## 12 26500   F 2682  21.93  12.49   4.24
## 13 26566   F 2712  22.05  12.84   4.53
  subset(m1, sex == "F")  # extract obs where sex=F
##    catno sex elev conlen zygbre lstiob
## 2  17573   F 3230     NA  12.38   4.28
## 4  17575   F 3047  20.41  11.44   4.36
## 5  17576   F 3047  21.70  12.06   4.51
## 7  17578   F 3047  20.89  11.08   4.13
## 8  17579   F 3047     NA  11.43   4.26
## 10 17581   F 3047  21.11     NA   3.94
## 11 17582   F 3047  20.32  11.38   4.15
## 12 26500   F 2682  21.93  12.49   4.24
## 13 26566   F 2712  22.05  12.84   4.53
## 14 27666   F 3181  22.66  12.30   4.53
## 16 27668   F 3181  22.35  12.74   4.67

Extracting Rows and Columns from a Dataframe


To extract both rows and columns meeting conditions requires just a simple modification to the subset() call; the addition of the option select = c(Vars2Select). The select = is where the columns, i.e., the Vars2Select, to extract are specified. Specification can be by column number or by column name.

# assume m1 from Exercise #6 
# extract rows where elev<3000, and keep only the 2 vars catno $ sex
  subset(m1, elev < 3000, select = c(catno, sex))  # we did not keep the var elev here
##    catno sex
## 1   9316   M
## 12 26500   F
## 13 26566   F
  subset(m1, elev < 3000, select = c(elev, catno, sex))  # here we keep the var elev
##    elev catno sex
## 1  1878  9316   M
## 12 2682 26500   F
## 13 2712 26566   F
# extract observations where sex=M, and the var elev
  m1$elev[m1$sex == "M"]  # returns a vector of elev values for sex=M
## [1] 1878 3230 3047 3047 3181 3181 3181 3002

Using column numbers to subset requires you know the sequence of column (variable) names.

# extract rows where elev<3000, and keep columns 1:2, & 5
  names(m1)  # listing of the column names; lets pick 1:2 & 5
## [1] "catno"  "sex"    "elev"   "conlen" "zygbre" "lstiob"
  subset(m1, elev < 3000, select = c(1:2, 5))  # column extraction by column number
##    catno sex zygbre
## 1   9316   M  12.64
## 12 26500   F  12.49
## 13 26566   F  12.84

Note that in the select = options we specified, using c() the rows 1:2, where the : (colon) symbol indicates a sequence of 1 through 2, and 5 for the fifth column.


Dropping and Keeping Rows and Columns


The decision to drop or keep variables depends on which is easiest. If you want only a few variables from long list, then use keep, as in Object[c(KeepVar)], KeepVar is/are the variables to be retained in the data object. To drop, employ the reverse, only use the - (minus) symbol in front of the variable to drop.

# assume m1 from Exercise #6
  names(m1)  # names in data object sequence
## [1] "catno"  "sex"    "elev"   "conlen" "zygbre" "lstiob"
# keeping by variable number; NOTE using head() to restrict lines of output
  head(m1[c(1:3)], 3)  # keep columns 1 to 3
##   catno sex elev
## 1  9316   M 1878
## 2 17573   F 3230
## 3 17574   M 3230
# dropping by variable number: head() again and NOTE "-" symbol for drop
  head(m1[c(-3:-6)], 3)  # drop columns 3 to 6; NOTE - (minus) sign to drop
##   catno sex
## 1  9316   M
## 2 17573   F
## 3 17574   M

You can also keep (drop) variables referenced by name.

# keep by variable number; NOTE using head() to restrict lines of output
  head(m1[c("catno", "sex", "elev")], 3)  # keep columns 1 to 3 by name
##   catno sex elev
## 1  9316   M 1878
## 2 17573   F 3230
## 3 17574   M 3230
# drop by variable number; head() again to restrict output view
  head(subset(m1, select = -c(catno, sex, elev)), 3)  # drop columns 1 to 3; NOTE - (minus) sign to drop
##   conlen zygbre lstiob
## 1  22.37  12.64   4.83
## 2     NA  12.38   4.28
## 3     NA  11.75   4.45

WARNING !!
When referencing names of variables to drop the - (minus) symbol is outside the c() vector of names. This is different from when numbers are dropped. When dropping numbers the - (minus) symbols is inside the c() vector.


Selecting Random Samples from a Data Object


You can randomly sample from a data object using sample(DataObject, NumberSamples), where DataObject is the data object from which the sample is to be drawn, and NumberSamples is the desired sample size. An important option is sampling with (replace= T) or without (replace = F) replacement. The default is without replacement.

WARNING!!
NA can be sampled using sample(). If there is no desire to have NA in the sample, remove observations with NA from data object before sampling (e.g., na.omit(), see Module 3.3).

More complex sampling strategies (e.g., multi-stage) can be accomplished using the sampling and survey packages.

# assume m1 from Exercise #6
  sample(m1$catno, 7)  # sample 7 obs from var=catno w/out replacement (default)
## [1] 26500 91354 27669 26566 27670 17582 27667
  sample(m1$catno, 7, replace = T)  # sample 7 obs with replacement
## [1] 17580 17573 17576 17581 17579 26566 17582

Note that in the first output vector CATNO’s were selected only once. In the second, several CATNO’s were selected more than once.

In addition, each run of sample() generates a new, unique selection of samples. This is because the sample() (and all of the other R sampling) function draws on your CPU clock to start the random number seed. If you wish to ensure that you can duplicate the sample you need to use set.seed(YourNumbers). The YourNumbers can then applied using the set.seed() function each time you sample. This ensures repeatability in your samples.

WARNING!!
If for any reason whatsoever you feel there might be a post-analysis request to perfectly reproduce the samples selected in your work, use set.seed(). Otherwise there is virtually no way you can recover the internal CPU clock seed.

# set a seed; make it a simple number you can always remember
  set.seed(1234)  # hard to forget this one ...
# run with seed set as 1234
  sample(m1$catno, 7)  # sample 5 obs from var=catno w/out replacement (default)
## [1] 17574 26500 17582 17581 26566 17580  9316
# run again w/out seed set; NOTE different catno's
  sample(m1$catno, 7)  # sample 5 obs from var=catno w/out replacement (default)
## [1] 17576 26500 17580 27670 27669 17575 26566
# run with seed set as 1234
  set.seed(1234)  # set seed; NOTE catno's identocal with first sample
  sample(m1$catno, 7)  # sample 5 obs from var=catno w/out replacement (default)
## [1] 17574 26500 17582 17581 26566 17580  9316

Summary of Module 4.3 functions


Basic calls related subsetting data are:

  • subset() => Extract rows and/or columns based on condition
  • [rows, columns] => Data object rows, columns identified by numeric sequence or name to keep or drop
  • sample() => Random sample of selected size from a data object

Exercise #12


Data for this exercise are in: ../baseR-V2016.2/data/exercise_dat.

The dataset lichens_environ.csv includes attribute information on locations of lichens in the Pacific Northwest. From these data:

  • Extract the topographic variables for all observations where species LobaOreg=1
  • What is the size of this resultant data object?
  • Create a new lichen data object of all Pseu* species, PlotNum, and all climate variables starting with the letter=t (for temperature) ONLY
  • What is the size of this resultant data object?
  • Note there are 7 8-character species names (columns 2:8).
    Convert these to 4-character names, using the first 2 letters of each name, and replace the original names in the dataset with the new ones
  • Obtain a sample of size 210 from all the data (i) with and (ii) without replacement (use variable PlotNum as sample key)
  • Save these samples as two separate data objects
  • What is the number of unique observations (i.e., PlotNum) in each data object?

END MODULE 4.3