Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016
Common operations involving subsetting of data from a data object are to:
The ultimate goal is reduce large data sets (i.e., many rows of observations and many columns of variables) to smaller data objects for analysis.
Data from Exercise #6 (objects f1, m1, m2 ,m3, m4, t1, and w1) were saved as mod3data.RData
. Some of these objects will be needed, so load them first into your workspace.
# load objects from Exercise #6; should have saved as mod3data.RData
# REMEMBER your directory path will be different .. I'm a long-winded instructor
getwd() # in correct directory ?
## [1] "C:/Users/tce/Documents/words/classes/baseR_ALLversions/baseR-V2016.2/data/powerpoint_dat"
list.files(pattern = ".RData") # is it there ?
## [1] "mod3data.RData"
load("mod3data.RData") # load it
ls() # check workspace; objects present ?
## [1] "f1" "m1" "m2" "m3" "m4" "t1" "w1"
If the objects are not there, or you did not save an .RData
from Exercise #6, you will need to return to Module 3.4, Exercise #6, and re-import the data before proceeding further.
Here, the objective is to extract only those rows (observations) meeting a condition that is applied to the data object. The subsetting process uses subset(Object, Condition)
where Object is a data object, and Condition is specified criteria or a criterion. The Condition can be operators or conditionals (see Module 4.1).
# assume m1 from Exercise #6
head(m1, 2) # a quick refresh of what m1 looks like
## catno sex elev conlen zygbre lstiob
## 1 9316 M 1878 22.37 12.64 4.83
## 2 17573 F 3230 NA 12.38 4.28
# some subsets of different conditions ....
subset(m1, elev < 3000) # extract obs where elev<3000
## catno sex elev conlen zygbre lstiob
## 1 9316 M 1878 22.37 12.64 4.83
## 12 26500 F 2682 21.93 12.49 4.24
## 13 26566 F 2712 22.05 12.84 4.53
subset(m1, elev < 3000 & elev > 2000) # extract obs 2000
## catno sex elev conlen zygbre lstiob
## 12 26500 F 2682 21.93 12.49 4.24
## 13 26566 F 2712 22.05 12.84 4.53
subset(m1, sex == "F") # extract obs where sex=F
## catno sex elev conlen zygbre lstiob
## 2 17573 F 3230 NA 12.38 4.28
## 4 17575 F 3047 20.41 11.44 4.36
## 5 17576 F 3047 21.70 12.06 4.51
## 7 17578 F 3047 20.89 11.08 4.13
## 8 17579 F 3047 NA 11.43 4.26
## 10 17581 F 3047 21.11 NA 3.94
## 11 17582 F 3047 20.32 11.38 4.15
## 12 26500 F 2682 21.93 12.49 4.24
## 13 26566 F 2712 22.05 12.84 4.53
## 14 27666 F 3181 22.66 12.30 4.53
## 16 27668 F 3181 22.35 12.74 4.67
To extract both rows and columns meeting conditions requires just a simple modification to the subset()
call; the addition of the option select = c(Vars2Select)
. The select =
is where the columns, i.e., the Vars2Select, to extract are specified. Specification can be by column number or by column name.
# assume m1 from Exercise #6
# extract rows where elev<3000, and keep only the 2 vars catno $ sex
subset(m1, elev < 3000, select = c(catno, sex)) # we did not keep the var elev here
## catno sex
## 1 9316 M
## 12 26500 F
## 13 26566 F
subset(m1, elev < 3000, select = c(elev, catno, sex)) # here we keep the var elev
## elev catno sex
## 1 1878 9316 M
## 12 2682 26500 F
## 13 2712 26566 F
# extract observations where sex=M, and the var elev
m1$elev[m1$sex == "M"] # returns a vector of elev values for sex=M
## [1] 1878 3230 3047 3047 3181 3181 3181 3002
Using column numbers to subset requires you know the sequence of column (variable) names.
# extract rows where elev<3000, and keep columns 1:2, & 5
names(m1) # listing of the column names; lets pick 1:2 & 5
## [1] "catno" "sex" "elev" "conlen" "zygbre" "lstiob"
subset(m1, elev < 3000, select = c(1:2, 5)) # column extraction by column number
## catno sex zygbre
## 1 9316 M 12.64
## 12 26500 F 12.49
## 13 26566 F 12.84
Note that in the select =
options we specified, using c()
the rows 1:2, where the :
(colon) symbol indicates a sequence of 1 through 2, and 5 for the fifth column.
The decision to drop or keep variables depends on which is easiest. If you want only a few variables from long list, then use keep, as in Object[c(KeepVar)]
, KeepVar is/are the variables to be retained in the data object. To drop, employ the reverse, only use the -
(minus) symbol in front of the variable to drop.
# assume m1 from Exercise #6
names(m1) # names in data object sequence
## [1] "catno" "sex" "elev" "conlen" "zygbre" "lstiob"
# keeping by variable number; NOTE using head() to restrict lines of output
head(m1[c(1:3)], 3) # keep columns 1 to 3
## catno sex elev
## 1 9316 M 1878
## 2 17573 F 3230
## 3 17574 M 3230
# dropping by variable number: head() again and NOTE "-" symbol for drop
head(m1[c(-3:-6)], 3) # drop columns 3 to 6; NOTE - (minus) sign to drop
## catno sex
## 1 9316 M
## 2 17573 F
## 3 17574 M
You can also keep (drop) variables referenced by name.
# keep by variable number; NOTE using head() to restrict lines of output
head(m1[c("catno", "sex", "elev")], 3) # keep columns 1 to 3 by name
## catno sex elev
## 1 9316 M 1878
## 2 17573 F 3230
## 3 17574 M 3230
# drop by variable number; head() again to restrict output view
head(subset(m1, select = -c(catno, sex, elev)), 3) # drop columns 1 to 3; NOTE - (minus) sign to drop
## conlen zygbre lstiob
## 1 22.37 12.64 4.83
## 2 NA 12.38 4.28
## 3 NA 11.75 4.45
WARNING !!
When referencing names of variables to drop the -
(minus) symbol is outside the c()
vector of names. This is different from when numbers are dropped. When dropping numbers the -
(minus) symbols is inside the c()
vector.
You can randomly sample from a data object using sample(DataObject, NumberSamples)
, where DataObject is the data object from which the sample is to be drawn, and NumberSamples is the desired sample size. An important option is sampling with (replace= T
) or without (replace = F
) replacement. The default is without replacement.
WARNING!!
NA can be sampled using sample()
. If there is no desire to have NA in the sample, remove observations with NA from data object before sampling (e.g., na.omit()
, see Module 3.3).
More complex sampling strategies (e.g., multi-stage) can be accomplished using the sampling and survey packages.
# assume m1 from Exercise #6
sample(m1$catno, 7) # sample 7 obs from var=catno w/out replacement (default)
## [1] 26500 91354 27669 26566 27670 17582 27667
sample(m1$catno, 7, replace = T) # sample 7 obs with replacement
## [1] 17580 17573 17576 17581 17579 26566 17582
Note that in the first output vector CATNO’s were selected only once. In the second, several CATNO’s were selected more than once.
In addition, each run of sample()
generates a new, unique selection of samples. This is because the sample()
(and all of the other R sampling) function draws on your CPU clock to start the random number seed. If you wish to ensure that you can duplicate the sample you need to use set.seed(YourNumbers)
. The YourNumbers can then applied using the set.seed()
function each time you sample. This ensures repeatability in your samples.
WARNING!!
If for any reason whatsoever you feel there might be a post-analysis request to perfectly reproduce the samples selected in your work, use set.seed()
. Otherwise there is virtually no way you can recover the internal CPU clock seed.
# set a seed; make it a simple number you can always remember
set.seed(1234) # hard to forget this one ...
# run with seed set as 1234
sample(m1$catno, 7) # sample 5 obs from var=catno w/out replacement (default)
## [1] 17574 26500 17582 17581 26566 17580 9316
# run again w/out seed set; NOTE different catno's
sample(m1$catno, 7) # sample 5 obs from var=catno w/out replacement (default)
## [1] 17576 26500 17580 27670 27669 17575 26566
# run with seed set as 1234
set.seed(1234) # set seed; NOTE catno's identocal with first sample
sample(m1$catno, 7) # sample 5 obs from var=catno w/out replacement (default)
## [1] 17574 26500 17582 17581 26566 17580 9316
Basic calls related subsetting data are:
subset()
=> Extract rows and/or columns based on condition[rows, columns]
=> Data object rows, columns identified by numeric sequence or name to keep or dropsample()
=> Random sample of selected size from a data objectData for this exercise are in: ../baseR-V2016.2/data/exercise_dat.
The dataset lichens_environ.csv includes attribute information on locations of lichens in the Pacific Northwest. From these data: