MODULE 4.6 Merging and Combining Data Objects

baseR-V2016.2 - Data Management and Manipulation using R

Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016


Objective:

  • Merge and combine multiple R data objects



Let’s begin with …

Our Learning Questions:

  • How do I merge 2 data objects?

  • What is the difference between merging, row binding, and column binding?

  • Can I merge >2 data objects at 1 time?

  • What happens if I have rows in common between 2 objects I am merging?

  • I want to keep all observations when merging 2 data objects – can that be done in R?

  • Can I merge by 2 or more columns (variables)?


Some Background


Data Objects with common variables, and values can be merged by common variables and values, or combined through concatenation by columns, by rows, or by both. Difficulties arise when the columns, rows, and variables differ among data objects to be merged. Careful attention to R defaults is required.


Some Initialization Before We Proceed …


Data from Exercise #6 (objects f1, m1, m2 ,m3, m4, t1, and w1) were saved as mod3data.RData. Some of these objects will be needed, so load them first into your workspace.

# load objects from Exercise #6; should have saved as mod3data.RData
# REMEMBER your directory path will be different .. I'm a long-winded instructor
  getwd()  # in correct directory ?  
## [1] "C:/Users/tce/Documents/words/classes/baseR_ALLversions/baseR-V2016.2/data/powerpoint_dat"
  list.files(pattern = ".RData") # is it there ?
## [1] "mod3data.RData"
  load("mod3data.RData")  # load it
  ls()  # check workspace; objects present ?
## [1] "f1" "m1" "m2" "m3" "m4" "t1" "w1"

If the objects are not there, or you did not save an .RData from Exercise #6, you will need to return to Module 3.4, Exercise #6, and re-import the data before proceeding further.


The Simple Merge


The simplest merge is of two data objects having the same number of rows. merge(Object1, Object2, by.y = MergeVar) calls two or more data object and merges them row by row based on MergeVar. The by.y = option can accept a list of multiple variables for the merge, as in by.y = c(Var1, Var2, ..., VarN).

Some important caveats:

  • The data objects must have 1 or more columns (variables) in common
  • Use of by.y = MergeVar is not actually required.
    However, failure to use it can lead to problems unless you know with absolute certainty that number of observations is equal in the data objects being merged.

It is recommended that the option by.y = MergeVar always be used.

# assume data m1, m2 from Exercise #6; skull characteristics of jumping mice
# examine each data object berfore the merge
dim(m1); names(m1)  # view dim, names of m1
## [1] 19  6
## [1] "catno"  "sex"    "elev"   "conlen" "zygbre" "lstiob"
dim(m2); names(m2) # view dim, names of m2
## [1] 19  3
## [1] "catno"  "palbre" "pallen"

The dim()’s of m1 and m2 both show 19 rows, although the number of columns is six and three, respectively. names()’s of each data object show that the variable catno is common between both. These are ideal circumstances for a merge using the variable catno as the common variable to merge on. The remaining variables in each object are not in common.

# simple merge by var=catno
c1 <- merge(m1 ,m2, by.y = "catno")  # NOTE use of by.y = to identfy var to merge on
names(c1); dim(c1)  # view names, dim new object c1
## [1] "catno"  "sex"    "elev"   "conlen" "zygbre" "lstiob" "palbre" "pallen"
## [1] 19  8
head(c1, 2)  # examine new merged object
##   catno sex elev conlen zygbre lstiob palbre pallen
## 1  9316   M 1878  22.37  12.64   4.83   3.89   9.37
## 2 17573   F 3230     NA  12.38   4.28   3.93   8.70

The merged data object, c, still has 19 rows, and shows eight variables.


The Complex Merge - Different Numbers of Rows


But what happens if you have data objects with different numbers of rows?

If number of rows differs between the data objects, the simple merge() drops rows not in common. Smaller sized data objects essentially over-ride larger sized data objects, with the final number of rows merged equal to the number of rows in smallest data object. The merge option all.x = T ensures all rows are retained.

# assume data m1, m2 from Exercise #6; skull characteristics of jumping mice
# examine each data object berfore the merge
dim(m1); dim(m3)  # dimensions of m1, m3; m3 smallest at 17 rows vs. 19 for m1
## [1] 19  6
## [1] 17  3
c2 <- merge(m1, m3, by.y = "catno")  # merge by common var=catno
dim(c2)  # merge m1,m3; dim c2 smallest 
## [1] 17  8

Note that smaller sized data object m3 (17 rows, 8 columns) over-rides the larger sized m1 (19 rows, 8 columns), resulting in the data object c2 having only 17 rows but still 8 columns.

# examine the data objects to merge
head(m1, 2)  # call 1st 2 data lines m1; NOTE m1 starts w/catno=9316
##   catno sex elev conlen zygbre lstiob
## 1  9316   M 1878  22.37  12.64   4.83
## 2 17573   F 3230     NA  12.38   4.28
head(m3, 2)  # call 1st 2 data lines m3; NOTE m3 starts w/catno=17547
##   catno palbre pallen
## 1 17574   3.43     NA
## 2 17575   3.50   8.87

Including the option all.x = T keeps all the rows, resulting in a larger data object of 19 rows.

# complex merge by var=ctano; keep all rows option used
c3 <- merge(m1, m3, by.y = "catno", all.x = T); dim(c3) # merge m1 m3; call dim c3
## [1] 19  8
head(c3, 4)  # call 1st 4 data lines c3; NOTE NA assigned to variables not in common
##   catno sex elev conlen zygbre lstiob palbre pallen
## 1  9316   M 1878  22.37  12.64   4.83     NA     NA
## 2 17573   F 3230     NA  12.38   4.28     NA     NA
## 3 17574   M 3230     NA  11.75   4.45   3.43     NA
## 4 17575   F 3047  20.41  11.44   4.36   3.50   8.87

Notice what happened to the rows not in common. Recall that m1 had a catno=9316 and m3 a catno=17547. These catno’s were not in common in both data objects. This also means they did not have values for the variables being merged. As a consequence, the merge assigned NA to those variables.


The Complex Merge - Different Numbers of Columns


Merge gets more complex if you have both different numbers of rows as well as columns between two data objects.

Consider the data objects, m3 and m4, each with 17 and three, and four and six, rows and columns, respectively. With the exception of the variable catno, on which the merge will occur, each has a different set of variables.

# assume m3, m4 from Exercise #6; skull chaaracteristics of jumping mice
dim(m3); dim(m4)  # dims of objects to merge, respectively
## [1] 17  3
## [1] 4 6
colnames(m3); colnames(m4)  # colnames of objects m3 & m4, respectively
## [1] "catno"  "palbre" "pallen"
## [1] "catno"  "sex"    "elev"   "conlen" "zygbre" "lstiob"

In addition, if we examine the catno’s using unique()we see very few (4) in common. This means some variables of each data object simply do not occur in the other. A merge, even if we specify the all.x = T option, will not be correct because not only do the objects not have a common catno, they also have no variables in common.

unique(m3$catno)  # what are the catno's in m3 ??
##  [1] 17574 17575 17576 17577 17578 17579 17580 17581 17582 26500 26566
## [12] 27666 27667 27668 27669 27670 91354
unique(m4$catno)  # what are the catno's in m4 ??
## [1] 27668 27669 27670 91354
# apply merge with all.x=T and examine merge output
c4 <- merge(m4, m3, all.x = T) # merge w/out all.x=T; view
dim(c4)  # NOTE small dim
## [1] 4 8

Just like the all.x = T there is an all.y = T option. This will ensure that all columns are retained as well.

# use the all.y=T option
c5 <- merge(m4, m3, all.y = T)  # merge w/all.x=T
c5  # examine entire merged data object; NOTE all the NA
##    catno  sex elev conlen zygbre lstiob palbre pallen
## 1  17574 

The largest of all possible data objects could be created by using both the all.x =T and “all.y = T options together.


Merging by Row Concatenation


As long as two data objects have identical columns, you can use rbind(DataObject1, DataObject2) to bind the objects together. An error is returned if the columns differ.

If we examine the three data objects, m1, m2, and m4, we see that m1 and m4 have columns in common. rbind() would work with m1 and m4.

names(m1); names(m2); names (m4)  # check names in m1, m2, and m4
## [1] "catno"  "sex"    "elev"   "conlen" "zygbre" "lstiob"
## [1] "catno"  "palbre" "pallen"
## [1] "catno"  "sex"    "elev"   "conlen" "zygbre" "lstiob"
c4 <- rbind(m1, m4); dim(c4)  # rbind and check no. rows
## [1] 23  6

It will not work if m1 or m4 are bound to m2 with rbind(). You can check this out yourself.

WARNING!!
Use of rbind() to combine data objects will retain duplicate records, if they exist. The function duplicated(DataObject) returns the logical (TRUE, FALSE) of duplicates, where TRUE indicates a duplicate observation.

# recall c4 merge from above; are there duplicates ??
duplicated(c4)  # return logical of duplicate records
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
## [23]  TRUE

The return indicates we have four duplicates.

unique() can be used to delete duplicate rows or columns. The default is fromLast = F; switching to = T eliminates duplicate rows from bottom first, or duplicate columns from right first.

# use unique to eliminate duplicate observations
c5 <- unique(c4)  # eliminate using default
dim(c4); dim(c5)  # remove unique obs; check dims of c4 and c5
## [1] 23  6
## [1] 19  6
duplicated(c5)  # any duplicates left ??  Nope ....
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Summary of Module 4.6 Functions


Basic calls related to merging and combining data objects are:

  • merge() => Merge 2 data objects with equal/different rows and columns
  • rbind() => Combine ≥2 objects w/same number columns
  • duplicated() => Returns logical if row is duplicated in object
  • unique() => Identify duplicate rows/columns, object elements

Some important options for merge() include:

  • by.y = => List of variable(s) for merge
  • all.x = T => Ensures all rows are retained
  • all.y = T => Ensures all columns are retained

Exercise #16


In Exercise #14, you created 4 separate .RData Objects of mean, sd, min, and max of male grouse by year within lek complex. Recall those. If you did not save them you will need to return to Exercise #14 and rebuild the objects.

  • Now add n (number of samples) per year and complex combination as well.
  • Merge these into single data object, and export as a .csv file.

HINT: Think about the call used to determine how long a column of elements is as a means of getting at n.


END MODULE 4.6


Printable Version