MODULE 4.8 Looping in R

baseR-V2016.2 - Data Management and Manipulation using R

Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016


Objective:

  • Construct loops for repeated operations of R statements



Let’s begin with …

Some Learning Objectives:

  • I must repeat the same R statements on objects that change in an orderly fashion - is this appropriate for a loop?

  • How do I start and stop a loop?

  • Can I write code for nested loops?


Some Background


Looping involves control structures (or conditionals per Module 4.5) and one or more statements that are repeated during each loop. The number of times a loop occurs (n) is determined by start and stop conditions. The benefit of the loop is to repeat statements without having to change statement parameters. The most obvious case for a loop is when output from a previous loop is used in a subsequent loop.

Generally speaking, loops are not as efficient as functions (see Module 4.9) when applied to vectors unless the number of iterations (loops) and statements is small and fixed. In many cases the difference between waiting for a 3 min loop operation versus 2.5 min for a vectorized operation is trivial.

However, 1000 repeats at 0.5 min difference between a loop and vectorized function is an additional 8+ hours or so. A general Rule-of-Thumb to follow is that an order of magnitude difference in time is worth pursing; less than that may not be worth the time investment to fully optimize the code.

There are some coding considerations that can increase the efficiency of loops. One of these is not to “grow” a data object during the loop. “Growing” means appending, through a rbind(), cbind(), or similar call, loop output onto an existing data object. It is better to create an outside “empty” data object and fill that with output from the loop.

In addition, use matrices rather dataframes. Matrices are more efficient. You can always coerce a matrix to a dataframe on conclusion of the loop (Module 3.6), if you prefer to work with dataframes rather than matrices

Nonetheless, loops serve a useful purpose in R, and often times they provide a logic framework for repeated operations that can then be considered for more optimized operations. My personal, bottom line is that if the time difference is not huge, build a loop and perform the analysis.

After all, any free time waiting can always be spent drinking coffee.


Some Initialization Before We Proceed …


Data from Exercise #6 (objects f1, m1, m2 ,m3, m4, t1, and w1) were saved as mod3data.RData. Some of these objects will be needed, so load them first into your workspace.

# load objects from Exercise #6; should have saved as mod3data.RData
# REMEMBER your directory path will be different .. I'm a long-winded instructor
  getwd()  # in correct directory ?  
## [1] "C:/Users/tce/Documents/words/classes/baseR_ALLversions/baseR-V2016.2/data/powerpoint_dat"
  list.files(pattern = ".RData") # is it there ?
## [1] "mod3data.RData"
  load("mod3data.RData")  # load it
  ls()  # check workspace; objects present ?
## [1] "f1" "m1" "m2" "m3" "m4" "t1" "w1"

If the objects are not there, or you did not save an .RData from Exercise #6, you will need to return to Module 3.4, Exercise #6, and re-import the data before proceeding further.


Some Free Advice on Looping


First consider whether a loop is appropriate. Answering “yes” to the questions posed below is a good start.

  • Will a set of one or more R statements be repeatedly applied to one or more data objects?

Next, the loop construction itself requires a logical approach:

  • Start with a bulleted approach to loop operations, outlining what is a logical outcome of an operation on a data object
  • Convert these operations to R statement(s) and make sure each works
  • Consider how to configure the index (start and stop as above), as well as determining what the actual values of the index will be?
  • How will the index increment?
  • Is the index numeric? A character string? Read from a list?
  • Last, test the loop with a fixed index value

It is also a good strategy to determine a time stamp on the length of a loop. Two options include system.time() and date(). Each R statement in your developing loop can be nested inside system.time(), and R will return the time elapsed for each statement.

If you are simply interested in the total time required to run numerous statements, run date() at the start and end of a test loop. Either way you will obtain an idea of how long your proposed loop process will require.

After that you can balance your personal need for efficiency versus just getting the analysis completed.


Looping Control Structures - The for


The for control structure builds a loop that repeats statements for a specified number of iterations. The structure is: for (StartStop) {Statement(s)}. Any single R statement or function, or multiple statements, can be placed inside the { } (curley brackets).

One of simpler for structures is: for (i in 1:3) {Statement(s)}, where i is the counter, and thus i = 1 is the Start, and i = 3 is the Stop. Numerous alternatives, especially lists, exist for the (StartStop) structure.

Consider the following illustrative (although nonsensical) loop. It starts with an outside data object x1 which consists of 5 numbers. These numbers are to be squared. Each iteration of the loop cause the index counter i to increase from i = 1 until it reaches i = 5, at which time the loop stops. Note that 5 is exactly how many elements are in x1.

At each iteration of i, the loop statement takes the ith observation in x1 and squares it. Output is directed to a new object x2, which we created as an empty object, and should result in a vector of values [1, 4, 9, 16, 25].

# a simple loop 
  x1 <- c(1, 2, 3, 4, 5)  # data object fed to loop
  x2 <- {}  # use {} to create an empty vector x2
  for (i in 1:5) {     # loop control for & statement start {
    x2[i] <- x1[i]^2   # single loop statement
    }                  # statement end }
  x2  # examine; should be 1,4,9,16,25
## [1]  1  4  9 16 25

Assume you did not know how many elements were in x1. You could modify the control structure as:

for (i in 1:length(x1)) {}  # use length() to determine stop condition

where the stop is now determined by another function, length(), which in this case is equal to five.

Let’s consider a more realistic loop, one that imports into your workspace a series of external data structures, such all your .csv files. Here, your steps are to:

  • Determine the number and names of external .csv files for import;
  • Import into your workspace using read.csv(); and assign them names that reflect the actual names of each .csv file.

This can all be accomplished using a command we saw in Module 4.2, the assign().

# assume working directory with all the data for this Module.
# return the files in your working dir with .csv in the filename
  files <- list.files(pattern = ".csv")  # return is assigned to object "files"
  files  # what files are there ??
##  [1] "conifer_anntemps.csv" "coyote_drugs.csv"     "f1.csv"              
##  [4] "f1mod.csv"            "grouse_lekmales.csv"  "loopdat.csv"         
##  [7] "m1.csv"               "m2.csv"               "m3.csv"              
## [10] "m4.csv"               "m6.csv"               "m7.csv"              
## [13] "ospreyprey_bysex.csv" "r1.csv"               "sculpin_eggs.csv"    
## [16] "t1.csv"               "w1.csv"               "zapusmorph.csv"

Build the loop:

# check your workspace 1st
  ls()
##  [1] "f1"    "files" "i"     "m1"    "m2"    "m3"    "m4"    "t1"   
##  [9] "w1"    "x1"    "x2"
# all new files will have "XX" in their names
  length(files)  # how many objects in files ???
## [1] 18
# build a loop using "files"" for StartStop
  for (i in 1:length(files)) {
    assign(paste(substr(files[i], 1, 2), "XX", sep = ""),
      read.csv(files[i], header = T))
  }
# are the new "XX" files there ??
  ls(pattern = "XX")  # find files with "XX" in name
##  [1] "coXX" "f1XX" "grXX" "loXX" "m1XX" "m2XX" "m3XX" "m4XX" "m6XX" "m7XX"
## [11] "osXX" "r1XX" "scXX" "t1XX" "w1XX" "zaXX"

The loop operated by setting the index i = 1 as the start, and it continued until it had reached the value determined by length(), which was 17. The first part of assign() substringed characters from the files object, added XX, and pasted that together. The second part of assign() used read.csv() to import each external .csv file, which was then assigned the name created in the first part of assign(). The result???? Rather than writing 17 read.csv() statements, a loop was used to get the data into your workspace.

Most loops are not as simple as these two loops, but the process for creating a loop that performs repeated operations through a set of R statements is the same. It is just the control structure that varies.


Looping Control Structures - The while


Like for, while operates from a Start until it reaches a Stop condition. The basic syntax is while(UntilStop) {Statements}. The start condition for the loop is usually set outside the loop. The while loop requires a counter to index the stop condition.

# assume x2 from the for loop
# now we'll take the square root
  length(x2)  # check length of x2
## [1] 5
  x3 <- {}  # create empty vector for results
  i <- 1  # start condition for i
# start the while loop
  while (i <= length(x2)) {   # loop control; NOTE stop condition based on length
    x3[i] <- sqrt(x2[i])     #   loop statement w/operator
    i <- i+1 }               #   counter to increase i+1
# output from while loop
  x3  # examine; should be 1,2,3,4,5
## [1] 1 2 3 4 5

As a quick exercise on your own, use the while structure to mimic the importation of the .csv files as we did with the for loop. Use “YY” instead of “XX” in the workspace names this time.


Looping Control Structures - The repeat


The repeat{Statement(s), UntilStop} has slightly different syntax than for() {} or while() {}. All operations occur inside {}, including the Stop condition.

As with while() {}, the Start condition is typically set outside the loop. Once the Stop condition met, the R call break stops the loop from continuing. break is the analog of the control structure in the for () {} and while () {} loops.

WARNING!!
The repeat{} call can lead to infinite loops. Think carefully about the loop logic, especially the Stop condition, before using. Personally, I avoid using repeat{}.

# assume x3 from the while loop
# now we'll multiply by 2
length(x3)  # check length of x3
## [1] 5
x4 <- {}  # create empty vector for results
i <- 1  # start condition for i
repeat {                            # loop start
    x4[i] <- x3[i]*2                # loop statement
    i <- i+1                        # counter to increase i+1
    if (i > length(x3)) break }     # loop stop condition w/break call
x4  # examine; should be 2,4,6,8,10
## [1]  2  4  6  8 10

Again, use the repeat structure to mimic the importation of the .csv files as we did with the for loop. Add “ZZ” to the workspace names this time.


Controlling Output of Looping Operations


Output from a loop can be directed to the console, as a named object in workspace, or to an external file. The print(LoopOperation) within loop sends output to the console, where LoopOperation is the R statement. There is one output per loop iteration.

You can also “build” a data object using c(Object, LoopOperation). Here, Object is the workspace object being constructed during loop, and LoopOperation is the R statement(s) that are generating new output. Any possible R object class, e.g., data.frame, matrix, can built in this manner.

Last, the write.FileType(Object, file = "FileName") group of functions can be used to write to external file, where Object is the workspace object and FileName is the name of the external file.

# assume x1 from above
# create two blank vectors, d1 and 2
d1 <- {}  # set blank vector
d2 <- {}  # set blank data object
# vie all objects
x1; d1; d2  # view objects
## [1] 1 2 3 4 5
## NULL
## NULL

Note that x1 has values (1, 2, 3, 4, 5) while both of the empty vectors - d1, and d2 - return NULL indicating no elements in the objects.

# start the loop
  for (i in 1:length(x1)) {
    v1 <- x1[i]^2         # operation #1; can be any R statement
    v2 <- x1[i] * 10      # operation #2; can be any R statement
    v3 <- cbind(v1, v2)   # operation #3; build row for data object
  
    print(v1)  # write to console
    d1 <- c(d1, v1)  # write to named object; vector ex.
    # build data object with if - else conditional
    if(i == 1) d2 <- v3 else d2 <- rbind(d2, v3)  # build output data object
    # if at end of loop write out .csv file
    if(i == length(x1)) write.csv(d2, file="loopdat.csv", row.names = F) # output object d2 as .csv
  }
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25

Note values of v1 - 1, 4, 9, 16, 25 - were written out to the console.

The remainder of the output, the objects d1 and d2, as well as the external file loopdat.csv, were successfully created, too.

# examine loop output
d1; d2  # view loop objects
## [1]  1  4  9 16 25
##      v1 v2
## [1,]  1 10
## [2,]  4 20
## [3,]  9 30
## [4,] 16 40
## [5,] 25 50
# check to see if external written ... yes it was
list.files(pattern = "loopdat")
## [1] "loopdat.csv"

Summary of Module 4.8 Functions


Basic calls related to sorting, ordering, and ranking data objects are:

  • for () {} => Repeat statements for specified iterations
  • while () {} => Repeat statements until stop criteria is reached
  • repeat {} => Repeat statements until break is called
  • print() => Sends output to console from inside loop

Exercise #17


Data for this exercise are in: ../baseR-V2016.2/data/exercise_dat.

Import the dataset bearclawpoppy.csv from the data directory. This dataset consists of presence-absence locations of the bearclaw poppy, a rare plant of the Mojave desert, and associated environmental covariates.

Write a loop that:

  • Feeds multiple statistical functions (i.e., mean, sd, length) to;
    • Three topographic variables (elev, slope, aspect);
    • By presab in the data bearclawpoppy data; and
  • Exports these results as 3 separate R objects in your workspace, one for elev, slope, and aspect, respectively.

Write a loop to read in the 4 m1.csv, … , m4.csv datasets. Some Challenges:

  • How to change the m1.cs, m2.csv, m3.csv and m4.csv (i.e., filename) given there is only a single read.csv() in the loop.
  • How to ensure each new object is assigned a unique name (e.g., m1, … ,m4)?

HINTS for both exercises:

  • Think lists and indexing within a list
  • Think pastes to create new character strings
  • Think about ways new objects can be assigned

END MODULE 4.8


Printable Version