MODULE 3.1 Data Caveats and Cautions in R

baseR-V2016.2 - Data Management and Manipulation using R

Tested on R versions 3.0.X through 3.3.1
Last update: 15 August 2016


Objective:

  • Data in R can be of many different types and forms - these differences must be understood before working in R



Let’s begin with …

Some basic R data caveats you ignore at your own peril


The most challenging aspect to understanding data in R is accepting that data in R are NOT spreadsheet-based. If you approach data management (and manipulation) from a spreadsheet-based (e.g., MS Excel) point of view only you will have trouble working in R. You need to think outside the spreadsheet box.

Some particular data caveats to be aware of include:

  • R is not well suited for massive data manipulation.
    If data are in a large, relational structure, then use a data management program designed for that, e.g., MS ACCESS, ORACLE. R has excellent interface capabilities to extract data (e.g., JAVA, Perl, python), but not to manage data per se.
  • Avoid use of whitespace in variable names.
    Yes, you can use whitespace, but R treats whitespace as a literal break. It is recommended that you use: (i) long names (Caps if desired, e.g., SppName); or (ii) underscore “_" to connect names when entering data (e.g., spp_name). R will, during import from some external files types (e.g., .csv) with whitespace in column names, add a “.” to connect the two words. For example, “Species name” will become “Species.name.”
    However, I have a experienced errors, and I encourage you to remove the whitespace in all column names.
  • Do NOT use whitespace in file names.
    Actually you can, but R can get quite snotty about it.
  • Be CAREFUL aBout leTter CasE.
    R is sensitive to letter case. If you have variables named aBout or CasE then R will search for those case-specific variable names only. I personally tend to use all lowercase to avoid this issue.
  • Do NOT start variable names with numbers or special characters.
    It simply will not work in R. Numbers are okay after starting with alpha character, e.g., x1, spp106, but 106spp or 1x will not work.
  • Avoid long-winded variable and data names.
    Consider Juniperus_monosperma. As a variable name, it is long-winded and a pain to type. Instead, use a code like jumo or spp106. If your data structure contains numerous plants/animals, consider a naming convention like sppXX, where XX represents numeric codes from, say, 001 to 999.. NOTE: Sequential naming conventions like this have programming benefits in R.
  • Variable assignment in R is different.
    Assignment can be made with both “<-” and “=” symbols, e.g., “x <- y” or “x=y”. “Old-timers” like “<-”, and many text books use “<-” exclusively, but as of 2001 (approximately R-1.2.1) R allows use of “=” for assignment. See my r_notes 04-IsAssignedVersusEqualSymbology-16Jul2015.pdf for an expanded explanation of assignment in R.

Paying attention to these data conventions in R, especially if you routinely enter and organize your raw data is a spreadsheet-based program like Ms Excel, will ease the importation of externally-managed data into your R working environment.


END MODULE 3.1


Printable Version