CougRstats

New website!

Thank you for reading the cougrstats blog! We’ve moved our blog to a new site as of May 2021. All of our posts up to April 2021 should be available on both sites, but new posts will only be on the new blog. You can find the new blog here:

https://cougrstats.netlify.app/

Tell your friends!

Using Python in R Studio with Reticulate

By Sarah Murphy

Download all code used below from the GitHub repository!

There are four ways to use Python code in your R workflow:

Python code blocks in R Markdown
Interactively in the console
Importing Python packages and using the commands within your R scripts
Importing Python scripts and using user-defined functions within your R scripts

All of these require reticulate.

Reticulate

Reticulate is a library that allows you to open a Python environment within R. You can also load Python packages and use them within your R script using a mix of Python and R syntax. All data types will be converted to their equivalent type when being handed off between Python and R.

Installation

First, install the reticulate package:

install.packages("reticulate")

When you install reticulate you are also installing Miniconda, a lightweight package manager for Python. This will create a new Python environment on your machine called r-reticulate. You do not need to use this environment but I will be using it for the rest of this post. More information about Python environments can be found here.

Python Code Blocks in R Markdown

R Markdown can contain both Python and R code blocks. When in an RMarkdown document you can either manually create a code block or click on the insert dropdown list in R Studio and select Python. Once you have a code block you can code using typical Python syntax. Python code blocks will run as Python code when you knit your document.

## 6

Working with Python Interactively

We can use Python interactively within the console in R studio. To do this, we will use the repl_python() command. This opens a Python prompt.

# Load library
library(reticulate)

# Open an interactive Python environment
# Type this in the console to open an interactive environment
repl_python()

# We can do some simple Python commands
python_variable = 4
print(python_variable)

# To edit the interactive environment
exit

You’ll notice that when you’re using Python >>> is displayed in the console. When you’re using R, you should see >.

Changing Python environments

Using repl_python() will open the r-reticulate Python environment by default. If you’ve used Python and have a environment with packages loaded that you’d like to use, you can load that using the following commands.

# I can see all my available Python versions and environments
reticulate::py_config()

# Load the Python installation you want to use
use_python("/Users/smurphy/opt/anaconda3/envs/r-reticulate/bin/python")

#  If you have a named environment you want to use you can load it by name
use_virtualenv("r-reticulate")

Installing Python packages

Next, we need to install some Python packages:

conda_install("r-reticulate", "cartopy", forge = TRUE)
conda_install("r-reticulate", "matplotlib")
conda_install("r-reticulate", "xarray")

The conda_install function uses Miniconda to install Python packages. We specify "r-reticulate" because that is the Python environment we want to install the packages into. We are going to install three different packages here, cartopy (for making maps), matplotlib (a common plotting package), and xarray (to import data).

Using Python Commands within R Scripts

We can now load our newly installed Python libraries in R. Open a new R script and import these packages using the import command. We are assigning these to some common nicknames for these packages so they are easier to reference later on.

library(reticulate)

## Warning: package 'reticulate' was built under R version 4.0.5

plt <- import('matplotlib.pyplot')
ccrs <- import('cartopy.crs')
xr <- import('xarray')
feature <- import('cartopy.feature')

Xarray comes with some tutorial datasets. We are going to use the air temperature tutorial dataset here. To import it, we will use the xarray open_dataset command. We can now use all the usual Python commands, substituting . for $.

air_temperature <- xr$tutorial$open_dataset("air_temperature.nc")

air_temperature

## <xarray.Dataset>
## Dimensions:  (lat: 25, lon: 53, time: 2920)
## Coordinates:
##   * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
##   * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
##   * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
## Data variables:
##     air      (time, lat, lon) float32 ...
## Attributes:
##     Conventions:  COARDS
##     title:        4x daily NMC reanalysis (1948)
##     description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
##     platform:     Model
##     references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Next, we will create our figure. This is a map of air temperature over the United States for any day in 2013 specified in the plotting command.

Python users define lists with square brackets. However, the equivalent R type to this is a multi-element vector, so we must define it as such. One example of where this is used is in the first line in the following code block. In Python, we would specify the size of a figure using fig = plt.figure(figsize = [15, 5]), but when we use this command in R with reticulate we must use types that R recognizes, so we replace the = with <-, . with $, and [] with c() giving us fig <- plt$figure(figsize = c(15, 5)).

The Python equivalent of the following R code can be seen in the next section about sourcing Python scripts.

# Creating the figure
fig <- plt$figure(figsize = c(15, 5))

# Defining the axes projection
ax <- plt$axes(projection = ccrs$PlateCarree())

# Setting the latitude and longitude boundaries
ax$set_extent(c(-125, -66.5, 20, 50))

# Adding coastlines
ax$coastlines()

# Adding state boundaries
ax$add_feature(feature$STATES);

# Drawing the latitude and longitude
ax$gridlines(draw_labels = TRUE);

# Plotting air temperature
plot <- air_temperature$air$sel(time='2013-04-14 00:00:00', method = 'nearest')$plot(ax = ax, transform = ccrs$PlateCarree())

# Giving our figure a title
ax$set_title('Air Temperature, 14 April 2013');

plt$show()

## <cartopy.mpl.feature_artist.FeatureArtist object at 0x0000028DECE6AC50>

## <cartopy.mpl.gridliner.Gridliner object at 0x0000028DECE6ABE0>

## Text(0.5, 1.0, 'Air Temperature\n2013-04-14 00:00:00')

## <cartopy.mpl.feature_artist.FeatureArtist object at 0x0000028DECE5F978>

Sourcing Python Scripts

We can easily import a Python script and use user-defined functions from it using source_python. The Python script below creates a function that does the same thing as the above R script. This function accepts a date string and plots the corresponding map.

import xarray as xr
import matplotlib.pyplot as plt
import cartopy
import cartopy.crs as ccrs


def PlotAirTemp(usertime):
    """ Plot air temperature
    Args:
        username (str): "yyyy-mm-dd hh:mm:ss" 
                (for example '2013-04-14 00:00:00')
    """
    air_temperature = xr.tutorial.open_dataset("air_temperature.nc")
    fig = plt.figure(figsize = (15,5))
    ax = plt.axes(projection = ccrs.PlateCarree())
    ax.set_extent([-125, -66.5, 20, 50])
    ax.coastlines()
    ax.gridlines(draw_labels=True)
    plot = air_temperature.air.sel(time=usertime).plot(ax = ax, transform = ccrs.PlateCarree())
    ax.set_title('Air Temperature\n' + usertime)
    ax.add_feature(cartopy.feature.STATES)
    plt.show()

I have named the above file Python_PlotAirTemp.py. Once we source this file, we can use the function. Our function is called PlotAirTemp.

library(reticulate)

# Source the Python file to import the functions
source_python('../Python_PlotAirTemp.py')

# Call the function and give it the appropriate arguments
PlotAirTemp('2013-04-14 00:00:00')

## <cartopy.mpl.feature_artist.FeatureArtist object at 0x0000028DED1D8710>

## <cartopy.mpl.gridliner.Gridliner object at 0x0000028DED1D8860>

## Text(0.5, 1.0, 'Air Temperature\n2013-04-14 00:00:00')

## <cartopy.mpl.feature_artist.FeatureArtist object at 0x0000028DED1A48D0>

Sources & Resources

A very clear YouTube tutorial using reticulate
RStudio Blog – reticulate: R interface to Python
RStudio- reticulate
R Markdown Python Engine
Python example based on code from this xarray example and the xarray plotting reference

Feature Engineering with Sliding Windows and Lagged Inputs

This week Bryan Shalloway presented a talk about feature engineering with sliding windows and lagged inputs. You can find a post on his personal blog about the topic here, and the link to the corresponding GitHub repository here.

Introduction to mapping in R

By Matt Brousil

In this post I’ll walk through some of the basics for getting started importing, formatting, and creating maps with spatial data in R. Many students and researchers I know (myself included) have taken GIS coursework and are familiar with tools like ArcMap, but want to work with their data in R. R provides a good alternative to costly and potentially less reproducibility-friendly GIS tools, but it can be challenging to get started! This walkthrough is by no means exhaustive, but hopefully it will provide enough to give you a foothold and begin learning independently. I’ll assume you know a bit about GIS to begin with.

In order to follow along with this walkthrough you’ll want to install and load the following packages:

library(tidyverse)
library(sf)
library(cowplot)
library(ggspatial)

1. Quickly plot data for coordinate points

One of the more straightforward things you might want to do with mapping in R is to plot some points.

The built-in quakes dataset in R is an easy place for us to start. This dataset provides info on earthquakes near Fiji.

Take a look:

head(quakes)

##      lat   long depth mag stations
## 1 -20.42 181.62   562 4.8       41
## 2 -20.62 181.03   650 4.2       15
## 3 -26.00 184.10    42 5.4       43
## 4 -17.97 181.66   626 4.1       19
## 5 -20.42 181.96   649 4.0       11
## 6 -19.68 184.31   195 4.0       12

We have a column for the latitude of earthquake observations, one for the longitude, and then additional ones for the depth, magnitude and the number of stations reporting the quake.

Without even loading a package, you would be able to make a very basic map of these points using the plot function:

plot(quakes$long, quakes$lat)

fig_1

However, if you’re familiar with ggplot2 you can plot them a little more elegantly with a little bit more code:

ggplot(quakes) +
  geom_point(aes(x = long, y = lat))

fig_2

Perhaps you’d like to make the size of the points dependent on the magnitude of the earthquake:

ggplot(quakes) +
  geom_point(aes(x = long, y = lat, size = mag, alpha = 0.4))

fig_3

Note that right now ggplot isn’t considering this spatial data yet, but just some points with coordinates.

From here you could take things further and format with different colors, plot themes, and plenty more. We’ll dive into some of this in the following sections. But let’s look at how to import our own data, plot it, and format the plot.

2. Build a map using external point and polygon files and add an inset

For the rest of this walkthrough I’ll be using a dataset by Pagano et al.(2019) containing polar bear satellite locations obtained from GPS collars. It’s available here. If you’re following along, you’ll want to download and unzip this folder.

We’ll be working mostly with the sf package now. There are several good vignettes on the package provided here and an explanation of the purpose of the package provided in this publication. Some of the basics are:

sf is built around “simple features”. Simple features are a standard for storing spatial data
sf was authored by some of the same people as the earlier spatial package sp and allows us to work with spatial data using simple features much more effectively
One major benefit is that sf has been written to coexist well with packages in the tidyverse
sf doesn’t work well with time series or raster data (see the stars and raster packages)
sf stores spatial data in data frames (see tidyverse comment above)

2.1 Import, format, test

Let’s dive in. To start, we’ll load in the dataset. It contains GPS collar data from several adult female polar bears.

bears <- read_csv(file = "../data/polarBear_argosGPSLocations_alaska_pagano_2014.csv") %>%
  select(Bear:LocationQuality)

## Warning: Missing column names filled in: 'X7' [7], 'X8' [8], 'X9' [9]

## 
## -- Column specification -----------------------------------------------------------------
## cols(
##   Bear = col_double(),
##   Type = col_character(),
##   Date = col_datetime(format = ""),
##   latitude = col_double(),
##   longitude = col_double(),
##   LocationQuality = col_character(),
##   X7 = col_logical(),
##   X8 = col_logical(),
##   X9 = col_datetime(format = "")
## )

You’ll notice I included the select statement above. When reading this file in I found that it contained extra empty columns. A good reminder that it’s important to look over your data when you import it into R.

What do the data look like?

head(bears)

## # A tibble: 6 x 6
##    Bear Type  Date                latitude longitude LocationQuality
##   <dbl> <chr> <dttm>                 <dbl>     <dbl> <chr>          
## 1     1 GPS   2014-04-10 02:00:38     70.4     -146. G              
## 2     1 Argos 2014-04-10 02:19:12     70.4     -146. L3             
## 3     1 Argos 2014-04-10 02:50:59     70.4     -146. LB             
## 4     1 Argos 2014-04-10 02:53:19     70.4     -146. L1             
## 5     1 GPS   2014-04-10 03:00:38     70.4     -146. G              
## 6     1 GPS   2014-04-10 04:00:38     70.4     -146. G

Just like the earthquake example earlier, we can plot these to get a sense for how they’re arranged spatially.

ggplot(bears) +
  geom_point(aes(x = longitude, y = latitude))

fig_4

Knowing that there are multiple bears present in the dataset, we might want to color their points by their ID number. We’ll do so by converting the Bear column to a factor so that R doesn’t treat it like an ordinal or continuous variable when assigning colors.

ggplot(bears) +
  geom_point(aes(x = longitude, y = latitude, color = as.factor(Bear)))

fig_5

Another option would be to create a separate map for each bear in the study, and to indicate the progression of time using color. We’ll facet by bear ID now, and color by date

ggplot(bears) +
  geom_point(aes(x = longitude, y = latitude, color = Date)) +
  facet_wrap(~ Bear)

fig_6

It’s very handy to be able to plot almost directly from a .csv file like this, but we could also convert this dataset into a more spatial-friendly sf object. Note that we provide a CRS, or coordinate reference system here. 4326 corresponds to WGS84.

bears_sf <- st_as_sf(x = bears,
                     coords = c("longitude", "latitude"),
                     # WGS84
                     crs = 4326)

NCEAS has a helpful info sheet about working with CRS in R here if you want to learn more.

What’s the result of us creating this new object?

bears_sf

## Simple feature collection with 9583 features and 4 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -158.9622 ymin: 70.0205 xmax: -140.4845 ymax: 75.33249
## geographic CRS: WGS 84
## # A tibble: 9,583 x 5
##     Bear Type  Date                LocationQuality             geometry
##  * <dbl> <chr> <dttm>              <chr>                    <POINT [°]>
##  1     1 GPS   2014-04-10 02:00:38 G               (-146.3468 70.39948)
##  2     1 Argos 2014-04-10 02:19:12 L3                   (-146.332 70.4)
##  3     1 Argos 2014-04-10 02:50:59 LB                 (-146.328 70.423)
##  4     1 Argos 2014-04-10 02:53:19 L1                 (-146.366 70.395)
##  5     1 GPS   2014-04-10 03:00:38 G               (-146.3469 70.39932)
##  6     1 GPS   2014-04-10 04:00:38 G               (-146.3455 70.39947)
##  7     1 Argos 2014-04-10 04:31:44 L3                 (-146.351 70.396)
##  8     1 Argos 2014-04-10 04:32:30 L2                 (-146.364 70.397)
##  9     1 GPS   2014-04-10 05:00:38 G               (-146.3459 70.39945)
## 10     1 GPS   2014-04-10 08:00:01 G               (-146.3419 70.40024)
## # ... with 9,573 more rows

We can see that the object is a tibble (data frame) but also has spatial data associated with it. The coordinate data is now stored in the geometry list column. If you want to pull the coordinates out of that object you can access them with st_coordinates().

Plotting sf objects with ggplot is remarkably simple using a geom_sf() layer:

ggplot(bears_sf) +
  geom_sf(aes(color = as.factor(Bear)))

fig_7

2.2 More complex plotting

Let’s focus our efforts on plotting the GIS data for just one bear. We’ll filter for bear 2’s data just like we would if we were doing this operation on a normal data frame.

bear_two <- bears_sf %>%
  filter(Bear == 2)

I’d also like to include shapefiles in my map. In this instance, we could use these to add polygon layers for the state of Alaska and for sea ice at the time the bear was being tracked. We can use the st_read() function to bring in shapefiles.

# Multipolygon geometry
alaska <- st_read(dsn = "../data/cb_2018_us_state_5m/cb_2018_us_state_5m.shp") %>%
  filter(NAME == "Alaska")

## Reading layer `cb_2018_us_state_5m' from data source `C:\Users\matthew.brousil\Documents\R-talks\intro_to_mapping\data\cb_2018_us_state_5m\cb_2018_us_state_5m.shp' using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 9 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -179.1473 ymin: -14.55255 xmax: 179.7785 ymax: 71.35256
## geographic CRS: NAD83

# Sea ice from the first month that the bear was tracked
ice <- st_read(dsn = "../data/extent_N_201404_polygon_v3.0/extent_N_201404_polygon_v3.0.shp")

## Reading layer `extent_N_201404_polygon_v3.0' from data source `C:\Users\matthew.brousil\Documents\R-talks\intro_to_mapping\data\extent_N_201404_polygon_v3.0\extent_N_201404_polygon_v3.0.shp' using driver `ESRI Shapefile'
## Simple feature collection with 145 features and 1 field
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -3225000 ymin: -4950000 xmax: 3425000 ymax: 5700000
## projected CRS:  NSIDC Sea Ice Polar Stereographic North

Now these files have been brought in largely ready-to-use. Let’s preview:

ggplot(ice) + 
  geom_sf()

fig_8

As a quick sidenote, the tigris package for R is an alternative resource for polygon data for US boundaries that you could use. It accesses shapefiles from the US Census and loads them as sf objects.

Now that we have three separate layers loaded in R it’s worth looking to see if they have matching coordinate reference systems. For example, if you wanted to do any direct comparisons of the layers to one another for spatial analysis you’d want to make sure they were compatible. st_crs() will return each layer’s CRS for us.

st_crs(bear_two)

## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["World"],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]

st_crs(ice)

## Coordinate Reference System:
##   User input: NSIDC Sea Ice Polar Stereographic North 
##   wkt:
## PROJCRS["NSIDC Sea Ice Polar Stereographic North",
##     BASEGEOGCRS["Unspecified datum based upon the Hughes 1980 ellipsoid",
##         DATUM["Not specified (based on Hughes 1980 ellipsoid)",
##             ELLIPSOID["Hughes 1980",6378273,298.279411123064,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4054]],
##     CONVERSION["US NSIDC Sea Ice polar stereographic north",
##         METHOD["Polar Stereographic (variant B)",
##             ID["EPSG",9829]],
##         PARAMETER["Latitude of standard parallel",70,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8832]],
##         PARAMETER["Longitude of origin",-45,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8833]],
##         PARAMETER["False easting",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["easting (X)",south,
##             MERIDIAN[45,
##                 ANGLEUNIT["degree",0.0174532925199433]],
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["northing (Y)",south,
##             MERIDIAN[135,
##                 ANGLEUNIT["degree",0.0174532925199433]],
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["World - N hemisphere - north of 60Â°N"],
##         BBOX[60,-180,90,180]],
##     ID["EPSG",3411]]

st_crs(alaska)

## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

Alas, the CRS of the three layers don’t match.

Because the three layers all do have a CRS already, we’ll want to do transformations to put them in the same system. For this we’ll use st_transform(). If instead you wanted to assign a CRS to an object that did not have one you would use st_crs().

bear_two_pcs <- st_transform(x = bear_two, crs = 3467)

alaska_pcs <- st_transform(x = alaska, crs = 3467)

ice_pcs <- st_transform(x = ice, crs = 3467)

Note that these layers would have plotted successfully even if we didn’t transform their CRS. But the plotting would not have been pretty.

With transformation:

ggplot() +
  geom_sf(data = alaska_pcs) +
  geom_sf(data = ice_pcs) +
  geom_sf(data = bear_two_pcs)

fig_9

Without transformation:

ggplot() +
  geom_sf(data = alaska) +
  geom_sf(data = ice) +
  geom_sf(data = bear_two)

fig_10

Next we’ll want to zoom in a bit. It would help to know the rough spatial boundaries of the polar bear data in order to do this well. We can use st_bbox() to get this
information in the form of a bounding box, save it, then use it to define the boundaries of our plot. I’ve also added colors to differentiate the sea ice from Alaska’s land area.

limits <- st_bbox(bear_two_pcs)

bear_plot <- ggplot() +
  geom_sf(data = alaska_pcs, fill = "seashell", color = "snow3") +
  geom_sf(data = ice_pcs, fill = "lightcyan", color = "snow3") +
  geom_sf(data = bear_two_pcs, alpha = 0.3) +
  coord_sf(xlim = c(limits["xmin"], limits["xmax"]),
           ylim = c(limits["ymin"], limits["ymax"])) +
  theme_bw()

Now we have a plot that is starting to look usable:

bear_plot

fig_11

You might want to add things like a north arrow or a scale bar. The ggspatial package provides some options for doing this with the functions annotation_scale() and annotation_north_arrow().

bear_plot <- bear_plot +
  # Place a scale bar in top left
  annotation_scale(location = "tr") +
  annotation_north_arrow(location = "tl",
                         # Use true north
                         which_north = "true",
                         height = unit(0.75, "cm"),
                         width = unit(0.75, "cm"))

bear_plot

fig_12

For the next step we’ll add an inset map to show the geographic context of where this study took place. This map will be zoomed out and show just the state of Alaska for reference. In order to do this we’ll need to create a second ggplot object and then use cowplot::ggdraw() to combine our two maps.

First, create the inset map. We want something pretty basic. The limits object we made from st_bbox()earlier can be plotted if we run it through st_as_sfc() to create a polygon from it.

inset_map <- ggplot(data = alaska_pcs) +
  geom_sf() +
  geom_sf(data = st_as_sfc(limits), fill = "red") +
  theme_bw() +
  xlab("") +
  ylab("") +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        plot.background = element_blank())

inset_map

fig_13

Now that we’ve successfully made a map that we want to place as an inset it’s time to combine our two ggplot objects into one using ggdraw(). You’ll want to play around with the arguments to your inset ggdraw() call until you’ve arranged it in a way you like.

ggdraw() +
  draw_plot(bear_plot) +
  draw_plot(plot = inset_map,
            x = 0.13, # x location of inset placement
            y = 0.009, # y location of inset placement
            width = .455, # Inset width
            height = .26, # Inset height
            scale = .66) # Inset scale

fig_14

2.3 Taking this further

There are a couple other places you might want to make changes to the map we’ve created. For one, you might want to use color to code the bear’s path by date to show how long it lingered in some locations.

We first will need to create breakpoints that we can feed to ggplot for this. We’ll use pretty() from base R to generate equally spaced dates and corresponding labels for the legend.

date_breaks <- tibble(dates_num = as.numeric(pretty(bear_two_pcs$Date)),
                      dates_lab = pretty(bear_two_pcs$Date))

date_breaks

## # A tibble: 7 x 2
##    dates_num dates_lab          
##        <dbl> <dttm>             
## 1 1396310400 2014-04-01 00:00:00
## 2 1398902400 2014-05-01 00:00:00
## 3 1401580800 2014-06-01 00:00:00
## 4 1404172800 2014-07-01 00:00:00
## 5 1406851200 2014-08-01 00:00:00
## 6 1409529600 2014-09-01 00:00:00
## 7 1412121600 2014-10-01 00:00:00

You can use scale_color_viridis_c() (or another scale_color_ option if you want) and provide it dates_num as the color breaks and dates_lab as the break labels. Note that I also add in a fill color for panel.background to give the the appearance of water.

bear_time_plot <- ggplot() +
  geom_sf(data = alaska_pcs, fill = "seashell", color = "snow3") +
  geom_sf(data = ice_pcs, fill = "lightcyan", color = "snow3") +
  geom_sf(data = bear_two_pcs, aes(color = as.numeric(Date)), alpha = 0.3) +
  coord_sf(xlim = c(limits["xmin"], limits["xmax"]),
           ylim = c(limits["ymin"], limits["ymax"])) +
  scale_color_viridis_c(name = "Date",
                        breaks = date_breaks$dates_num,
                        labels = date_breaks$dates_lab) +
  theme_bw() +
  theme(panel.background = element_rect(fill = "lightseagreen"),
        panel.grid = element_blank()) +
  # Place a scale bar in top left
  annotation_scale(location = "tr") +
  annotation_north_arrow(location = "tl",
                         # Use true north
                         which_north = "true",
                         height = unit(0.75, "cm"),
                         width = unit(0.75, "cm"))

bear_time_plot

fig_15

And lastly, you might want to represent the GPS data as a line instead of a series of points. (Read here for a discussion on converting to LINESTRINGS). You’ll need to convert the points first and then you can plot as normal.

# Convert points to LINESTRING
bear_two_line <- bear_two_pcs %>%
  summarise(do_union = FALSE) %>%
  st_cast("LINESTRING")

bear_time_plot +
  geom_sf(data = bear_two_line, color = "black") +
  coord_sf(xlim = c(limits["xmin"], limits["xmax"]),
           ylim = c(limits["ymin"], limits["ymax"]))

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

fig_16

ggdraw() +
  draw_plot(bear_time_plot +
              geom_sf(data = bear_two_line, color = "black") +
              coord_sf(xlim = c(limits["xmin"], limits["xmax"]),
                       ylim = c(limits["ymin"], limits["ymax"]))) +
  draw_plot(plot = inset_map,
            x = 0.03, # x location of inset placement
            y = 0.009, # y location of inset placement
            width = .455, # Inset width
            height = .26, # Inset height
            scale = .66) # Inset scale

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

fig_17

References

Fetterer, F., K. Knowles, W. N. Meier, M. Savoie, and A. K. Windnagel. 2017, updated daily. Sea Ice Index, Version 3. [Monthly ice extent ESRI shapefile, April 2014]. Boulder, Colorado USA. NSIDC: National Snow and Ice Data Center. doi: https://doi.org/10.7265/N5K072F8. [Accessed 2021-02-07].
Pagano, A. M., Atwood, T. C. and Durner, G. M., 2019, Satellite location and tri-axial accelerometer data from adult female polar bears (Ursus maritimus) in the southern Beaufort Sea, April-October 2014: U.S. Geological Survey data release, https://doi.org/10.5066/P9VA5I0M.
Pebesma E (2018). “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal, 10(1), 439–446. doi: 10.32614/RJ-2018-009, https://doi.org/10.32614/RJ-2018-009.
https://github.com/r-spatial/sf/issues/321
2018 TIGER/Line Shapefiles (machinereadable data files) / prepared by the U.S. Census Bureau, 2018

Research profiles with Shiny Dashboard: A case study in a community survey for antimicrobial resistance in Guatemala

October 28th, 2020.

Romero Juan Carlos (jromero@ces.uvg.edu.gt), Alvis Juan Pablo (jalvis@ces.uvg.edu.gt), Grajeda Laura María (lgrajeda@ces.uvg.edu.gt) from Centro de Estudios en Salud, Universidad del Valle de Guatemala.

The successful implementation of community surveys requires close and timely monitoring of study activities. To monitor the performance of an antimicrobial resistance two-stage cluster survey in communities of the highlands of Guatemala, we developed a web-based application with R Shiny dashboards. Our password-protected application displays study progress indicators, approaching activities, and unresolved data quality issues using maps, tables, graphics, dynamic texts and downloadable spreadsheets. Scientists use this information to assess study performance against upcoming milestones, to address deviations in a timely manner, to assist real-time planning of field activities and to serve as a communication tool within the study team and with funding organizations.

We captured geographic data with GPS devices, and demographic, clinical and laboratory data with the RedCap system. All data was stored in a SQL-structured database. An ecosystem of scripts, each dedicated to a particular map, table or graphic of the dashboard, produces data structures ready to for R packages to create each object. We programmed the process to be repeated at 12-hour intervals to feed the Shiny dashboard with updated data and reduce loading time. We published the dashboard using the Shiny server open-source platform in a dedicated Linux server accessed by an intranet IP. To facilitate collaborative work in creating and maintaining the dashboard, we implemented the Apache Subversion in a centralized server for code control and the open-source Tortoise SVN client for version control.

A description of our approach to create a Shiny Dashboard and a discussion of alternatives follow in the video. Comments and discussion are very welcome.

YouTube Link: https://youtu.be/GryIw3OHXyc

Formatting data for use with R

By Matt Brousil

This post walks through some best practices for formatting and preparing your data for use with R. Some of the suggestions here are drawn from Data Carpentry, Broman and Woo (2018), and the BES Guide to Reproducible Code. The required dataset is available here. A previous version of this post that deals with cleaning both data and scripts is available here.

library(tidyverse)
library(readxl)
library(janitor)
library(lubridate)

1. Use .csv instead of .xlsx or other proprietary file formats when possible.

The .csv filetype can be read easily by many software packages, including base R. This isn’t necessarily the case with .xlsx. Instead, we use the readxl package here. In my experience, other packages for reading Excel files can have issues installing on many computers due to their installation requirements.

raw_data <- read_excel(path = "cool project 2020.xlsx")

Additional notes:

If your .xlsx has multiple sheets, then make multiple files. This is a more reliable storage method, and .csv files don’t allow for multiple sheets.
Don’t use formatting to store data. R can’t decipher things like highlighting easily and the ways of documenting formatting meaning often leave a lot to be desired. But if you need to detect Excel formatting from R, check out this blog post.

2. File naming

cool project 2020.xlsx isn’t a good name! There are a few points we want to keep in mind with file naming:

Machine readability
- Don’t use spaces in filenames
- Don’t use characters other than letters, numbers, and _ or –
Human readability
- Make sure the name of the file is descriptive, try to follow a consistent naming convention

Something like brousil_thesis_data_2020.csv is a better name, although you might want to make yours more specific.

3. Column naming

Let’s take a look at our column names:

names(raw_data)

## [1] "start date"   "end date"    
## [3] "which group?" "value"       
## [5] "\"Notes\""    "lat"         
## [7] "long"

Here’s what the dataset itself looks like:

head(raw_data)

## # A tibble: 6 x 7
##   `start date` `end date`         
##          <dbl> <dttm>             
## 1        40142 2010-05-05 00:00:00
## 2        40143 2010-05-06 00:00:00
## 3        40143 2010-05-06 00:00:00
## 4           NA 2010-05-08 00:00:00
## 5           NA 2010-05-09 00:00:00
## 6        40142 2010-05-05 00:00:00
## # ... with 5 more variables: `which
## #   group?` <chr>, value <dbl>,
## #   `"Notes"` <chr>, lat <chr>, long <chr>

The column naming in this dataset is a mess! Note that the formatting of the column names changes between the vector output of names() and the tibble version from head(). We could have avoided this instability. When naming columns in a spreadsheet for use in R,
we should avoid:

Spaces
Special characters
Capitalization

Spaces in particular make it very difficult to refer to column names, because you have to use backticks or quotes when referencing them. Capitalization isn’t a huge deal, but it adds extra keystrokes to your names and also makes it easier to misspell them.

raw_data$`which group?`

# This won't work 
raw_data %>% select(which group?) 

# This will work 
raw_data %>% select("which group?")

We can take care of the names quickly using clean_names() from janitor.

raw_data <- clean_names(raw_data)

names(raw_data)

## [1] "start_date"  "end_date"    "which_group"
## [4] "value"       "notes"       "lat"        
## [7] "long"

4. Layout of the data

The next step is to consider the layout of the data. For example, is the dataset in tidy format? Let’s take a look:

raw_data

## # A tibble: 10 x 7
##    start_date end_date            which_group
##         <dbl> <dttm>              <chr>      
##  1      40142 2010-05-05 00:00:00 A          
##  2      40143 2010-05-06 00:00:00 A          
##  3      40143 2010-05-06 00:00:00 B          
##  4         NA 2010-05-08 00:00:00 B          
##  5         NA 2010-05-09 00:00:00 A          
##  6      40142 2010-05-05 00:00:00 A          
##  7      40148 2010-05-11 00:00:00 A          
##  8      40148 2010-05-11 00:00:00 A          
##  9      40149 2010-05-12 00:00:00 B          
## 10      40142 2010-05-05 00:00:00 A          
## # ... with 4 more variables: value <dbl>,
## #   notes <chr>, lat <chr>, long <chr>

This dataset is close, but not quite in tidy format.

Good qualities of the layout:

A single table
Variables are in columns
Observations are in rows

Bad qualities of the layout:

Multiple pieces of data in the notes column
Duplicated rows

To start, let’s separate the notes column into multiple columns.

clean_data <- separate(data = raw_data, col = notes, into = c("type", "location"),
                    sep = ", ", remove = FALSE)

head(clean_data)

## # A tibble: 6 x 9
##   start_date end_date            which_group
##        <dbl> <dttm>              <chr>      
## 1      40142 2010-05-05 00:00:00 A          
## 2      40143 2010-05-06 00:00:00 A          
## 3      40143 2010-05-06 00:00:00 B          
## 4         NA 2010-05-08 00:00:00 B          
## 5         NA 2010-05-09 00:00:00 A          
## 6      40142 2010-05-05 00:00:00 A          
## # ... with 6 more variables: value <dbl>,
## #   notes <chr>, type <chr>, location <chr>,
## #   lat <chr>, long <chr>

Next, let’s investigate the duplicate rows. We’ll print any rows that aren’t unique using get_dupes() from janitor and then remove the repeated row with help from duplicated().

# From janitor
get_dupes(clean_data)

## No variable names specified - using all columns.

## # A tibble: 2 x 10
##   start_date end_date            which_group
##        <dbl> <dttm>              <chr>      
## 1      40142 2010-05-05 00:00:00 A          
## 2      40142 2010-05-05 00:00:00 A          
## # ... with 7 more variables: value <dbl>,
## #   notes <chr>, type <chr>, location <chr>,
## #   lat <chr>, long <chr>, dupe_count <int>

# From base R. Note that it prints out a logical vector.
duplicated(clean_data)

##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
##  [8] FALSE FALSE FALSE

Drop the extra row using duplicated():

clean_data <- clean_data[!duplicated(clean_data), ]

5. Dates

Dates are tricky in Excel. It has a lot of automatic formatting options that are intended to be helpful. They often lead to problems, though. Moreover, things that aren’t supposed to be dates are interpreted by Excel to be dates.

This dataset shows multiple potential pitfalls with dates.

The start_date column contains Excel serial number dates that haven’t translated over to R.
The end_date column has imported correctly! But when it was entered into Excel, it was formatted like this: 5/11/10. This isn’t trustworthy, and 5, 11, and 10 are all realistic values for the year, month, or day.

Some of the safest ways to enter dates are as follows:

As separate columns in your spreadsheet: year, month, day
As ISO 8601: YYYY-MM-DD (e.g., 2020-10-21)
- ISO = International Organization for Standardization

Luckily we can clean up the Excel serial formatted dates:

clean_data$start_date <- excel_numeric_to_date(as.numeric(clean_data$start_date))

If we want to save the data as a .csv for later use in Excel, we might want to split the dates up still and drop the original columns.

clean_data <- clean_data %>%
mutate(start_year = year(start_date),
       start_mo = month(start_date),
       start_day = day(start_date),
       end_year = year(end_date),
       end_mo = month(end_date),
       end_day = day(end_date)) %>%
  select(-start_date, -end_date)


head(clean_data)

## # A tibble: 6 x 13
##   which_group value notes type  location lat  
##   <chr>       <dbl> <chr> <chr> <chr>    <chr>
## 1 A               1 expe~ expe~ greenho~ -27.~
## 2 A               5 expe~ expe~ greenho~ -27.~
## 3 B              75 expe~ expe~ lab      -27.~
## 4 B               4 expe~ expe~ greenho~ -27.~
## 5 A               2 expe~ expe~ lab      -27.~
## 6 A               1 expe~ expe~ lab      -27.~
## # ... with 7 more variables: long <chr>,
## #   start_year <dbl>, start_mo <dbl>,
## #   start_day <int>, end_year <dbl>,
## #   end_mo <dbl>, end_day <int>

6. Missing values

Many datasets you work with will likely contain missing values. It’s important to know why data can be missing, and how to treat it differently based on the reason for its absence.

Make sure that you enter 0 when your data is supposed to contain zero! Often people will leave zeroes blank during data entry, and it’s not clear to the user whether these are intended to be missing data or actual 0 values.
There’s not uniform agreement about how to code in truly missing data. But there are multiple options. Either a blank cell or a manually entered NA value is best for data that truly are gaps in the dataset. My thought is that NA is probably best, i.e. it is clear that the data are missing and not just forgot to enter (Broman & Woo, 2018).
- e.g., An experimental replicate was lost, data logger corrupted, etc.
Always fill in values, even when they are just repeats of dates in the rows above!

7. Miscellaneous

Lastly, you’ll want to avoid including special characters or units inside of cells with numeric data (e.g., lat/long). There’s a few reasons for this:

These characters or units won’t typically add anything to your analytical abilities in R
Special characters may be difficult to wrangle in R
The presence of a special character in a numeric column will likely force that column to be character rather than numeric data

My suggestion is either don’t include them at all, or you can reference them in column name, e.g. distance_m.

We can drop them from our dataset using gsub():

clean_data$lat <- gsub(pattern = "°", replacement = "", x = clean_data$lat)
clean_data$long <- gsub(pattern = "°", replacement = "", x = clean_data$long)

8. Keep your raw data untouched and separate from cleaned data!

In case a transcription or cleaning error happens, you need to be able to compare your original dataset (untouched) to the cleaned version. Save a new .csv separately from the old data.

write.csv(x = clean_data,
          file = "brousil_cleaned_thesis_data.csv",
          row.names = FALSE)

Sources:

Cooper, N., & Hsing, P. Y. (2017). A guide to reproducible code in ecology and evolution. Technical report, British Ecological Society. Available at https://www.britishecologicalsociety.org/wp-content/uploads/2017/12/guide-to-reproducible-code.pdf.
Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989
Peter R. Hoyt, Christie Bahlai, Tracy K. Teal (Eds.), Erin Alison Becker, Aleksandra Pawlik, Peter Hoyt, Francois Michonneau, Christie Bahlai, Toby Reiter, et al. (2019, July 5). datacarpentry/spreadsheet-ecology-lesson: Data Carpentry: Data Organization in Spreadsheets for Ecologists, June 2019 (Version v2019.06.2). Zenodo. http://doi.org/10.5281/zenodo.3269869

ggplot2 tutorial

By Mikala Meize

Libraries used

library(tidyverse)
library(ggthemes)

The ggplot() function, which is used in this tutorial is housed within the tidyverse library. The ggthemes library is something we’ll use later on in the tutorial.

There are several ways to set up a plot using ggplot(). The first is to simply run:

ggplot()

This tells R that you are prepping the plot space. You’ll notice that the output is a blank space. This is because we have not specified an x or y axis, the variables we want to use, etc.

When building a plot with ggplot(), you add pieces of information in layers. These are separated by a plus sign (+). Whatever you include in the first layer, will be included in the following layers. For example, if we include a dataframe in the first layer:

ggplot(data = mtcars)

Then each layer following will use that dataframe. Notice that even though we added a dataset, the output is still a blank space.

Something else to note about ggplot(), you can get the same plot using several different chunks of code. In the next example, I will plot the mpg variable from the mtcars dataset using two different sets of code. These both result in the same plot:

ggplot(data = mtcars) +
  geom_boxplot(aes(x = mpg)) #specified the data in the first layer and the x variable in the second

ggplot(data = mtcars, aes(x = mpg)) +
  geom_boxplot() #specified both data and x variable in the first layer and the type of plot alone in the second

These plots are the same, but using the aes() code, the x variable was specified in different layers. The ‘aes’ is short for aesthetic.

There are different reasons to include the data, the x, or the y variables in the first layer. If you want to layer data from different dataframes together in a plot, you’ll likely start with ggplot() alone in the first layer. If you want to layer multiple x variables from the same dataset, then you might specify the data in the first layer: ggplot(data = df). Then, if you want to keep the same x or y variable as you add layers, you can specify that as well: ggplot(data = df, aes(x = variable, y = variable)).

Plotting variables together

Using the dataset mtcars, mpg (miles per gallon) will be on the x axis and hp (horsepower) will be on the y axis.

ggplot(data = mtcars, aes(x = mpg, y = hp))

I have not clarified how I want these variables plotted (lines, points, etc.) so the x and y axes are labeled, but there is no data in the plot space. In the next layer, I specify that I want the data plotted as points.

ggplot(data = mtcars, aes(x = mpg, y = hp)) +
  geom_point()

I am interested in the relationship between displacement and mpg too. By adding another layer, I can add a second y variable.

ggplot(data = mtcars, aes(x = mpg)) +
  geom_point(aes(y = hp)) +
  geom_point(aes(y = disp))

Because I have two y variables, I removed the y specification from the first layer and added the separate y variables in their own layers. To differentiate one set of points from another, I can request different shapes for the data points based on variable.

ggplot(data = mtcars, aes(x = mpg)) +
  geom_point(aes(y = hp, shape = 'Horsepower')) +
  geom_point(aes(y = disp, shape = 'Displacement'))

Now I can tell the difference between horsepower and displacement, and there is a legend off to the side explaining this. You can do the same thing with lines instead of points.

ggplot(data = mtcars, aes(x = mpg)) +
  geom_line(aes(y = hp)) +
  geom_line(aes(y = disp))

Instead of shape, use linetype to differentiate between variables.

ggplot(data = mtcars, aes(x = mpg)) +
  geom_line(aes(y = hp, linetype = 'Horsepower')) +
  geom_line(aes(y = disp, linetype = 'Displacement'))

You can change the color of each plotted variable too. If you add the label inside aes(), then R picks the line color for you.

ggplot(data = mtcars, aes(x = mpg)) +
  geom_line(aes(y = hp, color = 'Horsepower')) +
  geom_line(aes(y = disp, color = 'Displacement'))

You can specify the color of each line by include the color code outside the aes().

ggplot(data = mtcars, aes(x = mpg)) +
  geom_line(aes(y = hp), color = 'blue') +
  geom_line(aes(y = disp), color = 'green')

Publishable Plots

In the following example, I am using the economics dataset that comes loaded with R. I want to plot variables across time, so I’ll use ‘date’ as the x axis.

ggplot(data = economics, aes(x = date))

The first variable I’ll plot is ‘psavert’ (Personal Savings Rate), and I’ll plot it as a line.

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert))

The second variable I’ll plot is ‘uempmed’ (Duration of Unemployment measured in weeks).

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert)) +
  geom_line(aes(y = uempmed))

As before, the lines are indistinguishable. For this example, I want to make a black and white plot that I could publish with. So I’ll let R choose the line type for these two variables.

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert, linetype = 'Personal Savings Rate')) +
  geom_line(aes(y = uempmed, linetype = 'Duration of Unemployment (weeks)'))

I have no need for the grid in the background so I can use the theme() layer to change this. You can remove some of the grid lines, or all of the grid lines.

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert, linetype = 'Personal Savings Rate')) +
  geom_line(aes(y = uempmed, linetype = 'Duration of Unemployment (weeks)')) +
  theme(panel.grid.major = element_blank())

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert, linetype = 'Personal Savings Rate')) +
  geom_line(aes(y = uempmed, linetype = 'Duration of Unemployment (weeks)')) +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

If you want zero grid lines, you can skip a step and do this:

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert, linetype = 'Personal Savings Rate')) +
  geom_line(aes(y = uempmed, linetype = 'Duration of Unemployment (weeks)')) +
  theme(panel.grid = element_blank())

I want to add axis lines for both x and y, and I want them to be black.

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert, linetype = 'Personal Savings Rate')) +
  geom_line(aes(y = uempmed, linetype = 'Duration of Unemployment (weeks)')) +
  theme(panel.grid = element_blank(),
        axis.line = element_line(color = 'black'))

We can also use preset themes. Most of these themes are in the tidyverse library, but some of the more unique themes are part of the ggthemes library.
I tend to use the theme_bw() or the theme_classic() when building my publishable plots.

Note: the premade/preset theme must go before the theme() specifications you make.

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert, linetype = 'Personal Savings Rate')) +
  geom_line(aes(y = uempmed, linetype = 'Duration of Unemployment (weeks)')) +
  theme_bw() +                                                                    #this theme adds a border around the plot
  theme(panel.grid = element_blank(),
        panel.border = element_blank(),                                           #this code removes the border
        axis.line = element_line(color = 'black'))

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert, linetype = 'Personal Savings Rate')) +
  geom_line(aes(y = uempmed, linetype = 'Duration of Unemployment (weeks)')) +
  theme_classic()

#theme_classic() does all of the things I did manually above, but in one line of code instead of several.

Currently the plot is very wide with the legend on the right, so I am going to move it to the bottom of the plot using the theme() options.

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert, linetype = 'Personal Savings Rate')) +
  geom_line(aes(y = uempmed, linetype = 'Duration of Unemployment (weeks)')) +
  theme_classic() +
  theme(legend.position = 'bottom')

This looks much better, but the axis titles and the legend title are still not publishable quality. I can fix the axis and plot titles using the labs() layer.

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert, linetype = 'Personal Savings Rate')) +
  geom_line(aes(y = uempmed, linetype = 'Duration of Unemployment (weeks)')) +
  theme_classic() +
  theme(legend.position = 'bottom') +
  labs(x = 'Date',
       y = NULL,
       title = 'Savings and Unemployment',
       subtitle = 'US Economic Data')

My plot now has no Y axis title, a grammatically correct x axis title, a plot title, and a subtitle. In this next step, I’ll go back to the theme() options and center the plot title and get rid of the legend title.

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert, linetype = 'Personal Savings Rate')) +
  geom_line(aes(y = uempmed, linetype = 'Duration of Unemployment (weeks)')) +
  theme_classic() +
  theme(legend.position = 'bottom',
        plot.title = element_text(hjust = 0.5),        #Center the title
        plot.subtitle = element_text(hjust = 0.5),     #Center the subtitle
        legend.title = element_blank()) +              #Remove the legend title
  labs(x = 'Date',
       y = NULL,
       title = 'Savings and Unemployment',
       subtitle = 'US Economic Data')

Now I have a beautiful black and white plot, with no odd coding language. This can be exported from R as an image or as a PDF.

Sometimes a journal will want you to match the font of your plots to the font of your text. I’ve found a nice theme (from ggthemes library) I like to use for this.

ggplot(data = economics, aes(x = date)) +
  geom_line(aes(y = psavert, linetype = 'Personal Savings Rate')) +
  geom_line(aes(y = uempmed, linetype = 'Duration of Unemployment (weeks)')) +
  theme_tufte() +
  theme(legend.position = 'bottom',
        plot.title = element_text(hjust = 0.5),        
        plot.subtitle = element_text(hjust = 0.5),     
        legend.title = element_blank(),
        axis.line = element_line(color = 'black')) +    #This theme removes all axis lines, I've added them back here.        
  labs(x = 'Date',
       y = NULL,
       title = 'Savings and Unemployment',
       subtitle = 'US Economic Data')

There are so many more things you can do with ggplot, this is only a start to the possibilities. I highly recommend browing the ggplot website and their posted cheat sheets to learn more: https://ggplot2.tidyverse.org/

Tidyverse Data Wrangling

By Ben Leonard

What is Data Wrangling?

Most of the time datasets will not come “out of the box” ready to use for analysis. Summarizing data and presenting audience friendly tables and figures requires quite a bit of data manipulation. A data wrangling task is performed anytime a dataset is cleaned, tweaked, or adjusted to fit a subsequent analytical step.

Why Use R?

Programs that provide an easy to use graphical user interface (GUI) such as Excel can be used to perform data wrangling tasks. Since it can be faster to manipulate data is Excel why should anyone take extra time to write code in R?

Repeatability. A cornerstone of the scientific process is producing a replicable result. If a dataset has to be heavily massaged to produce the desired result these steps must be documented. Otherwise it is easy for errors to creep into the final output. Without transparency it is difficult for reviewers to differentiate between honest mistakes and purposeful data fabrication. It is also difficult to perform any type of quality assurance protocol without proper documentation. By coding all analytical steps in R it is actually easier to leave a trail of evidence showing data wrangling steps than trying to do so in Excel by heavily annotating workbooks.
Data Integrity. Excel often does unexpected things to data. For example, Excel will attempt to auto-recognize data types when entering data. If I try to enter the value “1-15” this automatically becomes the date “January 15, 2020”. Excel also does not always respect the preservation and display of significant digits. Some formulas don’t follow sounds statistical philosophy. Many advanced data maneuvers are absent from or really difficult to perform in Excel.
Work Flow. If R is required eventually for statistics, modeling, rmarkdown, shiny, etc. then keeping all steps “in house” just makes sense. Since R and RStudio provide most of the tools a data scientist needs why muddy the waters by switching back and forth with Excel. However, it should be noted that there are many packages that make it easy to integrate the two approaches. For example, readxl is a tidyverse package for reading Excel workbooks. It is also possible to code Excel procedures using Visual Basic for Applications (VBA) code and then programmatically execute this code in R. Many choose a hybrid approach in which Excel is used as a data exploration tool that can be used to figure out which steps need to be taken to accomplish a task in R.

Tidyverse Packages

# Installing tidyverse packages
# install.packages("tidyverse")

# Loading tidyverse packages
# library("tidyverse")

# Tidyverse information
# ?tidyverse

https://www.tidyverse.org/

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Install the complete tidyverse with:

install.packages(“tidyverse”)

Loading the tidyverse library is a great way to load many useful packages in one simple step. Several packages included in this distribution are great tools for data wrangling. Notably dplyr, readr, and tidyr are critical assets. The general philosophy is that if a number of steps are needed for a data wrangling task these can be modularized into a stream-lined set into a human readable commands.

https://dplyr.tidyverse.org/

https://readr.tidyverse.org/

https://tidyr.tidyverse.org/

Criticism

Some argue that tidyverse syntax is not intuitive and that it should not be taught to students first learning R. Much of the basis for this comes from the avoidance of certain base R staples such as the “$” operator, “[[]]”, loops, and the plot command.

https://github.com/matloff/TidyverseSkeptic

Pipe Operator

“%>”

The pipe operator is the crux of all tidyverse data operations. It strings together multiple functions without the need for subsetting. If a number of operations are needed to be performed on a single dataframe this approach simplifies the process. The following image shows the mathematical principle of how the pipe operator works by bridging functions with explicitly subsetting arguments. In this case, steps A %>% B %>% C would represent the tidyverse “pipe syntax” while f(g(x)) is the base R approach.

The examples below show how to perform a few simple tasks in both base R and tidyverse “pipe syntax”. Note that for both of these cases the base R syntax may actually be preferred in terms of brevity, but the tidyverse approaches are slightly more readable.

## Task #1: Determine the mean of a vector of three doubles.

# Create a vector of values
a <- c(12.25, 42.50, 75.00)

# Base R
round(mean(a), digits = 2)

## [1] 43.25

# Tidyverse
a %>%
  mean() %>%
  round(digits = 2)

## [1] 43.25

## Task #2: Extract the unique names of all storms in the "storms" dataset and view the first few results.

# Base R
head(unique(storms$name))

## [1] "Amy"      "Caroline" "Doris"    "Belle"    "Gloria"   "Anita"

# Tidyverse
storms %>%
  pull(name) %>%
  unique() %>%
  head()

## [1] "Amy"      "Caroline" "Doris"    "Belle"    "Gloria"   "Anita"

Reading Data

read_csv() and write_csv()

While base R does provide analogous functions (read.csv, and write.csv) the versions included in readr are 5 to 10 times faster. This speed, in addition to a loading bar, makes these functions preferred for large files. The dataframes produced are also always tibbles and don’t convert columns to factors or use rownames by default. Additionally, read_csv is able to make better determinations about data types and has more sensible defaults.

The example below shows how to read.csv is unable to identify time series data automatically and assigns a column class of “character”. However, read_csv correctly reads in the data without having to specify any column data types.

## Task #1: Read in a dataset of time series data and determine the default class assignments for each column data type.

# Base R
sapply(read.csv("weather.csv"), class)

##   timestamp       value 
## "character"   "numeric"

# Tidyverse
read_csv("weather.csv") %>%
  sapply(class)

## Parsed with column specification:
## cols(
##   timestamp = col_datetime(format = ""),
##   value = col_double()
## )

## $timestamp
## [1] "POSIXct" "POSIXt" 
## 
## $value
## [1] "numeric"

tribble()

Sometimes it is necessary to define a small dataframe in your code rather than reading data from an external source such as a csv file. The tribble() function makes this very easy by allowing row-wise definition of dataframes in tibble format.

Note. A tibble is a type of dataframe specific to the tidyverse. See the following description:

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.

https://tibble.tidyverse.org/

The example below shows how much more intuitive it is to create data frames using tribble. Row-wise definitions make it simple to enter tabular data and manually update datasets when need be. The use of tribble also guarantees a tibble output which is then going to be fully compatible with all tidyverse functions. A tibble is also achieved without row-wise syntax using tibble() (formerly data_frame()).

## Task #2: Manually create a data frame.

# Base R
class(data.frame("col1" = c(0.251, 2.253),
                 "col2" = c(1.231, 5.230)))

## [1] "data.frame"

# Tidyverse #1
class(tibble("col1" = c(0.251, 2.253),
             "col2" = c(1.231, 5.230)))

## [1] "tbl_df"     "tbl"        "data.frame"

# Tidyverse #2
tribble(
  ~col1, ~col2,
  0.251, 1.231,
  2.253, 5.230,
  1.295, 3.192, # New Data
  ) %>%
  class()

## [1] "tbl_df"     "tbl"        "data.frame"

Manipulating Data

Much of data wrangling comes down to data manipulations performed over and over again. A compact set of functions make this easy in tidyverse without having to nest functions or define a bunch of different variables. While many base R equivalents exist the example below performs a set of commands using only tidyverse syntax.

select() extracts columns and returns a tibble.

arrange() changes the ordering of the rows.

filter() picks cases based on their values.

mutate() adds new variables that are functions of existing variables.

rename() easily changes the name of a column(s)

pull() extracts a single column as a vector.

# Task #1: Perform numerous data manipulation operations on a dataset.

ToothGrowth %>%
  mutate(dose2x = dose*2) %>% # Double the value of dose
  select(-dose) %>% # Remove old dose column
  arrange(-len) %>% # Sort length from high to low
  filter(supp == "OJ") %>% # Filter for only orange juice supplement
  rename(length = len) %>% # Rename len to be more explicit
  pull(length) %>% # Extract length as a vector
  head()

## [1] 30.9 29.4 27.3 27.3 26.4 26.4

Summarizing Data

group_by() Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”. ungroup() removes grouping.

summarize() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.

summarise() and summarize() are synonyms.

pivot_wider() “widens” data, increasing the number of columns and decreasing the number of rows.

pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns.

Data summarization often requires some difficult maneuvering. Since data summaries may then be used for subsequent analyses and may even be used as a deliverable it is important to have a well documented and clear set of procedures. The following procedures are used to create tabular data summaries much like pivot tables in Excel. However, unlike pivot tables in Excel the steps here are well documented.

## Task #1: Summarize the prices of various diamond cuts using a variety of statistical measures and then properly format the results.

diamonds %>%
  group_by(cut) %>%
  summarize(mean = mean(price)) %>%
  mutate(mean = dollar(mean))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 5 x 2
##   cut       mean     
##   <ord>     <chr>    
## 1 Fair      $4,358.76
## 2 Good      $3,928.86
## 3 Very Good $3,981.76
## 4 Premium   $4,584.26
## 5 Ideal     $3,457.54

# Tip: Using _at will apply one or more functions to one or more variables.
diamonds %>%
  group_by(cut) %>%
  summarize_at(vars(price), funs(mean, sd, min, median, max)) %>%
  mutate_at(vars(-cut), dollar)

## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## # A tibble: 5 x 6
##   cut       mean      sd        min   median    max    
##   <ord>     <chr>     <chr>     <chr> <chr>     <chr>  
## 1 Fair      $4,358.76 $3,560.39 $337  $3,282.00 $18,574
## 2 Good      $3,928.86 $3,681.59 $327  $3,050.50 $18,788
## 3 Very Good $3,981.76 $3,935.86 $336  $2,648.00 $18,818
## 4 Premium   $4,584.26 $4,349.20 $326  $3,185.00 $18,823
## 5 Ideal     $3,457.54 $3,808.40 $326  $1,810.00 $18,806

## Task #2: a) Create a presence absence table for fish encounters where each station is a different column. b) Then condense the table to get a total fish counts per station.

# a)
df <- fish_encounters %>%
  mutate(seen = "X") %>% # Set value to X for seen
  pivot_wider(names_from = station, values_from = seen) # Move stations to columns

df %>% head()

## # A tibble: 6 x 12
##   fish  Release I80_1 Lisbon Rstr  Base_TD BCE   BCW   BCE2  BCW2  MAE   MAW  
##   <fct> <chr>   <chr> <chr>  <chr> <chr>   <chr> <chr> <chr> <chr> <chr> <chr>
## 1 4842  X       X     X      X     X       X     X     X     X     X     X    
## 2 4843  X       X     X      X     X       X     X     X     X     X     X    
## 3 4844  X       X     X      X     X       X     X     X     X     X     X    
## 4 4845  X       X     X      X     X       <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 5 4847  X       X     X      <NA>  <NA>    <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 6 4848  X       X     X      X     <NA>    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>

# b)
df %>%
  pivot_longer(-fish, names_to = "station") %>% # Move back to long format
  group_by(station) %>%
  filter(!is.na(value)) %>% # Remove NAs
  summarize(count = n()) # Count number of records

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 11 x 2
##    station count
##    <chr>   <int>
##  1 Base_TD    11
##  2 BCE         8
##  3 BCE2        7
##  4 BCW         8
##  5 BCW2        7
##  6 I80_1      19
##  7 Lisbon     13
##  8 MAE         5
##  9 MAW         5
## 10 Release    19
## 11 Rstr       12

Complex Example

While this document follows some pretty simple examples using real-life data can get really complicated quickly. In the last example, I use real data from a study looking at tree transpiration. Data received from Campbell data loggers must be heavily manipulated to calculate “v” or the velocity of sap movement for each timestamp. A function is defined which performs all of the data wrangling steps and allows for the user to change the data sources and censoring protocols. While there is no need to fully understand what is going on with this example it is important to note that each step is documented with in line comments. This tidyverse recipe which is saved as a function can then be added to a custom package and loaded as needed.

# Define a function that accomplishes numerous data wrangling steps given several input variables.
tdp <- function(
  path, # Path to data files
  meta, # Name of metadata file
  min, # Minimum accepted value
  max, # Maximum accepted value
  mslope # Minimum accepted slope
) {

  list.files(path = path, # Define subfolder
                  pattern = "*.dat", # Pattern for file suffix
                  full.names = TRUE, # Use entire file name
                  recursive = TRUE) %>% # Create list of dat files
  map_df(~ read_csv(
    .x,
    skip = 4, # Skip first four lines
    col_names = F, # Don't auto assign column names
    col_types = cols(), # Don't output
    locale = locale(tz = "Etc/GMT+8") # Set timezone
  ) %>% # Read in all dat files
    select(-c(37:ncol(.))) %>% # Remove extra columns
    mutate(temp = as.integer(str_extract(.x, "(?<=_)\\d{1}(?=_TableTC.dat)")))) %>% # Extract site number from file name
  `colnames<-`(c("timestamp", "record", "jday", "jhm", 1:32, "plot_num")) %>% # Manually define column names
  select(plot_num, everything()) %>% # Move plot_num to be the first column
  pivot_longer(
    names_to = "channel",
    values_to = "dTCa",
    -(1:5),
    names_transform = list(channel = as.double)) %>% # Flatten data
  left_join(read_csv(meta, 
                     col_types = cols())) %>% # Merge tree metadata
  filter(!is.na(tree_num)) %>% # Remove NAs associated with channels lacking a tree number
  unite(tree_id,
        c(plot_num, tree_num, tree_type), # Combine three variables
        sep = "-", # Use - as delimiter
        remove = FALSE) %>% # Create tree IDs
  arrange(plot_num, channel, timestamp) %>%
  mutate(date = as.Date(timestamp, tz = "Etc/GMT+8"), # Extract date
         hour = hour(timestamp), # Extract hour of day
         am = if_else(hour < 12, T, F), # True vs False for morning hours
         slope = dTCa - lag(dTCa), # Calculate dTCa slope for each timestep
         flag = ifelse(dTCa < min | dTCa > max | slope > mslope, TRUE, FALSE)) %>% # Flag values that are out of bounds
  filter(!flag) %>% # Filter and remove values that are out of bounds
  group_by(plot_num, channel) %>% # Group data by day/site/channel
  filter(date != min(date)) %>% # Remove first day of data collection
  group_by(plot_num, channel, date, am) %>% # include date and morning in grouping
  mutate(dTM = max(dTCa)) %>% # Calculate maximum temperature difference per day/site/channel
  ungroup() %>%
  mutate(baseline = if_else(dTCa == dTM & am, dTM, NA_real_), # Force dTM to occur in the morning
         diff = difftime(timestamp, lag(timestamp), units = "mins"), # Calculate time difference
         id = if_else(diff != 15 | is.na(diff), 1, 0), # Identify non-contiguous periods of time
         id = cumsum(id)) %>% # Unique IDs will be produced for contiguous periods of data collection
  group_by(plot_num, channel, id) %>%
  mutate(baseline = na.approx(baseline, na.rm = F, rule = 2)) %>% # Interpolate between baselines
  ungroup() %>%
  mutate(k = (baseline - dTCa) / dTCa, # Calculate the sap flux index
         v = 0.0119 * (k ^ 1.231) * 60 * 60, # Calculate sap velocity in cm/hr using Granier formula
         v = replace_na(v, 0)) %>% # Replace NAs with 0s
  group_by(tree_id, date, probe_depth) %>%
  summarize(sum_v = sum(v)) # Select important variables
}

# Perform data wrangling steps
tdp(path = "data", meta = "channels.csv", min = 2, max = 20, mslope = 1) %>%
  head()

## Joining, by = c("plot_num", "channel")

## `summarise()` regrouping output by 'tree_id', 'date' (override with `.groups` argument)

## # A tibble: 6 x 4
## # Groups:   tree_id, date [2]
##   tree_id date       probe_depth sum_v
##   <chr>   <date>           <dbl> <dbl>
## 1 1-1-DF  2020-07-03          15 155. 
## 2 1-1-DF  2020-07-03          25  79.2
## 3 1-1-DF  2020-07-03          50 185. 
## 4 1-1-DF  2020-07-03          90  29.6
## 5 1-1-DF  2020-07-04          15 156. 
## 6 1-1-DF  2020-07-04          25  83.4

An introduction to linear models with R

By Mikala Meize

Libraries used

library(QuantPsyc)
library(lmtest)
library(palmerpenguins)

Linear model information

Linear models are generally coded in this format:
model.name <- modeltype(outcome ~ predictor, data, specifications)

The following code will apply specifically to Ordinary Least Squares (OLS) regression; these use continuous outcome/dependent variables.

We will use the mtcars dataset that comes with R and the penguins dataset that comes with the ‘palmerpenguins’ library for these examples.

Linear models with one predictor

Outcome variable: mtcars$mpg

#checking for normal distribution in the outcome variable
hist(mtcars$mpg)

fig1-hist

#checking for outliers
boxplot(mtcars$mpg)

fig2-boxplot

Predictor variable: mtcars$wt
Checking for a linear relationship between outcome and predictor variables using a plot() code:

plot(mtcars$mpg, mtcars$wt)

fig3-scatter

scatter.smooth(mtcars$mpg, mtcars$wt)

fig4-scatterline

Just for some useful information – checking the strength of that linear relationship using correlations codes.
Both of these codes give us the correlation between the variables, but cor.test() outputs the results of a significance test.

cor(mtcars$mpg, mtcars$wt) #correlation between x and y variables

## [1] -0.8676594

cor.test(mtcars$mpg, mtcars$wt) #correlation with significance test

## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$wt
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594

Code for single predictor linear model

The code for a linear model using a single predictor follows the basic format given above.

Model type: lm()
Outcome: mpg
Predictor: Weight (wt)
Data: mtcars

mpg.model <- lm(mpg ~ wt, mtcars)

The model shows up in the global environment. To see the results of the model, run a summary() code.

summary(mpg.model)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Interpreting the results of a single predictor linear model

The F-test/Anova at the end of the results is statistically significant. This means our model with one predictor is stronger than a null model with no predictors.
The Multiple R-Squared value indicates that the predictor (weight) explains about 75% of the variance in the outcome (miles per gallon). The Adjusted R-Squared value has the same interpretation as the Multiple R-Squared value, but penalizes more complex models. As we increase predictors, the multiple R-squared is expected to increase simply because we added something else to explain the outcome; however, the adjusted R-squared value penalizes models because it’s possible we’ve added a variable that does not actually increase the explanatory power of our model.
The predictor (weight) is statistically significant and the coefficient is negative. This matches the visual relationship in the plot() code from above. Simply put, as the weight of a car increases, the miles per gallon decreases.

The format of an equation for a linear model is: Outcome = Intercept + Coefficient*Predictor
If we put these values into an equation, it would look like this:

mpg = 37.2851 – 5.3445(weight)

Linear models with two or more predictors

Outcome variable: mtcars$mpg
Predictor variable 1: mtcars$wt
Predictor variable 2: mtcars$hp

The relationship between mpg and wt has not changed in the data, so when we add a predictor, we only run the linear checks and such on the second predictor.
Checking for a linear relationship between outcome and predictor variable 2 using a plot() code:

plot(mtcars$mpg, mtcars$hp)

fig5-scatter

scatter.smooth(mtcars$mpg, mtcars$hp)

fig6-scatterline

cor(mtcars$mpg, mtcars$hp) #correlation between x and y variables

## [1] -0.7761684

cor.test(mtcars$mpg, mtcars$hp) #correlation with significance test

## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$mpg and mtcars$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684

Code for multiple predictor linear model

The code for a linear model using a multiple predictor follows the basic format given above.

Model type: lm()
Outcome: mpg
Predictor(s): weight (wt) + horsepower (hp)
Data: mtcars

mpg.model2 <- lm(mpg ~ wt + hp, mtcars)

The model shows up in the global environment. To see the results of the model, run a summary() code.

summary(mpg.model2)

## 
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

Interpreting the results of a multiple predictor linear model

The F-test/Anova at the end of the results is statistically significant. This means our model with two predictors is stronger than a null model with no predictors.
We use the Adjusted R-Squared value here because we’ve increased the complexity of our model by adding another predictor. In this case, our model now accounts for about 81% of the variance in miles per gallon. This is an increase over the previous model with only a single predictor.
Weight (wt), as a predictor, is still statistically significant and the coefficient is negative. As a car’s weight increases, the miles per gallon decreases, holding horsepower constant (or controlling for horsepower as it’s often written).
Horsepower (hp), as a predictor, is statistically significant and the coefficient is negative. As a car’s horsepower increases, the miles per gallon decreases, holding weight constant (or controlling for weight).

If we put these values into an equation, it would look like this:

mpg = 37.22727 – 3.87783(weight) – 0.03177(horsepower)

Linear Models with Binary/Dichotomous Predictors (Dummy variables)

Sometimes we want to use a dichotomous variable to predict an outcome. In this example, we’ll use the transmission variable (am) which is coded as 1 = manual, 0 = automatic.

If your dichotomous variable is coded as numeric values, use the as.factor() code to add it to the model. R will treat this variable no longer as numeric, but as a factor.

mpg.model3 <- lm(mpg ~ wt + hp + as.factor(am), mtcars)

This model is now in the global environment. Run a summary() to see the results.

summary(mpg.model3)

## 
## Call:
## lm(formula = mpg ~ wt + hp + as.factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4221 -1.7924 -0.3788  1.2249  5.5317 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    34.002875   2.642659  12.867 2.82e-13 ***
## wt             -2.878575   0.904971  -3.181 0.003574 ** 
## hp             -0.037479   0.009605  -3.902 0.000546 ***
## as.factor(am)1  2.083710   1.376420   1.514 0.141268    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared:  0.8399, Adjusted R-squared:  0.8227 
## F-statistic: 48.96 on 3 and 28 DF,  p-value: 2.908e-11

Interpreting the results of a linear model using binary/dichotomous variables

The F-test/Anova at the end of the results is statistically significant. This means our model with these predictors is stronger than a null model with no predictors.
We use the Adjusted R-Squared value here because we’ve increased the complexity of our model by adding another predictor. In this case, our model now accounts for about 82% of the variance in miles per gallon. This is a slight increase over the previous model with two predictors.
Weight (wt), as a predictor, is still statistically significant and the coefficient is negative. As a car’s weight increases, the miles per gallon decreases, holding horsepower and transmission type constant (or controlling for horsepower and transmission type, as it’s often written).
Horsepower (hp), as a predictor, is statistically significant and the coefficient is negative. As a car’s horsepower increases, the miles per gallon decreases, holding weight and transmission type constant (or controlling for weight and transmission type).
Transmission type (am) is not a statistically significant predictor. This means that there is not a statistically significant difference between manual or automatic transmission miles per gallon values, controlling for weight and horsepower. If this were significant, we would interpret the coefficient as the increase in miles per gallon when the variable is 1 (when it’s a manual transmission); this is not an increase as the transmission type increases, because that doesn’t make sense when there are only two options.

If we put these values into an equation, it would look like this:

mpg = 34.002875 – 2.878575(weight) – 0.037479(horsepower) + 2.08371(transmission type)

Linear Models With Categorical Predictors (More Than Two Options)

It is common to use a predictor variable that is not numeric, and has more than two options. In this example, we’ll use the penguins dataset from the library ‘palmerpenguins’. The species variable has three options.

summary(penguins$species)

##    Adelie Chinstrap    Gentoo 
##       152        68       124

Let’s predict bill length with species.

Outcome: Bill length (bill_length_mm)
Predictor: species
Data: penguins

If your categorical variable is coded as numeric values, you can use the as.factor() code to have R treat it like a factor rather than a set of related numbers. This is common when using months as a variable.

bill.model1 <- lm(bill_length_mm ~ as.factor(species), penguins)

This model should be in the global environment. To see the results use: summary()

summary(bill.model1)

## 
## Call:
## lm(formula = bill_length_mm ~ as.factor(species), data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9338 -2.2049  0.0086  2.0662 12.0951 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  38.7914     0.2409  161.05   <2e-16 ***
## as.factor(species)Chinstrap  10.0424     0.4323   23.23   <2e-16 ***
## as.factor(species)Gentoo      8.7135     0.3595   24.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.96 on 339 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7078, Adjusted R-squared:  0.7061 
## F-statistic: 410.6 on 2 and 339 DF,  p-value: < 2.2e-16

If your variable is already coded as a factor, simply use the variable name in the code.

bill.model2 <- lm(bill_length_mm ~ species, penguins)

This model should be in the global environment. To see the results use: summary()

summary(bill.model2)

## 
## Call:
## lm(formula = bill_length_mm ~ species, data = penguins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9338 -2.2049  0.0086  2.0662 12.0951 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       38.7914     0.2409  161.05   <2e-16 ***
## speciesChinstrap  10.0424     0.4323   23.23   <2e-16 ***
## speciesGentoo      8.7135     0.3595   24.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.96 on 339 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7078, Adjusted R-squared:  0.7061 
## F-statistic: 410.6 on 2 and 339 DF,  p-value: < 2.2e-16

You can see that the results from bill.model1 and bill.model2 are the same.

One other way to use categorical variables in a linear model, is to code each option as its own dummy variable.

penguins2 <- penguins
penguins2$adelie <- ifelse(penguins2$species == 'Adelie', 1, 0)
penguins2$chinstrap <- ifelse(penguins2$species == 'Chinstrap', 1, 0)
penguins2$gentoo <- ifelse(penguins2$species == 'Gentoo', 1, 0)

Each of the species is now its own variable. A one (1) indicates that the penguin is that species. If a one (1) is coded for Adelie, then Chinstrap and Gentoo must be zeros (0).

When adding the categorical variable to the model, you must leave out one category. This is your reference category. R automatically decides the reference category when using the as.factor() code and when leaving the categorical variable as one variable. Using dummy variables is a good option if you want to pick your own reference category.

bill.model3 <- lm(bill_length_mm ~ chinstrap + gentoo, penguins2)

To see the results, run summary()

summary(bill.model3)

## 
## Call:
## lm(formula = bill_length_mm ~ chinstrap + gentoo, data = penguins2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9338 -2.2049  0.0086  2.0662 12.0951 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.7914     0.2409  161.05   <2e-16 ***
## chinstrap    10.0424     0.4323   23.23   <2e-16 ***
## gentoo        8.7135     0.3595   24.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.96 on 339 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7078, Adjusted R-squared:  0.7061 
## F-statistic: 410.6 on 2 and 339 DF,  p-value: < 2.2e-16

As you can see, each of these models produces the same results as the reference category in all three is “Adelie.”

Interpreting the results of a linear model using categorical variables

The F-test/Anova at the end of the results is statistically significant. This means our model with these predictors is stronger than a null model with no predictors.
The multiple r-squared value indicates that about 71% of the variance in bill length of a penguin is explained by the species.
The Chinstrap category is statistically significant; the Chinstrap penguins have a longer bill length compared to Adelie penguins.
The Gentoo category is statistically significant; the Gentoo penguins have a longer bill length compared to the Adelie penguins.

If we put these values into an equation, it would look like this:

bill length = 38.7914 + 10.0424(Chinstrap) + 8.7135(Gentoo)

The intercept then, is the average length of the Adelie penguin bill because the Chinstrap and Gentoo values in the equation are both zero (0) when looking at Adelie penguins.

If you wanted to use Chinstrap as the reference category, you would use a model like this, leaving out the chosen reference category:

bill.model4 <- lm(bill_length_mm ~ adelie + gentoo, penguins2)
summary(bill.model4)

## 
## Call:
## lm(formula = bill_length_mm ~ adelie + gentoo, data = penguins2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9338 -2.2049  0.0086  2.0662 12.0951 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  48.8338     0.3589 136.052  < 2e-16 ***
## adelie      -10.0424     0.4323 -23.232  < 2e-16 ***
## gentoo       -1.3289     0.4473  -2.971  0.00318 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.96 on 339 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7078, Adjusted R-squared:  0.7061 
## F-statistic: 410.6 on 2 and 339 DF,  p-value: < 2.2e-16

In this case, we can now see that Adelie penguins and Gentoo penguins both have shorter bills than the Chinstrap penguins, as the coefficients for the Adelie and Gentoo penguins are negative.

Comparing models

How do we know if when adding a variable, it was useful to the model?

One option is to compare the adjusted r-squared values between models. When we look at the results from mpg.model and mpg.model2, there is an increase in the adjusted r-squared value when we added the horsepower variable.

summary(mpg.model)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

summary(mpg.model2)

## 
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

However, when we compare mpg.model2 and mpg.model3, the increase in adjusted r-squared is not substantial.

summary(mpg.model2)

## 
## Call:
## lm(formula = mpg ~ wt + hp, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

summary(mpg.model3)

## 
## Call:
## lm(formula = mpg ~ wt + hp + as.factor(am), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4221 -1.7924 -0.3788  1.2249  5.5317 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    34.002875   2.642659  12.867 2.82e-13 ***
## wt             -2.878575   0.904971  -3.181 0.003574 ** 
## hp             -0.037479   0.009605  -3.902 0.000546 ***
## as.factor(am)1  2.083710   1.376420   1.514 0.141268    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared:  0.8399, Adjusted R-squared:  0.8227 
## F-statistic: 48.96 on 3 and 28 DF,  p-value: 2.908e-11

Another option is to compare AIC (Akaike’s information criterion) and BIC (Bayesian information criterion) values. These are comparison statistics that penalize more complex models. When comparing two models, the one with a lower AIC or BIC value is usually the stronger choice.

AIC(mpg.model)

## [1] 166.0294

AIC(mpg.model2)

## [1] 156.6523

AIC(mpg.model3)

## [1] 156.1348

BIC(mpg.model)

## [1] 170.4266

BIC(mpg.model2)

## [1] 162.5153

BIC(mpg.model3)

## [1] 163.4635

The AIC value is lower for mpg.model2 than mpg.model. The AIC is again lower for mpg.model3, but by less than 1. The BIC value for mpg.model2 is lower than mpg.model and mpg.model3.

Lastly, we can compare nested models with an anova/F-test. A significant result would indicate that the larger model is in fact statistically different from the smaller model, so the addition of the predictor variable was useful. A non-significant result would mean the addition of the predictor variable is not useful and should be removed from further models.

anova(mpg.model, mpg.model2)

## Analysis of Variance Table
## 
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + hp
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     30 278.32                                
## 2     29 195.05  1    83.274 12.381 0.001451 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Mpg.model2 is in fact different from mpg.model. In combination with the adjusted r-squared values and the AIC values, mpg.model2 is the stronger model.

anova(mpg.model2, mpg.model3)

## Analysis of Variance Table
## 
## Model 1: mpg ~ wt + hp
## Model 2: mpg ~ wt + hp + as.factor(am)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     29 195.05                           
## 2     28 180.29  1    14.757 2.2918 0.1413

Mpg.model3 is not different from mpg.model2. In combination with the adjusted r-squared values and the AIC values, mpg.model2 is still the stronger model, and adding the transmission type did not better the model.

Checking model assumptions

Now that we’ve decided mpg.model2 is the best of the mpg models, let’s check the assumptions. We can do this using the plot() function. The first line of code allows us to see all four plots in one window. The second line of code gives us the actual plots. The third line of code is a statistical test for homoscedasticity.

par(mfrow=c(2,2))
plot(mpg.model2)

fig7-assumptions

lmtest::bptest(mpg.model2)

## 
##  studentized Breusch-Pagan test
## 
## data:  mpg.model2
## BP = 0.88072, df = 2, p-value = 0.6438

We use the lower left plot to address the assumption of homoscedasticity. If we see a funnel, then we’ve violated this assumption. The Breusch-Pagan test allows us to check the visual conclusions we might come to. This model does not violate the homoscedasticity assumption either visually or statistically.
We use the upper right plot to address normality (this is sometimes referred to as a Q-Q plot). If the points folow a linear pattern, we’ve met this assumption.

In this case, we’ve met both the homoscedasticity and the normality assumptions.

To check the linearity assumption, we plot the residuals of the model against the predictors of the model.

par(mfrow=c(1,2))
plot(mtcars$wt, (residuals(mpg.model2)))
plot(mtcars$hp, (residuals(mpg.model2)))

fig8-dualassumptions

###add a loess curve
scatter.smooth(mtcars$wt, residuals(mpg.model2))
scatter.smooth(mtcars$hp, residuals(mpg.model2))

fig9-dualassumptions2

We don’t want to see a trend in this plot. If it’s random we’ve met the linearity assumption. At first, these plots look random. After adding a trend line, it’s possible there might be an underlying pattern that wasn’t captured in the model. You could try transforming the predictor variables to account for this.

An Introduction to data.table using sythetic lat/lon data

By Nicholas A. Potter

This is a short introduction to using data.table based on my work with climate data.

Initial setup

Let’s begin by generating some data to work with. We’ll use the following packages:

# If you don't have the packages below, install with:
#install.packages(c("data.table", "readr", "dplyr", "microbenchmark"))

library(data.table) 
library(dplyr) # tidyverse data manipulation
library(readr) # tidyverse reading in data
library(microbenchmark) # for timing code to see which is faster

We will work with a grid of latitude and longitude points, and we’ll generate some fake temperature and precip data:

# grid resolution
res <- 0.0416666666666667

# Grid extents
lats <- seq(  24.358333333333333,  52.909999999999997, by = res)
lons <- seq(-124.933333300000000, -66.016666633333330, by = res)

# Define a data.table
dt <- data.table(expand.grid(lat = lats, lon = lons))

# default printing of data.tables is screen friendly:
dt

##              lat        lon
##      1: 24.35833 -124.93333
##      2: 24.40000 -124.93333
##      3: 24.44167 -124.93333
##      4: 24.48333 -124.93333
##      5: 24.52500 -124.93333
##     ---                    
## 970686: 52.73333  -66.01667
## 970687: 52.77500  -66.01667
## 970688: 52.81667  -66.01667
## 970689: 52.85833  -66.01667
## 970690: 52.90000  -66.01667

# Assume:
# GDD is a function of latitude (Areas closer to the poles are less warm)
# Precip is random

# Equivalent to dplyr::mutate(gdd = ..., prec = ...)
dt[, `:=`(
  gdd = 3000 - 70*(lat-24) + rnorm(n = nrow(dt), sd = 100),
  prec = pmax(0,rnorm(n = nrow(dt), mean = 12, sd = 3)))]
dt

##              lat        lon       gdd
##      1: 24.35833 -124.93333 3056.2775
##      2: 24.40000 -124.93333 3234.5665
##      3: 24.44167 -124.93333 2920.4358
##      4: 24.48333 -124.93333 2850.3544
##      5: 24.52500 -124.93333 2980.4107
##     ---                              
## 970686: 52.73333  -66.01667 1069.3607
## 970687: 52.77500  -66.01667  812.7283
## 970688: 52.81667  -66.01667  920.2619
## 970689: 52.85833  -66.01667 1137.4059
## 970690: 52.90000  -66.01667  946.3984
##             prec
##      1: 10.66249
##      2: 12.14676
##      3: 10.33538
##      4: 10.30929
##      5: 14.82836
##     ---         
## 970686: 12.03940
## 970687: 13.76590
## 970688: 11.02060
## 970689: 11.92815
## 970690: 17.25918

# For comparison, a data.frame
df <- as.data.frame(dt)

A good reference comparing data.table and dplyr:

https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

If you’re wedded to dplyr grammar and don’t want to learn data.table, try dtplyr:

https://dtplyr.tidyverse.org/

Why data.table?

When is data.table perhaps better than the tidyverse? I like and use both, so I don’t hold with the idea that there is one that is better. Instead, there are specific reasons to use each. I use data.table when:

1. I need complex transformations on large data sets

As the number of rows increases, data.table becomes increasingly faster that a data.frame or tibble. This can turn a several day process into a day or less, which is huge when you inevitably end up rerunning things. By converting a script to use data.table, sp, and refactoring functions, I decreased the processing time for one year of climate data from four days to five hours. I had 90 years of data to process for a minimum of six different models…

For example, let’s time a summary of gdd at a more coarse lat/lon grid:

microbenchmark(
  dt[, .(gdd = mean(gdd)), by = .(lat = round(lat, 2), lon = round(lon,2))],
  times = 10, unit = "ms")

## Unit: milliseconds
##                                                                             expr
##  dt[, .(gdd = mean(gdd)), by = .(lat = round(lat, 2), lon = round(lon,      2))]
##       min       lq     mean   median
##  263.4171 265.3146 302.2987 268.0365
##        uq      max neval
##  355.6186 364.7961    10

microbenchmark(
  df %>%
    #mutate(lat = round(lat, 2), lon = round(lon, 2)) %>%
    group_by(lat = round(lat, 2), lon = round(lon, 2)) %>%
    summarize(gdd = mean(gdd)),
  times = 10, unit = "ms")

## Unit: milliseconds
##                                                                                           expr
##  df %>% group_by(lat = round(lat, 2), lon = round(lon, 2)) %>%      summarize(gdd = mean(gdd))
##       min       lq     mean   median
##  2156.705 2544.885 2627.235 2661.827
##        uq      max neval
##  2745.505 2816.072    10

2. tidyverse grammar gets too long / complex

People will also say this is a negative, because other people can’t read your code as easily. I’m not sure I agree. In my experience as a researcher, we don’t collaboratively write code often. Your most common reader is going to be yourself in 3-6 months when you have to revisit something. So documenting and writing clear code is important, but data.table is clear, it’s just a different language than the tidyverse. It is wonderfully succinct at times. For example, in dplyr you might write:

# dplyr approach
df %>%
  mutate(lat = round(lat), lon = round(lon)) %>%
  group_by(lat, lon) %>%
  summarize(gdd = mean(gdd))

# data.table
dt[, .(gdd = mean(gdd)), by = .(lat = round(lat), lon = round(lon))]

That doesn’t seem like much for one transformation, but if the number of transformations is high either because you have multiple data files or multiple variables that all need a different transformation, the difference in code length is substantial. It’s much easier to look at 20 lines of data.table transformations than it is to look at 50 lines of dplyr transformations to accomplish the same thing.

Using data.table

The grammar of data.table is DT[i,j, other], where i is a logical selector on rows, j is operations on columns, and other is additional arguments for grouping, which columns to perform operations on, etc…

I Operations (select rows)

# Select using the "i" argument
dt[lat % dplyr::filter(lat < 25)

##             lat        lon      gdd
##     1: 24.35833 -124.93333 3056.277
##     2: 24.40000 -124.93333 3234.566
##     3: 24.44167 -124.93333 2920.436
##     4: 24.48333 -124.93333 2850.354
##     5: 24.52500 -124.93333 2980.411
##    ---                             
## 22636: 24.81667  -66.01667 2902.917
## 22637: 24.85833  -66.01667 2901.789
## 22638: 24.90000  -66.01667 2822.173
## 22639: 24.94167  -66.01667 2830.030
## 22640: 24.98333  -66.01667 3019.428
##             prec
##     1: 10.662486
##     2: 12.146765
##     3: 10.335384
##     4: 10.309294
##     5: 14.828355
##    ---          
## 22636: 11.866986
## 22637:  9.461943
## 22638: 12.930729
## 22639: 14.225363
## 22640: 12.048109

dt[lat  -67]

##           lat       lon      gdd      prec
##   1: 24.35833 -66.97500 2847.152 10.236523
##   2: 24.40000 -66.97500 2997.729 10.588268
##   3: 24.44167 -66.97500 2912.901 15.688600
##   4: 24.48333 -66.97500 2782.295 13.240460
##   5: 24.52500 -66.97500 3024.306 13.872522
##  ---                                      
## 380: 24.81667 -66.01667 2902.917 11.866986
## 381: 24.85833 -66.01667 2901.789  9.461943
## 382: 24.90000 -66.01667 2822.173 12.930729
## 383: 24.94167 -66.01667 2830.030 14.225363
## 384: 24.98333 -66.01667 3019.428 12.048109

J Operations (operate on columns)

# Perform an operation on specific columns
dt[, .(
  lat = mean(lat),
  lon = mean(lon),
  gdd = mean(gdd))]

##         lat     lon     gdd
## 1: 38.62917 -95.475 1975.92

# Alternatively, for all columns or just specific ones:
dt[, lapply(.SD, mean, na.rm = TRUE)] # equivalent to dplyr::transmute_all()

##         lat     lon     gdd     prec
## 1: 38.62917 -95.475 1975.92 11.99819

dt[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("gdd", "prec")]

##        gdd     prec
## 1: 1975.92 11.99819

# A more complicated function
# center GDD by removing the mean
dt[, .(lat, lon, d_gdd = gdd - mean(gdd))]

##              lat        lon      d_gdd
##      1: 24.35833 -124.93333  1080.3575
##      2: 24.40000 -124.93333  1258.6465
##      3: 24.44167 -124.93333   944.5158
##      4: 24.48333 -124.93333   874.4344
##      5: 24.52500 -124.93333  1004.4908
##     ---                               
## 970686: 52.73333  -66.01667  -906.5593
## 970687: 52.77500  -66.01667 -1163.1916
## 970688: 52.81667  -66.01667 -1055.6581
## 970689: 52.85833  -66.01667  -838.5141
## 970690: 52.90000  -66.01667 -1029.5216

# Perform operations on the same data set
dt[, gd_gdd := gdd - mean(gdd)] # equivalent to dplyr::mutate()

# For multiple variables at once:
dt[, `:=`(gd_gdd = gdd - mean(gdd),
          gd_prec = prec - mean(prec))]
dt

##              lat        lon       gdd
##      1: 24.35833 -124.93333 3056.2775
##      2: 24.40000 -124.93333 3234.5665
##      3: 24.44167 -124.93333 2920.4358
##      4: 24.48333 -124.93333 2850.3544
##      5: 24.52500 -124.93333 2980.4107
##     ---                              
## 970686: 52.73333  -66.01667 1069.3607
## 970687: 52.77500  -66.01667  812.7283
## 970688: 52.81667  -66.01667  920.2619
## 970689: 52.85833  -66.01667 1137.4059
## 970690: 52.90000  -66.01667  946.3984
##             prec     gd_gdd     gd_prec
##      1: 10.66249  1080.3575 -1.33570660
##      2: 12.14676  1258.6465  0.14857212
##      3: 10.33538   944.5158 -1.66280901
##      4: 10.30929   874.4344 -1.68889873
##      5: 14.82836  1004.4908  2.83016281
##     ---                                
## 970686: 12.03940  -906.5593  0.04120797
## 970687: 13.76590 -1163.1916  1.76771001
## 970688: 11.02060 -1055.6581 -0.97759204
## 970689: 11.92815  -838.5141 -0.07004544
## 970690: 17.25918 -1029.5216  5.26098430

# Or equivalently
dt[, c("gd_gdd", "gd_prec") := .(gdd - mean(gdd), prec - mean(prec))]

# Group transformations
dt[, `:=`(lat0 = round(lat), lon0 = round(lon))]
dt[, `:=`(gd_gdd = gdd - mean(gdd),
          gd_prec = prec - mean(prec)),
   by = .(lat0, lon0)]
dt

##              lat        lon       gdd
##      1: 24.35833 -124.93333 3056.2775
##      2: 24.40000 -124.93333 3234.5665
##      3: 24.44167 -124.93333 2920.4358
##      4: 24.48333 -124.93333 2850.3544
##      5: 24.52500 -124.93333 2980.4107
##     ---                              
## 970686: 52.73333  -66.01667 1069.3607
## 970687: 52.77500  -66.01667  812.7283
## 970688: 52.81667  -66.01667  920.2619
## 970689: 52.85833  -66.01667 1137.4059
## 970690: 52.90000  -66.01667  946.3984
##             prec     gd_gdd    gd_prec lat0
##      1: 10.66249   95.14628 -1.2601424   24
##      2: 12.14676  273.43529  0.2241363   24
##      3: 10.33538  -40.69542 -1.5872448   24
##      4: 10.30929 -110.77682 -1.6133345   24
##      5: 14.82836   45.97717  3.0862385   25
##     ---                                    
## 970686: 12.03940   62.21880  0.2354413   53
## 970687: 13.76590 -194.41356  1.9619434   53
## 970688: 11.02060  -86.87996 -0.7833587   53
## 970689: 11.92815  130.26402  0.1241879   53
## 970690: 17.25918  -60.74347  5.4552177   53
##         lon0
##      1: -125
##      2: -125
##      3: -125
##      4: -125
##      5: -125
##     ---     
## 970686:  -66
## 970687:  -66
## 970688:  -66
## 970689:  -66
## 970690:  -66

# Removing variables
dt[, `:=`(gd_gdd = NULL, gd_prec = NULL)] # dplyr::select(-gd_gdd, -gd_prec)

Other Operations (keys, indices, merges, and renaming)

# Create some 2-digit lat/lon groups
dt[, `:=`(lat2 = round(lat,2), lon2 = round(lon,2))]

# No keys is okay, but operations are slower
key(dt) # No key set

## NULL

microbenchmark(
  dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)],
  times = 10, unit = "ms")

## Unit: milliseconds
##                                          expr
##  dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)]
##       min       lq     mean   median
##  155.8982 289.4522 323.1314 303.3228
##       uq     max neval
##  310.706 530.188    10

# Set keys that you are grouping by is faster
setkey(dt, lat2, lon2)
setkeyv(dt, c("lat2", "lon2")) #Equivalent - useful for quoted vars in functions
key(dt) # Now with lat2, lon2

## [1] "lat2" "lon2"

microbenchmark(
  dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)],
  times = 10, unit = "ms")

## Unit: milliseconds
##                                          expr
##  dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)]
##      min      lq     mean  median      uq
##  44.2884 46.3645 66.44499 69.0569 79.6183
##     max neval
##  96.594    10

In Functions (or for loops)

#' Center variables of a data.table.
#' @param d a data.table.
#' @param vars a list of quoted column names to center.
#' @param byvar a list of quoted column names by which to center.
#' @param na.rm exclude NA?
#' @return a data.table with centered variables.
center_dt <- function(d, vars, byvars, na.rm = TRUE) {
  setkeyv(d, byvars)
  d[, lapply(.SD, function(x) { x - mean(x, na.rm = na.rm) }),
    by = byvars, .SDcols = vars]
}
dt[, `:=`(lat0 = round(lat), lon0 = round(lon))]
center_dt(dt, vars = c("gdd", "prec"), byvars = c("lat0", "lon0"))

##         lat0 lon0         gdd       prec
##      1:   24 -125   95.146283 -1.2601424
##      2:   24 -125  -45.366808  0.7401815
##      3:   24 -125  -59.763709 -1.5278002
##      4:   24 -125    4.492488 -6.7632807
##      5:   24 -125   36.774394  1.3920487
##     ---                                 
## 970686:   53  -66  -72.051125  2.6577200
## 970687:   53  -66   75.705411  5.4217230
## 970688:   53  -66  -57.443669  1.3007194
## 970689:   53  -66 -145.215167  1.6826111
## 970690:   53  -66  -60.743466  5.4552177

# Alternatively, specify just the transformation as a function
center <- function(x) { x - mean(x) }
dt[, lapply(.SD, function(x){ center(x) }), 
   by = c("lat0", "lon0"), .SDcols = c("gdd", "prec")]

##         lat0 lon0         gdd       prec
##      1:   24 -125   95.146283 -1.2601424
##      2:   24 -125  -45.366808  0.7401815
##      3:   24 -125  -59.763709 -1.5278002
##      4:   24 -125    4.492488 -6.7632807
##      5:   24 -125   36.774394  1.3920487
##     ---                                 
## 970686:   53  -66  -72.051125  2.6577200
## 970687:   53  -66   75.705411  5.4217230
## 970688:   53  -66  -57.443669  1.3007194
## 970689:   53  -66 -145.215167  1.6826111
## 970690:   53  -66  -60.743466  5.4552177

Reading and writing data

data.table provides fread and fwrite to quickly read and write data.

fread() also takes a URL if you want to directly read data from http.

microbenchmark(fwrite(dt, file = "dt.csv"), times = 1, unit = "ms")

## Unit: milliseconds
##                         expr      min
##  fwrite(dt, file = "dt.csv") 682.1413
##        lq     mean   median       uq
##  682.1413 682.1413 682.1413 682.1413
##       max neval
##  682.1413     1

microbenchmark(write_csv(dt, path = "dt.csv"), times = 1, unit = "ms")

## Unit: milliseconds
##                            expr      min
##  write_csv(dt, path = "dt.csv") 3856.494
##        lq     mean   median       uq
##  3856.494 3856.494 3856.494 3856.494
##       max neval
##  3856.494     1

microbenchmark(saveRDS(dt, file = "dt.rds"), times = 1, unit = "ms")

## Unit: milliseconds
##                          expr      min
##  saveRDS(dt, file = "dt.rds") 4002.413
##        lq     mean   median       uq
##  4002.413 4002.413 4002.413 4002.413
##       max neval
##  4002.413     1

microbenchmark(fread("dt.csv"), times = 1, unit = "ms")

## Unit: milliseconds
##             expr      min       lq     mean
##  fread("dt.csv") 298.4178 298.4178 298.4178
##    median       uq      max neval
##  298.4178 298.4178 298.4178     1

microbenchmark(read_csv("dt.csv"), times = 1, unit = "ms")

## Unit: milliseconds
##                expr      min       lq
##  read_csv("dt.csv") 2899.474 2899.474
##      mean   median       uq      max neval
##  2899.474 2899.474 2899.474 2899.474     1

microbenchmark(readRDS("dt.rds"), times = 1, unit = "ms")

## Unit: milliseconds
##               expr      min       lq
##  readRDS("dt.rds") 529.9389 529.9389
##      mean   median       uq      max neval
##  529.9389 529.9389 529.9389 529.9389     1