Tutorial on Pivot Tables and other Excel things you can also do in R

19 Jul 2018

R / Excel / tutorial

Introduction

This tutorial was inspired by the R Curious tutorial at useR! 2018, and follows on thematically from the R Curious workshop notes as an extension.

It is aimed at those with a background in Excel, who would also like to use R for data analysis. This tutorial compares the things you would normally do in Excel, but with an equivalent function in R.

This introductory level tutorial assumes you have already installed R and R studio and had a brief introduction to the R basics and R Markdown.

From the R Markdown run each line of code (Ctrl + Enter) or code chunk (Ctrl + Shift +Enter) yourself or Knit -> Knit to HTML to read through the HTML version.

Particularly useful packages for this exercise

# Load packages
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.1

## -- Attaching packages -------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.0.0     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.5
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## Warning: package 'ggplot2' was built under R version 3.5.1

## -- Conflicts ----------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(rpivotTable)

Tip Install the package from CRAN the first time you use the package using install.packages("name of package") else you will get an Error that there is no package.

Tip The latest package version is sometimes available from GitHub so alternatively install from GitHub using thedevtools package. A package’s GitHub README file typically has installation instructions.

Import Witch Trial data

The dataset used in this tutorial is of 43,000 people tried for witchcraft across 21 European countries over a period of five-and-a-half centuries used for this economic journal.

This is what the csv file looks like in Excel:

In R, import the data using the read_csv function from the readr R package which is loaded with the tidyverse, to import the raw data to a data frame called witchdat.

A data frame, which is a 2-dimensional object where contents can be different data types.

# Download raw data from the author's GitHub url https://github.com/JakeRuss/witch-trials/tree/master/data
url <- "https://raw.githubusercontent.com/JakeRuss/witch-trials/master/data/trials.csv"
# The read_csv function also produces default messages describing the parsing of columns which is converting the columns into different types.
witchdat <- read_csv(url)

## Parsed with column specification:
## cols(
##   year = col_integer(),
##   decade = col_integer(),
##   century = col_integer(),
##   tried = col_integer(),
##   deaths = col_integer(),
##   city = col_character(),
##   gadm.adm2 = col_character(),
##   gadm.adm1 = col_character(),
##   gadm.adm0 = col_character(),
##   lon = col_double(),
##   lat = col_double(),
##   record.source = col_character()
## )

Let’s take a quick look at the witchdat data frame in R to get an idea of it’s size and contents.

In R there are some packages that are included by default; take a look at the packages available in the Global Environment drop down. One such package is called base which includes dim function to retrieve the dimensions, the row and column numbers.

# Retrieve the number of rows and columns
dim(witchdat)

## [1] 10940    12

Another included package is utils R package which includes the str and structure functions to view the basic structure.

# Use str function to 'Compactly Display the Structure of an Arbitrary R Object'. All R objects store additional attributes to store metadata about an object. In this str function we set the give.attr to FALSE to exclude these attributes.
str(witchdat, give.attr=FALSE)

## Classes 'tbl_df', 'tbl' and 'data.frame':    10940 obs. of  12 variables:
##  $ year         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ decade       : int  1520 1530 1540 1580 1590 1600 1610 1620 1630 1640 ...
##  $ century      : int  1500 1500 1500 1500 1500 1600 1600 1600 1600 1600 ...
##  $ tried        : int  1 1 5 7 11 6 22 14 25 39 ...
##  $ deaths       : int  1 1 5 5 0 1 18 8 4 10 ...
##  $ city         : chr  NA NA NA NA ...
##  $ gadm.adm2    : chr  NA NA NA NA ...
##  $ gadm.adm1    : chr  NA NA NA NA ...
##  $ gadm.adm0    : chr  "Estonia" "Estonia" "Estonia" "Estonia" ...
##  $ lon          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ lat          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ record.source: chr  "Madar (1990)" "Madar (1990)" "Madar (1990)" "Madar (1990)" ...

# Next we will use structure function to view witchdat 
structure(witchdat)

## # A tibble: 10,940 x 12
##     year decade century tried deaths city  gadm.adm2 gadm.adm1 gadm.adm0
##    <int>  <int>   <int> <int>  <int> <chr> <chr>     <chr>     <chr>    
##  1    NA   1520    1500     1      1 <NA>  <NA>      <NA>      Estonia  
##  2    NA   1530    1500     1      1 <NA>  <NA>      <NA>      Estonia  
##  3    NA   1540    1500     5      5 <NA>  <NA>      <NA>      Estonia  
##  4    NA   1580    1500     7      5 <NA>  <NA>      <NA>      Estonia  
##  5    NA   1590    1500    11      0 <NA>  <NA>      <NA>      Estonia  
##  6    NA   1600    1600     6      1 <NA>  <NA>      <NA>      Estonia  
##  7    NA   1610    1600    22     18 <NA>  <NA>      <NA>      Estonia  
##  8    NA   1620    1600    14      8 <NA>  <NA>      <NA>      Estonia  
##  9    NA   1630    1600    25      4 <NA>  <NA>      <NA>      Estonia  
## 10    NA   1640    1600    39     10 <NA>  <NA>      <NA>      Estonia  
## # ... with 10,930 more rows, and 3 more variables: lon <dbl>, lat <dbl>,
## #   record.source <chr>

# There are other ways of looking at your data. Try the glimpse function from the tibble package. Take a look at the Help description with ?glimpse to find out more
glimpse(witchdat)

## Observations: 10,940
## Variables: 12
## $ year          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ decade        <int> 1520, 1530, 1540, 1580, 1590, 1600, 1610, 1620, ...
## $ century       <int> 1500, 1500, 1500, 1500, 1500, 1600, 1600, 1600, ...
## $ tried         <int> 1, 1, 5, 7, 11, 6, 22, 14, 25, 39, 10, 10, 8, 8,...
## $ deaths        <int> 1, 1, 5, 5, 0, 1, 18, 8, 4, 10, 1, 4, 1, 3, 3, 0...
## $ city          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ gadm.adm2     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ gadm.adm1     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ gadm.adm0     <chr> "Estonia", "Estonia", "Estonia", "Estonia", "Est...
## $ lon           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ lat           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ record.source <chr> "Madar (1990)", "Madar (1990)", "Madar (1990)", ...

This is a fairly large data frame with columns that are characters, doubles and integers.

Format “cells” in R

In Excel we can rename and format “cells”.

Which functions can we use in R?

From now on we will refer to the columns as variables, with column headers as the variable names.

First we will look at the format of the variable names.

# Take a look at the column or variable names
names(witchdat)

##  [1] "year"          "decade"        "century"       "tried"        
##  [5] "deaths"        "city"          "gadm.adm2"     "gadm.adm1"    
##  [9] "gadm.adm0"     "lon"           "lat"           "record.source"

A tool for data frame data manipulation is the dplyr R package. R packages typically have vignettes which are tutorials and worked examples using the package. There is useful documentation in the Introduction to dplyr vignette.

In order to access the contents of the columns we can use a $ for example witchdat$gadm.adm0. Alternatively we can use another method that joins objects and functions using a pipe %>%. This can be useful if we would like to use multiple dplyr verbs or functions on an object.

We will be use <- to assign the formatting changes to witchdat.

# From inspection of witchdat this cryptic name gadm.adm0 refers to country. Use dplyr to rename a variable. 
witchdat <- witchdat %>% 
  rename(country=gadm.adm0)

# Perform data type conversion of the deaths variable - use dplyr mutate to change this variable
witchdat <- witchdat %>% 
  mutate(deaths=as.numeric(deaths))
# or we can use dplyr mutate_all to change types of all integer variables (is.integer) to numeric (as.numeric). With mutate_if we don't need the () with is.integer and as.numeric functions as mutate_if will operate on all the columns for which the first test function is TRUE.
witchdat <- witchdat %>% 
  mutate_if(is.integer,as.numeric)

Try typing ‘as.’ in the R Console to see what other options are available to change data types.

# View the structure of eventdat again to see these formatting changes
glimpse(witchdat)

## Observations: 10,940
## Variables: 12
## $ year          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ decade        <dbl> 1520, 1530, 1540, 1580, 1590, 1600, 1610, 1620, ...
## $ century       <dbl> 1500, 1500, 1500, 1500, 1500, 1600, 1600, 1600, ...
## $ tried         <dbl> 1, 1, 5, 7, 11, 6, 22, 14, 25, 39, 10, 10, 8, 8,...
## $ deaths        <dbl> 1, 1, 5, 5, 0, 1, 18, 8, 4, 10, 1, 4, 1, 3, 3, 0...
## $ city          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ gadm.adm2     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ gadm.adm1     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ country       <chr> "Estonia", "Estonia", "Estonia", "Estonia", "Est...
## $ lon           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ lat           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ record.source <chr> "Madar (1990)", "Madar (1990)", "Madar (1990)", ...

For other formatting we can also format the end products (such as plots), using the functions available in those R packages.

“Sort” data in R

In Excel we can sort by columns.

Which functions can we use in R?

R is different to Excel that we can also view data manipulations without assigning changes to an object. In these sorting examples we will just view the changes in-line.

# Sort by character using arrange from dplyr. We can also pipe other packages functions such as head from the utils package.
witchdat %>% 
  arrange(city) %>% 
  head()

## # A tibble: 6 x 12
##    year decade century tried deaths city   gadm.adm2   gadm.adm1   country
##   <dbl>  <dbl>   <dbl> <dbl>  <dbl> <chr>  <chr>       <chr>       <chr>  
## 1  1613   1610    1600     1     NA Aalen  Stuttgart   Baden-Wurt~ Germany
## 2  1564   1560    1500     1      0 Aalsm~ Aalsmeer    Noord-Holl~ Nether~
## 3  1596   1590    1500     1      1 Aalst  Oost-Vlaan~ Vlaanderen  Belgium
## 4  1601   1600    1600     1      1 Aalst  Oost-Vlaan~ Vlaanderen  Belgium
## 5  1658   1650    1600     1      0 Abbek~ Wognum      Noord-Holl~ Nether~
## 6  1499   1490    1400     1     NA Abens~ Mittelfran~ Bayern      Germany
## # ... with 3 more variables: lon <dbl>, lat <dbl>, record.source <chr>

# This sorting also works on numeric variables, sort by year using arrange from dplyr
witchdat %>% 
  arrange(year) %>% 
  head()

## # A tibble: 6 x 12
##    year decade century tried deaths city   gadm.adm2  gadm.adm1   country 
##   <dbl>  <dbl>   <dbl> <dbl>  <dbl> <chr>  <chr>      <chr>       <chr>   
## 1  1300   1300    1300     1      0 Nurnb~ Mittelfra~ Bayern      Germany 
## 2  1300   1300    1300     1     NA Freib~ Freiburg   Baden-Wurt~ Germany 
## 3  1301   1300    1300     1      0 Coven~ West Midl~ England     United ~
## 4  1302   1300    1300     1      0 Exeter Devon      England     United ~
## 5  1302   1300    1300     1      0 Exeter Devon      England     United ~
## 6  1303   1300    1300     1      0 <NA>   <NA>       <NA>        France  
## # ... with 3 more variables: lon <dbl>, lat <dbl>, record.source <chr>

# Or we can sort in descending order
witchdat %>% 
  arrange(desc(year)) %>% 
  head()

## # A tibble: 6 x 12
##    year decade century tried deaths city     gadm.adm2  gadm.adm1  country
##   <dbl>  <dbl>   <dbl> <dbl>  <dbl> <chr>    <chr>      <chr>      <chr>  
## 1  1850   1850    1800     1      0 Vrouwen~ Veere      Zeeland    Nether~
## 2  1835   1830    1800     1      0 Sint-An~ Tholen     Zeeland    Nether~
## 3  1826   1820    1800     1      0 Amsterd~ Amsterdam  Noord-Hol~ Nether~
## 4  1792   1790    1700     1     NA Dinkels~ Mittelfra~ Bayern     Germany
## 5  1790   1790    1700     1     NA Memming~ Mittelfra~ Bayern     Germany
## 6  1790   1790    1700     1      0 <NA>     <NA>       Sachsen    Germany
## # ... with 3 more variables: lon <dbl>, lat <dbl>, record.source <chr>

Create a “formula” in R

In Excel we can create formulas in cells.

Which functions can we use in R?

# Remember R as a calculator? This is a simple formula
3+2

## [1] 5

# Create a new variable in our dataframe called triedbutnotdied as the difference between two variables (formula after the =). Use the mutate function from dplyr
witchdat <- witchdat %>% 
  mutate(triedbutnotdied = tried-deaths)

# In R missing values are typically NA values. A handy base function to know is is.na(). Try typing ?is.na in the console.  Let's use this function in a new formula to test if a variable is populated
witchdat <- witchdat %>% 
  mutate(missingyear = ifelse(is.na(year),"Yup missing","Not missing"))
# View the structure of witchdat again to see these new variables
glimpse(witchdat)

## Observations: 10,940
## Variables: 14
## $ year            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ decade          <dbl> 1520, 1530, 1540, 1580, 1590, 1600, 1610, 1620...
## $ century         <dbl> 1500, 1500, 1500, 1500, 1500, 1600, 1600, 1600...
## $ tried           <dbl> 1, 1, 5, 7, 11, 6, 22, 14, 25, 39, 10, 10, 8, ...
## $ deaths          <dbl> 1, 1, 5, 5, 0, 1, 18, 8, 4, 10, 1, 4, 1, 3, 3,...
## $ city            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ gadm.adm2       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ gadm.adm1       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ country         <chr> "Estonia", "Estonia", "Estonia", "Estonia", "E...
## $ lon             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ lat             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ record.source   <chr> "Madar (1990)", "Madar (1990)", "Madar (1990)"...
## $ triedbutnotdied <dbl> 0, 0, 0, 2, 11, 5, 4, 6, 21, 29, 9, 6, 7, 5, 2...
## $ missingyear     <chr> "Yup missing", "Yup missing", "Yup missing", "...

Filter a “column”

In Excel we can filter by columns.

Which functions can we use in R?

# Filter which witches were tried in decade 1520 by using dplyr filter. Remember to use == for equals in R, whereas = is an assignment operator in R functions.
witchdat %>% 
  filter(decade==1520) %>% 
  head()

## # A tibble: 6 x 14
##    year decade century tried deaths city  gadm.adm2 gadm.adm1   country   
##   <dbl>  <dbl>   <dbl> <dbl>  <dbl> <chr> <chr>     <chr>       <chr>     
## 1    NA   1520    1500     1      1 <NA>  <NA>      <NA>        Estonia   
## 2    NA   1520    1500     1      0 <NA>  <NA>      <NA>        Finland   
## 3    NA   1520    1500     3      1 <NA>  <NA>      <NA>        Hungary   
## 4    NA   1520    1500     2      1 <NA>  <NA>      Basel-Stadt Switzerla~
## 5    NA   1520    1500     2      2 <NA>  <NA>      Bern        Switzerla~
## 6    NA   1520    1500     4      4 <NA>  <NA>      Fribourg    Switzerla~
## # ... with 5 more variables: lon <dbl>, lat <dbl>, record.source <chr>,
## #   triedbutnotdied <dbl>, missingyear <chr>

Calculate on “columns”

In Excel we can add a calculation such as sum or a mean to a column.

Which functions can we use in R?

In R Markdown documents we can “highlight” and comment on insights, calculations or data manipulations.

# Sum the total deaths. Remember there are NA values so we need to tell this sum function to ignore these with argument na.rm=TRUE. Using sum on the 'whole column' ie variable witchdat$deaths is intuitively the same way of calculating this sum in Excel
sum(witchdat$deaths,na.rm=TRUE)

## [1] 16333

# Now use dplyr with the select function to select the deaths variable (also the same as deleting the other columns in Excel) then apply the sum on this object
witchdat %>% 
  select(deaths) %>% 
  sum(na.rm = TRUE)

## [1] 16333

# Now try the dplyr summarise function and sum function then name this sum object as sum_deaths
witchdat %>% 
  summarise(sum_deaths = sum(deaths, na.rm = TRUE))

## # A tibble: 1 x 1
##   sum_deaths
##        <dbl>
## 1      16333

Create a “Pivot Table”

In Excel you can create pivot tables.

Which functions can we use in R?

In R we can use the table function to create a simple pivot table.

# Use th table function to find a simple count by country
witchdat %>% 
  select(country) %>% 
      table()

## .
##        Austria        Belgium Czech Republic        Denmark        Estonia 
##             16            671              1             90             17 
##        Finland         France        Germany        Hungary        Ireland 
##             17            807           3417             26              4 
##          Italy     Luxembourg    Netherlands         Norway         Poland 
##            107             20            314             20              9 
##          Spain         Sweden    Switzerland United Kingdom 
##             29            353           1272           3750

A neat R package that creates interactive pivot tables is rpivotTable. Have a look at this package’s vignette.

This example shows the sum of the “witches” tried by country.

# Create an interactive pivot table using the rpivotTable function and package. 
pivot <- rpivotTable(witchdat,rows="country",vals="tried",aggregator="Sum")
# For this widget to show up  in a blogdown page, we can use the widgetframe package and frameWidget function
widgetframe::frameWidget(pivot, height = '400')

Challenge 1 Can you recreate the pivot table as a new R data frame using dplyr functions? (Hint ?group_by)

Challenge 2 On the pivot table widget move the country above the table so the country is a pivot table column. Can you extend your challenge 1 answer with dplyr to recreate.

Challenge 3 On the pivot table widget move decade and country as pivot table rows. Can you extend your challenge 1 answer with dplyr to recreate.

Save the “spreadsheet”

In Excel you can save your workbooks and spreadsheets. Which functions can we use in R?

Here are some options in R, depending on what output is needed:

The base R write.csv function to write to a csv file
The readr package in the tidyverse, includes the function write_csv, which is faster than write.csv
The xlsx package write.xlsx to write to an Excel workbook

# Save dataframe with write_csv from readr package
write_csv(witchdat,"witchdat.csv")

Acknowledgments

Thanks to Marcia Ferreira @DrMarciaFe, Chris Beltz @BeltzEcology, Duncan Garmonsway @nacnudus and Ivan Leung @urganmax for reviewing and providing suggestions on this tutorial through Twitter.

If you have any further suggestions or corrections please raise an issue in GitHub.

Tutorial on Pivot Tables and other Excel things you can also do in R - Witch Trials data

Introduction

Particularly useful packages for this exercise

Import Witch Trial data

Format “cells” in R

“Sort” data in R

Create a “formula” in R

Filter a “column”

Calculate on “columns”

Create a “Pivot Table”

Save the “spreadsheet”

Acknowledgments