Introduction
This tutorial was inspired by the R Curious tutorial at useR! 2018, and follows on thematically from the R Curious workshop notes as an extension.
It is aimed at those with a background in Excel, who would also like to use R for data analysis. This tutorial compares the things you would normally do in Excel, but with an equivalent function in R.
This introductory level tutorial assumes you have already installed R and R studio and had a brief introduction to the R basics and R Markdown.
From the R Markdown run each line of code (Ctrl + Enter) or code chunk (Ctrl + Shift +Enter) yourself or Knit -> Knit to HTML to read through the HTML version.
Particularly useful packages for this exercise
# Load packages
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages -------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.5
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## -- Conflicts ----------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(rpivotTable)
Tip Install the package from CRAN the first time you use the package using install.packages("name of package")
else you will get an Error
that there is no package.
Tip The latest package version is sometimes available from GitHub so alternatively install from GitHub using thedevtools
package. A package’s GitHub README file typically has installation instructions.
Import Witch Trial data
The dataset used in this tutorial is of 43,000 people tried for witchcraft across 21 European countries over a period of five-and-a-half centuries used for this economic journal.
This is what the csv file looks like in Excel:
In R, import the data using the read_csv
function from the readr R package which is loaded with the tidyverse, to import the raw data to a data frame called witchdat
.
A data frame, which is a 2-dimensional object where contents can be different data types.
# Download raw data from the author's GitHub url https://github.com/JakeRuss/witch-trials/tree/master/data
url <- "https://raw.githubusercontent.com/JakeRuss/witch-trials/master/data/trials.csv"
# The read_csv function also produces default messages describing the parsing of columns which is converting the columns into different types.
witchdat <- read_csv(url)
## Parsed with column specification:
## cols(
## year = col_integer(),
## decade = col_integer(),
## century = col_integer(),
## tried = col_integer(),
## deaths = col_integer(),
## city = col_character(),
## gadm.adm2 = col_character(),
## gadm.adm1 = col_character(),
## gadm.adm0 = col_character(),
## lon = col_double(),
## lat = col_double(),
## record.source = col_character()
## )
Let’s take a quick look at the witchdat
data frame in R to get an idea of it’s size and contents.
In R there are some packages that are included by default; take a look at the packages available in the Global Environment drop down. One such package is called base which includes dim function to retrieve the dimensions, the row and column numbers.
# Retrieve the number of rows and columns
dim(witchdat)
## [1] 10940 12
Another included package is utils R package which includes the str
and structure
functions to view the basic structure.
# Use str function to 'Compactly Display the Structure of an Arbitrary R Object'. All R objects store additional attributes to store metadata about an object. In this str function we set the give.attr to FALSE to exclude these attributes.
str(witchdat, give.attr=FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame': 10940 obs. of 12 variables:
## $ year : int NA NA NA NA NA NA NA NA NA NA ...
## $ decade : int 1520 1530 1540 1580 1590 1600 1610 1620 1630 1640 ...
## $ century : int 1500 1500 1500 1500 1500 1600 1600 1600 1600 1600 ...
## $ tried : int 1 1 5 7 11 6 22 14 25 39 ...
## $ deaths : int 1 1 5 5 0 1 18 8 4 10 ...
## $ city : chr NA NA NA NA ...
## $ gadm.adm2 : chr NA NA NA NA ...
## $ gadm.adm1 : chr NA NA NA NA ...
## $ gadm.adm0 : chr "Estonia" "Estonia" "Estonia" "Estonia" ...
## $ lon : num NA NA NA NA NA NA NA NA NA NA ...
## $ lat : num NA NA NA NA NA NA NA NA NA NA ...
## $ record.source: chr "Madar (1990)" "Madar (1990)" "Madar (1990)" "Madar (1990)" ...
# Next we will use structure function to view witchdat
structure(witchdat)
## # A tibble: 10,940 x 12
## year decade century tried deaths city gadm.adm2 gadm.adm1 gadm.adm0
## <int> <int> <int> <int> <int> <chr> <chr> <chr> <chr>
## 1 NA 1520 1500 1 1 <NA> <NA> <NA> Estonia
## 2 NA 1530 1500 1 1 <NA> <NA> <NA> Estonia
## 3 NA 1540 1500 5 5 <NA> <NA> <NA> Estonia
## 4 NA 1580 1500 7 5 <NA> <NA> <NA> Estonia
## 5 NA 1590 1500 11 0 <NA> <NA> <NA> Estonia
## 6 NA 1600 1600 6 1 <NA> <NA> <NA> Estonia
## 7 NA 1610 1600 22 18 <NA> <NA> <NA> Estonia
## 8 NA 1620 1600 14 8 <NA> <NA> <NA> Estonia
## 9 NA 1630 1600 25 4 <NA> <NA> <NA> Estonia
## 10 NA 1640 1600 39 10 <NA> <NA> <NA> Estonia
## # ... with 10,930 more rows, and 3 more variables: lon <dbl>, lat <dbl>,
## # record.source <chr>
# There are other ways of looking at your data. Try the glimpse function from the tibble package. Take a look at the Help description with ?glimpse to find out more
glimpse(witchdat)
## Observations: 10,940
## Variables: 12
## $ year <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ decade <int> 1520, 1530, 1540, 1580, 1590, 1600, 1610, 1620, ...
## $ century <int> 1500, 1500, 1500, 1500, 1500, 1600, 1600, 1600, ...
## $ tried <int> 1, 1, 5, 7, 11, 6, 22, 14, 25, 39, 10, 10, 8, 8,...
## $ deaths <int> 1, 1, 5, 5, 0, 1, 18, 8, 4, 10, 1, 4, 1, 3, 3, 0...
## $ city <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ gadm.adm2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ gadm.adm1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ gadm.adm0 <chr> "Estonia", "Estonia", "Estonia", "Estonia", "Est...
## $ lon <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ lat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ record.source <chr> "Madar (1990)", "Madar (1990)", "Madar (1990)", ...
This is a fairly large data frame with columns that are characters, doubles and integers.
Format “cells” in R
In Excel we can rename and format “cells”.
Which functions can we use in R?
From now on we will refer to the columns as variables, with column headers as the variable names.
First we will look at the format of the variable names.
# Take a look at the column or variable names
names(witchdat)
## [1] "year" "decade" "century" "tried"
## [5] "deaths" "city" "gadm.adm2" "gadm.adm1"
## [9] "gadm.adm0" "lon" "lat" "record.source"
A tool for data frame data manipulation is the dplyr R package. R packages typically have vignettes which are tutorials and worked examples using the package. There is useful documentation in the Introduction to dplyr vignette.
In order to access the contents of the columns we can use a $
for example witchdat$gadm.adm0
. Alternatively we can use another method that joins objects and functions using a pipe %>%
. This can be useful if we would like to use multiple dplyr
verbs or functions on an object.
We will be use <-
to assign the formatting changes to witchdat
.
# From inspection of witchdat this cryptic name gadm.adm0 refers to country. Use dplyr to rename a variable.
witchdat <- witchdat %>%
rename(country=gadm.adm0)
# Perform data type conversion of the deaths variable - use dplyr mutate to change this variable
witchdat <- witchdat %>%
mutate(deaths=as.numeric(deaths))
# or we can use dplyr mutate_all to change types of all integer variables (is.integer) to numeric (as.numeric). With mutate_if we don't need the () with is.integer and as.numeric functions as mutate_if will operate on all the columns for which the first test function is TRUE.
witchdat <- witchdat %>%
mutate_if(is.integer,as.numeric)
Try typing ‘as.’ in the R Console to see what other options are available to change data types.
# View the structure of eventdat again to see these formatting changes
glimpse(witchdat)
## Observations: 10,940
## Variables: 12
## $ year <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ decade <dbl> 1520, 1530, 1540, 1580, 1590, 1600, 1610, 1620, ...
## $ century <dbl> 1500, 1500, 1500, 1500, 1500, 1600, 1600, 1600, ...
## $ tried <dbl> 1, 1, 5, 7, 11, 6, 22, 14, 25, 39, 10, 10, 8, 8,...
## $ deaths <dbl> 1, 1, 5, 5, 0, 1, 18, 8, 4, 10, 1, 4, 1, 3, 3, 0...
## $ city <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ gadm.adm2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ gadm.adm1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ country <chr> "Estonia", "Estonia", "Estonia", "Estonia", "Est...
## $ lon <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ lat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ record.source <chr> "Madar (1990)", "Madar (1990)", "Madar (1990)", ...
For other formatting we can also format the end products (such as plots), using the functions available in those R packages.
“Sort” data in R
In Excel we can sort by columns.
Which functions can we use in R?
R is different to Excel that we can also view data manipulations without assigning changes to an object. In these sorting examples we will just view the changes in-line.
# Sort by character using arrange from dplyr. We can also pipe other packages functions such as head from the utils package.
witchdat %>%
arrange(city) %>%
head()
## # A tibble: 6 x 12
## year decade century tried deaths city gadm.adm2 gadm.adm1 country
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 1613 1610 1600 1 NA Aalen Stuttgart Baden-Wurt~ Germany
## 2 1564 1560 1500 1 0 Aalsm~ Aalsmeer Noord-Holl~ Nether~
## 3 1596 1590 1500 1 1 Aalst Oost-Vlaan~ Vlaanderen Belgium
## 4 1601 1600 1600 1 1 Aalst Oost-Vlaan~ Vlaanderen Belgium
## 5 1658 1650 1600 1 0 Abbek~ Wognum Noord-Holl~ Nether~
## 6 1499 1490 1400 1 NA Abens~ Mittelfran~ Bayern Germany
## # ... with 3 more variables: lon <dbl>, lat <dbl>, record.source <chr>
# This sorting also works on numeric variables, sort by year using arrange from dplyr
witchdat %>%
arrange(year) %>%
head()
## # A tibble: 6 x 12
## year decade century tried deaths city gadm.adm2 gadm.adm1 country
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 1300 1300 1300 1 0 Nurnb~ Mittelfra~ Bayern Germany
## 2 1300 1300 1300 1 NA Freib~ Freiburg Baden-Wurt~ Germany
## 3 1301 1300 1300 1 0 Coven~ West Midl~ England United ~
## 4 1302 1300 1300 1 0 Exeter Devon England United ~
## 5 1302 1300 1300 1 0 Exeter Devon England United ~
## 6 1303 1300 1300 1 0 <NA> <NA> <NA> France
## # ... with 3 more variables: lon <dbl>, lat <dbl>, record.source <chr>
# Or we can sort in descending order
witchdat %>%
arrange(desc(year)) %>%
head()
## # A tibble: 6 x 12
## year decade century tried deaths city gadm.adm2 gadm.adm1 country
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 1850 1850 1800 1 0 Vrouwen~ Veere Zeeland Nether~
## 2 1835 1830 1800 1 0 Sint-An~ Tholen Zeeland Nether~
## 3 1826 1820 1800 1 0 Amsterd~ Amsterdam Noord-Hol~ Nether~
## 4 1792 1790 1700 1 NA Dinkels~ Mittelfra~ Bayern Germany
## 5 1790 1790 1700 1 NA Memming~ Mittelfra~ Bayern Germany
## 6 1790 1790 1700 1 0 <NA> <NA> Sachsen Germany
## # ... with 3 more variables: lon <dbl>, lat <dbl>, record.source <chr>
Create a “formula” in R
In Excel we can create formulas in cells.
Which functions can we use in R?
# Remember R as a calculator? This is a simple formula
3+2
## [1] 5
# Create a new variable in our dataframe called triedbutnotdied as the difference between two variables (formula after the =). Use the mutate function from dplyr
witchdat <- witchdat %>%
mutate(triedbutnotdied = tried-deaths)
# In R missing values are typically NA values. A handy base function to know is is.na(). Try typing ?is.na in the console. Let's use this function in a new formula to test if a variable is populated
witchdat <- witchdat %>%
mutate(missingyear = ifelse(is.na(year),"Yup missing","Not missing"))
# View the structure of witchdat again to see these new variables
glimpse(witchdat)
## Observations: 10,940
## Variables: 14
## $ year <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ decade <dbl> 1520, 1530, 1540, 1580, 1590, 1600, 1610, 1620...
## $ century <dbl> 1500, 1500, 1500, 1500, 1500, 1600, 1600, 1600...
## $ tried <dbl> 1, 1, 5, 7, 11, 6, 22, 14, 25, 39, 10, 10, 8, ...
## $ deaths <dbl> 1, 1, 5, 5, 0, 1, 18, 8, 4, 10, 1, 4, 1, 3, 3,...
## $ city <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ gadm.adm2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ gadm.adm1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ country <chr> "Estonia", "Estonia", "Estonia", "Estonia", "E...
## $ lon <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ lat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ record.source <chr> "Madar (1990)", "Madar (1990)", "Madar (1990)"...
## $ triedbutnotdied <dbl> 0, 0, 0, 2, 11, 5, 4, 6, 21, 29, 9, 6, 7, 5, 2...
## $ missingyear <chr> "Yup missing", "Yup missing", "Yup missing", "...
Filter a “column”
In Excel we can filter by columns.
Which functions can we use in R?
# Filter which witches were tried in decade 1520 by using dplyr filter. Remember to use == for equals in R, whereas = is an assignment operator in R functions.
witchdat %>%
filter(decade==1520) %>%
head()
## # A tibble: 6 x 14
## year decade century tried deaths city gadm.adm2 gadm.adm1 country
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 NA 1520 1500 1 1 <NA> <NA> <NA> Estonia
## 2 NA 1520 1500 1 0 <NA> <NA> <NA> Finland
## 3 NA 1520 1500 3 1 <NA> <NA> <NA> Hungary
## 4 NA 1520 1500 2 1 <NA> <NA> Basel-Stadt Switzerla~
## 5 NA 1520 1500 2 2 <NA> <NA> Bern Switzerla~
## 6 NA 1520 1500 4 4 <NA> <NA> Fribourg Switzerla~
## # ... with 5 more variables: lon <dbl>, lat <dbl>, record.source <chr>,
## # triedbutnotdied <dbl>, missingyear <chr>
Calculate on “columns”
In Excel we can add a calculation such as sum or a mean to a column.
Which functions can we use in R?
In R Markdown documents we can “highlight” and comment on insights, calculations or data manipulations.
# Sum the total deaths. Remember there are NA values so we need to tell this sum function to ignore these with argument na.rm=TRUE. Using sum on the 'whole column' ie variable witchdat$deaths is intuitively the same way of calculating this sum in Excel
sum(witchdat$deaths,na.rm=TRUE)
## [1] 16333
# Now use dplyr with the select function to select the deaths variable (also the same as deleting the other columns in Excel) then apply the sum on this object
witchdat %>%
select(deaths) %>%
sum(na.rm = TRUE)
## [1] 16333
# Now try the dplyr summarise function and sum function then name this sum object as sum_deaths
witchdat %>%
summarise(sum_deaths = sum(deaths, na.rm = TRUE))
## # A tibble: 1 x 1
## sum_deaths
## <dbl>
## 1 16333
Create a “Pivot Table”
In Excel you can create pivot tables.
Which functions can we use in R?
In R we can use the table function to create a simple pivot table.
# Use th table function to find a simple count by country
witchdat %>%
select(country) %>%
table()
## .
## Austria Belgium Czech Republic Denmark Estonia
## 16 671 1 90 17
## Finland France Germany Hungary Ireland
## 17 807 3417 26 4
## Italy Luxembourg Netherlands Norway Poland
## 107 20 314 20 9
## Spain Sweden Switzerland United Kingdom
## 29 353 1272 3750
A neat R package that creates interactive pivot tables is rpivotTable. Have a look at this package’s vignette.
This example shows the sum of the “witches” tried by country.
# Create an interactive pivot table using the rpivotTable function and package.
pivot <- rpivotTable(witchdat,rows="country",vals="tried",aggregator="Sum")
# For this widget to show up in a blogdown page, we can use the widgetframe package and frameWidget function
widgetframe::frameWidget(pivot, height = '400')
Challenge 1 Can you recreate the pivot table as a new R data frame using dplyr
functions? (Hint ?group_by)
Challenge 2 On the pivot table widget move the country
above the table so the country
is a pivot table column. Can you extend your challenge 1 answer with dplyr
to recreate.
Challenge 3 On the pivot table widget move decade
and country
as pivot table rows. Can you extend your challenge 1 answer with dplyr
to recreate.
Save the “spreadsheet”
In Excel you can save your workbooks and spreadsheets. Which functions can we use in R?
Here are some options in R, depending on what output is needed:
- The
base
Rwrite.csv
function to write to a csv file
- The
readr
package in the tidyverse, includes the functionwrite_csv
, which is faster than write.csv
- The
xlsx
packagewrite.xlsx
to write to an Excel workbook
# Save dataframe with write_csv from readr package
write_csv(witchdat,"witchdat.csv")
Acknowledgments
Thanks to Marcia Ferreira @DrMarciaFe, Chris Beltz @BeltzEcology, Duncan Garmonsway @nacnudus and Ivan Leung @urganmax for reviewing and providing suggestions on this tutorial through Twitter.
If you have any further suggestions or corrections please raise an issue in GitHub.