15 Oct 2018

R / modeling / timeseries / EDA

Introduction

The motivations for cycling around town can be for fitness, economic or environmental reasons. However in Auckland the weather and the 53 odd volcanoes could be deterrents to would-be cyclists.

Auckland Transport (AT) publishes daily and monthly bike count data from counters around Auckland now available under CC BY 4.0 license.

Kia ora, us again! We've had this updated now: https://t.co/OXIZvvmS8f Thanks for bringing this to our attention 😀^JN
— Auckland Transport (@AklTransport) December 16, 2018

We will analyse and visualise this data, then fit a model to forecast Auckland bike activity and trends to see if cyclists are braving the weather and taking on the hills.

Load Packages

# Load the tidyverse metapackage
library(tidyverse)
# Load skimr for summary statistics
library(skimr)
# Load the leaflet and htmltools packages for maps
library(leaflet)
library(htmltools)
# Load the sf package for map coordinate manipulations
library(sf)
# Load the lubridate package for time variable manipulations
library(lubridate)
# Load the forecast package for time series
library(forecast)
# Load the cowplot package for side by side ggplots
library(cowplot)

Data

Import data

Use the read_csv functions from the readr R package to import the csv files directly from urls. These function can be used for most csv files, reading and parsing the files to dataframes.

# Import the bikes daily counter dataset for August
url <- "https://at.govt.nz/media/1977986/august2018akldcyclecounterdata.csv"
bikes_daily <- read_csv(url)

## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   Date = col_character()
## )

## See spec(...) for full column specifications.

# Import the bikes mounthly counter dataset. Set the counter variables types to numeric
url2 <-"https://at.govt.nz/media/1975708/monthlyakldcyclecountdatanov2010-dec2017.csv"
bikes_monthly <- read_csv(url2, 
                          col_types = cols(.default = col_number(), Month=col_character()))

Data Summaries

View the summary statistics with the skimr R package.

# Produce a skim summary excluding the histograms as these do not plot in R markdown currently
skim_with(numeric = list(hist = NULL))
skim_with(
  numeric = list(p0 = NULL, p25 = NULL, p50 = NULL, p75 = NULL, p100 = NULL,sd = NULL,hist = NULL),
  integer = list(p0 = NULL, p25 = NULL, p50 = NULL, p75 = NULL, p100 = NULL,sd = NULL,hist = NULL)  
)
skim(bikes_daily)

## Skim summary statistics
##  n obs: 32 
##  n variables: 42 
## 
## -- Variable type:character ------------------------------------------------------------
##  variable missing complete  n min max empty n_unique
##      Date       1       31 32  14  15     0       31
## 
## -- Variable type:integer --------------------------------------------------------------
##                             variable missing complete  n    mean
##                  Beach Road Cyclists       1       31 32  275.9 
##  Carlton Gore Cycle Counter Cyclists       1       31 32  225.94
##         Curran Street Total Cyclists       1       31 32  248.23
##         Dominion Road Total Cyclists       1       31 32  295.55
##             East Coast Road Cyclists       1       31 32   88.65
##   GI TO TAMAKI DR SECTION-1 Cyclists       1       31 32   20.19
##              Grafton Bridge Cyclists       1       31 32  550.87
##               Grafton Gully Cyclists       1       31 32  347.23
##                Grafton Road Cyclists       1       31 32   81.29
##        Great North Rd Total Cyclists       1       31 32  226.1 
##        Great South Rd Total Cyclists       1       31 32   80.71
##                   HighBrook Cyclists       1       31 32   37.9 
##             Hopetoun Street Cyclists       1       31 32  159.03
##            Karangahape Road Cyclists       1       31 32  456.9 
##                Lagoon Drive Cyclists       1       31 32  142.13
##             Lake Road Total Cyclists       1       31 32  320.29
##              Mangere Bridge Cyclists       1       31 32  322.68
##         Mangere Safe Routes Cyclists       1       31 32   18.61
##                    Matakana Cyclists       1       31 32   14.97
##               Nelson Street Cyclists       1       31 32  419.9 
##     Nelson Street Lightpath Cyclists       1       31 32  383.23
##       NW Cycleway Kingsland Cyclists       1       31 32  794.23
##         NW Cycleway TeAtatu Cyclists       1       31 32  607.55
##        Oceanview Rd Waiheke Cyclists       1       31 32   92.32
##                  Orewa Path Cyclists       1       31 32  240.35
##           Ormiston Rd Total Cyclists       1       31 32   48.48
##         Pukekohe - Queen St Cyclists       1       31 32    9.06
##         Quay St Eco Display Cyclists       1       31 32  828.58
##   Quay St Spark Arena Total Cyclists       1       31 32  575.45
##                  Rankin Ave Cyclists       1       31 32   23.45
##        Saint Lukes Rd Total Cyclists       1       31 32  170.94
##              SW Shared Path Cyclists       1       31 32  165.26
##            Symonds St Total Cyclists       1       31 32  361.77
##          Tamaki Drive Total Cyclists       1       31 32 1138.06
##   TeAtatu Peninsula Pathway Cyclists       1       31 32  151.55
##  TeWero Bridge Bike Counter Cyclists       1       31 32  582.13
##                Twin Streams Cyclists       1       31 32   83.65
##              Upper Harbour  Cyclists       1       31 32  136.03
##          Upper Queen Street Cyclists       1       31 32  127.52
##  Victoria Street West Total Cyclists       1       31 32  114.94
##    Waterview Unitec Counter Cyclists       1       31 32  174.81

The bikes_daily dataset has the following variables in wide format:

Date : daily from Wed 1 Aug 2018
Bike counter street location names : 41 locations

It seems like the last row consists of NAs, so we can remove this row in the cleaning step.

# Produce a skim summary excluding the histograms as these do not plot in R markdown currently
skim(bikes_monthly)

## Skim summary statistics
##  n obs: 87 
##  n variables: 42 
## 
## -- Variable type:character ------------------------------------------------------------
##  variable missing complete  n min max empty n_unique
##     Month       0       87 87   6  11     0       87
## 
## -- Variable type:numeric --------------------------------------------------------------
##                                   variable missing complete  n     mean
##                                   Beach Rd      48       39 87  8537   
##                                  Canada St      85        2 87  1067   
##                            Carlton Gore Rd      58       29 87  6008.21
##                                  Curran St      61       26 87  8046.12
##             Dominion Rd (near Balmoral Rd)      51       36 87  3391.72
##                 Dominion Rd (near View Rd)      82        5 87  8959.6 
##                              East Coast Rd      24       63 87  3880.1 
##                                 G Sth Road       1       86 87  2684.43
##          Glen Innes (Te Ara ki Uta ki Tai)      73       14 87   658.64
##                             Grafton Bridge      26       61 87 14727.38
##                              Grafton Gully      48       39 87  9556.64
##                                 Grafton Rd      61       26 87  2291.31
##  Gt Nth Rd (citybound only until May 2017)      70       17 87  5030.88
##                                  Highbrook       1       86 87  1162.44
##                                Hopetoun St      61       26 87  4317   
##                                       K Rd      49       38 87 15239   
##                                  Lagoon Dr      25       62 87  5166.73
##                                  Lake Road       1       86 87  8714.14
##                             Mangere Bridge      11       76 87 12440.62
##                        Mangere Safe Routes      70       17 87  1233.41
##                            Matakana Bridge      85        2 87   679   
##                         Nelson St cycleway      62       25 87 10846   
##                        Nelson St Lightpath      62       25 87 14633.92
##                   NW Cycleway \n(Te Atatu)       1       86 87 12861.63
##                    NW Cycleway (Kingsland)       1       86 87 14751.98
##                      Oceanview Rd, Waiheke      86        1 87  6086   
##                                      Orewa       1       86 87  8587.48
##                      Ormiston Rd Flat Bush      85        2 87  1194.5 
##                          Pukekohe Queen St      85        2 87   264   
##                        Quay St Spark Arena      61       26 87 32134.92
##                              Quay St totem      69       18 87 23498.89
##                                SH20 Dom Rd      25       62 87  3140.39
##                               St Luke's Rd      80        7 87  5307.29
##                                 Symonds St      49       38 87 11254.61
##                        Tamaki Dr (EB + WB)      18       69 87 34166.7 
##                             Te Wero Bridge      56       31 87 17058.1 
##                               Twin Streams       1       86 87  3315.06
##                              Upper Harbour       1       86 87  4345.03
##                             Upper Queen St      61       26 87  4198.5 
##                           Victoria St West      61       26 87  3927.35
##                      Waterview SP (Unitec)      82        5 87  4907.4

The bikes_monthly dataset has the following variables in wide format:

Date : monthly from Nov-10
Bike counter street location Names : 41 locations

There are missing values represented as NA’s to be considered in data aggregations.

This analysis will assume that the bike counter data is accurate in counting cyclists as we have no visibility of the apparatus or data storage.

Auckland Population Data

Population estimates for regional council, territorial authority (including Auckland local board), area unit, and urban areas is available from DataHub table viewer. In order to download the data, a request needs to be submitted online but the Stats NZ link doesn’t currently work. As an interim workaround, a manual dataframe is created based on the online query results.

# Create Auckland population dataset
population <- data.frame(Date= c("30-June-11","30-June-12","30-June-13","30-June-14","30-June-15",         "30-June-16","30-June-17"), Auckland = c(1459600, 1476500, 1493200, 1526900, 1569900, 1614500, 1657200   ))

Data Cleaning

The bikes_daily and bikes_monthly counter data do not have coordinates available so we will manually create a dataframe with selected points with the longitude and latitude coordinates from a GPS coordinates lookup website.

Curran_Street
Quay_Street
Grafton_Gully

Feedback has been sent to AT to add the longitude and latitude coordinates to their csv files, so that further mapping and joins with other data can be performed another time.

# Create a streets datafarme
streets <- data.frame(street = c("Curran_Street","Quay_Street","Grafton_Gully"),  longitude=c(174.739289,174.773376,174.7678), latitude=c(-36.840521,-36.844713,-36.8623), stringsAsFactors = FALSE)

Now clean the the bikes_daily and bikes_monthly using dplyr and lubridate R packages.

# Convert the character date variable to a date variable using lubridate
bikes_daily$Date <- dmy(bikes_daily$Date)
# Add a new weekday variable using lubridate and dplyr
bikes_daily <- bikes_daily %>% 
      mutate(weekday =weekdays(Date)) %>% 
      # Remove the last row with NAs in the dataframe using slice using this tip https://stackoverflow.com/questions/38916660/r-delete-last-row-in-dataframe-for-each-group
      slice(-n())

# View the column names by position in bikes_daily to check they are the 3 points we have selected
names(bikes_daily)[4]

## [1] "Curran Street Total Cyclists"

names(bikes_daily)[9]

## [1] "Grafton Gully Cyclists"

names(bikes_daily)[29]

## [1] "Quay St Eco Display Cyclists"

# Rename the 3 points in bikes_daily to match the streets points
names(bikes_daily)[4] <- streets[1,1]
names(bikes_daily)[9] <- streets[2,1]
names(bikes_daily)[29] <- streets[3,1]

# View the column names by position in bikes_monthly to check they are the 3 points we have selected
names(bikes_monthly)[5]

## [1] "Curran St"

names(bikes_monthly)[11]

## [1] "Grafton Gully"

names(bikes_monthly)[32]

## [1] "Quay St totem"

# Rename the 3 points in bikes_monthly to match the streets points
names(bikes_monthly)[5] <- streets[1,1]
names(bikes_monthly)[11] <- streets[2,1]
names(bikes_monthly)[32] <- streets[3,1]

# Add the row sums ignoring the NAs and remove the first row which has a month as Te Atatu
bikes_monthly <- bikes_monthly %>% 
      mutate(sumStreets = rowSums(.[2:dim(bikes_monthly)[2]],na.rm = TRUE)) %>% 
      slice(-1)
# Convert the character date variable to a date variable using lubridate
bikes_monthly$Month <- dmy(paste0("01-", bikes_monthly$Month)) 
# In order to select the bikes columns as the vector of points in the streets dataframe, we need to provide column names as variables which contain strings. We could use select_ as the standard evaluation counterpart of select. However from the docs select_ has been deprecated in favour of tidy eval. https://timmastny.rbind.io/blog/nse-tidy-eval-dplyr-leadr/#expressions. We use the unquote !! https://adv-r.hadley.nz/quasiquotation.html to tell dplyr to select the variables that match the values in the streets$street vector and combine with the Date and weekday variable in the bikes dataset
bikes_subset <- bikes_daily %>% 
      select(!!streets$street) %>% 
      cbind(bikes_daily %>% select(Date,weekday))

# Now convert this dataframe into a tidy dataset using gather, where the Key is the street and the value is the count http://garrettgman.github.io/tidying/
bikes_subset <- bikes_subset %>% 
      gather(key = "street", value="counts",1:3)

# Now join the streets coordinates and the bikes subset dataframe using dplyr function inner_join    
bikes_streets <- streets %>%
      inner_join(bikes_subset,by = "street")

# Creat a sf object from the streets dataframe with points geometries with crs as 2193 https://epsg.io/2193
bikes_streets_sf <- st_as_sf(bikes_streets, 
                             coords = c("longitude", "latitude"), crs=2193)

Visualisations

Bike Counter Locations Daily Data

Let’s visualise the selected bike counter locations using the leaflet R package.

# Create a leaflet map centred at the mean of the streets points
long <- mean(streets$longitude)
lat <- mean(streets$latitude)
m <- leaflet() %>% 
      setView(lng = long, lat = lat, zoom = 13) %>% 
      addTiles()
m %>% 
      addMarkers(data = bikes_streets_sf, 
                 popup = ~htmlEscape(street)) %>% 
      addProviderTiles(providers$ Stamen.TonerBackground)

Let’s view the daily counts by the 3 counter points with scatter plot and labelled scatterplot, with the regression trend line.

# Plot a facetted plot of the  daily bike counts
ggplot(bikes_streets,aes(Date,counts))+ 
      geom_point() +
      geom_smooth(method="lm") +
      facet_wrap(~street) +
      theme_bw()+
      ggtitle("Daily Bicycle Counts and Weekdays for Counter Locations")

# Plot a facetted plot of the  daily bike counts
ggplot(bikes_streets,aes(Date,counts))+ 
      geom_point() +
      geom_label(aes(label=weekday)) +
      geom_smooth(method="lm") +
      facet_wrap(~street) +
      theme_bw()+
      ggtitle("Daily Bicycle Counts and Weekdays for Counter Locations")

It seems like there is a downward trend for August, perhaps this is related to the winter weather in August?

Next create a ggplot of the total aggregated count by weekday and street.

# Create a gpplot of the total aggregated count by weekday and street
bikes_streets %>% 
      group_by(weekday,street) %>% 
      summarise(weekday_sum=sum(counts)) %>% 
      na.omit() %>% 
      # Plot the weekday sum of counts and order the factor levels of the weekday variable but then use forcats fc_rev to reverse the ordering of labels when we add the coord_flip
      ggplot(aes(forcats::fct_rev(factor(weekday, c("Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday","Sunday"))) , weekday_sum, fill=street))  + 
      geom_col(position="dodge") +
      scale_fill_viridis_d()  +     
      xlab("Weekday")+
      ylab("Sum of Weekday Counts") +
      coord_flip()  +
      theme_bw() +
      ggtitle("Total Aggregated Bicyle Count by Weekday and Street")+ 
      labs(caption = "Source: AT https://at.govt.nz/cycling-walking/research-monitoring/")

Weekdays notably Thursday appears to be the most popular day in Grafton Gully and Quay Street, but Sunday appears the most popular day for cycling in Curran Street.

Bike Counter Locations Monthly Aggregated Data

Now lets turn to the monthly bike count dataset bike_monthly and look at this as a time series.

First we will use the base decompose function to break the time series into its components.

# Create a time series object bikes_monthly_ts
bikes_monthly_ts <- ts(bikes_monthly$sumStreets, start=c(2010,11), frequency=12)
decomp <- decompose(bikes_monthly_ts)
# Plot the decomposed time series
autoplot(decomp)

From the decomposed components it appears there is a repetitive seasonality pattern dipping in winter months and an overall upwards trend.

Compare Aggregated Bike Data with another Dataset

Now extract and plot the trend line against another dataset. Here we will use the Auckland population data over a similar time period.

# Plot the Auckland population
population$Date <- dmy(population$Date)
pop_ts <- ts(population$Auckland, start=c(2011), frequency=1)

bike_trend <- decomp$trend
# Plot the Auckland bicycle growth
bike_plot <- autoplot(bike_trend,na.rm=TRUE) +
      ylab("Bicycle Counts") +
      xlab("Date")+
      scale_y_continuous(labels = scales::comma)+
      ggtitle("Auckland Estimated Bicyle Counts") +
      theme_bw() + 
      labs(caption = "Source: AT https://at.govt.nz/cycling-walking/research-monitoring/")
# Plot the Auckland population growth
pop_plot <- autoplot(pop_ts)+xlab("Date") +
      ylab("Population number")+
      scale_y_continuous(labels = scales::comma)+
      ggtitle("Auckland Estimated Population") +
      theme_bw()+ 
      labs(caption = "Source: NZ Stats http://nzdotstat.stats.govt.nz/")
# Plot side by side plots with cowplot packagefor comparison instead of using 2 y-axis
plot_grid(bike_plot,pop_plot, labels = "AUTO")

# save_plot("cowplot.png", p, ncol = 2)

The cycling trend in A looks remarkably similar to Auckland population growth in B?!? It seems like increasing population is cycling regardless of the hilly roads.

Forecast Bike Numbers

Finally create an arima model using the auto.arima function from the forecast R package to see what the growth in cycling could look like.

# Fit an arima model using auto.arima function
fit <- bikes_monthly_ts %>% 
      auto.arima(seasonal=TRUE) 
# Produce a summary of the fitted model
summary(fit)

## Series: . 
## ARIMA(0,1,1)(2,1,0)[12] 
## 
## Coefficients:
##           ma1     sar1     sar2
##       -0.1933  -0.6515  -0.2840
## s.e.   0.1198   0.1203   0.1369
## 
## sigma^2 estimated as 365080369:  log likelihood=-824.49
## AIC=1656.97   AICc=1657.56   BIC=1666.13
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 1445.464 17238.28 12216.95 0.2184844 6.939697 0.2421102
##                      ACF1
## Training set -0.002190182

# Check ACF and PACF plots for model residuals
tsdisplay(residuals(fit), lag.max=45, main='(0,1,1) Model Residuals')

# Plot the forecast using the autoplot function
fit %>% 
      forecast() %>% 
      autoplot() +
      ylab("Bicycle counts") +
      xlab("Date") +
      scale_y_continuous(labels = scales::comma) +
      scale_x_continuous(labels = scales::number) +
      theme_bw()

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

This model is a seasonal Arima model. The moving average model has p=0 autoregressive terms, d=1 , q=1 moving average terms, where d=1 ie differencing of order 1. The seasonal part of the model has P=2 seasonal autoregressive (SAR) terms, D=1 number of seasonal differences, Q=0 seasonal moving average (SMA) terms.

Conclusions

The bike count data is interesting data to look at, especially the time series components and forecast. Once the coordinates are available in the AT datasets, further geospatial analysis can be performed through delving into the sf package and using other data.

The data cleaning here used new functions and packages including lubridate and cowplot and operators including the unquote !! . It has also been good to start reading up on tidy eval.