Summer of Tech Event and Intern Analysis

29 Jun 2018

R / EDA / networks / sankey / modeling

Introduction
Data
Clean Data
Success Factors
Specific Events Resulting in Internships
Social Impact
Conclusion

Introduction

We were looking for a real data from a non-profit organisation for the R-Ladies Auckland Dataviz meetup. We were approached by Summer of Tech(SoT) who volunteered their data for the group to explore.

SoT is a non-profit organisation that connects employers with students and graduates for paid work experience and graduate jobs. Students create a profile to register interest in a number of available summer interships. The students then attend skills and recruitment events, and SoT facilitates the whole process to job offers and securing these internships.

The specific questions that SoT have are as follows:

What are the success factors for getting hired? These are the measurable factors over and above being intrinsically awesome.
There is a lot of effort placed in running a large number of events, which specific events are better and result in internships?
In order to understand the social impact, where and how are the under-represented groups fairing?
What data and variables are missing and would be useful to collect?

The objectives of this post are to:

Perform an initial exploratory data analysis. This will form a basis for further analysis with additional data provided and feedback gathered.
Provide insights to SoT’s questions (note these insights are not recommendations or predictions).
Use the opportunity to experiment with new R visualisation packages and functions in networkD3 with real data.

The interviewing and hiring process can be subjective and often based on timing and personality fit. Data can tell some of the story but it is also worth considering the individual cases, and any specialist skills, with the analysis.

LOAD PACKAGES

library(tidyverse)
library(DataExplorer)
library(skimr)
library(networkD3)
library(maps)
library(ggmap)
library(gganimate)
library(animation)

Data

The Summer of Tech data is available in the R-Ladies github link.

The following datasets have been made available by Summer of Tech for the R-Ladies dataviz meetup:

“2017 event detail report.csv” includes event name and type, location (by city), start/finish time and date, event numbers
“2017 interns per employer.csv” includes the number of interns each employer hired in 2017
“Event attendances data Dec 2017.csv” includes event name and type, and student details
“Institution to region mapper.csv” includes institutions and corresponding region
“Intern data Dec 2017.csv” includes student details and internship details

# Import the "2017 event detail report.csv" 
eventdetail <- read_csv("https://raw.githubusercontent.com/R-LadiesAKL/sotdata/master/2017%20event%20detail%20report.csv")
# Import the "Event attendances data Dec 2017.csv"
event <- read_csv("https://raw.githubusercontent.com/R-LadiesAKL/sotdata/master/Event%20attendances%20data%20Dec%202017.csv")
# Import the intern data "Intern data Dec 2017.csv"
intern <- read_csv("https://raw.githubusercontent.com/R-LadiesAKL/sotdata/master/Intern%20data%20Dec%202017.csv")
#  Only select the first 214 rows, the rest are N/A
intern <- intern[1:214,]
# Import the "Institution to region mapper.csv"
mapper <- read_csv("https://raw.githubusercontent.com/R-LadiesAKL/sotdata/master/Institution%20to%20region%20mapper.csv")
# Import the "2017 interns per employer.csv"
internemployer <- read_csv("https://raw.githubusercontent.com/R-LadiesAKL/sotdata/master/2017%20interns%20per%20employer.csv")

#  We will use the make.names base R function to make syntactically valid names of the column variables for further analysis and output in R Markdown. For these names, it adds a "." to the spaces in the names
names(eventdetail) <- make.names(names(eventdetail))
#  Make Syntactically Valid Names of the eventdetail 
names(event) <- make.names(names(event))
#  Make Syntactically Valid Names of the intern
names(intern) <- make.names(names(intern))

Summarise the data in numbers:

From the event attendances data, there are 1381 students from 34 institutions, studying 886 different qualifications, who express interest in the events.
From the event detail data, there are 4529 registrations for events with 3206 attendances 109 events.
There are 213 unique interships of total 214 internships of students in 6 programmes from 18 institutions.
There are 20 different internships role types.

Clean Data

We will first clean the data.

Duplicates

From the summary statistics there appears to be a student who has two internships? Let’s view and remove the duplicate token.

#  Check which internship with the duplicated token ID
intern %>% 
      filter(token==intern$token[duplicated(intern$token)]) %>% 
      kableExtra::kable() %>%
      kableExtra::kable_styling()

token	Programme	Region	internship.type	Intern.ethnicity	Intern.gender	Institution	Year.of.study	Final.year	Offer.Created.at	Rate.value	Rate.period
8f4d30d4577a4b7424b4cc75524bc793	SoTJobs (Biz)	Wellington	hr	pakeha	female	Wellington Institute of Technology (Weltec)	3	FALSE	2017-10-02 08:40:39 +1300	20.2	hour
8f4d30d4577a4b7424b4cc75524bc793	Summer of Biz	Wellington	hr	pakeha	female	Wellington Institute of Technology (Weltec)	3	FALSE	2017-10-03 17:17:58 +1300	41000.0	year

# Remove the duplicate token by identifying the row number with the duplicate and then using dplyr's slice function
intern <- intern %>% 
      slice(-which(duplicated(intern$token)))

There is one event that also appears in the event detail called Health and Safety with no registered attendees which may be a duplicate of another Health and Safety event.

Add successful intern flag

Add a variable to the intern data to identify successful internship students after joining with the event data.

# Add a variable to the intern data to identify successful internship students after merged with the event data
intern <- intern %>% 
      mutate(internship="Successful")

Rename column variables

Let’s rename some of the column variables to be able to track which dataset set they came from.

# Rename column variables
intern <- intern %>% 
      # Rename the intern Region to intern.region
      rename_at(vars('Region'), ~ 'Intern.region') %>% 
      # Rename the intern institution to intern.institution
      rename_at(vars('Institution'), ~ 'Intern.institution') %>% 
      # Rename the intern Year of study to Intern.year.of.study
      rename_at(vars('Year.of.study'), ~ 'Intern.year.of.study')
event <- event %>% 
       # Rename the Student token variable in event to token so that this is the same as the intern variable name.
      rename_at(vars('Student.token'), ~ 'token') %>% 
      # Rename the region institution to region.institution
      rename_at(vars('Institution'), ~ 'Event.institution') 
eventdetail <- eventdetail %>% 
       # Rename the eventdetail Name variable in event to Event.name 
      rename_at(vars('Name'), ~ 'Event.name') %>% 
       # Rename the eventdetail Name variable in event to Event.name 
      rename_at(vars('Location'), ~ 'Event.region')

Mapper update

Add a missing institution,“Toi Ohomai Institute of Technology”, to the region-institution mapper.

# Add a missing institution to the mapper
mapper <- rbind(mapper,c("Toi Ohomai Institute of Technology","Multiple"))

Join files

Next Join the intern and event files by the common identifier, the student ID, “token”.

# Join the intern and event by the common token variable
combine <- full_join(intern,event,by="token")
# Replace the internship NA's to no. This will include event attendance states including late withdrawals
combine$internship <- replace_na(combine$internship,"Unsuccessful")
# Since some students attend multiple events, let's create a flag for attended events
combine <- combine %>% 
      mutate(registered.for.events=ifelse(is.na(Event.region),"no","yes"))
# Create a new student institution variable from the Event and intern institution variables
combine <- combine %>% 
      mutate(Institution=ifelse(is.na(Event.institution),Intern.institution, Event.institution))
# Map instititution to the region using the mapper 
combine <- combine %>% 
      left_join(mapper,by="Institution") %>% 
      rename(Student.institution=Institution) %>% 
      rename(Student.region=Region)

Success Factor - Academic Institution

Since we have a flow from one set of values to another i.e. flow from a factor to whether the student was successful or unsuccessful with an internship, we can use flow and network diagrams to visualise potential success factors.

One potential factor is the region of study and if where the student studies is a success factor. This regional grouping consolidates the large number of 34 institutions for visualisation.

The following diagrams are different representations of the total number of students from each region of study to successful or unsuccessful internships.

Sankey Diagram

I initially tried the Use the gvisSankey function from the googleVis R package. The downside is that it creates a plot in a new browser tab and not easily embedded to R Markdown so not suitable for use here.

Then I tried the networkD3 R package to create a D3 JavaScript Sankey diagram. I used this tutorial for the networkD3 diagrams.

After creating a summary of the data with the renamed Student.region and Intern.region variables, I included a check that the total number of intern tokens in original file is same as sum in this summary after the join.

# Create a summary of the counts of Event institutions by unique interns
regioninterncount <- combine %>%
      distinct(Student.region,internship,token,Intern.region) %>%
      group_by(Student.region,internship,Intern.region) %>%  
      summarize(counts = n()) %>%
      ungroup() %>%  
      arrange(desc(counts)) %>% 
      mutate(internship.long=ifelse(internship=="Unsuccessful","Unsuccessful",paste0(internship," intern in ",Intern.region))) %>% 
      mutate(Student.region.long=paste0("Studies in ",Student.region))
# Check that the total intern tokens in original file is same as sum in this summary
length(unique(intern$token))==
regioninterncount %>% filter(stringr::str_detect(internship.long ,"Successful")) %>% select(counts) %>% sum()

[1] TRUE

# Create the links and nodes for the Sankey and force network diagrams
name_vec <- c(unique(regioninterncount$Student.region.long), unique(regioninterncount$internship.long))
name_vec <- name_vec[!is.na(name_vec)]
lengthid <- length(name_vec)
# Create the nodes, this set up requires the id to start at 0
nodes <- data.frame(name = name_vec,id=0:(lengthid-1),stringsAsFactors = FALSE)
links <- regioninterncount %>%
  left_join(nodes, by = c('Student.region.long' = 'name')) %>%
  rename(origin_id = id) %>%
  left_join(nodes, by = c('internship.long' = 'name')) %>%
  rename(dest_id = id) %>% 
      data.frame()

See below the interactive Sankey diagram:

s <- sankeyNetwork(Links = links, Nodes = nodes, Source = 'origin_id', Target = 'dest_id', Value = 'counts', NodeID = 'name', 
              # Format
              fontSize = 14,
              nodeWidth = 20)   
# View Sankey s
s

# Save as htmlwidget
# htmlwidgets::saveWidget(s, "sankeynet.html", selfcontained = TRUE)

Force Directed Network

Let’s view this with an interactive and zoom-able force directed network graph using the forceNetwork function from the same networkD3 R package.

# Create and plot the forcenetwork with the links and nodes
f <- forceNetwork(Links = links, Nodes = nodes, Source = 'origin_id', Target = 'dest_id', NodeID = 'name', Group = 'id', 
             # Customise the links
             #Value = 'counts', 
             zoom = TRUE,
             #linkWidth = networkD3::JS("function(d) { return d.value/50; }"),
             charge=-1500,
             # Format
             #width = 600, 
             #height = 400,
             #legend=T,
             fontFamily = "arial",
             opacity = 1, 
             opacityNoHover = 0.5,
             fontSize = 18) %>% 
      htmlwidgets::prependContent(htmltools::tags$h3("Institution Links to Successful and Unsuccessful Internships")) 
# View the f network created
f

Institution Links to Successful and Unsuccessful Internships

# Save as htmlwidget
# htmlwidgets::saveWidget(f, "forcenetwork.html", selfcontained = TRUE)

Map

Since there is regional data let’s view the flow spatially using a map. In R Markdown the gganimate R package’s object will show as a sequence of panes, however we can save this as a gif to view it as an animation.

regioninterncount$Intern.region[8] <- "Palmerston North"
# Geocode the region data using mutate_geocode from ggmap R package
regioninterngeo <- regioninterncount %>% 
      filter(internship=="Successful") %>% 
      mutate_geocode(Student.region) %>% 
      rename(lat.s =lat) %>% 
      rename(lon.s=lon) %>% 
      mutate_geocode( Intern.region)
# Get a simple NZ map for ggplot
nzMap <- borders("nz", colour="grey", fill="grey")
# Extract the data of the students where there was a flow from region to region for the lines
flowdata <- regioninterngeo %>% 
      mutate(moved=ifelse(lon.s==lon,"no","yes")) %>% 
      filter(moved=="yes") %>% 
      # Remove the Multiple region from the Student.region
      filter(Student.region!="Multiple")
# Create a data.frame with the region cities and latitude and longitudes
regioncodes <- regioninterngeo %>% 
      select(Intern.region,lon,lat) %>%  
      rbind(c("Waikato",175.18940, -37.45580),c("Otago" ,170.15476, -45.47907 )) %>% 
      distinct() %>% 
      drop_na() %>% 
      rename(region=Intern.region) %>% 
      mutate_at(vars(lon,lat),funs(as.numeric)) 

# Plot the lables of the cities onto the nzMap using ggplot. We will ignore the student region multiple  in this map
g <- ggplot() + 
      nzMap +
      # Adjust the region labels
      geom_text(data = regioncodes, aes(x = (lon+0.5), y = (lat+0.5), label= region)) +
      # Plot the geocoded data with the counts of the interns
      geom_point(data = regioninterngeo,
                 aes(x = lon, y = lat, 
                     color=Intern.region,
                     size=counts,
                     alpha=0.6))  +
      # Add the arrows from the students institutions to the intern cities using the flowdata 
      # and the geom_curve function
      geom_curve(data=flowdata, 
                 aes(x = lon.s, y = lat.s, xend = lon, yend = lat, 
                     color=Intern.region, 
                     frame=Intern.region,
                 size=counts),
                 curvature = 0.5, 
                 lineend ='square',
                 arrow = arrow(length = unit(0.3,"cm"), type = "open")) +
      # Use the ggthemes package theme_map to remove the axes
      ggthemes::theme_map()  +
      # Remove specific aesthetics from the legend
      guides(size=FALSE, alpha=FALSE) + 
      # Rename the legend title
      scale_color_brewer(palette="Set1") +
      labs(color = "Intern City")  +
      # Reposition the legend over another theme
      theme(
              legend.position = "bottom",
              legend.direction = "horizontal",
              legend.justification = "center")
      
# Animate having reinstalled the gganimate and animation R packages from GitHub and install ImageMagick to convert to gif. 
# devtools::install_github("dgrtwo/gganimate") 
# devtools::install_github("yihui/animation")
# library(installr)
# install.ImageMagick()
# ani.options(convert=shortPathName("C:\\Program Files\\ImageMagick-7.0.8-Q16\\magick.exe"))
gganimate(g)

# Save as a gif
# gganimate(g,"mapinterns.gif")

The two main Institution feeders to internships appear to be Victoria University of Wellington and Auckland University, the former resulting in half of internships. The region as the largest feeder of interns to other regions is Christchurch.

Specific Events Resulting in Internships

One of the challenges with this data is that a student may attend one or many events but may only obtain one internship. A successful internship is potentially a result of a combination of event attendances.

The other challenge with the events is that there are 109 individual events which is quite difficult to visualise.

In order to identify potential events resulting in internships, we will need to reshape the data given these challenges.

Event Counts by Student Token

In order to arrange the data such that there is one token id by row, to summarise this one to many relationship, I have summarised the event counts as a wide data set, with column variables by region. This dataframe also has the internship variable as “Successful” or “Unsuccessful”. We can then look at whether total event attendance is a factor.

After creating the new dataframe, check that there are the same number of successful interns in this new dataset as the original intern data.

# Create a new summary dataframe of the event counts by student.
eventcountbystudent  <- combine %>%     
      select(token,internship,registered.for.events,Event.region,Student.region) %>% 
      group_by(token,internship,registered.for.events,Event.region,Student.region) %>% 
      mutate(event.count=n()) %>% 
      distinct()  %>% 
      ungroup()
# Now create a wide dataset with the spread function
eventcountbystudent_wide <- eventcountbystudent %>% 
      spread(Event.region,event.count) %>% 
      rename(DirectApp='<NA>')  %>% 
      # Replace the NAs with 0 for this plot
      mutate_at(vars(Auckland,Wellington,Christchurch,Dunedin,Hamilton,DirectApp), funs(replace(., is.na(.), 0))) %>% 
      # Add a total events variable
      mutate(TotEvents = Auckland+Wellington+Christchurch+Dunedin+Hamilton)
# Check that there are the same number of successful interns as the original intern data
table(eventcountbystudent_wide$internship)[1]==length(unique(intern$token))

## Successful 
##       TRUE

Total registered events by student token

Of the 213 successful interns, 43 applied directly and did not attend any events. Let’s first take a look the balance who registered for events. The total event registration by internship success is plotted as a violin plot which is a mirrored density plot. A boxplot is layered on top to show the quantiles. We use registration as in indication of interest in interships, as actual attendance may be dependent on other external factors.

# plot violin plot with boxploit layer
eventcountbystudent_wide  %>% 
      ggplot() +
      geom_violin(aes(internship,TotEvents),fill="purple") +
      geom_boxplot(aes(internship,TotEvents),alpha=0.3) +
      theme_bw() +
      ggtitle("Distribution of Registered Events Count and Internship") +
      xlab("Internship") +
      ylab("Count of registered events (All Regions)")

Overall, it appears from the quantile box that successful interns registered to attend more events. However there are some outliers of unsuccessful students who registered for a very large number events, the maximum number is 47 by one student. The greater spread at the bottom of the unsuccessful students indicates more unsuccessful students registered for fewer events than the successful interns.

Registration count by student region

Next we will break down and plot the total registered events by the region in which the student studies.

This plot uses a jitter effect to add random variation to disperse the plotted points.

# Rename the Multiple Student region to Multiple region Academy
eventcountbystudent_wide    %>%    
# Replace the NAs with 0 for this plot
      mutate_at(vars(Auckland,Wellington,Christchurch,Dunedin,Hamilton,DirectApp), funs(replace(., is.na(.), 0))) %>% 
      # Add a total events variable
      mutate(TotEvents = Auckland+Wellington+Christchurch+Dunedin+Hamilton) %>% 
      mutate_at(vars(Student.region),funs(replace(.,.== "Multiple","Multiple Region Academy"))) %>%  
      ggplot() +
      geom_jitter(aes(internship,TotEvents,colour=internship))+
      xlab("Internship")+
      ylab("Total number student registered events")+
      ggtitle("Number of Student Registered Events by Student Region of Study") +
      theme(axis.text.x=element_blank()) +
      scale_color_manual(values=c("green", "blue"))+
      facet_wrap(~Student.region)

Students from outside the major centres are very low in numbers in registering and relatively unsuccessful in interships.

Decision Tree

Instead of trying to visualise all the events, we can use a model to visualise as a decision tree what possible combination of events successful interns attended in 2017. We will use the rpart R package to create a classification model and the rattle R package which has a nice function, fancyRpartPlot, to plot the tree below. Here I also used this blog.

This is a very simple model which is not intended as a prediction model with measures of accuracy. However the reshaped data with events classified with 0 or 1, which is also called one hot encoding, could be leveraged to build predictive models.

# Create a new summary dataframe of the event names by student.
eventnamebystudent_wide  <- combine %>%     
      select(token,internship,Student.region,Event.name) %>% 
      mutate(count=1) %>% 
      distinct() %>% 
      spread(Event.name,count)  %>% 
      # Replace the NAs with 0 for this plot
      mutate_if(is.numeric, funs(replace(., is.na(.), 0)))
# Convert the internship values with 0 and 1 for the model
eventnamebystudent_wide<- eventnamebystudent_wide %>% 
      mutate_at(vars(internship),funs(replace(.,.== "Successful",1))) %>% 
      mutate_at(vars(internship),funs(replace(.,.== "Unsuccessful",0))) %>% 
# Rename the NA variable which are the Students who did not attend any events
      rename(NoEvents='<NA>')
eventnamebystudent_wide$Student.region <- as.factor(eventnamebystudent_wide$Student.region )
eventnamebystudent_wide$internship <- as.factor(eventnamebystudent_wide$internship)
# Create a model using rpart
model <- rpart::rpart(internship~.,data=eventnamebystudent_wide[,-c(1,3)], cp=0.001 ) # default cp is 0.01, decresing this value creates a deeper tree
rattle::fancyRpartPlot(model, sub = "Decision tree of events that result in successful (green) 2017 internships",palettes=c("Greys", "Greens"))

From this tree we can see the 43 successful interns who applied directly as a percentage of the total pool of applicants 1424 or 0.03 on the far right green branch.

The remaining branches relate to 170 students who registered for events and their success in green. Note only the top branches are shown.

Social Impact

Let’s take a look at the percentage breakdown of intern ethnicity and gender of successful interns.

# Plot the intern ethnicity data 
intern_eth <- intern %>% 
      select(Intern.ethnicity) %>% 
      table() %>% 
      prop.table() %>% 
      data.frame()  
intern_eth %>% 
      ggplot( aes(., 100*Freq)) +
      geom_col(fill="purple") +
      coord_flip() +
      labs(title = "Intern Breakdown by Intern Ethnicity\n", x = "\nEthnicity", y = "Percentage\n")+ 
      guides(fill=FALSE)

46 % of interns were pakeha however there is a large 23 % of undisclosed ethnicities.

# Plot the intern gender data as a frequency plot
intern_gender <- intern %>% 
      select(Intern.gender) %>% 
      table() %>% 
      prop.table() %>% 
      data.frame(stringsAsFactors = FALSE)  
intern_gender %>% 
      ggplot( aes(., 100*Freq)) +
      geom_col(fill="purple") +
      coord_flip() +
      labs(title = "Intern Breakdown by Intern Gender\n", x = "\nGender", y = "Percentage\n")+ 
      guides(fill=FALSE)

39 % of interns were female and 59 % were male in 2017.

Conclusion

Following this analysis see below preliminary insights to Summer of Tech’s questions:

Success factors

For the success factors in obtaining an internship we looked at variables which students can influence or change and not intrinsic factors such as demographics.
The Institution that the student attends appears to be a factor.
The more event registrations appears to overall increase chances of success, although this may be an indication of interest and drive of the student.
From looking at event data, students from outside the major centres have a relatively low chance of success but could this be related to awareness with the low numbers registering?
It would be useful to have the IT skills in the student profile applications and IT skills listed as required by the employers to determine if skills are success factors.

Specific events resulting in internships

There is a very large number of events run relative to the number of successful interns.. It appears from a simple model that the top five events that resulted in 2017 interships are SoT&Biz Guide to Offers, Meet & Greet - WELLINGTON, Guide to 2017 Offers, Meet & Greet ProTips - Alumni Panel, CV Clinics 1:1 feedback.

Social Impact

There is demographic data for Intern ethnicity and Intern gender on the interns, but not currently on the students creating a profile or attending the events.
It could be useful to obtain the gender and ethnicity demographic data for applicants over time to view trends and proportions of those who apply versus those who are successful for the social impact.
Also the numbers for under-represented Intern ethnicities are very small or undisclosed for December 2017.

Missing Data

Regarding the large undisclosed number in the intern ethnicities, is the ethnicity an optional field to fill in? Scottish/Phillipino is a unique combination! Perhaps make this field mandatory and drop down, and add a “prefer not to say” option.
Additionally for the gender option, perhaps add a “prefer not to say” option.
It could be useful to have an employer ID / token in the intern data to map and plot this with the other data variables.