Name Entity Recognition using Python spaCy in R

23 May 2018

R / Python / NLP

Recently I have been working on an Natural Language Processing (NLP) client project. This field appears to extensively use Python packages so I used the opportunity to go on an NLP journey in Python, starting with a Jupyter notebook. The Python packages included here are the research tool NLTK, gensim then the more recent spaCy.

The purpose of this post is the next step in the journey to produce a pipeline for the NLP areas of text mining and Named Entity Recognition (NER) using the Python spaCy NLP Toolkit, in R. This is made possible with the interface to Python, the reticulate R package.

This post will cover the following:

Python initialisation in R
Load the data
Data Summaries
Named entity recognition with spaCy

This will not however include advanced topic modeling and training annotation models in spaCy.

Load Packages

library(tidyverse)
# Packages for manipulating data
library(stringr)
library(lubridate)
# Packages for NLP
library(NLP)
# install.packages("openNLPmodels.en",repos = "http://datacube.wu.ac.at/", type = "source")
library(openNLP)
library(cleanNLP)
# cnlp_download_corenlp() # install the coreNLP Java back end 'CoreNLP' <http://stanfordnlp.github.io/CoreNLP/>
# Packages for Python interface
# Packages for Python
library(reticulate)
use_virtualenv("r-reticulate")

Python Initialisation

We have loaded the required R packages including the reticulate R package which provides an R interface to Python modules, classes, and functions.

As per the cleanNLP R package documentation, we will load the spaCy Python NLP backend.

# First set the executable. Note this needs to be set before any initialising
use_python("C:/Users/HOME/Anaconda3/python.exe") 
# py_available(initialize = TRUE) # should give TRUE
# Check Python configuration
py_config()

## python:         C:/Users/HOME/Anaconda3/python.exe
## libpython:      C:/Users/HOME/Anaconda3/python36.dll
## pythonhome:     C:\Users\HOME\ANACON~1
## version:        3.6.3 |Anaconda, Inc.| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]
## Architecture:   64bit
## numpy:          C:\Users\HOME\ANACON~1\lib\site-packages\numpy
## numpy_version:  1.13.3
## 
## python versions found: 
##  C:/Users/HOME/Anaconda3/python.exe
##  C:\Users\Home\AppData\Local\Programs\Python\Python36\python.exe
##  C:\Users\Home\AppData\Local\Programs\Python\Python36\\python.exe
##  C:\Users\Home\ANACON~1\python.exe

# Initialise the spaCy backend
cnlp_init_spacy()

Load Data

The A Million News Headlines dataset is easy to load with the read_csv function from the readr R package.

This contains data of news headlines published over a period of 15 years. From the reputable Australian news source ABC (Australian Broadcasting Corp.)

These files have been downloaded into a local directory first to agree to the terms of use.

Data Summary

Let’s take a look at the ABC headline data.

# Change the date to a date format
abc$publish_date <- as.Date(as.character(abc$publish_date), format = '%Y%m%d')
# Add new columns for the year, month and day using the lubridate R package
abc <- abc %>%  
      mutate(year = lubridate::year(abc$publish_date),
             month = lubridate::month(abc$publish_date),
             day = lubridate::day(abc$publish_date))
# Take a look at the first rows in the  dataset
head(abc)

## # A tibble: 6 x 5
##   publish_date headline_text                              year month   day
##   <date>       <chr>                                     <dbl> <dbl> <int>
## 1 2003-02-19   aba decides against community broadcasti~  2003     2    19
## 2 2003-02-19   act fire witnesses must be aware of defa~  2003     2    19
## 3 2003-02-19   a g calls for infrastructure protection ~  2003     2    19
## 4 2003-02-19   air nz staff in aust strike for pay rise   2003     2    19
## 5 2003-02-19   air nz strike to affect australian trave~  2003     2    19
## 6 2003-02-19   ambitious olsson wins triple jump          2003     2    19

## # A tibble: 15 x 2
## # Groups:   year [15]
##     year     n
##    <dbl> <int>
##  1  2017 44182
##  2  2016 54615
##  3  2015 77941
##  4  2014 82330
##  5  2013 92337
##  6  2012 89109
##  7  2011 77829
##  8  2010 74948
##  9  2009 76454
## 10  2008 80015
## 11  2007 77192
## 12  2006 66912
## 13  2005 73124
## 14  2004 72674
## 15  2003 64003

## # A tibble: 12 x 3
## # Groups:   year, month [12]
##     year month     n
##    <dbl> <dbl> <int>
##  1  2017    12  3032
##  2  2017    11  3607
##  3  2017    10  3747
##  4  2017     9  3588
##  5  2017     8  3893
##  6  2017     7  3723
##  7  2017     6  3456
##  8  2017     5  3616
##  9  2017     4  3486
## 10  2017     3  4474
## 11  2017     2  3873
## 12  2017     1  3687

Since this is a large dataset, we will use a subset of articles from January, 2017.

Named Entity Recognition using CleanNLP and spaCy

Annotate the string of text using the cnlp_annotate function from CleanNLP. This annotate function performs the word tokenisation and parts of speech tagging steps.

# Annotate the string tf by running the annotation engine over the corpus of text
anno <- cnlp_annotate(tf)
# Summarise the tokens by parts of speech
cnlp_get_token(anno, include_root = FALSE) %>%
  group_by(upos) %>%
  summarize(posnum = n()) %>%
  arrange(desc(posnum))

## # A tibble: 15 x 2
##    upos  posnum
##    <chr>  <int>
##  1 NOUN   14026
##  2 VERB    5051
##  3 ADJ     3831
##  4 ADP     3491
##  5 PART     716
##  6 NUM      650
##  7 ADV      597
##  8 DET      509
##  9 CCONJ    231
## 10 PRON     223
## 11 PUNCT    146
## 12 PROPN     79
## 13 X         45
## 14 INTJ      17
## 15 SYM       13

# Summarise the count of entities
cnlp_get_entity(anno) %>%
  group_by(entity_type) %>%
  summarize(count = n())  %>%
  arrange(desc(count))

## # A tibble: 13 x 2
##    entity_type count
##    <chr>       <int>
##  1 DATE          340
##  2 CARDINAL      330
##  3 ORDINAL       133
##  4 NORP           40
##  5 GPE            38
##  6 TIME           28
##  7 MONEY          17
##  8 QUANTITY       11
##  9 PERSON          4
## 10 ORG             2
## 11 EVENT           1
## 12 LOC             1
## 13 PRODUCT         1

# Extract the entities of type GPE which is are geo-political entities such as city, state/province, and country
cnlp_get_entity(anno) %>%
  filter(entity_type == "GPE") %>%
  group_by(entity) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

## # A tibble: 6 x 2
##   entity  count
##   <chr>   <int>
## 1 india      15
## 2 china      11
## 3 japan       7
## 4 mexico      3
## 5 chicago     1
## 6 london      1

# Extract the entities of type NORP which are Nationalities or religious or political groups.
cnlp_get_entity(anno) %>%
  filter(entity_type == "NORP") %>%
  group_by(entity) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

## # A tibble: 15 x 2
##    entity      count
##    <chr>       <int>
##  1 chinese        15
##  2 japanese        4
##  3 russian         4
##  4 american        3
##  5 british         2
##  6 european        2
##  7 mexico          2
##  8 173yo           1
##  9 190117          1
## 10 canadian        1
## 11 christian       1
## 12 israeli         1
## 13 korean          1
## 14 palestinian     1
## 15 republican      1

# Extract the entities of type PERSON which are People, including fictional.
cnlp_get_entity(anno) %>%
  filter(entity_type == "PERSON") %>%
  group_by(entity) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

## # A tibble: 2 x 2
##   entity    count
##   <chr>     <int>
## 1 elizabeth     2
## 2 jack          2

# Extract the entities of type ORG which are Companies, agencies, institutions, etc.
cnlp_get_entity(anno) %>%
  filter(entity_type == "ORG") %>%
  group_by(entity) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

## # A tibble: 2 x 2
##   entity count
##   <chr>  <int>
## 1 7 134      1
## 2 mafia      1

# Extract the entities of type MONEY which are Monetary values, including unit.
cnlp_get_entity(anno) %>%
  filter(entity_type == "MONEY") %>%
  group_by(entity) %>%
  summarize(count = n()) %>%
  arrange(desc(count))

## # A tibble: 17 x 2
##    entity              count
##    <chr>               <int>
##  1 $20m                    1
##  2 $493 billion            1
##  3 $6 million              1
##  4 $60 million             1
##  5 $6m                     1
##  6 15000                   1
##  7 20 cents                1
##  8 3 million dollars       1
##  9 5.8 per cent            1
## 10 50 per cent             1
## 11 500000                  1
## 12 70                      1
## 13 76                      1
## 14 8 mark                  1
## 15 more than $860000       1
## 16 more than 4 billion     1
## 17 over $60m               1

The other entities types can be viewed in the spaCy documentation.

See the GitHub NLPexamples repo for my other NLP projects in R and Python.