Recently I have been working on an Natural Language Processing (NLP) client project. This field appears to extensively use Python packages so I used the opportunity to go on an NLP journey in Python, starting with a Jupyter notebook. The Python packages included here are the research tool NLTK, gensim then the more recent spaCy.
The purpose of this post is the next step in the journey to produce a pipeline for the NLP areas of text mining and Named Entity Recognition (NER) using the Python spaCy NLP Toolkit, in R. This is made possible with the interface to Python, the reticulate R package.
This post will cover the following:
This will not however include advanced topic modeling and training annotation models in spaCy.
Load Packages
library(tidyverse)
# Packages for manipulating data
library(stringr)
library(lubridate)
# Packages for NLP
library(NLP)
# install.packages("openNLPmodels.en",repos = "http://datacube.wu.ac.at/", type = "source")
library(openNLP)
library(cleanNLP)
# cnlp_download_corenlp() # install the coreNLP Java back end 'CoreNLP' <http://stanfordnlp.github.io/CoreNLP/>
# Packages for Python interface
# Packages for Python
library(reticulate)
use_virtualenv("r-reticulate")
Python Initialisation
We have loaded the required R packages including the reticulate R package which provides an R interface to Python modules, classes, and functions.
As per the cleanNLP R package documentation, we will load the spaCy Python NLP backend.
# First set the executable. Note this needs to be set before any initialising
use_python("C:/Users/HOME/Anaconda3/python.exe")
# py_available(initialize = TRUE) # should give TRUE
# Check Python configuration
py_config()
## python: C:/Users/HOME/Anaconda3/python.exe
## libpython: C:/Users/HOME/Anaconda3/python36.dll
## pythonhome: C:\Users\HOME\ANACON~1
## version: 3.6.3 |Anaconda, Inc.| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]
## Architecture: 64bit
## numpy: C:\Users\HOME\ANACON~1\lib\site-packages\numpy
## numpy_version: 1.13.3
##
## python versions found:
## C:/Users/HOME/Anaconda3/python.exe
## C:\Users\Home\AppData\Local\Programs\Python\Python36\python.exe
## C:\Users\Home\AppData\Local\Programs\Python\Python36\\python.exe
## C:\Users\Home\ANACON~1\python.exe
# Initialise the spaCy backend
cnlp_init_spacy()
Load Data
The A Million News Headlines dataset is easy to load with the read_csv function from the readr R package.
This contains data of news headlines published over a period of 15 years. From the reputable Australian news source ABC (Australian Broadcasting Corp.)
These files have been downloaded into a local directory first to agree to the terms of use.
Data Summary
Let’s take a look at the ABC headline data.
# Change the date to a date format
abc$publish_date <- as.Date(as.character(abc$publish_date), format = '%Y%m%d')
# Add new columns for the year, month and day using the lubridate R package
abc <- abc %>%
mutate(year = lubridate::year(abc$publish_date),
month = lubridate::month(abc$publish_date),
day = lubridate::day(abc$publish_date))
# Take a look at the first rows in the dataset
head(abc)
## # A tibble: 6 x 5
## publish_date headline_text year month day
## <date> <chr> <dbl> <dbl> <int>
## 1 2003-02-19 aba decides against community broadcasti~ 2003 2 19
## 2 2003-02-19 act fire witnesses must be aware of defa~ 2003 2 19
## 3 2003-02-19 a g calls for infrastructure protection ~ 2003 2 19
## 4 2003-02-19 air nz staff in aust strike for pay rise 2003 2 19
## 5 2003-02-19 air nz strike to affect australian trave~ 2003 2 19
## 6 2003-02-19 ambitious olsson wins triple jump 2003 2 19
## # A tibble: 15 x 2
## # Groups: year [15]
## year n
## <dbl> <int>
## 1 2017 44182
## 2 2016 54615
## 3 2015 77941
## 4 2014 82330
## 5 2013 92337
## 6 2012 89109
## 7 2011 77829
## 8 2010 74948
## 9 2009 76454
## 10 2008 80015
## 11 2007 77192
## 12 2006 66912
## 13 2005 73124
## 14 2004 72674
## 15 2003 64003
## # A tibble: 12 x 3
## # Groups: year, month [12]
## year month n
## <dbl> <dbl> <int>
## 1 2017 12 3032
## 2 2017 11 3607
## 3 2017 10 3747
## 4 2017 9 3588
## 5 2017 8 3893
## 6 2017 7 3723
## 7 2017 6 3456
## 8 2017 5 3616
## 9 2017 4 3486
## 10 2017 3 4474
## 11 2017 2 3873
## 12 2017 1 3687
Since this is a large dataset, we will use a subset of articles from January, 2017.
Named Entity Recognition using CleanNLP and spaCy
Annotate the string of text using the cnlp_annotate function from CleanNLP. This annotate function performs the word tokenisation and parts of speech tagging steps.
# Annotate the string tf by running the annotation engine over the corpus of text
anno <- cnlp_annotate(tf)
# Summarise the tokens by parts of speech
cnlp_get_token(anno, include_root = FALSE) %>%
group_by(upos) %>%
summarize(posnum = n()) %>%
arrange(desc(posnum))
## # A tibble: 15 x 2
## upos posnum
## <chr> <int>
## 1 NOUN 14026
## 2 VERB 5051
## 3 ADJ 3831
## 4 ADP 3491
## 5 PART 716
## 6 NUM 650
## 7 ADV 597
## 8 DET 509
## 9 CCONJ 231
## 10 PRON 223
## 11 PUNCT 146
## 12 PROPN 79
## 13 X 45
## 14 INTJ 17
## 15 SYM 13
# Summarise the count of entities
cnlp_get_entity(anno) %>%
group_by(entity_type) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 13 x 2
## entity_type count
## <chr> <int>
## 1 DATE 340
## 2 CARDINAL 330
## 3 ORDINAL 133
## 4 NORP 40
## 5 GPE 38
## 6 TIME 28
## 7 MONEY 17
## 8 QUANTITY 11
## 9 PERSON 4
## 10 ORG 2
## 11 EVENT 1
## 12 LOC 1
## 13 PRODUCT 1
# Extract the entities of type GPE which is are geo-political entities such as city, state/province, and country
cnlp_get_entity(anno) %>%
filter(entity_type == "GPE") %>%
group_by(entity) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 6 x 2
## entity count
## <chr> <int>
## 1 india 15
## 2 china 11
## 3 japan 7
## 4 mexico 3
## 5 chicago 1
## 6 london 1
# Extract the entities of type NORP which are Nationalities or religious or political groups.
cnlp_get_entity(anno) %>%
filter(entity_type == "NORP") %>%
group_by(entity) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 15 x 2
## entity count
## <chr> <int>
## 1 chinese 15
## 2 japanese 4
## 3 russian 4
## 4 american 3
## 5 british 2
## 6 european 2
## 7 mexico 2
## 8 173yo 1
## 9 190117 1
## 10 canadian 1
## 11 christian 1
## 12 israeli 1
## 13 korean 1
## 14 palestinian 1
## 15 republican 1
# Extract the entities of type PERSON which are People, including fictional.
cnlp_get_entity(anno) %>%
filter(entity_type == "PERSON") %>%
group_by(entity) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 2 x 2
## entity count
## <chr> <int>
## 1 elizabeth 2
## 2 jack 2
# Extract the entities of type ORG which are Companies, agencies, institutions, etc.
cnlp_get_entity(anno) %>%
filter(entity_type == "ORG") %>%
group_by(entity) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 2 x 2
## entity count
## <chr> <int>
## 1 7 134 1
## 2 mafia 1
# Extract the entities of type MONEY which are Monetary values, including unit.
cnlp_get_entity(anno) %>%
filter(entity_type == "MONEY") %>%
group_by(entity) %>%
summarize(count = n()) %>%
arrange(desc(count))
## # A tibble: 17 x 2
## entity count
## <chr> <int>
## 1 $20m 1
## 2 $493 billion 1
## 3 $6 million 1
## 4 $60 million 1
## 5 $6m 1
## 6 15000 1
## 7 20 cents 1
## 8 3 million dollars 1
## 9 5.8 per cent 1
## 10 50 per cent 1
## 11 500000 1
## 12 70 1
## 13 76 1
## 14 8 mark 1
## 15 more than $860000 1
## 16 more than 4 billion 1
## 17 over $60m 1
The other entities types can be viewed in the spaCy documentation.
See the GitHub NLPexamples repo for my other NLP projects in R and Python.