Mark Needham

Thoughts on Software Development

Archive for the ‘polyglot’ tag

scikit-learn: Creating a matrix of named entity counts

without comments

I’ve been trying to improve my score on Kaggle’s Spooky Author Identification competition, and my latest idea was building a model which used named entities extracted using the polyglot NLP library.

We’ll start by learning how to extract entities form a sentence using polyglot which isn’t too tricky:

>>> from polyglot.text import Text
>>> doc = "My name is David Beckham. Hello from London, England"
>>> Text(doc, hint_language_code="en").entities
[I-PER(['David', 'Beckham']), I-LOC(['London']), I-LOC(['England'])]

This sentence contains three entities. We’d like each entity to be a string rather than an array of values so let’s refactor the code to do that:

>>> ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
['David_Beckham', 'London', 'England']

That’s it for the polyglot part of the solution. Now let’s work out how to integrate that with scikit-learn.

I’ve been using scikit-learn’s Pipeline abstraction for the other models I’ve created so I’d like to take the same approach here. This is an example of a model that creates a matrix of unigram counts and creates a Naive Bayes model on top of that:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
 
nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(),
    ('mnb', MultinomialNB())
])
 
...
# Train and Test the model
...

I was going to write a class similar to CountVectorizer but after reading its code for a couple of hours I realised that I could just pass in a custom analyzer instead. This is what I ended up with:

entities = {}
 
 
def analyze(doc):
    if doc not in entities:
        entities[doc] = ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
    return entities[doc]
 
nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(analyzer=lambda doc: analyze(doc))),
    ('mnb', MultinomialNB())
])

I’m caching the results in a dictionary because the entity extraction is quite time consuming and there’s no point recalculating it each time the function is called.

Unfortunately this model didn’t help me improve my best score. It scores a log loss of around 0.5, a bit worse than the 0.45 I’ve achieved using the unigram model 🙁

Written by Mark Needham

November 29th, 2017 at 11:01 pm

Python: polyglot – ModuleNotFoundError: No module named ‘icu’

without comments

I wanted to use the polyglot NLP library that my colleague Will Lyon mentioned in his analysis of Russian Twitter Trolls but had installation problems which I thought I’d share in case anyone else experiences the same issues.

I started by trying to install polyglot:

$ pip install polyglot
 
ImportError: No module named 'icu'

Hmmm I’m not sure what icu is but luckily there’s a GitHub issue covering this problem. That led me to Toby Fleming’s blog post that suggests the following steps:

brew install icu4c
export ICU_VERSION=58
export PYICU_INCLUDES=/usr/local/Cellar/icu4c/58.2/include
export PYICU_LFLAGS=-L/usr/local/Cellar/icu4c/58.2/lib
pip install pyicu

I already had icu4c installed so I just had to make sure that I had the same version of that library as Toby did. I ran the following command to check that:

$ ls -lh /usr/local/Cellar/icu4c/
total 0
drwxr-xr-x  12 markneedham  admin   408B 28 Nov 06:12 58.2

That still wasn’t enough though! I had to install these two libraries as well:

pip install pycld2
pip install morfessor

I was then able to install polyglot, but had to then run the following commands to download the files needed for entity extraction:

polyglot download embeddings2.de
polyglot download ner2.de
polyglot download embeddings2.en
polyglot download ner2.en

Written by Mark Needham

November 28th, 2017 at 7:52 pm

Posted in Python

Tagged with , ,