Mark Needham

Thoughts on Software Development

Archive for the ‘NLP’ tag

scikit-learn: Creating a matrix of named entity counts

without comments

I’ve been trying to improve my score on Kaggle’s Spooky Author Identification competition, and my latest idea was building a model which used named entities extracted using the polyglot NLP library.

We’ll start by learning how to extract entities form a sentence using polyglot which isn’t too tricky:

>>> from polyglot.text import Text
>>> doc = "My name is David Beckham. Hello from London, England"
>>> Text(doc, hint_language_code="en").entities
[I-PER(['David', 'Beckham']), I-LOC(['London']), I-LOC(['England'])]

This sentence contains three entities. We’d like each entity to be a string rather than an array of values so let’s refactor the code to do that:

>>> ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
['David_Beckham', 'London', 'England']

That’s it for the polyglot part of the solution. Now let’s work out how to integrate that with scikit-learn.

I’ve been using scikit-learn’s Pipeline abstraction for the other models I’ve created so I’d like to take the same approach here. This is an example of a model that creates a matrix of unigram counts and creates a Naive Bayes model on top of that:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(),
    ('mnb', MultinomialNB())
# Train and Test the model

I was going to write a class similar to CountVectorizer but after reading its code for a couple of hours I realised that I could just pass in a custom analyzer instead. This is what I ended up with:

entities = {}
def analyze(doc):
    if doc not in entities:
        entities[doc] = ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
    return entities[doc]
nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(analyzer=lambda doc: analyze(doc))),
    ('mnb', MultinomialNB())

I’m caching the results in a dictionary because the entity extraction is quite time consuming and there’s no point recalculating it each time the function is called.

Unfortunately this model didn’t help me improve my best score. It scores a log loss of around 0.5, a bit worse than the 0.45 I’ve achieved using the unigram model 🙁

Written by Mark Needham

November 29th, 2017 at 11:01 pm

Python: polyglot – ModuleNotFoundError: No module named ‘icu’

without comments

I wanted to use the polyglot NLP library that my colleague Will Lyon mentioned in his analysis of Russian Twitter Trolls but had installation problems which I thought I’d share in case anyone else experiences the same issues.

I started by trying to install polyglot:

$ pip install polyglot
ImportError: No module named 'icu'

Hmmm I’m not sure what icu is but luckily there’s a GitHub issue covering this problem. That led me to Toby Fleming’s blog post that suggests the following steps:

brew install icu4c
export ICU_VERSION=58
export PYICU_INCLUDES=/usr/local/Cellar/icu4c/58.2/include
export PYICU_LFLAGS=-L/usr/local/Cellar/icu4c/58.2/lib
pip install pyicu

I already had icu4c installed so I just had to make sure that I had the same version of that library as Toby did. I ran the following command to check that:

$ ls -lh /usr/local/Cellar/icu4c/
total 0
drwxr-xr-x  12 markneedham  admin   408B 28 Nov 06:12 58.2

That still wasn’t enough though! I had to install these two libraries as well:

pip install pycld2
pip install morfessor

I was then able to install polyglot, but had to then run the following commands to download the files needed for entity extraction:

polyglot download
polyglot download
polyglot download embeddings2.en
polyglot download ner2.en

Written by Mark Needham

November 28th, 2017 at 7:52 pm

Posted in Python

Tagged with , ,

Where are we now? Where do we want to be?

without comments

Listening to Dan North speaking last week I was reminded of one of my favourite NLP[*] techniques for making improvements on projects.

The technique is the TOTE (Test, Operate, Test, Exit) and it is a technique designed to help us get from where we are now to where we want to be via short feedback loops.

On my previous project we had a situation where we needed to build and deploy our application in order to show it to the client in a show case.

The first time we did this we did most of the process manually – it took three hours and even then still didn’t work properly. It was clear that we needed to do something about this. We needed the process to be automated.

Before we did this I mentally worked out what the difference was between where we were now and what the process would look like when we were at our desired state.

A full description of the technique can be found in the NLP Workbook but in summary, these are the steps for using the TOTE technique.

  1. What is our present state?
  2. What is our desired state?
  3. What specific steps and stages do we need to go through to get there?

We then execute an action and then re-compare the current state to the desired state until they are the same. Then we exit.

In our situation:

  1. To deploy our application we need to manually build it then copy the files to the show case machine.
  2. We want this process to happen automatically each time a change is made to the application

For step 3 to make sense more context is needed.

For our application to be ready to use we needed to build it and deploy it to the user’s desktop and build and deploy several other services to an application repository so that our application could stream them onto the desktop.

The small steps for achieving step 3 were:

  • Write a build for building the application and making the artifacts available
  • Write a script to deploy the application to the user’s desktop
  • Edit the build for building the services to make artifacts available
  • Write a script to deploy these services to the repository

This evolved slightly so that we could get Team City to simulate the process for our functional testing and then run a script to deploy it on the show case machine but the idea is the same.

There’s nothing mind blowing about this approach. It’s just a way of helping us to clarify what exactly it is we want to do and providing an easy way of getting there as quickly as possible.

* Sometimes when I mention NLP people get a bit defensive as it has been fed to them previously as a tool kit for solving all problems.

We need to remember that NLP is a set of communication techniques gathered from observing effective communicators and then recorded so that others could learn from this.

When used properly ideas from NLP can help us to clarify what we want to do and improve our ability to communicate with each other.

Written by Mark Needham

September 20th, 2008 at 5:32 pm

Feedback: Positive Reinforcement/Change yourself first

without comments

One of the more interesting concepts used on the NLP course that I did last year was the idea of only giving positive feedback to people.

The thinking behind the theory (which I think comes from Robert Dilts, one of the early thinkers behind NLP) is that people know what they are doing wrong and already beat themselves up about it; therefore there is no point you mentioning it as well.

I was initially sceptical about this approach as it seemed a bit too idealistic for my cynical mind. I found it extremely difficult to start with and didn’t give any feedback to anyone for quite a few sessions. Eventually though something clicked for me and by the end of the 18 day course I feel I did gain a greater respect for and recognition of the talents that other people on the course had. The need to focus only on the positive actually seems to drive the mind to see more in this area than it otherwise would.

Although noone likes it when they are criticised, I think there are some occasions when someone criticising you can prove to be extremely motivational. This basically involves them completely writing you off and you then being determined to prove them wrong. For example at school I was told that I would definitely fail the Pure Maths modules of my A Level Maths course. Completely unimpressed with this verdict I persevered with it for months eventually scoring 85%. Job done.

I think sometimes when giving critical feedback it can say more about you than it does about the person you are giving it to, and this is where it’s vital to step back and think why you are giving the feedback.

I find at least for myself the tendency is to want to point out things people do that annoy me, which in effect is me trying to make the person more like myself. Steve Pavlina suggests that the things we hate the most in other people are the things we actually hate in ourselves. Therefore his suggestion was if you find something someone else does annoying, first look at yourself and try and improve yourself in this area.

I’m not sure if I totally subscribe to why this approach would work but I definitely agree that it is way easier to change yourself than it is to change someone else.

Similar articles:

Written by Mark

February 12th, 2008 at 12:01 am

Posted in Feedback

Tagged with , , ,

Giving effective feedback

with 2 comments

One of the most interesting things I have discovered since starting at ThoughtWorks earlier this month is the emphasis that is placed on giving feedback.

The first lesson we were taught about giving feedback was that it could be one of two types. Either it should Strengthen Confidence or Increase Effectiveness.

In Layman’s term that means that if you want to make a positive comment about somebody’s contribution then you should make reference to something specific that you believe they have done well so that they can continue doing it. Equally if you believe there is an area that they could improve it, a specific example of this behaviour/fault should be noted along with a suggestion for how they can improve.

As a member of Toastmasters since January I was already used to this concept of feedback and there are certainly parallels in the feedback system encouraged at Toastmasters and that used at ThoughtWorks.

Although Toastmasters do not define types of feedback, there is an expectation that evaluators will apply themselves in a certain manner when carrying out their job.

One of the things which is frowned upon is known as ‘whitewashing’. This is where an evaluator would say that a speaker was ‘brilliant’ or give a summary just using complementary adjectives. Although the speaker may well be flattered, it does not really tell them anything or leave room for improvement. The use of the word ‘brilliant’ or ‘superb’ is only the perception of the person using it, and the failure to make use of the word with regards to a specific behaviour or action means that it is rendered meaningless.

Equally when the evaluator believes there is an area that the speaker can improve in they should make a reference to the specific negative behaviour or action so that the speaker can recall their mistake and go about making the improvement. When giving feedback it is very poor practice to attribute your own feelings to the speaker – you are giving them control over something which they do not have control over! For example, if an evaluator were to say: ‘I felt bored listening to your speech, you should make the next speech more interesting’. In this case the evaluator is giving the speaker the power to make them feel bored. It is ridiculous to let someone have that amount of control over you and if we consider that another person listening to the same speech may have felt really engaged, a property of the speech cannot be that it was ‘boring’.

This is very similar to the way that ThoughtWorkers are expected to give feedback, although it is also emphasised that when giving feedback one should speak only for themselves, and not try and speak for a group of people. Doing this would assume that mind reading is possible and as far as I’m aware this feat has yet to be achieved. An example of committing this mistake would be to say something along the lines of: ‘It would be better for us if you could do x’. In this case ‘us’ is not defined and it is unlikely that one person can speak precisely of the feelings of other people.

This concept is very similar to that of Generalisation in the NLP Meta Model, which states the following:

“Generalization is the process by which elements or pieces of a person’s model become detached from their original experience and come to represent the entire category of which the experience is an example.”

This is an area that I am actually working on myself, and I am finding it very difficult to speak only for myself because I’m so used to generalising! Of course there are still times when generalisation is vital, and we would find it very difficult to live our daily lives without generalising on some things. Giving feedback, however, is one area where this ‘technique’ is counter productive.

Written by Mark

September 2nd, 2006 at 3:07 am