Thoughts and Experiments with NER-1

I have been spending quite sometime in the past 2-3 weeks reading and working on Named Entity Recognition aka NER, which is a sub-problem of information extraction from textual documents, and deals with automatically identifying names of people, organizations, locations, mentions of events, date, time etc. Comparing early and recent work, developing a few NER models and evaluating them, I noticed a few issues with the way this is done, and how it affects general understanding of the problem, and its application in real-world. This series of posts are my reflections. I want this first post to be more like a background, and in the next post, I will talk about evaluating existing NER models.

First off, what is the general approach:

You need access to some annotated training data (where each word in a sentence is tagged with its named entity category)
You take from there, and build predictive models to be used on new data.
For the first part, everyone seems to use standard datasets in research, and the second part is where people innovate - new learning methods, new ways to extract patterns from data, new ways to make this domain-agnostic etc. come up.

There is a lot of research, but there are also some accessible software you can use, if you want NER in your software. Some of them I know:

Stanford CoreNLP - Research software, but also available with commercial license
Spacy - industry strength NLP, as they say
Lot of research implementations shared publicly (and so on).

Apart from these which allow you to train your own models, there are also off-the-shelf solutions like Google Cloud’s NLP API, and Microsoft’s Azure Cognitive Services API - both of which support entity recognition/extraction. Now, if we go back to the two main parts of NER:

Training Data:
Common sources of training data reported in the research, and whatever I could see in the tool documentations are:

CONLL-03 dataset (freely available online, used in Stanford NER, and in several research articles)
MUC6 and MUC7 (used in Stanford NER, but does not seem to be free)
ACE 2002 (used in Stanford NER, but does not seem to be free)
CONLL-12 dataset (full data, with text content is not free)
ONTONOTES dataset (Used in Spacy.io, free if your institution is a member of Linguistic Data Consortium, for research use and close to 25K USD per year for commercial use- independent researchers may find it difficult to acquire this)
Enron email dataset, tagged with named entities (freely available)
WikiNER (free)
Perhaps there are a few I missed. I listed what I came across while reading. So, among these, I started playing with building some NER models, taking the freely accessible datasets and taking off from publicly available code implementations. In particular, I used:
Anago, which uses a Bi-directional LSTMs for NER, and can just not rely on any external resource except training data, and
CRF-Suite in Python implementation to build a feature based NER (and I used CONLL and Enron email datasets). I will talk more on what I learnt, and how these models worked, in a later post, but, here are a few things that concerned me:

Here is a NER tagged sentence from CONLL-03 test set, that everyone is using (the very first sentence in the test set).

SOCCER O
(hyphen) O
JAPAN B-LOC
GET O
LUCKY O
WIN O
, O
CHINA B-PER
IN O
SURPRISE O
DEFEAT O
. O

in this, Japan is tagged as a LOC (location), and yet, China is tagged as a PER (Person).

Nadim B-PER
Ladki I-PER
This is the second sentence. (Yes, that is it).

This is the third sentence:
AL-AIN B-LOC
, O
United B-LOC
Arab I-LOC
Emirates I-LOC
1996-12-06 O

I did not go through beyond these. While 2nd, and 3rd sentences, I can justify saying these could be news headlines or something, and need not be grammatical, I don’t understand how China is a person in that context (It is not as if they are talking about a person called China there!) I checked 2 other publicly shared versions of this dataset, and these are the same - it is not as if some one group made some error in extracting it. So, when people write about some 91.08% (or whatever) on CONLL-03 test set, what does that mean?

Number Game: If you are one of those NLP paper readers, you will be aware of the amount of numeric data you will see in results sections. In NER papers, I see these 91.08% on a single test set being better than 91.06% on the same test set, and someone else doing a 91.99% and claiming that is even better etc. What does it mean to have a overall F1 of 90%? I trained a simple NER model with Python-CRF, and a neural model with Anago, mentioned above, to understand this. I will talk more in next post, but main observations are:

When I got 85% overall F1 (which was easy btw, just purely using word contexts as features), it just meant PER, ORG, MISC are all bad, but OTH (basically identifying something as not a named entity correctly) was over 99%, and that was the most common category in the dataset.
I am surprised by an absolute lack of any such analysis in the papers I read in recent past on NER. You suggest a super-complex architecture, which gets you 0.2% overall improvement, how is it better or worse?
Finally, how much of these improvements transfer to real-world data?

[To be continued]

Written on June 2, 2018