The problem with off-the-shelf Named Entity Recognizers

A few months back, I wrote two posts about Named Entity Recognition (Part 1, Part 2), which is the language processing task that is concerned with identifying “entities” i.e., person/location/organization names, specific expressions such as time and money, and other pre-defined miscellaneous set of categories in a given piece of text. Months later, I came across two talks by Joel Grus, who is a research engineer with AllenNLP - one called “Why I don’t like Notebooks”, and the other, more recent one given with others from AllenNLP called “Writing code for NLP Research”. Both of these are on my favorite recent rant-topics, and I started taking a look at AllenNLP as a consequence.

While playing around with the demos on the website, I remembered NER again. In the past 5 months since writing my posts, I perhaps skimmed through 10-15 NER papers, each of them having a more complex neural network than the other, showing incremental improvement. In the past few months, I also spent sometime looking at the licenses of some of the open-source tools and their use in commercial, proprietary software. This post is a combined reflection on these two.

Here are the open-source options I explored which had NER functionality:

I also looked at 3 big service providers (their demo versions):

Since my last excursion with NERs in terms of training/testing with standard datasets resulted in both surprise and disillusionment, I decided to take two very simple text snippets to see how these work this time.

“LinkedIn was acquired by Microsoft in the past. Now , Microsoft acquired Github.”
“The Globe and Mail is Canada’s newspaper of the record”

In the first sentence, I expected to see: LinkedIn, Microsoft, Github identified as Organizations/Companies. In the second example, I expected to see “The Globe and Mail” identified as Organization and Canada as Location/Country. With this background, let us see what I found:

For sentence 1:

Stanford identified: Microsoft as ORG., Github as PER (Person), and missed LinkedIn.
Spacy identified: LinkedIn and Microsoft as ORG, Github was not identified.
AllenNLP identified: LinkedIn, Microsoft and Github as ORG
Google’s NLP identified: Microsoft as ORG, and LinkedIn and Github as “Other” (not sure what it means)
Microsoft’s Azure identified: LinkedIn and Microsoft and linked them to their Wikipedia pages and missed Github
Watson identified Microsoft and LinkedIn as companies.

For sentence 2:

Stanford identified: “Globe” as ORG and Canada as country.
Spacy identified: “The Globe and Mail” as a ORG and Canada as a Geo-political entity (similar to Location/Country in other tools).
AllenNLP identified: “Globe and Mail” (without the The) as a ORG and Canada as a Location.
Google’s NLP identified: Globe and Mail, Canada, and strangely enough, record as entities.
Azure identified: The Globe and Mail, Canada and linked them to Wiki pages.
IBM Watson identified only “Canada”.

These are what I considered very simple sentences for NLP tools to process. I had mixed results even with these. While that in itself is not a bad thing, and we hope to improve over time, as we get better with learning NER systems, the part that irritated me was the amounts of money involved in using some of these tools in some application. The last three (Google, Microsoft, IBM) offer these as a service (you pay by query or length of the text or some such measure). Among the first 3, Spacy and AllenNLP are under permissive licenses (MIT and Apache 2.0 respectively), which allows their usage in proprietary software. Stanford CoreNLP is GPL, which imposes some restrictions on proprietary use. I read about Stanford’s commercial licensing option, and was shocked to learn that it is in the multiples of $10,000 per year (depending on how many of the corenlp components you use). That just made no sense to me. Who can afford paying so much every year except the super rich, super large companies? - is a question I want to know answer for. Why should I be asked to pay so much for something that is obviously not doing well even on simple examples (It does well in general - I don’t deny that. But so do all those other options!) is a question I wish I knew the answer for.

With increasing adaptation to AI based tools in technology industry, NER becomes a necessary step in any project that uses NLP. Existing tools not getting such simple sentences completely correct does not sound good to me. I fully understand what it takes to develop and deploy NER systems. I am not demeaning any effort and all these people who made these tools are far more experienced and knowledgeable than me. Despite these facts, I think research in NER:

should focus on evaluating on multiple test sets, and ensure that their models are not just specific to CONLL or OntoNotes and work on sentences beyond those datasets (No, I am not even talking about non-canonical text or out of domain text).
be accessible to smaller organizations (for this, Spacy and AllenNLP seem to be pretty decent options)
should not discourage people from proprietary use by releasing with restrictive licenses (ofcourse, it is up to the creators)

(Note: This is more of a rant after a day at work working with some of these tools than a thorough evaluation).

Written on November 13, 2018