Keyphrase Extraction from Documents - 2

Several months ago, I started writing on automatic keyphrase extraction, but couldn’t continue. I was building a keyphrase extractor for legal documents at that time. Today, I came across a ArXiv paper (soon to appear in NAACL 2019), which is making me post on the topic again.

What is keyphrase extraction?(KPE)- As an NLP problem, it is primarily about summarizing a given piece of text using a bunch of keywords/phrases. This has a wide range of applications in many real world scenarios involving information extraction/summarization. Standard approaches in NLP involve both supervised and unsupervised methods, and recent research used supervised deep learning methods Meng et.al., 2017, for example, which can do both extractive (get what is in the text) and abstractive (generate what is not present) key phrases. We also have some popular KPE algorithms implemented in Python libraries such as textacy.

With that introduction, let me come to today’s read.

Keyphrase Generation: A Text Summarization Struggle
Erion Çano, Ondřej Bojar
url
To appear in NAACL 2019

Paper Summary: The authors modeled KPE as a task of generating an abstractive summary from title and abstract of scientific articles, using some popular neural architectures for text summarization. They conclude that despite using advanced deep learning models, lot of data and stuff, they don’t end up in getting any better results than existing unsupervised or supervised approaches (Unsupervised approaches are all non-neural as far as I know).

What I like: The paper is generally clear, and I liked how they communicated their negative results successfully.

“This paper carries several contributions, de- spite the fact that no progressive result scores were reached. It is the first work that considers keyphrase generation as an abstractive text sum- marization task. We produced a large dataset of article titles, abstracts, and keywords that can be used for keyword generation, text summarization or similar purposes. Finally, we evaluated the per- formance of different neural network architectures on summarization of article keyword strings, com- paring them with popular unsupervised methods.”

What I did not like:
(This is more generally valid for all KPE papers I read, not a criticism against this one)

Inconsistent measures of evaluation used across different papers on this topic. Some report F, some report F@top-5,10,15. Oddly enough, this paper reported F@5 and 7. They had a justification for that, but it makes comparisons difficult.
What papers on this topic report as best scores always seem like a matter of convenience for me. In the survey paper from 2014, they list a certain best score on this Huluth (2003) dataset that everyone uses. However, most papers don’t use that for comparison, and refer to a much lower score as their benchmark.
This paper in particular, shows Meng et.al. (2017)’s paper as the best performing one. However, if we look at SGRank paper from 2015, they report less than 1% difference on F@5 compared to Meng et.al., and a higher F@10 on the dataset that is common across the board in all these papers. SGRank is unsupervised, relatively simpler to understand, and has a implemented version in textacy - so, it is easier to both use it as well as test it on some new dataset. However, strangely, many papers, including this one, never cite that (just to clarify - I don’t know anyone associated with SGRank or textacy!).

Personally, I think is the best solution for KPE is a combination of:

an unsupervised phrase extraction method. I like the ones implemented in textacy. Tweaking the implementations a little bit gave me relatively consistent results for incorporating into an existing product, in the project I was working on at that time.
some kind of supervised semantic tagging method (to catch overall themes)
getting topical words/phrases from a large topic model (unsupervised).
Clearly, this ad-hoc approach is not too appealing to publish a research paper, but it may possibly be what one needs, if you need it in a software product.

Evaluation is again an issue. We can always opt for Precision/F-score@top-N or some such measure, like in existing research. But, on what data? Comparisons between different approaches can be done using a range of available KPE datasets (some of which were listed in my previous post). However, once an approach is finalized, I think an ideal evaluation would be to just ask human evaluators to rate the output of this approach, sampling a few instances. It may be time consuming, but it isn’t worse than just deploying it in some application and realizing it is doing poorly on real world data!

My wishlist:

A good survey article on state of the art in KPE (approaches, datasets used, evaluation measures used, KPE in industry case studies where possible)
Some concise summary of list of resources for KPE at NLP Progress

Written on April 2, 2019