Keyphrase Extraction from Documents - 1

For the past few weeks, I have been working on automatic keyphrase extraction from documents. This post is the first in (hopefully) a series of posts to note down my observations on the topic.

  • What is keyphrase extraction? * (KPE) - primarily, it is about extracting some keywords/phrases that convey the main points in a piece of text. A text can mean anything from a full blown document to an extract of a email or a complaint from a customer. In general, KPE is useful for several purposes such as - making search better, looking for quick summaries of short texts (e.g., highlighting parts of reviews on amazon), identifying actionable items in emails, customer requests etc automatically. KPE is also closely related to other terms that may hear - document tagging. From what I understood, this is used to refer to automatically tagging the documents such that the tags are more topical/thematic in nature, and need not exactly appear in the document.

Consider an example:

“Estimating populations for collective dose calculations The collective dose provides an estimate of the effects of facility operations on the public based on an estimate of the population in the area. Geographic information system software, electronic population dataresources, and a personal computer were used to develop estimates of population within 80 km radii of two sites”

  • Human assigned keyphrases for this passage are: [‘collective dose calculations’, ‘facility operations’, ‘public’, ‘geographicinformation system software’, ‘electronic population data resources’, ‘personal computer’] The task of KPE is to come up with such stuff automatically.

Clearly, if another human was given the same passage, there will be some differences in the list of given phrases. So, this is a difficult problem to solve even for a human, in my opinion, unlike a problem like POS tagging or Named Entity Recognition, for example.

With that introduction, the question is: how do you approach this problem? It is a fairly standard problem in NLP research, and considering that there are many practical applications, there are a lot of publicly available datasets, algorithm implementations etc. Research articles report poor results (F-scores in 30-40% range, sometimes even less) on standard datasets for this task. There is also a lot of discrepancy in the scores from one paper to another, but I will get to that later. When I looked at some publicly available libraries such as Textacy which has implementations of several KPE algorithms, I felt that output looked pretty decent. So, I wondered why these evaluations are so poor. Granted, evaluation is also difficult here … for example, what do we do with partial matches between a phrase given by human and a machine?

My future posts (whenever they arrive) will discuss these issues of generating and evaluating this task, common assumptions made in current approaches, why they may not always work etc.

Some useful readings:

Some links to publicly available datasets:

Python Libraries: Textacy - a python library with implementations of some algorithms.

Some recently proposed algorithms:

Written on July 31, 2018