Quick Notes - Day 5
All previous posts in this series here
Today’s theme: evaluation
Fill in the BLANC: Human-free quality estimation of document summaries
Authors: Oleg Vasilyev, Vedant Dharnidharka, John Bohannon
Published At: (Possibly not peer-reviewed). ArXiv. url
Evaluation of automatic text summarization is commonly done using measures such as ROUGE, which require reference human summaries. This paper proposes a new approach called BLANC to evaluate document summarization which does not require any reference summaries. BLANC evaluates the quality of a summary based on how well will it help an independent pre-trained language model (such as BERT) to perform some language understanding task on the document. They show that BLANC scores correlate with human judgements at least as much as ROUGE does.
I really like this idea of evaluating summarization without reference summaries. It can potentially be useful in even building real world applications which are often in various domains, and it is hard to get reference summaries. Further it also shows a heat map across document of where the summary is useful - it seems like a good way to understand what is going on and how to improve a summarization approach.
Evaluating Machines by their Real-World Language Use
Authors: Rowan Zellers, Ari Holtzman, Elizabeth Clark, Lianhui Qin, Ali Farhadi, Yejin Choi
Published At: (possibly not peer-reviewed. ArXiv. url
Some recent research showed how large pre-trained models with billions of parameters fail at a real world task - providing helpful advice to humans. They start with an analysis of a recent model T5 and suggest that there is a huge gap between how human beings use language and what NLP tasks’ evaluation methodologies measure. So, they propose to evaluate machines (NLP systems) in terms of how they tackle complex, open-ended situations involving human language understanding. They introduce a new challenge dataset RedditAdvice and shtow that “To- day’s largest model, T5, with 11 billion parameters (Raffel et al., 2019), produces advice that is preferable to human-written advice only 9% of the time – after being finetuned for our task, on a training dataset with 600k pieces of advice.” They also show how current NLP approaches fail in real world NLU.
I like the paper in general - the dataset creation, experiments (comparisons start from TF-IDF retrieval - which in itself is a rarity in the days Deep learning baselines), and a long discussion on what are the problems with current models, how can they potentially be better at this, what are the ethical considerations (again, not a commonly discussed topic) etc. Overall, I did not do a thorough study or anything, but felt that it is a good paper!