Quick Notes - Day 2

All previous posts in this series here

What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models
Author: Allyson Ettinger
Published at: Transactions of the Association for Computational Linguistics, 2020. url

Large, pre-trained language models such as BERT have dominated NLP research since the past 2-3 years. Consequently, the question “what properties of languages these models learn?” has been a topic of research too. This paper addresses that question by introducing tests used in human language experiments to query BERT. They identify a few short comings of BERT through these experiments. Specifically, they show that BERT “struggles with challenging inference and role-based event prediction— and, in particular, it shows clear insensitivity to the contextual impacts of negation.”

I am not familiar with the papers on such model probing experiments, except for a few papers I came across on syntactic properties that these models seem to learn. So, I found the idea of using tests from psycholinguistic studies to understand deep learning models to be very interesting. Further, the test sets are smaller in size compared to typical ones you see in NLP papers (understandably so!). So, I found the experiments section pretty cool - all the strong discussion on the results, why these are important despite small test set sizes, what they tell about BERT etc. It would have been nice if a couple more such pre-trained models were explored, though. The best part of the paper is that all code and data are available on github!

Machine Learning–Driven Language Assessment
Authors: Burr Settles, Geoffrey T. LaFlair and Masato Hagiwara
Published at: Transactions of the Association for Computational Linguistics, 2020. url

This paper is from DuoLingo and describes their method to automatically create language proficiency assessments that are also validated, reliable and secure. Normally, test items for language proficiency tests are manually created and “pilot tested” for validity. Clearly, this is both time consuming and not too secure (same set of items - so they can be leaked). DuoLingo team overcomes this by using NLP and ML to create tests so that the items can be automatically created, graded and analysed psychometrically. There is a model for vocabulary tests (Yes/No kind of question items) and one for passages (c-test, dictation, speech responses etc). After performing a brief validation experiment, these models were used to form the DuoLingo English test and the analysis of its reliability and security.

The paper started with an overview of important concepts from language testing and psychometrics such as Item Response Theory (IRT), test construct, item formats, computer adaptive testing, and CEFR scale for language proficiency assessment. I don’t remember seeing this kind of overview in any NLP paper related to language assessment. I also liked the discussion section at the end, which touched upon the non-computational, language testing specific aspects typically discussed in educational measurement communities. I think this is also the first paper I saw where the authors have a background in ML, NLP and Applied Linguistics.

That’s it for today!

Written on May 1, 2020