Quick Notes - Day 9
All previous posts in this series here
We need to talk about standard splits
Authors: Kyle Gorman and Steven Bedrick
Published at: ACL 2019 url
It is common in NLP to evaluate system performance on a held out test set. However, it is not so common to check for statistical significance or to check the system stability across multiple train-test splits. This paper suggests using such random train-test splits instead of a standard one, arguing that researchers may be overfitting to the “static” standard splits. They performed experiments on a standard POS tagging dataset replicating (with standard splits) and reproducing (with random splits) several approaches that were SOTA at different points in time, and show that the performance rankings reported on standard splits cannot be replicated with random splits for these taggers.
I am not surprised by the results of this paper - I have had the same question about statistical significance and stability for a very long time, and it is good to see some systematic exploration. It would be nice to see such experiments for other NLP tasks as well. The final words of this paper felt very true for all NLP tasks:
It is said that statistical praxis is of greatest import in those areas of science least informed by theory. While linguistic theory and statistical learning * *theory both have much to contribute to part-of-speech tagging, we still lack a theory of the tagging * *task rich enough to guide hypothesis formation. In the meantime, we must depend on system comparison, backed by statistical best practices and error * *analysis, to make forward progress on this task.
We need to talk about random splits
Authors: Anders Søgaard, Sebastian Ebert, Joost Bastings, Katja Filippova
Published at: ArXiv (May not be peer reviewed) url
This paper argues that random splits underestimate the error rate on new samples, by considering 7 tasks/datasets (compared to one in the first paper) and suggest the use of biased splits to understand what characteristics of text data determine the system performance on new samples. They also recommend using multiple test sets instead of focusing on standard splits or random splits.
I read the 2019 paper after seeing this paper. I felt this provided an interesting continuation to the findings of the previous paper. However, I felt it should have been expanded a little bit discussing more on how to create good biased splits in practice and how to create multiple test sets for any task (which seems impractical to me).