Quick Notes - Day 11
All previous posts in this series here
theme: evaluation
A Metric Learning Reality Check
Authors: Kevin Musgrave, Serge Belongie, Ser-Nam Lim
Published At: ArXiv (may not be peer-reviewed) url
This paper looks at recent metric learning literature and critiques the claims of improved performance of deep learning approaches over previous methods. Among others, they show that there are some flaws in current method of experiments/reporting and comparing results (e.g., not using the same experimental setup, changing default embedding size etc), and hence, claims of better performance may not actually be true. They then show that when these issues are fixed and experiments are replicated, SOTA loss functions performs only marginally better than classic methods.
Though this paper focuses specifically on metric learning, these conclusions may very well be true for any field that has shown SOTA using deep learning methods (or even earlier, using traditional machine learning). In many ways, this is obvious to anyone who thought about the problem before. I wonder what the solution is, and how one can compile a list of “best practices” for deep learning experiments and reporting.
On the State of the Art of Evaluation in Neural Language Models
Authors: Gabor Melis, Chris Dyer, Phil Blunsom
Published at: ICLR 2018 url
There is a lot of work on using Deep Learning for NLP. Many code bases, different architectures, different hyperparameters, etc. This paper reevaluated some architectures for some tasks and concluded that contrary to established results (at the time of that paper), standard LSTM architectures, when properly regularized, outperform recent models.
I found this paper while browsing twitter threads on the first paper. This is a sort of NLP focused precedent of the first paper. Here too, the authors established an experimental setup to point out how SOTA claims could be flawed, but agree that they cannot provide any methodological solution to the problem other than establishing carefully calibrated baselines that can be used to benchmark future work. However, I am glad to have come across all these papers that question SOTA in terms of experimental methods. These seem like very obvious hypotheses to form for anyone who follow the field, but are not so easy to test, IMO.