Quick Notes - Day 4

All previous posts in this series here

Today’s theme: low resource scenarios in NLP

When and Why is Unsupervised Neural Machine Translation Useless?
Authors:Yunsu Kim, Miguel Graça, Hermann Ney
Published at: EAMT 2020 url for extended version

Unsupervised MT is (in theory) useful in scenarios where there is not enough parallel data. This paper takes a look at unsupervised neural MT considering various possible scenarios (lot of data, less data, close language pairs, distant language pairs etc) and draws some conclusions on when it is useful and when it is not, by comparing with supervised and semi-supervised learning setups. In all tasks, unsupervised performs worser than semi-supervised and supervised learning, and less than 100K sentence pairs (20K in one case) are sufficient for supervised MT to overcome unsupervised MT. Having lots of monolingual data for unsupervised MT is probably useful when source/target languages are more closer, but not otherwise. They also show that domain mismatch results in bad unsupervised MT performance. So they conclude that the current form of neural unsupervised MT useless in practice.

There are two things I liked in this paper:

This sort of evaluation of Unsupervised NMT is very important. I (and someone I know) was quite excited when I saw the unsupervised NMT papers (which also received good coverage, in my view) in recent past, and dreamt of seeing MT systems for all sorts of under resourced Indian languages (not the major ones, that is). This paper helped me understand why the reality is different.
Many papers I come across neither do any qualitative analysis, nor have some thoughts on general perspective i.e., what are some limitations, how to address these in future etc. This paper had recommendations to make unsupervised NMT useful for future. I look forward to see more such usable methods in future!

Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùbá
Authors: David Ifeoluwa Adelani, Michael A. Hedderich, Dawei Zhu, Esther van den Berg, Dietrich Klakow
Published at: Practical ML for Developing Countries Workshop @ ICLR 2020

This paper explores whether distant supervision methods and different embedding representations are useful for NER in low resource settings by considering two languages of Africa: Hausa and Yoruba. They conclude that both distant supervision and embedding representations are useful for low resource settings.

What I like about this paper is that it tackles two languages that are not typically seen in NLP papers. I always believe that the real claims to multilingual models depend only on success in such scenarios. However, I am disappointed to see a LSTM model to be a default in this scenario, considering that they don’t seem to cite any prior work on this problem for these languages. I think they should start with simpler models and gradually increase the complexity to evaluate different approaches. Of course, this is not specific to this paper - it seems to be a common problem these days in NLP.

Written on May 3, 2020