Quick Notes - Day 6
All previous posts in this series here
Today’s theme: understanding pre-trained models (again)
Cross-Lingual Ability of Multilingual BERT: An Empirical Study
Authors: Karthikeyan K, Zihan Wang, Stephen Mayhew, Dan Roth
Published at: ICLR 2020 url
M-Bert (Multilingual BERT) has been successful with some tasks in recent past, even though it is not trained with any language aligned data and without any explicit cross lingual objective. This paper aims to understand this success using three languages (Spanish, Hindi, Russian) and two tasks (textual entailment, NER). One hypothesis has been that M-BERT works well because of some level of similarity between languages. In this paper, they show that the model is cross lingual even when there is no lexical similarity, and that is not what contributes to the performance. They conclude that the model architecture (network depth, total number of modern parameters) are more important than lexical similarity.
I am somewhat surprised by the finding about the lack of strong impact of lexical overlap. I don’t fully understand if creating “fake english” for this purpose is a good idea, but in general, i felt the paper systematically explored cross-lingual ability of M-BERT.
Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction
Authors: Taeuk Kim, Jihun Choi, Daniel Edmiston, Sang-goo Lee
Published at: ICLR 2020 url
Grammar induction is the task of associating syntactic trees with sentences without explicit supervision from gold standard trees. This paper explores whether pre-trained language models are useful for grammar induction. Then, in an effort to understand what syntactic knowledge do these models capture, they conclude that they are better at identifying adverb phrases than other approaches.
I found the idea interesting, but when I browsed through, I did not understand why there is a special focus on adverb phrases - the implications of doing well with adverb phrases (but not so well with others) are not clearly discussed, which made me wonder why take up ADVP other than the fact that it had better numbers.