Universal Dependencies for several languages

Thanks to the recent discussions with @phylostar, I have been reading quite a bit on Universal Dependencies(UD) and on the creation of manual annotations following the UD scheme. While there are elaborately written guidelines on the website, clearly, each time you start working with a new language (especially something not very close to English), you need to figure out how to accomodate certain language specific elements into UD without losing the “universal” part and not losing adequate representation of the language phenomenon. This was the main theme in my last week’s readings about the creation of UD treebanks for various morphologically rich languages. [2] gives a general overview of the idea of UD for the uninitiated researchers. I used UD in Stanford parser before, but never knew it now exists for over 50 languages and several more are getting ready!

Noticing that languages with much less number of speakers have UD treebanks, 4 Indian sub-continent languages (Hindi, Tamil, Sanskrit, Urdu) already have treebanks, and two more (Marathi, Bangla) are under preparation, I got interested in UD Telugu. @phylostar’s regular pings on the topic acted as a booster.

Two papers among what I read this week ([4] for Finnish and [5] for Turkish) primarily described work on automatic conversion from previously existing treebank annotations (following a different dependency representation) into UD format. I originally thought this should be pretty straight forward (not having done such a thing before!). Infact, [5] even made it look so. However, [4] made me more aware of issues one will run into. For example, they talk about excluding UD tags DET and PART in Finnish, primarily because those categories do not occur in the language. Another example is about keeping a head-final structure instead of head-initial structure used by UD for multi-word expressions. I found these details interesting and educational, from the perspective of starting off UD for a new language.

[7] is about UD for Slavic languages. While this does not talk about a UD treebank construction yet, it had an extensive discussion on the kind of morphological and syntactic phenomenon that occur in Slavic languages that may not be straight forward in UD notation. This perhaps served as a goto document for preparing UD treebanks for these languages. These discussions about individual parts of speech were informative and got me thinking on looking for a discussion on this lines for Telugu. I have been reading [3] on and off for understanding Telugu grammar, but it seems to me (for now) very hard to switch between UD and the terminology used in the book. It seems like a grammar that is elaborate, but many parts are not suited for any computational grammar (no offense. I really loved the detailed descriptions of rules for Sandhi, for example and thought they were thorough to write a rule based system).

[6] is the most recent of these, from EACL 2017. They described UD guidelines for Hungarian, and also wrote and evaluated a converter from an existing treebank to UD along with a manual conversion of 1800 sentences in newspaper domain. They also wrote about their experiments with the added value of language specific information in this annotation. I liked the detailed linguistic descriptions (with examples). They were very useful to understand certain language specific phenomenon and how to accomodate them within UD framework. I am not entirely convinced about the “specific” labels outperform UD argument though. Considering those ultra small differences, I got an impression that UD is actually very good at addressing even language specific phenomenon in morphologically rich languages. Overall, this was a useful read, and with a lot of references for further study.

I found the idea of [1] interesting. They describe the construction of a new UD based treebank for Turkish, using a grammar book. The idea behind choosing a grammar book instead of the typical news article corpus is that grammar book examples cover many language phenomena including the rare ones not typically seen in standard use. The treebank created with these sentences can also serve as a good evaluation resource for the parsers and to do some qualitative analysis of the parser coverage. There was a detailed discussion of various syntactic and morphological phenomenon of Turkish and how to cover them while doing UD annotations. For example, Turkish is very agglutinative, and hence one single word may contain multiple words (with multiple POS) and with different relations between words. There were good examples showing these issues. Among all the things I read this week, I found this to be the most informative in terms of guidance for someone starting with a new language.

Overall, learnt a lot of things about treebank annotations. I am learning to do such annotations for Telugu sentences, following UD guidelines and notes from [3]. I should see how it works. I also wondered if there is any study on doing such grammatical annotations for different types of texts (news, fiction, blogs, tweets etc).

References

Cöltekin, Cagrı. “A grammar-book treebank of Turkish.” In Proceedings of the 14th workshop on Treebanks and Linguistic Theories (TLT 14), pp. 35-49. 2015.
De Marneffe, Marie-Catherine, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D. Manning. “Universal Stanford dependencies: A cross-linguistic typology.” In LREC, vol. 14, pp. 4585-92. 2014.
Krishnamurti, Bhadriraju, and John Peter Lucius Gwynn. A grammar of modern Telugu. Oxford University Press, USA, 1985.
Pyysalo, Sampo, Jenna Kanerva, Anna Missilä, Veronika Laippala, and Filip Ginter. “Universal dependencies for Finnish.” In Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania, no. 109, pp. 163-172. Linköping University Electronic Press, 2015.
Sulubacak, Umut, Memduh Gokirmak, Francis M. Tyers, Çagri Çöltekin, Joakim Nivre, and Gülsen Eryigit. “Universal Dependencies for Turkish.” In COLING, pp. 3444-3454. 2016.
Vincze, Veronika, Katalin Ilona Simkó, Zsolt Szántó, and Richárd Farkas. “Universal Dependencies and Morphology for Hungarian–and on the Price of Universality.” Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 356–365
Zeman, Daniel. “Slavic Languages in Universal Dependencies.” Natural Language Processing, Corpus Linguistics, E-learning (proceedings of SLOVKO 2015) (2015): 151-163.

Written on July 22, 2017