Day2@COLING 2020

I am attending COLING 2020 and here are some notes about what I found interesting in Day 2 (9th December 2020) program. Note that I did not read them thoroughly. I just used titles to choose what talks to attend and either watched the video or followed the slides, and then browsed through the paper if I found them interesting. I skipped some of these initial picks in the first few minutes, once I realized it does not interest me. So, this is what remains after a highly ad-hoc and subjective selection process.

Keynote: “The computational linguistics of conversation modeling” by Amanda Stent
This talk traces comp ling research on conversation modeling through the years. I just loved this talk and I think it is a must watch for all NLP enthusiasts. There is something about every aspect of conversation modeling from resource creation to real world use. It would be great to see the slides being made available for this, as it has a ton of resources for anyone interested in the topic. What I missed are some thoughts on what is going to happen for non-English conversational modeling in the coming years, what kind of research is needed for that, and what are some challenges ahead etc.

Something which I felt as equally important part of today’s talks was Steven Bird’s Decolonising Speech and Language Technology. The author shares his experiences and suggested ways to work with indigenous communities in connection with speech/language technology problems. There has been some work recently on NLP for various indigenous language families, and there is also a workshop coming up soon focusing on the topic. So, I felt this is a very relevant discussion to have in NLP community in 2020.

New Methods and Stuff (new models for various NLP tasks and applications)
KeyGames: A Game Theoretic approach to Automatic Keyphrase Extraction
Automatic keyphrase extraction is a well studied problem in NLP literature, with a range of approaches from supervised to unsupervised, including several recent neural approaches. This paper adds to the existing research by proposing a new approach based on evolutionary game theory concepts. They also release a python based pipeline that enables comparison across multiple key phrase extraction methods. They show an evaluation on three commonly used datasets for this task. As someone who has been reading a lot of KPE literature and worked on deploying it in a product setup in the past, I keep getting frustrated with the relatively low performance of any new KPE approach proposed, both by itself and in terms of improvement over other existing ones. This makes me repeatedly question if we are defining the task wrong, or evaluating it wrong, or both. This paper added to my frustration. However, I find the pipeline they describe interesting, and it could be a useful to tool for KPE researchers.

A Neural Local Coherence Analysis Model for Clarity Text Scoring
As the name indicates, this paper developed a local coherence model (i.e., considering coherence at the level of adjacent sentences instead of a global level) to model text clarity. It does seem like an interesting problem, and relevant to a range of tasks such as general text quality/readability etc work and specific application scenarios like essay scoring. However, from the poster, it wasn’t entirely clear to me how exactly are they quantifying text clarity and how is that validated.

Improving Grammatical Error Correction with Data Augmentation by Editing Latent Representation
This paper proposes a new data augmentation method for generating synthetic examples for grammatical error correction. They show that using this approach improves the performance on this task for public benchmarks. It seems like an interesting new method, but I wonder how this works on languages that follow a different sentence structure than English.

Grammatical error detection in transcriptions of spoken English
This paper describes the process of collecting transcription corrections and grammatical error annotations for a spoken english corpus, and show its use for grammatical error detection (GED). I did not read about GED for spoken language in the past, so I found it interesting. It seems like there is some way to go for the performance to match written text GED.

Industry Track:
Best Practices for Data-Efficient Modeling in NLG:How to Train Production-Ready Neural Models with Less Data
This facebook paper with literally a dozen authors shares the authors’ experiences with building production ready neural natural language generation (NLG) models. They talk about many issues such as data quality, what metrics to use, handling the problem of low amount of annotated data, how to do data augmentation etc and how they address them while also considering issues with deploying the models such as scalability, latency, maintenance cost and quality assurance. It is good to see such talks in a research focused conference. However, I felt it would have been good to give an overview of how exactly are these being used, whether there are non-English languages in their experiments etc. I couldn’t figure this out with whatever short time I spent on the paper. The github link they put for the datasets threw a 404 error. So, that part was frustrating.

Towards Building a Robust Industry-scale Question Answering System
This paper from IBM Research presents “GAAMA: Go Ahead Ask Me Anything”, a question answering system, which is a natural question-answering setup (not SQUAD like setup). They say that their innovation lies in “attention over attention” and “attention diversity” on top of a BERT model. In addition, they also used data augmentation methods. They show that GAAMA outperforms other industry competitors, and show that zero shot transfer is possible. They then show how GAAMA is efficient during training and inference. To a large extent it reads like a regular research paper, nothing specific to industry track other than the fact that it was from IBM Research. I think some discussion like in the above paper on deployment issues would have been interesting.

Evaluation
Taking the Correction Difficulty into Account in Grammatical Error Correction Evaluation
In evaluating grammatical error correction systems, all errors are typically treated equally. However, in reality, I think everyone will agree that some errors are difficult to correct and some are easy (both for machines and humans). This paper proposes ways of measuring the performance of such systems taking this into account and show that their measures are consistent. They also seem to have released relevant code/difficulty weighted data available. I found the idea very interesting. In a project I work on now, we have been dealing with the same issue, but for a different NLP problem. I hope this kind of research on better evaluations will spread to other NLP tasks/languages as well.

Model Compression, Empirical Analyses

LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression
Deep learning models such as BERT are generally large and memory intensive, and there has been some research in the direction of model compression, with recent work focusing on knowledge distillation. This paper argues that knowledge distillation approaches also are expensive and proposes a hybrid approach for compressing BERT model through weight pruning, matrix factorization, and knowledge distillation. Results seem to indicate that the various levels of compression affect some datasets a lot, while some are not affected that much. I would have loved to see some comment on how small these models exactly are (in terms of model sizes in MB, GB etc) and whether this takes us more closer to on the device models. There is a passing mention of embedded ML at the top, but considering that Microsoft Research is involved, I would have loved to see more on that part.

An Empirical Study of the Downstream Reliability of Pre-Trained Word Embeddings
There are many different forms of word embeddings, and the results of an experiment may change based on what embeddings were used, what model was used for downstream task training, whether there is any fine tuning or not etc. This paper performs a study to understand cross-corpus and cross-domain reliability of using pre-trained word embedding models with and without fine tuning. They eventually concluded that static embeddings work best. I wasn’t very sure on how this fine tuning was done for the embeddings they used, and I was also curious why they did not look at any contextual embeddings. Otherwise, it seemed like an interesting study.

Bookmarked to check out later: Enabling Interactive Transcription in an Indigenous Community
Mark-Evaluate: Assessing Language Generation using Population Estimation Methods

That’s all for today and virtual conferencing is exhausting!

Written on December 9, 2020