The problem with off-the-shelf Named Entity Recognizers

A few months back, I wrote two posts about Named Entity Recognition (Part 1, Part 2), which is the language processing task that is concerned with identifying “entities” i.e., person/location/organization names, specific expressions such as time and money, and other pre-defined miscellaneous set of categories in a given piece of text. Months later, I came across two talks by Joel Grus, who is a research engineer with AllenNLP - one called “Why I don’t like Notebooks”, and the other, more recent one given with others from AllenNLP called “Writing code for NLP Research”. Both of these are on my favorite recent rant-topics, and I started taking a look at AllenNLP as a consequence.

Read More

Some thoughts on Tokenization approaches in NLP

I was planning on writing more on Key phrase extraction and document tagging in NLP, but got occupied with work. In the meanwhile, I had a few discussions with people on one of the basic pre-processing tasks in NLP - tokenization (i.e., the task of breaking of a piece of text into individual tokens where a token can refer to a word, punctuation mark etc.). This is essential to do anything else further down in the NLP pipeline for any application (e.g., information extraction, sentiment analysis, machine translation, chatbots etc).

Read More

Keyphrase Extraction from Documents - 1

For the past few weeks, I have been working on automatic keyphrase extraction from documents. This post is the first in (hopefully) a series of posts to note down my observations on the topic.

Read More

Readings and Thoughts on deploying NLP/ML models

Over the past few weeks, I spent quite a bit of time reading and thinking about the difference between academic and industry R&D for NLP and Machine Learning systems. This post is a quick summary of my thoughts, and readings. The disclaimer: this is from someone who switched very recently (2 months) from years in academia, into industry. All views are my own, not my employer’s.

Read More

Thoughts and Experiments with NER-2

In the last post, I wrote about the general idea of Named Entity Recognition, the issues I noticed in the way researchers discuss the problem in contemporary research articles. My point for this post is to show - we don’t need complex models to report state-of-the-art accuracies on standard datasets. I took a freely available dataset (CONLL-03) and trained three NER models:

Read More

Thoughts and Experiments with NER-1

I have been spending quite sometime in the past 2-3 weeks reading and working on Named Entity Recognition aka NER, which is a sub-problem of information extraction from textual documents, and deals with automatically identifying names of people, organizations, locations, mentions of events, date, time etc. Comparing early and recent work, developing a few NER models and evaluating them, I noticed a few issues with the way this is done, and how it affects general understanding of the problem, and its application in real-world. This series of posts are my reflections. I want this first post to be more like a background, and in the next post, I will talk about evaluating existing NER models.

Read More

Talk on Research Methods in Computational Linguistics

I gave a guest lecture today in a research methods class of comprising of graduate students from Rhetoric and Professional Communication. When Prashant Rajan, who teaches this course, asked me about this a few months back, I, for the first time, asked myself- “What exactly are the research methods of computational linguistics?” I never had a class dedicated to research methods - we learnt methods of NLP/CL in those classes which taught NLP/CL. I read about scientific method, specific stuff like - methods for social science research, methods for psychology research etc, and I think many of my friends who came from these backgrounds had specific methods classes. But, I don’t remember any CS friends I had, with or without NLP, talking about a research methods class.

Read More

Teaching Notes- Text Mining in R, for non-programmers

I teach a course “Language as Data” for introducing text processing and analysis methods to liberal arts majors. The course is a new experimental course first taught in Spring 2017, and now in its second round, in Spring 2018. I use R as the programming language, and this series of posts are my notes and observations about teaching in R for non-programmers.

Read More

Teaching Notes - Machine Translation

This post is a continuation from my previous posts on teaching a 100-level undergraduate course called Language and Computers. As mentioned earlier, it is a very diverse class, and I use this textbook: Language and Computers by Marcus Dickinson, Chris Brew and Detmar Meurers. I was quite busy over the past few weeks, and I almost thought of stopping this series, but decided to give it a completion. The last topic in my class was “Machine Translation”.

Read More

Teaching Notes - Teaching about 'What is NLP?'

This post is a continuation from my previous posts on teaching a 100-level undergraduate course called Language and Computers. As mentioned earlier, it is a very diverse class, and I use this textbook: Language and Computers by Marcus Dickinson, Chris Brew and Detmar Meurers. After finishing 4/8 topics in the textbook, I decided to take a break for 2 weeks from its content. One of these weeks was dedicated to introducing the general idea of Natural Language Processing and talking about typical tasks involved, where they are useful etc. 2nd Week mostly consisted of mid-term group presentations (totally enjoyed - more on them in the next post). So, what can we talk about NLP in a 100-level course? This post is my reflections on that.

Read More

Exploratory Factor Analysis in SPSS vs R

I got interested in Exploratory Factor Analysis (EFA) recently, thanks to some of the students with whom I work right now. They come from a background of statistical methods in language testing, where EFA is generally used to look for validating the test items, pondering over questions such as whether they are all assessing the same underlying constructs or different ones etc. I saw it as a good method to come up with and validate some hypotheses about data (That, by the way, is not a ground breaking idea. It has been done before.), and to use it as a way to group features based on some idea of a common underlying construct (unlike PCA, whose primary goal is dimensionality reduction) and later use Confirmatory Factor Analysis (CFA) to validate the “theory” about feature groups from EFA.

Read More

Teaching Notes - Teaching about search

This post is a continuation from my previous posts - (Part 1, Part 2 and Part 3) on teaching a 100-level undergraduate course called Language and Computers. As mentioned earlier, it is a very diverse class, and I use this textbook: Language and Computers by Marcus Dickinson, Chris Brew and Detmar Meurers. This post is about Chapter 4 of that book, called Searching. The chapter primarily dealt with ideas behind searching through unstructured text.

Read More

Teaching Notes - Teaching about language tutoring systems

This post is a continuation from my previous posts - Part 1 Part 2 on teaching a 100-level undergraduate course called Language and Computers. As mentioned earlier, it is a very diverse class, and I use this textbook: Language and Computers by Marcus Dickinson, Chris Brew and Detmar Meurers. This post is about Chapter 3 of that book, called Language Tutoring Systems. The chapter primarily dealt with idea of Computer Assisted Language Learning i.e., designing software to teach and assess learning of a new language.

Read More

Teaching Notes - Teaching about spelling and grammar correction

This post is a continuation from my previous post on teaching a 100-level undergraduate course called Language and Computers. As mentioned earlier, it is a very diverse class, and I use this textbook: Language and Computers by Marcus Dickinson, Chris Brew and Detmar Meurers. This post is about Chapter 2 of that book, called Writers’ Aids. The chapter primarily dealt with the intuitions behind spelling and grammar checkers.

Read More

Teaching Notes - Teaching about encoding language on computers

I teach a 100 level course called Language and Computers to a class of undergrads at all stages in their degree programs, and coming from diverse backgrounds. About the half the class are from CS and related disciplines, but there are also students from Journalism and Mass Communication, Physics, Biology, Chemical Engineering, and even Advertising, to name a few. This (and a few more future posts) are mostly my notes on teaching them (and it is my first time with this course), and on how to change for the next iteration, whenever that happens.

Read More

Universal Dependencies for several languages

Thanks to the recent discussions with @phylostar, I have been reading quite a bit on Universal Dependencies(UD) and on the creation of manual annotations following the UD scheme. While there are elaborately written guidelines on the website, clearly, each time you start working with a new language (especially something not very close to English), you need to figure out how to accomodate certain language specific elements into UD without losing the “universal” part and not losing adequate representation of the language phenomenon. This was the main theme in my last week’s readings about the creation of UD treebanks for various morphologically rich languages. [2] gives a general overview of the idea of UD for the uninitiated researchers. I used UD in Stanford parser before, but never knew it now exists for over 50 languages and several more are getting ready!

Read More

Reading Notes - Active comprehension (1982 article)

This post is my quick note-taking about the following 1982 article:

Singer, Harry and Donlan, Dan. Active comprehension: Problem-solving schema with question generation for comprehension of complex short stories.Reading Research Quarterly, pp. 166–186, 1982.

Read More

First Post

My first post has to be about why I chose to set up another blog when I already have another one for about 11 years now. I used to write about all sorts of things over there, but repeated readings about Jekyll tempted me into trying this out now. So, I hope to write my technical posts on this blog from now on.

Read More