A 1980s discussion in CL Journal on parsing ill-formed input
Having 5 hours to kill in the airport, and not wanting to browse, I started checking the old issues of Computational Liguistics journal, which was then called “American Journal of Computational Linguistics”. The ones I checked out were either book reviews or short communications (There is was one review of a conference on Computational Linguistics in Medicine, from 1980!).
Then, I found these two single page commentaries, which compelled me to post this.
Primary motivation for this writeup comes from the idea that natural language understanding systems should be able to handle different kinds of inputs coming from different sets of users. They show a small study where they asked students to issue a query in natural language, to get information from a database (early precedent of a search engine interface?) and gave some stats discussing about the ungrammatical aspects of the queries, non-canonical language etc, and conclude that about 1/3rd (32.8%) of these user generated queries had some ungrammatical aspect in it. That is it. That is how, in one page, they make their point and close! :-)
As a counter argument, this came out in 1983:
They argue that while ill-formed input can be expected, humans are adaptable, and we many not need to worry about them giving ill-formed inputs as much. They present an evaluation of a voice driven office automation system where they instructed participants to speak slowly, and in discrete words (one word per second, for example). Their system rejected voice inputs it cannot understand and the users are forced to slowly adapt to the system, and speak in a way it can understand. So, in around 1600 sentences they collected, only 10 were considered as ungrammatical.
So, they conclude: “As a consequence of these observations, we have practically discontinued our efforts to parse ill-formed sentences”
I found the exchange quite fascinating - now, you see a lot of discussion on such ill-formed inputs (learner written data, social media posts etc), and even these voice-input systems, dialog systems etc seem to be keen on adapting to different languages, different accents etc. I wonder how these authors, if they are still active, think about all these recent developments!!!