Teaching Notes - Speech Recognition and Synthesis

This post is a continuation from my previous posts on teaching a 100-level undergraduate course called Language and Computers. As mentioned earlier, it is a very diverse class, and I use this textbook: Language and Computers by Marcus Dickinson, Chris Brew and Detmar Meurers.

The topic for this week is Speech processing. I spent 3-3.5 classes on this, and spoke about speech recognition, speech synthesis, and different application scenarios for both. I then spent some time discussing one specific application - automated scoring of spoken responses as this is something I have been reading about, and working on a little bit.

I spent some time introducing about language specific aspects of speech processing by asking a series of questions. We discussed about the role of pronunciation dictionaries, reasons for pronunciation variation among people of various dialects, and even within one person’s speech. I spent a few minutes using google speech recognition as a demo, since it supports multiple languages and accents.

In the next class, I focused exclusively on Speech Recognition. We discussed general ASR system architecture - different steps in transforming speech to text, and what are the challenges for each step. We briefly discussed google’s ASR on google docs and swiftscribe.ai, and concluded with some discussion on how to evaluate ASR. The next class was on Speech synthesis. I spoke about why is it not easy as it may appear, what are potential issues, how are they handled, and where is such a thing useful. We also saw videos of Stephen Hawking speaking about his speech synthesizer, and a demo of early speech synthesizers of 20th century and more recent ones such as lyrebird and google translate. Finally, we discussed about the evaluation of speech synthesizers.

Finally, I started a discussion on speech scoring, for two reasons:

I used essay scoring as one of the examples while talking about text classification. So, I thought it is a good topic to connect text and speech processing.
Thanks to an enthusiastic student here, I started working with speech data these days, and I find the problem fascinating. So, I briefly talked about SpeechRater system by ETS (linked above) and at what point will there be overlap between processing text and processing speech.

While teaching, I realized the fun of teaching a 100-level class on this topic. We get to appreciate all these technologies much better, without worrying about underlying complex models. We also had lots of interesting discussions and group exercises on interesting, not yet fully solved problems such as - language understanding, ASR when people switch between multiple languages, Scoring of speech, speech to speech translation etc. Our next topic is machine translation. I gave a NACLO exercise - Running on MT in the class to make them think in the direction of the topic. More on this in the coming weeks!

Written on November 11, 2017