Inclusive AI – Voice: Speech Recognition AI Needs Diverse Datasets

By Life at Appen. December 15, 2021

Have you ever had a conversation with a chatbot? Or made voice commands using a home or in-car infotainment system? Many AI products use voice to interact with their users. It’s a convenient way to relay information to customers without needing to involve a human and can be extremely helpful for people with disabilities. But with voice products, there’s a major challenge: understanding human speech, or natural language. Natural language includes countless languages, dialects, tones, pitches, word use, abbreviations, and variations. Computers may be able to understand one customer well, but to understand many, who all have different speech patterns? That’s a major difficulty. Fortunately, if the AI is trained on speech data that covers all of these variations, it can learn to understand not just one, but many humans. Speech data, then, needs to be collected from and labeled by people who represent the speech patterns of the end users of the AI product. Considering that most AI products want to appeal to a wide variety of end users, there’s a huge need for diverse data annotators like Appen’s global crowd.

Representation in Voice Data

First, let’s explain how voice-powered AI works in a simple way: these models rely on automatic speech recognition (ASR) and natural language processing (NLP). ASR is simply the process of converting the spoken word to text. Computers will use ASR first, then apply NLP to interpret that text and create an appropriate response. Imagine you’re building a product similar to Amazon’s Echo or Google Home. You want your customers to be able to ask questions and obtain information quickly, and make commands that the product will then act on (such as turning off the lights in your home). Now think about your customers: if you want your product to compete with Amazon or Google, those customers are going to be from all over the world and speak a variety of languages. You’ll want voice data that includes people speaking not only every language in your target markets, but also every possible command that a customer may give your product. That’s a lot of data! But here’s why it’s important: AI needs to work for everyone equally, or it’s not performing ethically. For example, an in-car infotainment system that was trained primarily on speech data from male speakers will naturally not perform well for female speakers (and this has in fact been a common complaint with in-car systems), who tend to have a higher pitch and different speech patterns. This can be a frustrating and unfair experience for women customers, one that influences their usability of the product and negatively impacts brand reputation. It’s critical that speech data captures the full range of voices and commands that all end users may input. As a consequence, data annotators and those who provide speech data must represent the wide range of human speech for all potential customers.

Voice Project Highlights

There’s no better way to illustrate the power of representative data than through examples. Here are a few interesting voice projects Appen has been tasked with:

Global Tech Firm Expanding to New Markets

Most speech recognition systems are catered to adult voices, including popular speech products for children. The problem, though, is that children speak with a higher pitch and with more mispronunciations and irregularities than adults. One multinational tech company ran into this issue when working on an automatic speech recognition system for children. They approached Appen for our expertise in languages and speech recognition, wanting our help in capturing audio of children’s speech. Using a network of schools and churches, we were able to recruit hundreds of children (via their interested parents) across several demographics who then recorded snippets of selected speech. The client obtained 105 hours of audio from us through this project, audio they then used to train their speech recognition models. Those models were then widely used in the child education and infotainment space. The project serves as a fascinating example of the type of diversity AI can require, in ways that many of us may not expect.

Dialpad Improving Speech Recognition Software

Dialpad (now TalkIQ) collects telephone audio, transcribing and processing it using speech recognition and natural language processing models. They use this data to identify what companies and service reps are doing well and what they aren’t. Each company Dialpad works with is very different from one another, with their own vernacular and target markets. Dialpad needed to provide training datasets, then, that were unique to each of their clients. To do so, they turned to Appen. With our global crowd of annotators, Dialpad could develop training data that covered all of their clients’ target markets. For example, Dialpad used Appen’s geolocation tools to ensure British annotators labeled idiomatic speech from the UK. With our diverse selection of annotators, Dialpad was able to improve the accuracy of their training data up from 70% accuracy to 88%, and could serve clients in a greater variety of locations.

Automotive Software Provider Creating Smarter In-car Infotainment

One of the top complaints of new care owners is that their in-car infotainment system doesn’t recognize their voice commands. In-car infotainment systems are built from speech recognition models and must be trained on speech data that accounts for how all potential customers converse. If I give a command to my in-car system requesting the air conditioning to turn on, I might say “Turn on the air conditioning,” but another customer might say “Turn down the temp in here a couple degrees.” That’s not even accounting for differences in tone, pitch, and language. When an automotive software provider came to Appen for help with this issue, we supported them with in-market, on-demand annotators. These annotators recorded their voices making requests (e.g., can you turn down the radio volume?) in their native languages, and even tested their speech patterns in driving simulations to capture more real-world adjacent data. As a result of the project, the auto software provider obtained training data that covered over 40 languages; this data helped create a smarter in-car system that could understand the variety of speech patterns their customers had.

The Future of Voice in AI

We still have a long path ahead of us in creating equitable speech solutions in AI. The problem continues to be lack of representation in the speech data itself, and this problem will remain as long as companies don’t use diverse populations for data collection and preparation. But with time, as more speech data is created and made available, we should see improvement in voice-powered AI. It will be interesting to see how virtual assistants, chatbots, in-car infotainment systems, and other existing products progress in their development, but also worth keeping an eye out for exciting new applications of voice AI.