Transcript – Breaking Barriers: Exploring Voice and Accessibility in the Digital World
EPISODE 4 “RP: What machine learning and these new advances are doing though, is allowing us to create those voices in less time, for less money, and with less data. That is what is incredible.” [INTRODUCTION] [00:00:13] MM: Welcome to Veritone’s Adventures in AI, a worldwide podcast that dives into the many ways technology and artificial intelligence is shaping our future for the better. Veritone disclaims any responsibility for any statement of guests in the podcast. The views expressed in this podcast are those of the interviewee and do not necessarily reflect the views of Veritone or its directors, officers, or employees. I'm your host Magen Mintchev. I’m here with Dr. Rupel Patel, who is the VP of Voice and Accessibility at Veritone. She's also a professor at Northeastern University in the area of Speech Science Audiology and Information Science She's also given two TED Talks and numerous speeches on the subject of voice. [INTERVIEW] [00:00:56] MM: Thank you for joining the Adventures in AI Podcast, Rupel. [00:01:02] RP: I'm so excited to be here, Magen. [00:01:03] MM: Awesome. Why don't you tell us a bit about yourself and when you first knew you were passionate about voice? [00:01:11] RP: Yeah. I'm a speech scientist and got my start in my career working with people with speech disabilities. I got really interested before that, though in the human brain, and did my undergrad in neuroscience. What really got me excited when I was trying to figure out, what I was going to do with my life was, when I started working with people with brain injuries and noticed that their speech was impaired and so on, and yet the brain could rewire itself. I think for voice, like my interest in voice really came about as I started working more as a Speech-Language Pathologist and saw what the impact was when someone loses their ability to speak or to construct sentences and so on. That's what I got interested in how we could use different technologies to compensate and help individuals who are suffering for more severe speech and language issues. [00:02:03] MM: I did mention in the beginning that you do have your two TED Talks. Can you talk briefly about those experiences and the subject you discuss on those? [00:02:12] RP: Sure, so I gave a TED Talk at TED Women in 2013. This was is just an incredibly life-transforming event in the sense that, prior to this, I was doing some research in my lab at Northeastern where we had two streams of research. One stream of research was on basic science of speech production like, how do people produce speech? Like, from babies through individuals who have lost their ability to really create fluent speech, because of some neuro motor-speech impairment. We did some basic research there. We called fundamental research there. Then we would also in the same lab work on more interdisciplinary teams with students in computer science or in engineering for us to actually build dedicated communication devices, like assistive communication devices that leveraged whatever this person could do with their voice. We built a number of different kinds of technologies. Some that were for people who were really locked in and had very little control of anything, other than maybe their eyes or hands to communicate, so they would type in their messages and they would be spoken allowed by a system or we would work with little kids who were learning to talk to try to see if there were some combination of clarity in terms of how they produce speech to see if we could make these machines do something similar. It was these two streams of research coming together and one of the projects in the lab was something at that time that we called locality VocaliD. VocaliD was a project where we had learned that people who couldn't speak very clearly could still control some part of their voice called the prosody any of their voice. These are changes in the pitch of their voice, the loudness. It is really the melody of speaking and yet when these individuals had to use a speaking device, they were all given the same voice. At that time it was really just generic sounding voices. So what we've done in the lab is figure out a way to combine the voices of everyday people and the sound that these individuals could make – their voice still to create these unique synthetic voices. That was what the VocaliD was. The first TED Talk was, essentially talking about this technology where we could combine the voices of different individuals to create these unique sounding voices for people who use an assistive and alternative based communication method. Think someone like Stephen Hawking or an individual like, who is on the autism spectrum or someone who has cerebral palsy who has to use a talking box as their main method of communication. The voice that comes out of those talking boxes is at that time was pretty generic sounding. We were able to create a unique sounding voice in the lab for them. So that was what the first TED Talk was all about. How do we do that? What's the actual science behind it? At the end of my talk a last-minute throw in, a week or two before the actual talk. I asked the question of, what if you could donate your voice to someone in need? That's what started this whole thing around then taking that technology out of the lab and creating a company around it. [00:05:13] MM: Amazing. I love that. I just got goosebumps. Can you tell us about how the technology works? [00:05:20] RP: Sure. So customize synthetic voice is essentially a combination of one or more people's voices blended together. If someone can't speak very clearly, we take whatever sound they can still make and then we find someone in our voice bank of people who have contributed their voice, who's a close match to that individual, who's recorded maybe an hour or so of speech. Then we combine those voices together to then train a computer model that can speak like them. It's not just creating a synthetic voice clone of another person in this world. It's actually more like combining several people's voices together to create a unique voice for that individual. Then once it's built, it's basically a text to speech application. Now you have a customized text to speech voice that can be used on any assistive communication device where text is entered and you want to turn it into speech and that speech you want to sound like a particular person. [00:06:19] MM: Okay. Who would be some of the users of this technology? [00:06:22] RP: Some of the users could include people who have ALS or multiple sclerosis, children on the autism spectrum who are non-verbal, people who are born with cerebral palsy, who can't speak clearly enough so they have to use a device to talk. There's the entire cottage industry of these kinds of applications and devices they're called AAC Devices. That stands for Augmented and Alternative Communication Devices. It depends on what abilities that they still have. Some people need to use an eye tracker to select the messages on their screen. Others can actually type with a finger maybe not fully type but like peck at the different keyboard items to create a message. Once they finish creating their message, they hit the speak button. If they have a customized voice it will speak in that voice whatever message they've constructed. [00:07:12] MM: That's amazing. You mentioned ALS, too. Actually, my grandfather had that. He passed away from it, of course, many years ago. I think it was 1998. It would have been really advantageous for him to have this, although he only had it at least as long as we knew for six weeks. He was diagnosed and then we lost him to it, so it was quick, but we really need to know that these types of technologies are out there for people with those types of disabilities. [00:07:41] RP: Yeah. A few years ago, I mean, it's always been possible to recreate someone's voice, not always, but it's been possible to recreate someone's voice for several decades now. What machine learning and these new advances are doing, though is allowing us to create those voices in less time, for less money, and with less data. That is what is incredible. When we started VocaliD in the laboratory, it was several tens of hours that you would need of audio data. That's now already, now we need 90 minutes of data to make a high quality enterprise grade voice. You can make a voice with any amount of data. Companies say, five minutes of data. Sure, but it's not really going to sound like that person. The interesting thing here is that these new methodologies are really advancing, how quickly we can make the voices and really at scale so it makes it affordable for people who have disabilities, as well. [00:08:36] MM: Are there other applications of this technology for people with other disabilities? [00:08:42] RP: Yeah, absolutely. People with visual impairments who use screen readers that right now are pretty generic sounding could benefit from a personalized synthetic voice, especially if they lost their eyesight later in life where they're used to hearing their own personalized writing voice, their own writing voice when they read aloud content that they have written. It's really hard to hear that in a generic voice, so if they had a customized synthetic voice reading aloud their own content, it could help with creativity. We've had lots of requests for that. I also think we have children with some learning disabilities. Not all synthetic voices are understood or not even all human voices are understood in the same way by people who have different kinds of auditory processing impairments. So allowing individuals to choose which voice they want for educational products where they're getting information through computerized voice, it'd be great for them to be able to select the voice. Those are some applications that come to mind. [00:09:39] MM: How does AI voice or Voice AI assist with daily tasks and routines in helping people maintain more of their independence. I'd like to dive a little bit deeper into that. I know we talked a little bit about it. But anything else that it can really assist with? [00:09:56] RP: Well I'll give you an example. Let's say you are a child with cerebral palsy and you go to a classroom, which has other kids who have similar types of physical, as well as, speech motor difficulties. Now you have to use a device to talk so you somehow, either are typing or using your eye gaze in order to select on your dedicated device. Items that you want to communicate about, right? If you're literate you might actually type out messages. If you're not literate or preliterate right now, you might use symbols that when chained together make a sentence. At the end you actually have to hit the speak button for that message to be heard aloud by your peers, by your teacher, and so on. Imagine that most of the kids in that classroom are you using devices like this, but they're using the same voice, or the same voice, exact same voice. It's so hard for the teacher to identify who's talking, if they're not directly looking at the person. But it also is just a sense of identity. What we've seen is some of our very early users were kids in some of these schools and what we found is that the sense of self-esteem. When you use your talking device, which is supposed to be an extension of you, right? It's supposed to be a way that you communicate and let the world know your thoughts and ideas. If the voice coming out of that sounds like an ATM or it sounds like another kid in your class or whatever it is or sounds a robot. There is no sense of self-esteem. It actually is a vicious circle in the sense that you then don't really feel communication is as much of – there's not as much joy in that communication as there is if you can – or independence even if you can't communicate your thoughts well. I think part of this is, yes, words themselves have meaning, but how they're said and feeling like they are your words, actually is super important for empowerment. I think that that's some of what I feel like this technology does empower. Speaking devices have been around for now for several decades, but what hasn't been around is this ability at scale to make a customized voice that goes on those speaking devices. I think people thought that that was just icing on the cake or a nice-to-have, but when you started seeing that communication rates increased grammatically when people had their own customized voice that people were more willing to talk in environments where they were around unfamiliar listeners with their customized synthetic voice. You could see that the impact was more than just something aesthetic. It was really life-changing and life-altering. [00:12:28] MM: Yeah. I mean giving them a voice, but beyond that they already felt, okay, great. I have this voice before really fine-tuning it, so to speak, but now it's like, this is their actual voice like you said, an ID, a VocaliD for them. That's absolutely incredible. [00:12:48] RP: It's also the fact that when you have a generic sounding voice, people don't really think that there's much cognition there, right? A robotic sounding voice makes it feel a little bit the user of that voice is also not, doesn't have a personality or may not even understand what they're writing, and yet as soon as now you have these unique identities one of the really interesting things that happens is in the mind of the listener, oh, okay. I'm interacting with another individual, right? It's not just for the speaker that this technology is beneficial, but it's also for the listeners. It's also for the opportunities that it allows for these individuals, because now people see them more fully. They actually hear them. They actually see them as individuals and not just as a group of people who have to use a talking device. [00:13:36] MM: That is a great point. I never would have thought of that, so a great perspective. Thank you for pointing that out. How does this work with like household devices? Is it just all around the same? Is this something that people can take with them, that's always with them? Like, I don't know how it's integrated into that at all. [00:13:56] RP: Yeah. When people have to use these kinds of devices that many of them are dedicated devices, because every individual who has a neuromotor disorder will often have different capabilities, some may have limb control still and be able to access a keyboard. Others will not have that control and might only have control over their eyes. So many of these devices are actually more, they're full-fledged, there's a whole cottage industry of what are called AAC Devices, Alternative and Augmentative Communication Devices. Then there's also some apps out there that work on iPhones and iPads where it depends again on the person's motor abilities that then will have different input methodologies, as well as, the different voices that you can put on them. It's not like a – everybody uses one device. People use a variety of different devices. Then what we have seen with Amazon Alexa, Google Home, and devices of that sort is that it was actually very difficult for those kinds of applications to recognize the synthetic voices. Someone who can't speak would type into their device and speak aloud. Like let's say, do environmental control like shut off the lights. Like, tell Alexa to shut off the lights using your computer voice. The quality of those voices and because they're on these not such great speakers, often wasn't very good for accessing things like environmental controls, as our synthetic voices are improving, it is making it easier to access conversational devices or in-home voice assistants and things of that sort. [00:15:27] MM: That's wonderful. I mean, I can't imagine where this is all going to be even five, ten, fifteen years from now. So speaking of that and future developments, where do you see this improving or going? [00:15:43] RP: Yeah. That's a fantastic question, Magen. I think it's interesting, because when I started VocaliD like seven years ago, it was unthinkable that you could create a customized voice for under a few thousand dollars at that time. That was the anchor. It's unbelievable what has happened. It's a combination of two things. One the technology advancement, but also this whole other class of technologies that have come in in terms of making voice primary mode of communication for all of us, right? Whereas text was a primary mode in the early 2000s, nowadays I feel like many things talk to us, and we talk to many things in terms of an intermediary piece of technology. There's this great perfect storm where the synthetic voice technology is improving, as well as, everyday individuals are using more and more things at talk, which will then have a trickle-down effect or have an impact on people who have to use these devices as their everyday mode of communication. I think that one of the big things I see is that, we have in the past thought about assistive technology or technologies for people with disabilities as an afterthought or something we do when everybody else has the technology then we think about, well, who else can benefit from it? What I loved about what we were doing is that we were thinking about the people who needed it most. Then that actually then, had this ability to spread to the rest of us that could also benefit from it. As you know, at Veritone now, we're creating branded voices for organizations and companies, but it started with the need of an individual. Where do I see it going? Well, the voices are going to get more natural. I think that's one, but I think more than that, I think the inclusivity piece of it, the importance of voice. I think that we're already, I mean, I don't even think we have to wait five or ten years from now. I feel like we're already starting to understand that that's actually so important. Voice is not just a mode of communication. It's not a commodity. It's actually part of personality. It's part of the value proposition. [00:17:49] MM: Necessity, really. The whole reason why this is now a thing, a very amazing thing. I love the example you gave a little bit ago, but I would love to know more success stories. Can you share one or two of how Voice AI has really helped people? [00:18:09] RP: Yeah. We work a lot with people, veterans. There's a high incidence of head and neck cancer in that population. We have been working with them to bank their voice proactively. Individuals who are susceptible to or maybe at risk for head and neck cancer would bank their voice ahead of time or as soon as they know that they may be having a glossectomy, which is a removal of part or all of the tongue, or a laryngectomy, which is a removal of their voice box. So prior to the voice loss, we bank the voice and then we can then recreate it, so that they can continue to use it and have as active a life as possible, because they have full control over their hands and can use the device to type in their messages and speak as themselves and be heard as themselves. What is really exciting to me is that we have been making a huge impact there. Partnering with different VA organizations across this country to increase awareness that this service even exists, and this technology exists, because I think many people don't really think about it. They don't think it's going to be important until the day that they lose their voice. So voice, banking your voice as voice insurance is something it's just, it's possible now. We should be doing it. That's one. That's a success story, I would say. I think some of the other success stories are just around seeing the ability of these younger kids using their customized voices, where it just changes their trajectory of how they see themselves and as communicators and what they want to go off and do. One particular girl comes to mind where, she used a device to talk and also signed, because when people – signing was just easier for her, because it was just, she didn't have to have a device with her all the time. But many people didn't know how to understand her sign language, so sometimes she had to use her voice she was very limited to who her communication partners were. Then when she got her own personalized voice, she felt more, like she could use her device in situations where she didn't know people. She ended up getting a summer job, as a summer camp leader with other kids who had different kinds of needs and so on. It was just changed how she saw herself as a communicator. I have so many stories like that, but those are just two examples that come to mind. [00:20:27] MM: I can't imagine the amount of confidence that was instilled in her after that. So very, very cool. How can someone learn more about voice and accessibility? [00:20:39] RP: Yeah. You can go to – we still have our VocaliD website, which is Veritone Company now. So you can go to vocalid.ai/voicebank, if you want to contribute your voice. We also on the Veritone website, also have some links all the way to the work that we're doing with brands, as well. It really depends on the applications and the people are looking for, because it might be that for some applications with people with disabilities, they may just need unique sounding voices for, let's say educational materials, but in other cases, if they are actually using a customized voice for an individual, they may need to approach us in a slightly different way. Either the veritone.com website or the vocalid.ai website will get you there. [00:21:24] MM: Wonderful. I will definitely be sure to include both of those websites in the show notes of this episode. I feel very on cloud nine hearing about the good that you are doing for people with disabilities, people who don't have a voice. You are giving them a voice. Kudos to you, kudos to your team, and everything, I think it's fantastic. [00:21:48] RP: Well, I want to call out like on our team, there's definitely, someone who is doing an incredible job, which is Regina McCoy. There are a lot of people behind doing what we're doing. It's been a labor of love and passion. We're so psyched that we can continue to do this here at Veritone. [00:22:04] MM: Thank you, Dr. Patel Rupel. It's been amazing speaking with you about this. [OUTRO] [00:22:11] MM: This has been another episode of Veritone’s Adventures in AI, a worldwide podcast that dives into the many ways technology and artificial intelligence is shaping our future for the better. Talk with you next time. [END]Guests
Rupal Patel
VP, Voice and Accessibility, Veritone