08.28.24

Voice Anonymization in the Public Sector: Protecting the Identities of Speakers

By Markus Toman and Dietmar Schabus 

Summary: 

  • Various methods, such as voice conversion and speech synthesis, are being explored for effective voice anonymization. The challenge lies in balancing anonymization with maintaining essential voice characteristics like dialect and gender, while avoiding overly generic or distorted results.
  • Veritone Labs is developing voice anonymization technologies to protect speaker identities while maintaining the content’s usability, addressing privacy concerns in law enforcement, emergency services, and public health.
  • Veritone Redact is an AI-powered tool designed to automatically redact sensitive information from audio, video, and images by pixelating visuals or muting audio to protect personal data and comply with public records requests.

One of Veritone’s key products for the public sector is Veritone Redact, an AI-powered software that automates the redaction of sensitive information within audio, video and images. A typical scenario is to pixelate, blur or over-paint things like faces, license plates and laptop screens in the video material, and to bleep or mute speech segments in the audio that contain personal information like names, addresses etc. Such redaction is required to keep witnesses safe, comply with freedom of information laws, and other public records requests.

This article explores a different kind of redaction. However, sometimes we need to still hear what is being said in an audio or video recording, but prevent listeners from identifying the speaker from their voice. To this end, the Veritone Labs team is currently researching technological approaches and building prototypes for voice anonymization.

The importance of voice anonymization

Voice recordings can inadvertently reveal personal information, making individuals vulnerable to identification and potential misuse of their data. In the public sector, this risk is heightened due to the sensitive nature of the information often involved. For example:

  • Law Enforcement: Witnesses and informants may be reluctant to come forward if they fear their voices could be recognized.
  • Emergency Services: Callers reporting crimes or emergencies need assurance that their identities will be protected.
  • Public Health: Patients and healthcare workers discussing sensitive health information require confidentiality.

Voice anonymization ensures that while the content of these recordings can still be used for necessary purposes, the speaker’s identity remains protected.

Before:

After:

Degree of anonymization

The degree of anonymization in voice recordings involves a delicate tradeoff between obscuring identifiable features and preserving essential characteristics such as dialect, gender, and age. High levels of anonymization can effectively prevent identification, but may inadvertently strip away nuances critical for context and understanding. For instance, retaining dialect can be important for linguistic and cultural insights, while preserving gender and age can be essential in contexts like public health announcements or emergency services, where these attributes might influence the interpretation of the message. 

However, balancing these aspects requires sophisticated algorithms capable of anonymizing voices without rendering them overly generic or unnatural. Striking this balance is crucial for ensuring that voice anonymization serves its protective role without compromising the usability and informational value of the recorded speech.

For example, perceived gender identity, as seen below: 

Before:

After:

Methods of voice anonymization

There are several techniques for anonymizing voices, each with its own advantages and challenges. Voice transformation techniques modify the speaker’s voice characteristics, such as pitch, speed, and tone, to render the voice unrecognizable. Simple transformations, like changing to an extremely low pitch (like the bad guy in a movie might use in a telephone call), are unreliable for scenarios like witness protection, because they can be reversed to reconstruct the original speech recording. 

Similarly, methods adding noise, distorting the voice signal or overlaying other sounds must be assumed to be reversible. More advanced algorithms make this extremely difficult if not impossible, while ensuring that the transformed voice still sounds natural and intelligible. Here we present a prototype adapting a voice conversion (VC) method for anonymization.

Another approach is speech synthesis, which involves converting the original voice into a synthetic voice that retains the content but sounds entirely different. It’s common to use text-to-speech (TTS) systems to generate the content that has previously been converted to text using automatic speech recognition (ASR) systems. This can be used to completely remove vocal speaker identity including dialect, gender, and age. Howev,er textual content can still contain dialect- or sociolect-specific words. If synchronization to video is required, this would typically be lost when using ASR to TTS, although mechanisms to adapt the speaking rate of the output speech accordingly are possible.

The output speech signal of such a system is very strongly decoupled from the input speech, which is a plus in this context, as it renders inversion of the speaker anonymization virtually impossible. However, it can have serious drawbacks.

If a recording contains difficult to understand speech (low volume, slurred speech, background noise, multiple speakers overlapping, strong accent, etc.) then it could happen that the ASR component makes a mistake and returns incorrectly recognized words. The TTS system would then synthesize these wrong words, in crystal clear new speech! A system based on voice conversion, which sticks closer to the input speech signal to some degree, can potentially retain the ambiguity, such that a human listener could either still get the “true” meaning (from context and background knowledge), or at least it remains obvious that the passage in question was actually hard to understand and ambiguous.

Prototype system

Our prototype anonymization system is presented in Figure 1 and takes as input an audio recording (for example extracted from a video) and outputs the respective anonymized audio. Additional parameters that are not shown include various configurations for the speaker mapping process (i.e. how to select a target speaker for a given source speaker), sensitivity of the voice activity detection, previously known number of speakers etc. In the following sections the individual components are described.

Voice Anonymization - figure 1

Voice activity detection

In a first step, voice activity detection (VAD) can be employed to isolate speech parts in a given audio stream. Sometimes this step is performed by the speaker diarization method. Otherwise, a common method is to use Google’s WebRTC [1]  implementation where Python wrappers are available (https://github.com/wiseman/py-webrtcvad). VAD systems typically allow configuration of sensitivity depending on the audio source.

Speaker diarization

The next step is to discriminate between different speakers in the audio stream, commonly called speaker diarization. A very common library for this is pyannote [2]. Nvidia’s NeMo framework also offers powerful methods for speaker diarization [3]. The output of this step is a speaker ID associated with each speech segment in the original audio. Speaker diarization systems typically allow to inject a previously known number of speakers into the process to improve accuracy as well as various hyperparameters like thresholds for speaker clustering.

Speaker mapping

The source speaker IDs predicted in the speaker diarization steps now have to be mapped to target speakers to which the respective audio should be converted to. Here previously discussed trade-offs regarding the level of speaker characteristics retainment have to be taken into account. A semi-automatic system can provide a list of target speakers to the user. Alternatively depending on the goals of the anonymization, pitch- or gender-matched voices can be selected automatically. 

Alternatively, matching to a voice as neutral as possible is an option to remove certain biases. The output of this step is a mapping of speech segments to target speakers. Configuration of this component mostly involves defining a set of target speakers and mapping rules (in the simplest case randomly assigning target speakers).

Voice conversion

In the final step, individual speech segments are converted to the target speaker. In our experiments, “Voice Conversion With Just Nearest Neighbors” [4] has proven to be a relatively fast, robust and hackable method that also enables us to easily add new target speakers without requiring training..

The method employs WavLM [5] to extract feature sequences from source and target speakers. Feature frames from the source speaker utterance are replaced by the most similar target speaker frames using K-nearest neighbors. WavLM features of later layers have been shown to represent phonetic content so they are a good candidate to conduct replacements based on the speech content instead of speaker characteristics. 

Still we found that using a large target speaker dataset of mixed speakers results in a reconstruction that is very similar to the source speaker. Therefore the variance of the target speaker dataset should be constrained. Finally, the replaced sequence of feature frames is fed into an adapted version of the Hifi-GAN vocoder [6] to reconstruct the converted speech signal.

Implications for the future

The adoption of voice anonymization in the public sector holds significant potential for enhancing privacy and security. Future advancements may include more sophisticated algorithms that can better balance anonymization with clarity, and integration with other data protection measures to provide comprehensive privacy solutions. Moreover, public awareness and understanding of these technologies can foster trust and cooperation between citizens and public institutions.

Voice anonymization represents a critical step forward in protecting individuals’ identities in the public sector. As technology continues to evolve, so too will the methods and effectiveness of voice anonymization, ensuring that privacy concerns are addressed while maintaining the functionality and reliability of public services. 

Learn more about Veritone’s Public Sector offerings for the public sector and start transforming your operations without compromising personally identifiable information (PII). 

 


 

Sources:

Original source for video example #1:  https://www.youtube.com/watch?v=hUhq97fMQsk&t=451s

Original source for video example #2:   https://www.youtube.com/watch?t=1366&v=mqaODYJ702s&feature=youtu.be

References:

[1] Google. (2011) WebRTC. https://webrtc.org/ [Online, accessed June 2021]

[2] H. Bredin et al., “pyannote.audio: neural building blocks for speaker diarization,” Nov. 04, 2019, arXiv: arXiv:1911.01255. doi: 10.48550/arXiv.1911.01255.

[3] O. Kuchaiev et al., “NeMo: a toolkit for building AI applications using Neural Modules,” Sep. 13, 2019, arXiv: arXiv:1909.09577. doi: 10.48550/arXiv.1909.09577.

[4] M. Baas, B. van Niekerk, and H. Kamper, “Voice Conversion With Just Nearest Neighbors,” May 30, 2023, arXiv: arXiv:2305.18975. doi: 10.48550/arXiv.2305.18975.

[5] S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1505–1518, Oct. 2022, doi: 10.1109/JSTSP.2022.3188113.

[6] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” Oct. 23, 2020, arXiv: arXiv:2010.05646. doi: 10.48550/arXiv.2010.05646.