Research Feature: VocalTube

The VocalTube project, led by graduate student Debasish Ray Mohapatra, talks about how speech is an essential mechanism for communication and expressing emotions — the words that we pronounce and the nature of expression while speaking define our individuality. Hence, two people never sound similar, even though they speak the same word in the same language. It has been a long fascination in the science community to understand — “how does human produce sound?” and “how could we make machines speak like a human?”. It is clear that we have come a long way from modelling formant speech synthesizers to state-of-the-art powerful machine learning models. However, we are still far away from designing a physics-based articulatory speech synthesizer that could generate speech sound in real-time. Mohapatra’s current published paper, which was shown at the 2019 Interspeech Conference, held in Graz, Austria, addresses this research problem. Currently, he is trying to build

a speaker-specific vocal tract model (2.5D FDTD vocal tract) using the finite difference method that could produce static vowel sounds in quasi-real-time.

Speech production is a complex activity. But in terms of functionality, it is the same as a wind instrument — we blow air through the reed (mouthpiece), which is the VocalTube source of acoustic energy, and those acoustic waves pass through a resonator to produce sound. As we change the geometry of the resonator (duct), the musical sound will vary. In speech anatomy, the non-periodic vibration of vocal folds works as the source and the upper vocal tract as the resonator or articulator. As the articulation (geometry of vocal tract) differs for each individual, we sound very differently while speaking.

But the vocal tract has a very intricate and complex geometry. To capture its irregularity, we need to build a 3D model. That could provide us with better acoustic characteristics, but it’s computationally expensive. Similarly, the existing 1D models provide faster simulation but an oversimplified representation of the realistic vocal tract. We have come across a novel approach for modelling a 2D vocal tract having 3D characteristics using the finite difference time domain (FDTD) numerical method. This new strategy will give us a way to design a real-time speech synthesizer without compromising its acoustic features. And this computational model uses the vocal tract area functions, collected through MRI, while the speaker making vowels and consonants sounds.

Our next goal is to understand the nonlinear coupling between the vocal fold and vocal tract for designing a complete articulatory speech synthesizer, as this research could also be implemented in singing synthesizers and aero-acoustic modelling of wind instruments. Check out the paper and source code below for more information.

Paper Link: https://www.isca-speech.org/archive/Interspeech_2019/pdfs/1764.pdf
Source Code: https://github.com/Debasishray19/vocaltube-speech-synthesis/tree/master/version03