Thesis Info

Thesis Title
Realtime and Accurate Musical Control of Expression in Voice Synthesis
Nicolas d'Alessandro
2nd Author
3rd Author
Applied Sciences
Number of Pages
University of Mons
Thesis Supervisor
Thierry Dutoit
Supervisor e-mail
thierry.dutoit AT
Other Supervisor(s)
Language(s) of Thesis
Department / Discipline
Signal Processing
Languages Familiar to Author
English, French
URL where full thesis can be found
realtime, voice synthesis, tablet, wacom, handsketch, performance, hci
Abstract: 200-500 words
In the early days of speech synthesis research, understanding voice production has attracted the attention of scientists with the goal of producing intelligible speech. Later, the need to produce more natural voices led researchers to use prerecorded voice databases, containing speech units, reassembled by a concatenation algorithm. With the outgrowth of computer capacities, the length of units increased, going from diphones to non-uniform units, in the so-called unit selection framework, using a strategy referred to as “take the best, modify the least”. Today the new challenge in voice synthesis is the production of expressive speech or singing. The mainstream solution to this problem is based on the “there is no data like more data” paradigm: emotion- specific databases are recorded and emotion-specific units are segmented. In this thesis, we propose to restart the expressive speech synthesis problem, from its original voice production grounds. We also assume that expressivity of a voice synthesis system rather relies on its interactive properties than strictly on the coverage of the recorded database. To reach our goals, we develop the RAMCESS software system, an analysis/resynthesis pipeline which aims at providing interactive and real- time access to the voice production mechanism. More precisely, this system makes it possible to browse a connected speech database, and to dynamically modify the value of several glottal source parameters. In order to achieve these voice transformations, a connected speech database is recorded, and the RAMCESS analysis algorithm is applied. RAMCESS analysis relies on the estimation of glottal waveforms and vocal tract impulse responses from the prerecorded voice samples. We cascade two promising glottal flow analysis algorithms, ZZT and ARX- LF, as a way of reinforcing the whole analysis process. Then the RAMCESS synthesis engine computes the convolution of previously estimated glottal source and vocal tract components, within a realtime pitch-synchronous overlap-add architecture. A new model for producing the glottal flow signal is proposed. This model, called SELF, is a modified LF model, which covers a larger palette of phonation types and solving some problems encountered in realtime interaction. Variations in the glottal flow behaviour are perceived as modifications of voice quality along several dimensions, such as tenseness or vocal effort. In the RAMCESS synthesis engine, glottal flow parameters are modified through several dimensional mappings, in order to give access to the perceptual dimensions of a voice quality control space. The expressive interaction with the voice material is done through a new digital musical instrument, called the HandSketch: a tablet-based controller, played vertically, with extra FSR sensors. In this work, we describe how this controller is connected to voice quality dimensions, and we also discuss the long term practice of this instrument. Compared to the usual prototyping of multimodal interactive systems, and more particularly digital musical instruments, the work on RAMCESS and HandSketch has been structured quite differently. Indeed our prototyping process, called the Luthery Model, is rather inspired by the traditional instrument making and based on embodiment. The Luthery Model also leads us to propose the Analysis-by-Interaction (AbI) paradigm, a methodology for approaching signal analysis problems. The main idea is that if signal is not observable, it can be imitated with an appropriate digital instrument and a highly skilled practice. Then the signal can be studied be analyzing the imitative gestures.