A wavelet-based auditory planning space for production of vowel sounds
(Proceedings of Vision, Recognition, Action: Neural Models of Mind and Machine
Boston University, 1997)

Dave Johnson
(Department of Cognitive and Neural Systems, Boston University)

Recent evidence suggests that speakers may utilize an acoustic-like space in the planning of articulator movements for vowel production [J. Perkell, M. Matthies, M. Svirsky, M. Jordan, J. Acoust. Soc. Am. 93, 2948-2961, (1993)]. In earlier work, Guenther and Johnson [J. Acoust. Soc. Am. 97(5), 3402, (1993)] successfully utilized a formant-based planning space in a computational model of vowel production. However, formant-based models leave a numbers of issues unresolved, and several researchers have suggested that the gross shape of the vowel spectrum may correlate more closely with vowel perception data [e.g., see S. Zahorian and A. Jagharghi, J. Acoust. Soc. Am. 94(4), 1966-1982, (1993)]. Several measures of acoustic spectral shape have been proposed for vowel perception based on psychoacoustic studies. Unfortunately, none of these measures have received support from the physiological literature. However, recent physiological studies have shown that the peripheral auditory system computes the log magnitude spectrum of a steady vowel, and that the primary auditory cortex uses a wavelet transform to encode this log magnitude spectrum [e.g., see K. Wang and S. Shamma, IEEE Trans. Speech and Audio Processing, 3(5), 382-395, (1995)]. Based on these physiological results, the present work proposes a model of vowel production planning based on a wavelet representation of the log magnitude spectrum of the target vowel.

The model employs an orthonormal set of wavelet basis functions obtained from modeling of the visual system [A. Pentland, PAMI, 11(7), 674-693, (1994)] which spans the space of possible Fourier log magnitude spectra. The wavelet auditory planning space dimensions correspond to the coefficients in the wavelet expansion of the spectrum, and vowel targets are assumed to be connected regions in this space. The model computes a trajectory in the planning space by linear interpolation between consecutive vowel targets, and at each time step, the current position in the planning space is computed and converted into speech samples using standard homomorphic vocoder speech techniques. Simulations using the wavelet auditory planning space produce vowels and vowel transitions of acceptable quality, and the vowel spectra converge rapidly to their target values. In addition to support from the physiological literature, this model has a number of advantages over formant-based vowel production models, including robustness in noise, better approximations of gross spectral shape, and the ability to represent the log magnitude spectrum with arbitrary accuracy.

[Partially supported by AASERT (ONR N00014-94-1-0940), NIH 1-R29-DC02952-01, ONR N00014-95-1-0657, ONR N00014-94-1-0597, and MURI (ONR N00014-95-1-0409).]