Also see the VocalTractObj for a speech generation system based on the engine used in the gnuspeech toolkit.
The AuditoryProc object processes raw sound inputs in a manner consistent with the peripheral auditory pathways, and widely used in speech perception research. There are several stages of processing:
- First, a Discrete Fourier Transform (DFT) implemented with the Fast Fourier Transform (FFT) algorithm, that transforms the time-varying sound signal into a spectrogram where different frequencies are represented using different input features. Typically the sound is sampled every 10 msec or so, with a window of 25 msec, over a period of e.g., 100 msec, producing a 2D time-frequency transform of the sound.
- Then a Mel-frequency feature bank aggregation of different frequencies using triangle-shaped filters with spacing that reflects the perceptual discriminability of different frequencies in humans.
- The last stage can be either:
- A set of time-frequency tuned gabor filters based on those observed in the auditory pathways, that respond selectively to different trajectories of frequency power over time (e.g., rising vs. falling tones).
- Or the Mel-cepstrum discrete cosine transform that is widely used in speech processing.