The proposed neural architecture consists of an analytic lower net, and a synthetic upper net. This paper focuses on the upper net. The lower net performs a 2D multiresolution wavelet decomposition of an initial spectral representation to yield a multichannel representation of local frequency modulations at multiple scales. From this representation, the upper net synthesizes increasingly complex features, resulting in a set of acoustic observables at the top layer with multiscale context dependence. The upper net also provides for invariance under frequency shifts, dilatations in tone intervals and time intervals, by building these transformations into the architecture. Application of this architecture to the recognition of gross and fine phonetic categories from continuous speech of diverse speakers shows that it provides high accuracy and strong generalization from modest amounts of training data.
Proc. ICASSP, pp. II-1109, May, 1998