Abstract:
A technique to improve the recognition accuracy when transcribing speech data that contains data from a wide range of environments. Input data in many situations contains data from a variety of sources in different environments. Such classes include: clean speech, speech corrupted by noise (e. g. , music), non-speech (e. g. , pure music with no speech), telephone speech, and the identity of a speaker. A technique is described whereby the different classes of data are first automatically identified, and then each class is transcribed by a system that is made specifically for it. The invention also describes a segmentation algorithm that is based on making up an acoustic model that characterizes the data in each class, and then using a dynamic programming algorithm (the viterbi algorithm) to automatically identify segments that belong to each class. The acoustic models are made in a certain feature space, and the invention also describes different feature spaces for use with different classes.