题名语音识别的大词表解码策略与语速鲁棒性研究
作者张东滨
学位类别博士
答辩日期2005
授予单位中国科学院声学研究所
授予地点中国科学院声学研究所
关键词语音识别 解码策略 自适应束剪枝 持续时间模型 语速鲁棒性
其他题名Large Vocabulary Decoding and Speaking Rate Robustness of Speech Recognition
中文摘要语音识别研究的最终目标是进入实用化领域,目前中小词表识别已经开始走向实用,并在安静环境和规范发音情况下取得了良好的性能,而自然口语的大词表语音识别因为计算复杂度和系统资源的尖锐矛盾,以及口语发音的复杂性等原因依然停留在实验室研究阶段。在此背景下,本论文主要研究如何提高大词表语音识别实时性能和部分口语鲁棒性问题。因为口语鲁棒性问题涉及的研究领域很多,本文涉及的口语研究主要是针对在语速变化情况下的识别系统鲁棒性问题。本论文的主要工作如下:1、大词表识别的快速解码算法研究。大词表语音识别进入实用领域的两个最主要问题是计算复杂度带来的实时性差,与有限系统资源和庞大内存需求的矛盾。故这一部分工作目的是使解码器尽可能充分利用系统资源,提高实时性能,并使识别性能达到极大化。本论文引入自适应控制理论,提出了采用自适应剪枝策略的快速搜索算法,利用自适应调节器根据性能需要或系统资源限制来动态调整束宽,使解码器在不损失识别性能的情况下,可将搜索空间减少69.8%,识别时间减少52.9%。在此基础上提出了动态期望激活模型数的自适应束剪枝策略,利用从训练集实例估计得到的动态期望激活模型数曲线作为自适应调节器的动态参考信号,将搜索空间和解码时间在前者基础上再减少4.5%和4.4%。另外,在实用系统中,内存的许可占有量往往是一定的,本算法可在同样内存条件下可以将误识率相对下降14.3%。2、语速鲁棒性问题的研究。本论文通过实验和理论分析证明Gamma分布比正态分布更符合音子持续时间分布,但Gamma分布不便于规格化计算。我们又提出了利用分段高斯分布描述音子持续时间分布特性;并在此基础上提出用平均时长规格化偏差来估计语速,为了使此方法更加鲁棒,进而提出使用中段音子时长平均规格化偏差来改善语速估计,该方法具有不受识别结果影响的鲁棒性,估计语速与真实语速相关系数达到0.96。在语速估计的基础上,本论文针对慢速语句和快速语句分别提出了动态词惩罚策略和动态调整帧移法,可以分别使慢速语句误识率下降10.1%和快速语句误识率下降9.9%。另外针对语流中的停顿现象,采用并联静音模型处理策略,使系统误识率下降了9.1%。以上几种策略可以联合应用来提高针对口语语速变化和随意停顿等现象的鲁棒性。3、集成语速鲁棒性策略的大词表语音识别解码方案的构建。基于以上的研究分析,本论文提出了一套有效的大词表语音识别解码方案:首先进行一遍快速解码,然后对解码结果进行分析,利用鲁棒性语速估计寻找语速过快和过慢的语句,再对这些对识别性能影响较大的语句在第二遍解码过程中进行相应的处理,以取得较好的整体识别结果和实时性之间的平衡。实验结果表明,本方案可以在识别时间减少47.4%的基础上,使系统整体误识率下降13.2%。
英文摘要The ultimate goal of research on speech recognition is real life application. Now the middle and small vocabulary speech recognition systems based on model matching are applying on some portable platform such as mobile phone, PDA and tablePC, and it gain fair performance in clean speech and smooth speaking. The conflict between computation complexity and system resource, and some other problems brought by spoken speaking such as uncertainty and variability, result in that large vocabulary speech recognition still remain at lab research stage. Therefore, this thesis focuses on decoding strategy optimization in large vocabulary speech recognition, and aims to improve the real-time performance and robustness in some spontaneous condition. Because the robustness problem of spontaneous speaking is involved in lots of fields, this thesis is only solve the robustness problem with vary speaking rate and pauses. The main contributions of this thesis are: 1 Fast decoding algorithm of large vocabulary continuous speech recognition (LVCSR): this section is aimed to make decoder more efficient and to improve the real-time performance of LVCSR. We apply adaptive theory to speech recognition and present a fast decoding algorithm based on adaptive pruning mechanism. We use adaptive controller to adjust beam width dynamically depending on the demand of recognition performance or system resource limitation. Compared to the baseline system, this method can reduce 69.8% search-space and 52.9% computation time without sacrificing recognition rate. And one step further, we use the dynamic expected active models curve as reference signal of adaptive pruning system, which leads to a further reduction in the computation time and search space by 4.4% and 4.5% respectively. On the other hand, these two adaptive methods can reduce the WER (word error rate) by 14.3% in the same memory condition. Solution for speaking rate robustness problem: Gamma distribution comes out optimally characterized phone-level duration, but it is not easy for statistic computation. We use an asymmetrical Gauss distribution to characterize phone duration, and propose a robust speaking rate estimation method by using average standard duration shift. In order to avoid the influence of insertion and deletion error, we only choose the middle part of average standard duration shift as the estimator of speaking rate. This method has strong immunity to recognition error rate, and the correlation coefficient between estimates and reference rate reaches up to 0.96. After identifying the speaking rate, we proposed two different compensation mechanisms for over-slow and over-fast utterance respectively. One is dynamic word penalty strategy for slow utterances, which can reduce WER by 10.1%; the other is dynamic frame shift method for fast utterances, which can reduce WER by 9.9%. Construction of an efficient integrated two-stage decoder. We implement a fast decoding at the first stage, and analysis the result to find out over-fast and over-slow utterance. Only those abnormal utterances will be decoded again at the second stage with different compensation mechanism. Experimental results show the integrated scheme is a good balance between real-time performance and recognition rate. Compared to the baseline system, this scheme leads a reduction of 13.2% in error rate and 47.4% in computation time.
语种中文
公开日期2011-05-07
页码94
内容类型学位论文
源URL[http://159.226.59.140/handle/311008/918]  
专题声学研究所_声学所博硕士学位论文_1981-2009博硕士学位论文
推荐引用方式
GB/T 7714
张东滨. 语音识别的大词表解码策略与语速鲁棒性研究[D]. 中国科学院声学研究所. 中国科学院声学研究所. 2005.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace