CORC  > 自动化研究所  > 中国科学院自动化研究所  > 毕业生  > 博士学位论文
题名电话语音识别鲁棒性研究
作者张化云
学位类别工学博士
答辩日期2003-07-01
授予单位中国科学院研究生院
授予地点中国科学院自动化研究所
导师徐波
关键词基频提取 电话通道补偿 级联线性变换自适应 集外词拒识 pitch extraction channel compensation cascaded adaptation OOV
其他题名A Study of Robust Automatic Speech Recognition over telephone network
学位专业模式识别与智能系统
中文摘要电话是最普及的话音通信工具,是各种先进语音技术最大的潜在应用领域。 语音识别是基于电话平台的语音应用的一项核心技术。目前在实验室条件下表 现出色的语音识别系统在实际电话网络应用中都变十分脆弱。提高电话语音识 别鲁棒性是实现其:商用化的关键问题。本文针对汉语电话语音识别应用中的技 术难点,在以下几个方面做了深入研究和有效改进。 汉语是一种典型的声调语言,声调信息对汉语语音识别有重要作用。然而 由于电话通道的调制作用,通常的基频提取算法在电话通道上有较大误差,直 接影响语音的识别率。我们采用改进的无偏自相关分析方法,提出自相关强度 与清/浊音统计判决相结合的基频跟踪方法,使清/浊音误判率下降到原有自相 关方法的24%。准确可靠的基频特征使电话语音孤立词的误识率相对下降6.5%。 鲁棒的前端特征是高性能语音识别的前提。由于目前对语言的发音和感知 机理缺乏深入认识,还没有与噪声/通道无关的语音特征表示。系统的训练和测 试通道不一致时,必须对语音特征进行补偿。由于电话通道内存在众多不确定 因素,通常的倒谱均值估计和倒谱滤波方法都不能取得理想效果。我们提出准 线性通道分析模型,利用语音统计模型和最大似然估计方法估算通道偏置。在 汉语大词汇量连续电话语音识别测试中使字误识率相对降低20%。为解决快速 补偿中出现的数据稀疏问题,引入音素相关的通道先验知识,利用最大后验估 计方法估算通道偏置,使相对误识率进一步下降7%。与其它补偿方法不同,这 两种新算法不但对固定电话通道有效,对非线性的无线压缩电话通道也有作用。 针对特定应用的声学自适应是语音识别应用系统的重要组成。在级联线性 变换自适应方法的基础上,我们提出一种新的全矩阵线性变换参数化简形式。 新方法在保持全矩懈变换精度优势的同时能有效减少重估参数的数目,提高估 值的鲁棒性。这使我们可以在更小的回归类上进行变换估计,提高了自适应精 度。新方法在不同数据规模的自适应测试中都优于原有基于变换的自适应方法。 最后讨论在自然连续语流识别中对背景噪声和集外词的拒识机制及在电话 语音识别平台中的实现。实现了基于噪声模型和汉语音节补白模型的并行搜索 拒识方法,并利用这种方法有效地进行连续语流中的关键词检测。
英文摘要Since telephone is the only ubiquitous communications terminal device in current world, it is the largest potential application field for speech techniques. Automatic speech recognition (ASR) is a core technique for such telephone-based speech applications. However, it has been proved that a perfect laboratory ASR system may become very vulnerable in real telephony environment. And the robustness is the life-and-death issue for such commercial ASR systems. In this study, we present our recent progresses on improving the performance for Mandarin telephony ASR. Chinese is a tonal language and the tone information is important for Mandarin ASR. However, the filtering effect of telephone channels causes error increase when we apply traditional pitch extraction methods to telephony speech. This is a hindrance to high performance ASR. We adopt an improved anti-bias autocorrelation function (ACF) and integrate the ACF intensity with statistic voice/unvoice (V/U) decision in pitch path tracking. This makes the V/U error decreased to 24% of traditional method. The word error rate (WER) relatively decreases 6.5% in isolated word recognition. Robust speech feature is the premise for high performance ASR. However, our limited knowledge of speech production and perception prevent,; us from obtaining a feature set that has no relations with channel conditions. So compensation is essential if channel mismatch exists between training and testing stage. Channel compensation can be particularly difficult in applications where nonlinear distortion exists. Simple cepstral mean estimates and cepstral filtering methods are unreliable. To address this problem, a quasi-linear channel model is constructed. With the pure speech statistic knowledge, we propose a maximum-likelihood channel estimation method, which makes the character error rate (CER) relatively decrease 20% in telephony large vocabulary Mandarin ASR. To solve the data sparsing problem occurs in fast compensation, we extend the previous method by introducing a phone-conditioned prior channel distribution and use Bayesian techniques for estimation, which provides additional 7% relative CER decrease. Different with previous methods, the novel algorithm works well for both fixed-line channels and compressed wireless channels. Acoustic adaptation is an essential part for the state-of-the-art ASR system. Based on cascaded linear transform adaptation, we propose a novel parameterization type. It could effectively decrease the transform parameter number with the high precision advantage of full matrix maintained, which means a more robust estimation. Full transform could be constructed upon smaller regression class and higher resolution is achieved. It outperforms previous cascade method with varying amounts of data. Finally we discuss the strategies of noise rejection and out-of-vocabulary (OOV) rejection for continous natural speech input. We use syllable-based filler model and
语种中文
其他标识符834
内容类型学位论文
源URL[http://ir.ia.ac.cn/handle/173211/5777]  
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
张化云. 电话语音识别鲁棒性研究[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2003.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace