CORC  > 自动化研究所  > 中国科学院自动化研究所  > 毕业生  > 博士学位论文
题名电话语音环境的鲁棒说话人识别
作者郑榕
学位类别工学博士
答辩日期2007-05-27
授予单位中国科学院研究生院
授予地点中国科学院自动化研究所
导师徐波 ; 张树武
关键词说话人辨认 说话人确认 高斯混合模型-全局背景模型 支持向量机 规整算法 质量估计 说话人分割 speaker identification speaker verification GMM-UBM SVM Normalization algorithm quality measurement speaker segmentation
其他题名Research on Robust Speaker Recognition over Telephone
学位专业模式识别与智能系统
中文摘要电话语音环境下说话人识别技术的研究面临许多亟待解决的问题,包括语音环境的通道鲁棒性、说话人差异和系统判决拒识等等。近年来,针对概率统计模型和区分训练框架,研究人员进行了很多有益地探索和研究,这对于说话人识别技术走向实用有着重大的意义。为了提高电话语音环境下说话人识别系统的性能和鲁棒性,论文在高斯混合度信息利用、特征处理和输出分数处理、引入质量测度估计的说话人识别和对话语音分割说话人跟踪检测方面进行研究。主要工作包括: 1.针对高斯混合模型的相关框架进行了研究,提出了以下两方面的改进。首先,对奇异帧和混淆帧的实验分析提出了帧似然得分非线性后处理方法。该方法有效地抑止同一说话人在相邻时间上分数的差异,同时拉开不同说话人在同一特征矢量上的分数距离。其次,在GMM-UBM说话人确认系统中,通过对传统似然分数比的近似计算推导,提出利用高斯混合度的细致信息,得到基于高斯混合信息似然比的说话人确认。 2.面向电话信道应用的说话人识别系统中,训练和测试环境失配会造成系统识别性能急剧下降,本文提出了从特征规整和评分规整两个方面进行声学环境失配补偿的方法。首先,改进了基于分段的倒谱均值方差规整方法,将倒谱系数都规整到相同的段内高斯统计分布,以提高不同环境条件下特征参数的匹配程度;其次,由于不同说话人和不同测试环境引起输出评分分布变化,本文综合利用零规整和测试规整对输出分数进行变换,提出了两阶段的评分规整方法,使得失配环境下与说话人无关的决策门限更加鲁棒。最后,将分数规整变换方法的思想应用到基于MFCC和韵律特征参数的说话人辨认系统中,结合实验分析了该算法的有效性。 3.针对语音特征矢量与说话人模型的相似分数计算公式的一些局限性,提出了引入质量测度估计的说话人识别,解决识别系统输出分数对不同特征矢量同等看待,从而导致识别性能不高的问题。为每个说话人建立高斯混合质量参考模型,估计测试语音的质量测度值,得到对输出得分的贡献率,更好的符合了得分计算。同时,从提高质量测度的区分性和降低算法的计算量出发,分别考虑了散度距离和基于聚类的矢量预量化,使得系统具有较高的识别率。 4.通过分析真实环境下对话语音信号的主要特点,把说话人分割聚类技术和说话人识别技术相结合,设计并实现了一个面向复杂语音环境的说话人检测系统。该系统应用了音频信号预处理技术、对话语音自动分割聚类技术、单人识别技术和两人识别技术,实现对海量真实的电话语音进行说话人分割和识别,并在多个电话语音数据集上分析了各工作模块和系统的性能,获得了较好的应用前景。
英文摘要Speaker recognition under telephony environment brings some high necessities, including channel-robustness, speaker variability and decision-making. During the past years, some novel techniques and algorithms have been proposed for speaker recognition based on statistical or discriminative frameworks, which is significant for practical applications. In order to improve the performance of speaker recognition system over telephone, this dissertation focuses on the research on the usability of Gaussian component information, feature and score normalization, quality measure-based score computation, speaker segmentation and multi-speaker recognition.1. We make some investigations on Gaussian mixture model based speaker recognition. Based on the error analysis of the error-prone and confusion frames, a frame-level nonlinear score normalization is proposed in speaker identification task. The likelihood difference between the adjacent frames is restrained. At the same time, the score difference of speakers against the same speech frame is enlarged. In GMM-UBM based speaker verification, an experimental study of exploiting Gaussian component information is proposed to use the detailed component-specific information in generative likelihood ratio estimation. 2. In order to solve the problem of significant deterioration due to the mismatches between the training and testing acoustic conditions, two compensation approaches based on feature normalization and score normalization are presented, respectively. Firstly, segment-based cepstrum mean and variance normalization is modified to normalize the cepstral coefficients with similar segmental Gaussian distribution to improve the matching degree in different environmental conditions. Secondly, in order to cope with the score variability among the speakers and test utterances, two-stage score normalization techniques are presented to transform the output scores and make the speaker-independent decision threshold more robust under adverse conditions. Finally, we study the score normalization method in the application to speaker identification based on MFCC and prosodic features. This method can achieve better identification accuracy.3. A quality measure algorithm using Gaussian mixture density for traditional GMM-UBM scoring mechanism has been presented in this dissertation. By the use of GMM-based quality models, the proposed method explores the issues involved in applying soft estimates to quality measures as weighting factors in score computation. It has the advantage of estimating quality to potentially utilize broad phonetic-specific speaker characteristics by GMM modeling. Incoporation of Jensen divergence measure for quality estimation and clustering-based vector pre-quantization are performed to reduce the redundancy in speech signal and the computational load. Comparison experiments show the effectiveness of the proposed method.
语种中文
其他标识符200418014628083
内容类型学位论文
源URL[http://ir.ia.ac.cn/handle/173211/5977]  
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
郑榕. 电话语音环境的鲁棒说话人识别[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2007.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace