
软件学报 ISSN 1000-9825, CODEN RUXUEW E-mail: jos@iscas.ac.cn
Journal of Software,2020,31(6):1747−1760 [doi: 10.13328/j.cnki.jos.005709] http://www.jos.org.cn
©中国科学院软件研究所版权所有. Tel: +86-10-62562563
基于端到端句子级别的中文唇语识别研究
∗
张晓冰
,
龚海刚
,
杨
帆
,
戴锡笠
(电子科技大学 计算机科学与工程学院,四川 成都 611731)
通讯作者: 戴锡笠, E-mail: daixili_cs@163.com
摘 要: 近年来,随着深度学习的广泛应用,唇语识别技术也取得了快速的发展.与传统的方法不同,在基于深度学
习的唇语识别模型中,通常包含使用神经网络对图像进行特征提取和特征理解两个部分.根据中文唇语识别的特点,
将识别过程划分为两个阶段——图片到拼音(P2P)以及拼音到汉字(P2CC)的识别.分别设计两个不同子网络针对不
同的识别过程,当两个子网络训练好后,再把它们放在一起进行端到端的整体架构优化.由于目前没有可用的中文唇
语数据集,因此采用半自动化的方法从 CCTV 官网上收集了 6 个月 20.95GB 的中文唇语数据集 CCTVDS,共包含
14 975 个样本.此外,额外采集了 269 558 条拼音汉字样本数据对拼音到汉字识别模块进行预训练.在 CCTVDS 数据
集上的实验结果表明,所提出的 ChLipNet 可分别达到 45.7%的句子识别准确率和 58.5%的拼音序列识别准确率.此
外,ChLipNet 不仅可以加速训练、减少过拟合,并且能够克服汉语识别中的歧义模糊性.
关键词: 中文唇语识别;深度学习;中文汉语言的特征;数据集采集及处理;端到端模型
中图法分类号: TP18
中文引用格式: 张晓冰,龚海刚,杨帆,戴锡笠.基于端到端句子级别的中文唇语识别研究.软件学报,2020,31(6):1747−1760.
http://www.jos.org.cn/1000-9825/5709.htm
英文引用格式: Zhang XB, Gong HG, Yang F, Dai XL. Chinese sentence-level lip reading based on end-to-end model. Ruan Jian
Xue Bao/Journal of Software, 2020,31(6):1747−1760 (in Chinese). http://www.jos.org.cn/1000-9825/5709.htm
Chinese Se ntence -Level L ip Readi ng Base d o n End- to-E nd Mo de l
ZHANG Xiao-Bing, GONG Hai-Gang, YANG Fan, DAI Xi-Li
(School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China)
Abstra ct : In recent years, with the widely application of deep learning, lip reading recognition technology has achieved rapid
development. Different from traditional methods, lip reading recognition methods based on the deep learning usually use the neural
network model both for the feature extraction and comprehension. According to the characteristics of Chinese language, a two-step
end-to-end architecture is implemented, in which two deep neural network modules are applied to perform the recognition of
picture-to-pinyin (P2P) and pinyin-to-hanzi (P2CC) respectively. After the two modules are trained with convergence, they are then jointly
optimized to improve the overall performance. Due to the lack of Chinese lip reading dataset, the 6-month daily news broadcasts are
collected from China Central Television (CCTV), and they are semi-automatically labelled into a 20.95 GB dataset CCTVDS with 14 975
samples. In addition, the supplementary dataset with 269 558 samples are collected during the pre-training of P2CC. According to
experimental results trained on the CCTVDS, the proposed ChLipNet can achieve 45.7% sentence-level and 58.5% Pinyin-level
accuracies. In addition, ChLipNet can not only accelerate training, reduce overfitting, but also overcome syntactic ambiguity in the
recognition of Chinese language.
Key words: Chinese lip reading recognition;deep learning; characteristics of Chinese language; data collecting and preprocessing;
end-to-end model
∗ 基金项目: 国家自然科学基金(61572113)
Foundation item: National Natural Science Foundation of China (61572113)
收稿时间: 2018-05-10; 修改时间: 2018-09-04; 采用时间: 2018-11-16
评论