Research


Speaker recognition (identification) is a behavioral biometric identification technology which automatically identify a person from characteristics of her/his voices. It has a wide application in information security, forensic identification, human-computer interaction and etc. Although state-of-the-art speaker verification system has a low EER (Equal error rate) on several academic databases, such as NIST SREs and RSR2015 , its performance in real application is not always (or rarely) satisfied due to mis-matched contents, communication channels, languages, background noises, vocal effort, health condition and so on. For higher recognition accuracy and more robust application, we try to solve this challenge task by discriminant analysis, local learning, deep neural network and joint modeling.

Spoken language (or dialect, accent) recognition is a branch of speech signal processing, and its goal is to automatically identify the type of language by a segment of speech. It can be used in many fields, such as multilingual speech recognition, audio indexing, audio retrieval, information security and so on. Apart from the similar difficulties mentioned in speaker recognition, some languages (or dialect, accent) are easily confused, for example, Hindi vs India English, American English vs British English. We try to tell these differences by acoustic, phonetic or prosodic characteristics.

Undoubtedly, automatic speech recognition, speaker recognition and other speech-based recognition technologies are hot topics in the field of audio signal processing at present. As a matter of fact, apart from speech, audio contains more information than your think. For example, we can diagnose the running state of a machine from its sound, identify if there is a baby crying, or tell the scene by some special sounds. We call an audio segment with a concept we are interested in as an audio event. And audio event detection is a task to detect an audio event from an audio stream. It's a rather challenge task for several reasons. First, it's user-defined. Different audio events vary considerably. Second, it's a detection not classification task. The timestamp of an audio event is required in most cases. Third, the overlap of different audio events is very common. Fourth, some audio event is rare. In some extreme case, there is only one example. And finally, we are often short of enough annotation. So far, we lack a unified and appropriate theory (algorithm, or method) to solve it, and handle with it case by case. Due to the improtance of audio event decoration, we believe more and more researchers from academic and industrial areas will put attention on this interesting field.

We also interested in, and do research on voice activity detection, speech enhancement, audio indexing, duplicate audio detection, continuous speech recognition and keyword spotting.

people


principle investigator

Liang He

Associate Professor
Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., Communication Engineering, Civil Aviation University of China, Tianjin, China

M.S., Information & Communication Engineering, Zhejiang University, Hangzhou, China

Ph.D, Electronic Engineering, Tsinghua University, Beijing, China

WORKING EXPERIENCE

2011-2013, Postdoctoral fellow, Electronic Engineering, Tsinghua University, Beijing, China

2013-2018, Assistant Professor, Electronic Engineering, Tsinghua University, Beijing, China

2018-, Associate Professor, Electronic Engineering, Tsinghua University, Beijing, China

RESEARCH INTERESTS

Speaker recognition, speaker diarization, language recognition, audio event detection, voice activity detection, speech enhancement and audio indexing.


Xianghong Chen

Postdoctoral fellow
Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., Xiamen University, Electronics Science and Technology, Xiamen, China

Ph.D, Institute of Microelectronics, University of Chinese Academy of Sciences, Microelectronics and Solid Electronics, Beijing, China

BIOGRAPHY

Mainly focus on the reach of speaker recognition and speaker diarization. Using variational automatic encoder to separate content information and speaker information in an audio. Explore the application of attention mechanism to extraction speaker information. Take advantage of latent class model to do speaker diarization.

RESEARCH INTERESTS

Speaker diarization,speaker recognition


Yi Liu

PhD. candidate
Department of Electronic Engineering
Tsinghua University

Yi Liu

Contact

Address: Rohm Building 5111, Beijing

EDUCATION

B.S., Communication Engineering, Wuhan University, Wuhan, China

M.S., Information & Communication Engineering, Tsinghua University, Beijing, China

BIOGRAPHY

I'm a PhD. student focusing on speaker verification. I'm now interested in speaker embedding method based on deep neural network. My works involve different stages of the neural verification pipeline, from feature extraction to loss function design. I'm also working on an open-source speaker verification software, which tries to reproduce most of my works.

RESEARCH INTERESTS

Text dependent and independent speaker verification, speaker diarization.


Can Xu

Research and development engineer
Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., Electrical Engineering and Automation, Lu Dong University, Yantai, China

M.S., Pattern Recognition and Intelligent System, Tianjin University of Science and Technology, Tianjin, China

BIOGRAPHY

My main research work is to combine new technology to improve the performance of the recognition system.The acoustic model information is used to extract the speaker and language information.The application of end-to-end system in speaker recognition and speech recognition is studied.Explore the latent class analysis architecture to complete the task of audio diarization.Through these technologies, significant improvements have been made in tasks such as speaker/language recognition and audio diarization.

RESEARCH INTERESTS

Audio diarization, speaker and language recognition, end-to-end


Tianyu Liang

Research and development engineer
Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., School of Mathematical Sciences, Beijing Normal University, Beijing, China

BIOGRAPHY

My work focus on both text-dependent and text-independent speaker recognition. I'm now interested in end-to-end system and other algorithms related to neural networks. I also do research on voice activity detection based on convolutional neural network and duplicate audio detection based on audio fingerprinting.

RESEARCH INTERESTS

Speaker recognition, voice activity detection, audio indexing, duplicate audio detection


Wenhao Ding


Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., Electronic Engineering, Tsinghua University, Beijing, China

BIOGRAPHY

Mr. Ding is interested in both theoritical and applicational areas of machine learning. He mainly focues on deep generative models and representation learning, especially Generative Adversarial Networks and Variational Auto-Encoder. He also follows multiple tracks of robotics application, such as robot perception, multi-agent collaboration and autonomous driving. For the undergraduate dissertation, he worked on the topic of speaker verification, which broadened his research field to the direction of speech signal processing. He is now working on the topic of acoustic event detection and classification at Tsinghua University.

RESEARCH INTERESTS

Multi-agent collaboration, autonomous driving, speaker verification, audio event detection and classification


Zhixuan Li

M.S. candidate
Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., Information Engineering, Southeast University, Nanjing, China

BIOGRAPHY

My work focuses on speaker recognition using phonetic information, or more generally, joint recognition of speech and speaker. This is a classic but future-proof research direction since it will shed new light on the mechanism of how human beings separate and distill speaker traits intermingled with speech content. Deep learning, unsupervised learning and sequence learning are powerful tools in research process and I use Kaldi and TensorFlow to implement and validate my ideas.

RESEARCH INTERESTS

Speaker recognition, acoustic model, unsupervised learning


Yaoguang Wang

M.S. candidate
Department of Electronic Engineering
Tsinghua University

Cinque Terre

Contact

Address: Rohm Building 8101, Beijing

EDUCATION

B.S., Communication Engineering, Harbin Institute of Technology, Harbin, China

BIOGRAPHY

I come from Suzhou in AnHui, located in central China, I’m introverted like the southerners and Passionate like the northerners. I’m usually quiet and wouldn’t like to talk more, but I’m full of enthusiasm for life and work and sincere to people. I’m so sorry that I haven’t some special specialties, In my free time, I would like to watch anime, all kinds of ball games, brush moments, and occasionally play basketball. My ideal is to work close to home.

RESEARCH INTERESTS

Neural network, deep learning, audio event detection, speech recognition

publications


Liang He, Xianhong Chen, Can Xu, Liu Yi, Jia Liu and Michael T. Johnson, “Latent class model with application to speaker diarization,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2019, no. 1, p. 12, Jul. 2019.

Xianhong chen, Liang He, Can Xu and Jia Liu, “Distance-Dependent Metric Learning,” IEEE Signal Processing Letters, Feb. 2019, 26(2), 357-361.

Liang He, Xianhong Chen, Can Xu, and Jia Liu, “Multi-objective Optimization Training of PLDA for Speaker Verification,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 6026-6030. [code] .

Yi Liu, Liang He and Jia Liu, “Large Margin Softmax Loss for Speaker Verification,” INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association, Accepted. [code].

Zhixuan Li, Liang He, Jingyang Li, Li Wang and Weiqiang Zhang, “Towards Discriminative Representations and Unbiased Predictions: Class-specific Angular Softmax for Speech Emotion Recognition,” INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association, Accepted.

Jingyang Zhang, Wenhao Ding, Jintao Kang and Liang He, “Multi-Scale Time-Frequency Attention for Rare Sound Event Detection,” INTERSPEECH 2019, 20th Annual Conference of the International Speech Communication Association, Accepted.

Can Xu, Xianhong Chen, Liang He and Jia Liu, “Geometric Discriminant Analysis for I-vector Based Speaker Verification,” 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Accepted.

Liang He, Xianhong Chen, Can Xu and Jia Liu, “Subtraction-Positive Similarity Learning,” 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Accepted.

Liang He, Xianhong Chen, Can Xu, Jia Liu and Michael T. Johnson, “Local Pairwise Linear Discriminant Analysis for Speaker Verification,” IEEE Signal Processing Letters, Oct. 2018, 25(10), 1575-1579. [code].

Wenhao Ding and Liang He, “MTGAN: Speaker Verification through Multitasking Triplet Generative Adversarial Networks,” INTERSPEECH 2018, 19th Annual Conference of the International Speech Communication Association, 2-6 September 2018, Hyderabad, 3633-3637.

Yi Liu, Liang He, Jia Liu and Michael T. Johnson, “Speaker Embedding Extraction with Phonetic Information,” INTERSPEECH 2018, 19th Annual Conference of the International Speech Communication Association, 2-6 September 2018, Hyderabad, 2247-2251. [code].

Liang He, Xianhong Chen, Can Xu and Jia Liu, “Latent Class Model for Single Channel Speaker Diarization,” Odyssey 2018 The Speaker and Language Recognition Workshop, 26-29 June 2018, Les Sables d'Olonne, France, 128-133.

Xianhong Chen, Liang He, Can Xu, Yi Liu, Tianyu Liang and Jia Liu, “VB-HMM Speaker Diarization with Enhanced and Refined Segment Representation ,” Odyssey 2018 The Speaker and Language Recognition Workshop, 26-29 June 2018, Les Sables d'Olonne, France, 134-139.

Xukui Yang, Liang He, Dan Qu and Weiqiang Zhang, “Semi-supervised minimum redundancy maximum relevance feature selection for audio classification,” Multimedia Tools and Applications 77(1), 713-739.

Tianyu Liang, Xianhong Chen, Can Xu and Liang He, “Parallel Double Audio Fingerprinting,” 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei City, Taiwan, 2018, pp. 344-348.

Yi Liu, Liang He, Weiwei Liu and Jia Liu, “Exploring a Unified Attention-Based Pooling Framework for Speaker Verification,” 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei City, Taiwan, 2018, pp. 200-204.

Yi Liu, Liang He, Weiqiang Zhang, Jia Liu and Michael T. Johnson, “Investigation of Frame Alignments for GMM-based Digit-prompted Speaker Verification,” 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, 2018, pp. 1467-1472.

Liang He, Xianhong Chen, Can Xu, Tianyu Liang and Jia Liu, “Ivec-PLDA-AHC priors for VB-HMM speaker diarization system”, 2017 IEEE International Workshop on Signal Processing Systems (SiPS), Lorient, 2017, 1-6.

Yao Tian, Liang He, Meng Cai, Weiqiang Zhang and Jia Liu, “Deep neural networks based speaker modeling at different levels of phonetic granularity”, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, 2017, 5440-5444.

Junbiao Liu, Xinyu Jin, Fang Dong, Liang He and Hong Liu, “Fading channel modelling using single-hidden layer feedforward neural networks”, Multidimensional Systems and Signal Processing 28(3), 885-903.

Yi Liu, Liang He, Yao Tian, Zhuzi Chen, Jia Liu and Michael T. Johnson, “Comparison of multiple features and modeling methods for text-dependent speaker verification”, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, 2017, 629-636.

Yao Tian, Meng Cai, Liang He, Weiqiang Zhang and Jia Liu, “Improving deep neural networks based speaker verification using unlabeled data”, INTERSPEECH 2016, 17th Annual Conference of the International Speech Communication Association, 1863-1867.

Liang He, Yao Tian, Yi Liu, Jiaming Xu, Weiwei Liu, Meng Cai and Jia Liu, “THU-EE system description for NIST LRE 2015”, INTERSPEECH 2016, 17th Annual Conference of the International Speech Communication Association, 3294-3298.

Yi Liu, Yao Tian, Liang He and Jia Liu, “Investigating various diarization algorithms for speaker in the wild (SITW) speaker recognition challenge”, INTERSPEECH 2016, 17th Annual Conference of the International Speech Communication Association, 853-857.

Liang He, Yao Tian, Yi Liu, Fang Dong, Weiqiang Zhang and Jia Liu, “A study of variational method for text-independent speaker recognition”, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), Tianjin, 2016, 1-5.

Xukui Yang, Liang He, Dan Qu, Weiqiang Zhang and Michael T. Johnson, “Semi-supervised feature selection for audio classification based on constraint compensated Laplacian score”, EURASIP Journal on Audio, Speech, and Music Processing.

Xukui Yang, Liang He, Dan Qu and Weiqiang Zhang, “Voice activity detection algorithm based on long-term pitch information”, EURASIP Journal on Audio, Speech, and Music Processing.

Yao Tian, Meng Cai, Liang He and Jia Liu, “Speaker recognition system based on deep neural networks and bottleneck features”, Journal of Tsinghua University 56(11), 1143-1148.

Fang Dong, Junbiao Liu, Liang He, Xiaohui Hu and Hong Liu, “Channel Estimation Based on Extreme Learning Machine for High Speed Environments ”, Proceedings Of ELM-2015, Vol 1: Theory, Algorithms And Applications (I) 6, 159-167.

Liang He, Weiqiang Zhang and Mengnan Shi, “Channel Non-negative Tensor Factorization for Speech Enhancement”, Proceedings of the 2016 International Conference on Artificial Intelligence: Technologies and Applications.

Liang He and Jia Liu, “PRISM: A statistical modeling framework for text-independent speaker verification”, 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), Chengdu, 2015, 529-533.

Like Hui, Meng Cai, Cong Guo, Liang He, Weiqiang Zhang and Jia Liu, “Convolutional maxout neural networks for speech separation”, 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Abu Dhabi, 2015, 24-27.

Yao Tian, Meng Cai, Liang He and Jia Liu, “Investigation of bottleneck features and multilingual deep neural networks for speaker verification”, INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, 1151-1155.

Yao Tian, Liang He and Jia Liu, “Stacked bottleneck features for speaker verification”, 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), Chengdu, 2015, 514-518.

Yi Liu, Yao Tian, Liang He, Jia Liu and Michael T. Johnson“Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing”, INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, 2082-2086.

Weiqiang Zhang, Cong Guo, Qian Zhang, Jian Kang, Liang He, Jia Liu and Michael T. Johnson, "A speech enhancement algorithm based on computational auditory scene analysis," Journal of Tsinghua University 48(8), 663-669.

Yao Tian, Liang He, Zhiyi Li, Weilan Wu, Weiqiang Zhang and Jia Liu, “Speaker verification using Fisher vector,” The 9th International Symposium on Chinese Spoken Language Processing, Singapore, 2014, 419-422.

Zhiyi Li, Liang He, Weiqiang Zhang and Jia Liu, “Total variability subspace adaptation based speaker recognition,” Acta Automatica Sinica 40(8), 1836-1840.

Yi Liu, Liang He and Jia Liu, “Improved multitaper PNCC feature for robust speaker verification,”, The 9th International Symposium on Chinese Spoken Language Processing, Singapore, 2014, 168-172.

Liang He and Jia Liu, "I-matrix for text-independent speaker recognition," 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, 2013, 7194-7198.

Liu, Weiwei and Zhang, Weiqiang and He, Liang and Xu, Jiaming and Liu, Jia, "THUEE system for the Albayzin 2012 language recognition evaluation," 2013 IEEE China Summit and International Conference on Signal and Information Processing, Beijing, 2013, 109-112.

Liang He and Jia Liu, "Orthogonal subspace combination based on the joint factor analysis for text-independent speaker recognition," Lecture Notes in Computer Science, vol 7701, Springer, Berlin, Heidelberg.

Liang He and Jia Liu, "Discriminant local information distance preserving projection for text-independent speaker recognition," 2012 8th International Symposium on Chinese Spoken Language Processing (ISCSLP 2012), 349-352.

Zhiyi Li, Liang He, Weiqiang Zhang and Jia Liu, "Speaker recognition based on discriminant i-vector local distance preserving projection," Journal of Tsinghua University 52(5), 598-601.

Liang He, Yi Yang and Jia Liu, "TLS-NAP algorithm for text-independent speaker recognition," Pattern Recognition and Artificial Intelligence 25(6), 916-921.

Liang He, Yongzhe Shi and Jia Liu, "Eigenchannel space combination method of joint factor analysis," Acta Automatica Sinica 37(7), 849-856.