Noise-robust voice activity detection (rVAD)

Noise-robust voice activity detection (rVAD) - source code, reference VAD for Aurora 2 语音端点检测源码

Description:

An unsupervised segment-based method for robust voice activity detection (rVAD), or speech activity detection (SAD), is presented here [1], [2]. The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity.

The VAD method has been applied as a preprocessor for speech recognition, speaker identification [3], language identification, age and gender identification [5], human-robot interaction (for social robots) [6], [9], audio archive segmentation, and so on. The method performs well on NIST OpenSAD Challenge [8].

Source code:

Source code in Matlab for rVAD (including rVAD-fast) is available as a zip archive. It is straightforward to use: Simply call the function vad.m.

Some Matlab functions and their modified versions from the publicly available VoiceBox are included with kind permission of Mike Brookes.

Source code in Python for rVAD-fast is available as a zip archive. So is the Python source code as a zip archive for training and testing of GMM-UBM and maximum a posterirori (MAP) adapation based speaker verification.

Reference VAD for Aurora 2 database:
The frame-by-frame reference VAD was generated from forced-alignment speech recognition experiments, and has been used as a 'ground truth' for evaluating VAD algorithms. Whole word models were trained on clean speech data for all digits, and used for performing forced-alignment for the 4004 utterances (clean speech) from which all utterances in Test Set A, B, and C are derived from by adding noise. The forced-alignment results, in which '0' and '1' stand for non-speech and speech frames, respectively, are used to set the time boundaries for speech segments to create a frame-based reference VAD. For more details, refer to paper [1]. The generated reference VAD for the test set is available as a zip archive. The forced-alignment generated reference VAD for the training set of 8440 clean utterances is also available as a zip archive.

Other archives: the frame-by-frame results (i.e., VAD outputs) of the advanced front end VAD for the test set A, B and C as a bz archive, the results of the variable-frame-rate VAD (shown as 'Proposed' in Table VI of paper [1]) for the test set A, B and C as a bz archive. Forced alignment labels with timestampts for the training set is available as a text archive and for the test set A as a text archive [1].

We have done a systematic comparison of forced-alignment speech recognition and humans for generating reference VAD in [6].

Citation:

[1] Zheng-Hua Tan, Achintya kr. Sarkar and Najim Dehak, “rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method,” Computer Speech and Language, 2019. (Google Scholar)

[2] Z.-H. Tan and B. Lindberg, "Low-complexity variable frame rate analysis for speech recognition and voice activity detection." IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 5, pp. 798-807, 2010. (Google Scholar)

Our relvated work:

[3] O. Plchot, S. Matsoukas, P. Matejka, N. Dehak, J. Ma, S. Cumani, O. Glembek, H. Hermansky, S.H. Mallidi, N. Mesgarani, R. Schwartz, M. Soufifar, Z.-H. Tan, S. Thomas, B. Zhang and X. Zhou, “Developing a Speaker Identification System for the DARPA RATS project,” ICASSP 2013, Vancouver, Canada, May 26 - 31, 2013. (Google Scholar)

[4] T. Petsatodis, C. Boukis, F. Talantzis, Z.-H. Tan and R. Prasad, “Convex Combination of Multiple Statistical Models with Application to VAD,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 8, pp. 2314 - 2327, November 2011. (Google Scholar)

[5] S.E. Shepstone, Z-H. Tan, and S.H. Jensen. "Audio-based age and gender identification to enhance the recommendation of TV content." Consumer Electronics, IEEE Transactions on 59.3 (2013): 721-729.

[6] N.B. Thomsen, Z.-H. Tan, B. Lindberg and S.H. Jensen, “Improving Robustness against Environmental Sounds for Directing Attention of Social Robots,” The 2nd Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, September 14, 2014, Singapore.

[7] I. Kraljevski, Z.-H. Tan and M. P. Bissiri, “Comparison of Forced-Alignment Speech Recognition and Humans for Generating Reference VAD,” Interspeech 2015, Dresden, Germany, September 6-10, 2015.

[8] T. Kinnunen, A. Sholokhov, E. Khoury, D. Thomsen, Md Sahidullah and Z.-H. Tan, "HAPPY Team Entry to NIST OpenSAD Challenge: A Fusion of Short-Term Unsupervised and Segment i-Vector Based Speech Activity Detectors," Interspeech 2016, San Francisco, USA, 8 - 12 September 2016. PDF

[9] Zheng-Hua Tan, Nicolai Bæk Thomsen, Xiaodong Duan, Evgenios Vlachos, Sven Ewan Shepstone, Morten H. Rasmussen and Jesper Lisby Højvang, "iSocioBot - A Multimodal Interactive Social Robot,"International Journal of Social Robotics, vol. 10, no. 1, pp. 5–19, January 2018. (Springer). PDF from Springer Nature Sharing.

Contact:

Zheng-Hua Tan

Department of Electronic Systems, Aalborg University, Denmark

E-mail: zt@es.aau.dk