Noise-robust
voice activity detection (rVAD) - source code, reference VAD for Aurora
2 语音端点检测 源码
Description:
An unsupervised segment-based method for robust voice activity detection (rVAD), or speech activity detection (SAD), is presented here
[1], [2]. The method consists of two passes of denoising followed by a
voice activity detection (VAD) stage. In the first pass, high-energy
segments in a speech signal are detected by using a posteriori
signal-to-noise ratio (SNR) weighted energy difference and if no pitch
is detected within a segment, the segment is considered as a
high-energy noise segment and set to zero. In the second pass, the
speech signal is denoised by a speech enhancement method, for which
several methods are explored. Next, neighbouring frames with pitch are
grouped together to form pitch segments, and based on speech
statistics, the pitch segments are further extended from both ends in
order to include both voiced and unvoiced sounds and likely non-speech
parts as well. In the end, a posteriori SNR weighted energy difference
is applied to the extended pitch segments of the denoised speech signal
for detecting voice activity.
The
VAD method has been applied as a preprocessor for
speech recognition, speaker identification [3], language
identification,
age and gender identification [5], human-robot interaction (for social
robots) [6], [9], audio archive segmentation, and so on. The method performs
well on NIST OpenSAD Challenge [8].
Source code:
Source code in Matlab
for rVAD (including rVAD-fast) is available as a zip archive. It is straightforward
to use: Simply call the function vad.m.
Some Matlab functions
and their modified versions from the publicly available VoiceBox
are included with kind permission of Mike Brookes.
Source code in Python for rVAD-fast is available as a zip archive. So is the Python source code as a zip archive for training and testing of GMM-UBM and maximum a posterirori (MAP) adapation based speaker verification.
Reference VAD for Aurora 2 database:
The frame-by-frame reference VAD was generated from forced-alignment
speech recognition experiments, and has been used as a 'ground truth'
for evaluating VAD algorithms. Whole word models were trained on clean
speech data for all digits, and used for performing forced-alignment
for the 4004 utterances (clean speech) from which all utterances in
Test Set A, B, and C are derived from by adding noise. The
forced-alignment results, in which '0' and '1' stand for non-speech and
speech frames, respectively, are used to set the time boundaries for
speech segments to create a frame-based reference VAD. For more
details, refer to paper [1]. The generated reference VAD
for the test set is available as a zip archive. The forced-alignment generated reference
VAD for the training set of 8440 clean utterances is also
available as a zip archive.
Other
archives: the frame-by-frame results
(i.e., VAD outputs) of the advanced front end VAD for the test set A, B and C as a bz archive, the results
of the variable-frame-rate VAD (shown as 'Proposed' in Table VI of
paper [1]) for the test set A, B and C as a bz archive. Forced alignment labels with timestampts
for the training set is available as a text archive and for the test set A as a text archive [1].
We
have done a systematic comparison of forced-alignment speech
recognition and humans for generating reference VAD in [6].
Citation:
[1]
Zheng-Hua Tan, Achintya kr. Sarkar and Najim Dehak, ÒrVAD: An
Unsupervised Segment-Based Robust Voice Activity Detection Method,Ó
Computer Speech and Language, 2019. (Google Scholar)
[2]
Z.-H. Tan and B. Lindberg, "Low-complexity variable frame rate analysis
for speech recognition and voice activity detection." IEEE Journal of Selected Topics in
Signal Processing, vol. 4, no. 5, pp. 798-807, 2010. (Google
Scholar)
Our relvated work:
[3]
O. Plchot, S. Matsoukas, P. Matejka, N. Dehak, J. Ma, S. Cumani, O.
Glembek, H. Hermansky, S.H. Mallidi, N. Mesgarani, R. Schwartz, M.
Soufifar, Z.-H. Tan, S. Thomas, B. Zhang and X. Zhou, ÒDeveloping a
Speaker Identification System for the DARPA RATS project,Ó ICASSP 2013,
Vancouver, Canada, May 26 - 31, 2013. (Google
Scholar)
[4] T. Petsatodis, C. Boukis, F. Talantzis, Z.-H. Tan and R. Prasad, ÒConvex Combination of Multiple Statistical Models with Application to VAD,Ó IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 8, pp. 2314 - 2327, November 2011. (Google Scholar)
[5] S.E. Shepstone, Z-H. Tan, and S.H. Jensen. "Audio-based age and gender identification to enhance the recommendation of TV content." Consumer Electronics, IEEE Transactions on 59.3 (2013): 721-729.
[6]
N.B. Thomsen, Z.-H. Tan, B. Lindberg and S.H. Jensen, ÒImproving
Robustness against Environmental Sounds for Directing Attention of
Social Robots,Ó The 2nd Workshop on Multimodal Analyses Enabling
Artificial Agents in Human-Machine Interaction, September 14, 2014,
Singapore.
[7] I. Kraljevski, Z.-H. Tan and M. P. Bissiri, ÒComparison of Forced-Alignment Speech Recognition and Humans for Generating Reference VAD,Ó Interspeech 2015, Dresden, Germany, September 6-10, 2015.
[8]
T. Kinnunen, A. Sholokhov, E.
Khoury, D. Thomsen, Md Sahidullah and Z.-H. Tan, "HAPPY Team Entry to
NIST OpenSAD Challenge: A Fusion of Short-Term Unsupervised and Segment
i-Vector Based Speech Activity Detectors," Interspeech 2016, San
Francisco, USA, 8 - 12 September 2016. PDF
[9]
Zheng-Hua Tan, Nicolai B¾k Thomsen, Xiaodong Duan,
Evgenios Vlachos, Sven Ewan Shepstone, Morten H. Rasmussen and Jesper
Lisby H¿jvang, "iSocioBot - A Multimodal Interactive Social
Robot,"International Journal of Social Robotics, vol. 10, no. 1, pp.
5Ð19, January 2018. (Springer). PDF
from Springer Nature Sharing.
Department of
Electronic
Systems, Aalborg University, Denmark
E-mail: zt@es.aau.dk