What is a good approach for extracting portions of speech from an arbitrary audio file?
EnergyDetector
For Voice Activity Detection, I have been using the EnergyDetector program of the MISTRAL (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.
It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the SPro signal processing toolkit in the following way:
sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm
It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.
You will then run EnergyDetector in this way:
EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output
If you use the configuration file that you find at the end of the answer, you need to put output.prm
in prm/
, and you'll find the segmentation in lbl/
.
As a reference, I attach my EnergyDetector configuration file:
*** EnergyDetector Config File***loadFeatureFileExtension .prmminLLK -200maxLLK 1000bigEndian falseloadFeatureFileFormat SPRO4saveFeatureFileFormat SPRO4saveFeatureFileSPro3DataKind FBCEPSTRAfeatureServerBufferSize ALL_FEATURESfeatureServerMemAlloc 50000000featureFilesPath prm/mixtureFilesPath gmm/lstPath lst/labelOutputFrames speechlabelSelectedFrames alladdDefaultLabel truedefaultLabel allsaveLabelFileExtension .lbllabelFilesPath lbl/ frameLength 0.01segmentalMode filenbTrainIt 8 varianceFlooring 0.0001varianceCeiling 1.5 alpha 0.25mixtureDistribCount 3featureServerMask 19 vectSize 1baggedFrameProbabilityInit 0.1thresholdMode weight
CMU Sphinx
The CMU Sphinx speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.
A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See Using PocketSphinx with GStreamer and Python -> The 'vader' element
Other VADs
I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.
webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code.
It comes with a file, example.py, that does exactly what you're looking for: Given a .wav file, it finds each instance of someone speaking and writes it out to a new, separate .wav file.
The webrtcvad API is extremely simple, in case example.py doesn't do quite what you want:
import webrtcvadvad = webrtcvad.Vad()# sample must be 16-bit PCM audio data, either 8KHz, 16KHz or 32Khz,# and 10, 20, or 30 milliseconds long.print vad.is_voiced(sample)
Hi pyAudioAnalysis has a silence removal functionality.
In this library, silence removal can be as simple as that:
from pyAudioAnalysis import audioBasicIO as aIOfrom pyAudioAnalysis import audioSegmentation as aS[Fs, x] = aIO.readAudioFile("data/recording1.wav")segments = aS.silenceRemoval(x, Fs, 0.020, 0.020, smoothWindow=1.0, Weight=0.3, plot=True)
silenceRemoval()
implementation reference: https://github.com/tyiannak/pyAudioAnalysis/blob/944f1d777bc96717d2793f257c3b36b1acf1713a/pyAudioAnalysis/audioSegmentation.py#L670
Internally silence removal()
follows a semi-supervised approach: first, an SVM model is trained to distinguish between high-energy and low-energy short-term frames. Towards this end, 10% of the highest energy frames along with 10% of the lowest ones are used. Then, the SVM is applied (with a probabilistic output) on the whole recording and dynamic thresholding is used to detect the active segments.
Reference Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144610