What is a good approach for extracting portions of speech from an arbitrary audio file? What is a good approach for extracting portions of speech from an arbitrary audio file? linux linux

What is a good approach for extracting portions of speech from an arbitrary audio file?


EnergyDetector

For Voice Activity Detection, I have been using the EnergyDetector program of the MISTRAL (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.

It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the SPro signal processing toolkit in the following way:

sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm

It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.

You will then run EnergyDetector in this way:

EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output 

If you use the configuration file that you find at the end of the answer, you need to put output.prm in prm/, and you'll find the segmentation in lbl/.

As a reference, I attach my EnergyDetector configuration file:

*** EnergyDetector Config File***loadFeatureFileExtension        .prmminLLK                          -200maxLLK                          1000bigEndian                       falseloadFeatureFileFormat           SPRO4saveFeatureFileFormat           SPRO4saveFeatureFileSPro3DataKind    FBCEPSTRAfeatureServerBufferSize         ALL_FEATURESfeatureServerMemAlloc           50000000featureFilesPath                prm/mixtureFilesPath                gmm/lstPath                         lst/labelOutputFrames               speechlabelSelectedFrames             alladdDefaultLabel                 truedefaultLabel                    allsaveLabelFileExtension          .lbllabelFilesPath                  lbl/    frameLength                     0.01segmentalMode                   filenbTrainIt                       8       varianceFlooring                0.0001varianceCeiling                 1.5     alpha                           0.25mixtureDistribCount             3featureServerMask               19      vectSize                        1baggedFrameProbabilityInit      0.1thresholdMode                   weight

CMU Sphinx

The CMU Sphinx speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.

A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See Using PocketSphinx with GStreamer and Python -> The 'vader' element

Other VADs

I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.


webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection code.

It comes with a file, example.py, that does exactly what you're looking for: Given a .wav file, it finds each instance of someone speaking and writes it out to a new, separate .wav file.

The webrtcvad API is extremely simple, in case example.py doesn't do quite what you want:

import webrtcvadvad = webrtcvad.Vad()# sample must be 16-bit PCM audio data, either 8KHz, 16KHz or 32Khz,# and 10, 20, or 30 milliseconds long.print vad.is_voiced(sample)


Hi pyAudioAnalysis has a silence removal functionality.

In this library, silence removal can be as simple as that:

from pyAudioAnalysis import audioBasicIO as aIOfrom pyAudioAnalysis import audioSegmentation as aS[Fs, x] = aIO.readAudioFile("data/recording1.wav")segments = aS.silenceRemoval(x,                              Fs,                              0.020,                              0.020,                              smoothWindow=1.0,                              Weight=0.3,                              plot=True)

silenceRemoval() implementation reference: https://github.com/tyiannak/pyAudioAnalysis/blob/944f1d777bc96717d2793f257c3b36b1acf1713a/pyAudioAnalysis/audioSegmentation.py#L670

Internally silence removal() follows a semi-supervised approach: first, an SVM model is trained to distinguish between high-energy and low-energy short-term frames. Towards this end, 10% of the highest energy frames along with 10% of the lowest ones are used. Then, the SVM is applied (with a probabilistic output) on the whole recording and dynamic thresholding is used to detect the active segments.

Reference Paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144610