Mozilla's DeepSpeech an open source speech to text engine

2 min read

In this article, we will be trying a transcriber made using Mozilla's DeepSpeech.


Let's start with an example.

# Create and activate a virtualenv
virtualenv -p python3 $HOME/tmp/deepspeech-venv/
source $HOME/tmp/deepspeech-venv/bin/activate

# Install DeepSpeech
pip3 install deepspeech

# Download pre-trained English model files
curl -LO
curl -LO

Demo - transcribing an audio file

Now let's transcribe an audio file.

# Download example audio files
curl -LO
tar xvf audio-0.7.3.tar.gz

# Transcribe an audio file
deepspeech --model deepspeech-0.7.3-models.pbmm --scorer deepspeech-0.7.3-models.scorer --audio audio/2830-3980-0043.wav

The output should look similar to the verbose below. Notice the line "experience proves this" which shows this is working.

Loading model from file deepspeech-0.7.3-models.pbmm
TensorFlow: v1.15.0-24-gceb46aa
DeepSpeech: v0.7.3-0-g8858494
Loaded model in 0.00997s.
Loading scorer from files deepspeech-0.7.3-models.scorer
Loaded scorer in 0.000207s.
Running inference.
experience proves this
Inference took 1.007s for 1.975s audio file.

The last line shows something note worthy - inference took less time than the audio file length.

Streaming to DeepSpeech

Make sure you have the prerequisites before continuing.

# Prerequisites

# Dependancies
sudo apt-get install libasound2-dev

# Clone and move into working directory
git clone moz-ds-examples
cd moz-ds-examples/nodejs_mic_vad_streaming

# Install npm packages
npm install

# Start the server
node start.js

I had an issue where no verboise was given. This is because it uses the default mic which may not exist.

So update microphone variable to the following - notice device property was added to the options. Run arecord -l to see the list of microphones installed.

var microphone = mic({
  rate: "16000",
  channels: "1",
  debug: false,
  fileType: "wav",
  device: "plughw:2,0",

Now start streaming.

Next steps

Mozilla's DeepSpeech has an API in C,.NET, Java, JavaScript (NodeJS/ElectronJS) and Python so i'm sure you could integrate this the next time you need speech-to-text - I know I will be. You may want to train your own models. You could use this for an IoT device.