Mozilla's DeepSpeech an open source speech to text engine

2 min read

    In this article, we will be trying a transcriber made using Mozilla's DeepSpeech.


    Let's start with an example.

    # Create and activate a virtualenv
    virtualenv -p python3 $HOME/tmp/deepspeech-venv/
    source $HOME/tmp/deepspeech-venv/bin/activate
    # Install DeepSpeech
    pip3 install deepspeech
    # Download pre-trained English model files
    curl -LO
    curl -LO

    Demo - transcribing an audio file

    Now let's transcribe an audio file.

    # Download example audio files
    curl -LO
    tar xvf audio-0.7.3.tar.gz
    # Transcribe an audio file
    deepspeech --model deepspeech-0.7.3-models.pbmm --scorer deepspeech-0.7.3-models.scorer --audio audio/2830-3980-0043.wav

    The output should look similar to the verbose below. Notice the line "experience proves this" which shows this is working.

    Loading model from file deepspeech-0.7.3-models.pbmm
    TensorFlow: v1.15.0-24-gceb46aa
    DeepSpeech: v0.7.3-0-g8858494
    Loaded model in 0.00997s.
    Loading scorer from files deepspeech-0.7.3-models.scorer
    Loaded scorer in 0.000207s.
    Running inference.
    experience proves this
    Inference took 1.007s for 1.975s audio file.

    The last line shows something note worthy - inference took less time than the audio file length.

    Streaming to DeepSpeech

    Make sure you have the prerequisites before continuing.

    # Prerequisites
    # Dependancies
    sudo apt-get install libasound2-dev
    # Clone and move into working directory
    git clone moz-ds-examples
    cd moz-ds-examples/nodejs_mic_vad_streaming
    # Install npm packages
    npm install
    # Start the server
    node start.js

    I had an issue where no verboise was given. This is because it uses the default mic which may not exist.

    So update microphone variable to the following - notice device property was added to the options. Run arecord -l to see the list of microphones installed.

    var microphone = mic({
      rate: "16000",
      channels: "1",
      debug: false,
      fileType: "wav",
      device: "plughw:2,0",

    Now start streaming.

    Next steps

    Mozilla's DeepSpeech has an API in C,.NET, Java, JavaScript (NodeJS/ElectronJS) and Python so i'm sure you could integrate this the next time you need speech-to-text - I know I will be. You may want to train your own models. You could use this for an IoT device.