Project 5 - Live Captioning App

In this blog post, I am going to show you how create a app in python which converts speech to text from the selected audio input device. To achieve this, I am going to be using an open-source speech recognition toolkit called Vosk.

NOTE: Before you start following the further steps, it is recommended that you create a virtual environment. For instructions on creating a virtual environment head over to Venv docs.

To start off, we are first going to need to way to capture audio input. To do this, we are going to use the python-sounddevice library. To get started follow the installation steps here.

Then, import the library into your script by adding the following at the top of your script.

import sounddevice as sd

Now, we need to find the sample rate of our input device. To do this, we are going to use the query devices function.

device_info = sd.query_devices(device=None, kind='input')
samplerate = int(device_info['default_samplerate'])

In this step, you can also set the device parameter to the id of the device of your choice. To find out the list of input devices and their id, you can use the following function:

def getInputDevices():
    devicesRaw = sd.query_devices()
    devices = {}
    for x in range(len(devicesRaw)):
        if (devicesRaw[x]['max_input_channels'] > 0):
            devices[x] = devicesRaw[x]['name']

    return devices

Before we go any further, we need to load the model from our speech recognition toolkit. First, we need to install the vosk python library which you can do by following these instructions. Then download the vosk model of your choice from this list, extract it and place it in your project folder. Then import the library and load the model by typing:

import vosk
model = vosk.Model('path/to/model')

Now we can start detecting the speech using the vosk library:

import queue
import sys

q = queue.Queue()  # stores the audio while it is being processed
result = ''  # complete result (more accurate but slower prediction)
partialResult = ''  # partial result (less accurate but faster prediction)


# Callback for the speech to text function
def callback(outdata, frames, time, status):
    """This is called (from a separate thread) for each audio block."""
    if status:
        print(status, file=sys.stderr)
    q.put(bytes(outdata))


# Detects text from speech input
def speechToText():
    with sd.InputStream(samplerate=samplerate, blocksize=8000, device=None,
                        dtype='int16', channels=1, callback=callback):
        rec = vosk.KaldiRecognizer(model, samplerate)
        while True:
            data = q.get()
            if rec.AcceptWaveform(data):
                global result
                result = rec.Result()[14:-3]
            else:
                global partialResult
                partialResult = rec.PartialResult()[17:-3]

Now we can read the result partial result and result variables and use them for our needs.

So what can we do next?

We can add a GUI which displays the text while it is being predicted.
We can save the text for future use.
We can use it to dictate notes by using the PyAutoGUI library to type the predicted captions as keyboard input.

You can find my version of this app at github/code-explorer/Captionator.

If you have any questions or suggestions, feel free to post them in the comments down below.

Satwik Kambham

Search This Blog

Project 5 - Live Captioning App

Labels

Comments

Post a Comment

Popular posts from this blog

Project 3 - Analysis of sorting algorithms

Project 6 - State Space Search - 8-Puzzle

Project 1 - Browser linked list implementation