In this blog post, I am going to show you how create a app in python which converts speech to text from the selected audio input device. To achieve this, I am going to be using an open-source speech recognition toolkit called Vosk.
NOTE: Before you start following the further steps, it is recommended that you create a virtual environment. For instructions on creating a virtual environment head over to Venv docs.
To start off, we are first going to need to way to capture audio input. To do this, we are going to use the python-sounddevice library. To get started follow the installation steps here.
Then, import the library into your script by adding the following at the top of your script.
import sounddevice as sd
Now, we need to find the sample rate of our input device. To do this, we are going to use the query devices function.
device_info = sd.query_devices(device=None, kind='input') samplerate = int(device_info['default_samplerate'])
In this step, you can also set the device parameter to the id of the device of your choice. To find out the list of input devices and their id, you can use the following function:
def getInputDevices(): devicesRaw = sd.query_devices() devices = {} for x in range(len(devicesRaw)): if (devicesRaw[x]['max_input_channels'] > 0): devices[x] = devicesRaw[x]['name'] return devices
Before we go any further, we need to load the model from our speech recognition toolkit. First, we need to install the vosk python library which you can do by following these instructions. Then download the vosk model of your choice from this list, extract it and place it in your project folder. Then import the library and load the model by typing:
import vosk model = vosk.Model('path/to/model')
Now we can start detecting the speech using the vosk library:
import queue import sys q = queue.Queue() # stores the audio while it is being processed result = '' # complete result (more accurate but slower prediction) partialResult = '' # partial result (less accurate but faster prediction) # Callback for the speech to text function def callback(outdata, frames, time, status): """This is called (from a separate thread) for each audio block.""" if status: print(status, file=sys.stderr) q.put(bytes(outdata)) # Detects text from speech input def speechToText(): with sd.InputStream(samplerate=samplerate, blocksize=8000, device=None, dtype='int16', channels=1, callback=callback): rec = vosk.KaldiRecognizer(model, samplerate) while True: data = q.get() if rec.AcceptWaveform(data): global result result = rec.Result()[14:-3] else: global partialResult partialResult = rec.PartialResult()[17:-3]
Now we can read the result partial result and result variables and use them for our needs.
So what can we do next?
- We can add a GUI which displays the text while it is being predicted.
- We can save the text for future use.
- We can use it to dictate notes by using the PyAutoGUI library to type the predicted captions as keyboard input.
Comments
Post a Comment