Not only can Rhapsode read pages aloud to you via eSpeak NG and it’s own CSS engine, but now you can speak aloud to it via Voice2JSON! All without trusting or relying upon any internet services, except ofcourse for bogstandard webservers to download your requested information from. Thereby completing my vision for Rhapsode’s reading experience!
This speech recognition can be triggered either using the space key or by calling Rhapsode’s name
(Okay, by saying
Hey Mycroft because I haven’t bothered to train it).
Voice2JSON is exactly what I want from a speech-to-text engine!
Accross it’s 4 backends (CMU PocketSphinx, Dan Povey’s Kaldi, Mozilla DeepSpeech, & Kyoto University’s Julius) it supports 18 human languages! I always like to see more language support, but this is impressive.
I can feed it (lightly-preprocessed) whatever random phrases I find in link elements, etc to use as voice commands. Even feeding it different commands for every webpage, including unusual words.
It operates entirely on your device, only using the internet initially to download
profile for your language.
And when I implement webforms it’s
slots feature will be invaluable.
The only gotcha is that I needed to also add a JSON parser to Rhapsode’s dependencies.
To operate Voice2JSON you rerun
everytime you edit
any of it’s referenced files to update the list of supported voice commands.
This prepares a
language model to guide the output of
who’s output you’ll probably pipe into
to determine which
sentences.ini it matches.
If you want this voice recognition to be triggered by some
to determine when that keyphrase has been said.
For every page Rhapsode outputs a
sentences.ini file & runs
to compile this mix of INI &
Java Speech Grammar Format syntax into an appropriate
language model for the backend chosen by the
Once it’s parsed
sentences.ini Voice2JSON optionally normalizes the sentence casing and
lowers any numeric ranges,
slot references from external files or programs, & numeric digits
via num2words before reformatting it into a
with weighted edges. This resulting
Nondeterministic Finite Automaton (NFA)
is saved & gzip‘d
to the profile before lowering it further to an OpenFST
graph which, with a handful of opengrm commands,
is converted into an appropriate language model.
Whilst lowering the NFA to a language model Voice2JSON looks up how to pronounce every unique word in that NFA, consulting Phonetisaurus for any words the profile doesn’t know about. Phonetisaurus in turn evaluates the word over a Hidden Markov n-gram model.
voice2json transcribe-stream pipes 16bit 16khz mono WAVs
from a specified file or profile-configured record command
(defaults to ALSA)
to the backend & formats it’s output sentences with metadata inside
JSON Lines objects. To determine when a voice command
ends it uses some sophisticated code extracted
from the WebRTC implementation (from Google).
That 16khz audio sampling rate is interesting, it’s far below the 44.1khz sampling rate typical for digital audio. Presumably this reduces the computational load whilst preserving the frequencies (max 8khz per Nyquist-Shannon) typical of human speech.
To match this output to the grammar defined in
sentences.ini Voice2JSON provides
voice2json recognize-intent command. This reads back in the compressed
NetworkX NFA to find the best path, fuzzily or not, via
depth-first-search which matches
each input sentence. Once it has that path it iterates over it to resolve & capture:
The resulting information from each of these passes is gathered & output as JSON Lines.
In Rhapsode I apply a further fuzzy match, the same I’ve always used for keyboard input, via Levenshtein Distance.
To trigger Rhapsode to recognize a voice command you can either press a key <aside>(spacebar)</aside>
or, to stick to pure voice control, saying a
For this there’s the
voice2json wait-wake command.
voice2json wait-wake pipes the same 16bit 16khz mono WAV audio as
into (currently) Mycroft Precise
& applies some edge detection
to the output probabilities. Mycroft Precise, from the Mycroft
opensource voice assistant project, is a Tensorflow
spectograms (computed via
sonopy or legacy
speechpy) into probabilities.
Interpreting audio input into voice commands is a non-trivial task, combining the efforts of many projects. Last I checked Voice2JSON used the following projects to tackle various components of this challenge:
And for the raw text-to-speech logic you can choose between:
Rhapsode’s use of Voice2JSON shows two things.
Second there is zero reason for Siri, Alexa, Cortana, etc to offload their computation to the cloud. Voice recognition may not be a trivial task, but even modest consumer hardware are more than capable enough to do a good job at it.