HARMONI STT

This is a module for transcribing the speech from audio files and audio streaming. You can run your STT in the harmoni_full container.

Usage

Local DeepSpeech STT

Using the DeepSpeech service: To set up the local STT service, first run sh harmoni_detectors/harmoni_stt/get_deepspeech_models.sh from the HARMONI directory in order to place the models in a parallel directory.

The API for Local DeepSpeech STT has:

Request Name: ActionType: REQUEST
Body: None (the STT is already listening from the microphone)
Response:
- response (int): SUCCESS, or FAILURE
- message (str): text transcribed from the streaming audio or audio file

The local DeepSpeech speech-to-text service can be launched with roslaunch harmoni_stt stt_deepspeech_service.launch or roslaunch harmoni_stt stt_service.launch service_to_launch:=deepspeech. Transcriptions are only published by the DeepSpeech service when the client determines the text as final based on the t_wait parameter (the default is 0.5s).

Google STT

The API for Google STT has:

Request Name: ActionType: REQUEST
Body: None (the STT is already listening from the microphone)
Response:
- response (int): SUCCESS, or FAILURE
- message (str): text transcribed from the streaming audio or audio file

You can run the service with the following command: roslaunch harmoni_stt stt_google_service.launch

Parameters

Local DeepSpeech STT

Parameters input for the local STT service:

Parameters	Definition	Values
model_file_path	path of the local STT model	str; e.g., “$(find harmoni_models)/stt/deepspeech-0.9.3-models.pbmm”
scorer_path	path of the scorer for the deepspeech model	str; e.g., “$(find harmoni_models)/stt/deepspeech-0.9.3-models.scorer”
lm_alpha	parameters of the deepspeech model	int; 0.75
lm_beta	parameters of the deepspeech model	int; 1.85
beam_width	width of them beam	int; 700
t_wait	seconds to wait of silence before stoping transcribing	int; 3s
subscriber_id	id of the subscriber	e.g., “default”

Google STT

Parameters input for the Google STT service:

Parameters	Definition	Values
language_id	language of the audio file	str; “en-US”
sample_rate	sample rate of the audio file (it should match with the microphone one in case of streaming)	int; e.g., 48000, or 44100
audio_channel	number of audio channels	int; 1
max_duration	maximum duration of empty streaming (seconds)	int; 30
waiting_time	time of silence to wait after stopping the transcription (seconds)	int; 2
credential_path	path where private keys are mounted	str; “$(env HOME)/.gcp/private-keys.json/private-keys.json”
subscriber_id	id of the subscriber	e.g., “default”

Testing

Local DeepSpeech STT module can be tested using

rostest harmoni_stt deepspeech.test

Online Gooogle STT module can be tested using

rostest harmoni_stt google.test

## References
[Documentation](https://harmoni20.readthedocs.io/en/latest/packages/harmoni_stt.html)

https://trac.ffmpeg.org/wiki/Capture/ALSA