HARMONI STT

This is a module for transcribing the speech from audio files and audio streaming. You can run your STT in the harmoni_full container.

Usage

Local DeepSpeech STT

Using the DeepSpeech service: To set up the local STT service, first run sh harmoni_detectors/harmoni_stt/get_deepspeech_models.sh from the HARMONI directory in order to place the models in a parallel directory.

The API for Local DeepSpeech STT has:

  • Request Name: ActionType: REQUEST

  • Body: None (the STT is already listening from the microphone)

  • Response:

    • response (int): SUCCESS, or FAILURE

    • message (str): text transcribed from the streaming audio or audio file

The local DeepSpeech speech-to-text service can be launched with roslaunch harmoni_stt stt_deepspeech_service.launch or roslaunch harmoni_stt stt_service.launch service_to_launch:=deepspeech. Transcriptions are only published by the DeepSpeech service when the client determines the text as final based on the t_wait parameter (the default is 0.5s).

Google STT

The API for Google STT has:

  • Request Name: ActionType: REQUEST

  • Body: None (the STT is already listening from the microphone)

  • Response:

    • response (int): SUCCESS, or FAILURE

    • message (str): text transcribed from the streaming audio or audio file

You can run the service with the following command: roslaunch harmoni_stt stt_google_service.launch

Parameters

Local DeepSpeech STT

Parameters input for the local STT service:

Parameters

Definition

Values

model_file_path

path of the local STT model

str; e.g., “$(find harmoni_models)/stt/deepspeech-0.9.3-models.pbmm”

scorer_path

path of the scorer for the deepspeech model

str; e.g., “$(find harmoni_models)/stt/deepspeech-0.9.3-models.scorer”

lm_alpha

parameters of the deepspeech model

int; 0.75

lm_beta

parameters of the deepspeech model

int; 1.85

beam_width

width of them beam

int; 700

t_wait

seconds to wait of silence before stoping transcribing

int; 3s

subscriber_id

id of the subscriber

e.g., “default”

Google STT

Parameters input for the Google STT service:

Parameters

Definition

Values

language_id

language of the audio file

str; “en-US”

sample_rate

sample rate of the audio file (it should match with the microphone one in case of streaming)

int; e.g., 48000, or 44100

audio_channel

number of audio channels

int; 1

max_duration

maximum duration of empty streaming (seconds)

int; 30

waiting_time

time of silence to wait after stopping the transcription (seconds)

int; 2

credential_path

path where private keys are mounted

str; “$(env HOME)/.gcp/private-keys.json/private-keys.json”

subscriber_id

id of the subscriber

e.g., “default”

Testing

Local DeepSpeech STT module can be tested using

rostest harmoni_stt deepspeech.test

Online Gooogle STT module can be tested using

rostest harmoni_stt google.test

## References
[Documentation](https://harmoni20.readthedocs.io/en/latest/packages/harmoni_stt.html)

https://trac.ffmpeg.org/wiki/Capture/ALSA