webhookRiva

Riva Speech Recognition (ASR) and Text-to-Speech (TTS)

The RIVA module provides efficient Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) capabilities powered by NVIDIA Riva for your robot running OM1.

Overview

OpenMind integrates the NVIDIA Riva's state-of-the-art speech AI models to offer:

  • ASR (Automatic Speech Recognition): Real-time speech-to-text conversion with automatic punctuation, profanity filtering, and multi-language support

  • TTS (Text-to-Speech): High-quality speech synthesis with customizable voices and languages

  • WebSocket Integration: Efficient streaming communication for low-latency processing

  • Flexible Audio Input: Support for microphone, audio streams, and remote audio sources

ASR Usage

Cloud-Based ASR (OpenMind API)

The ASR endpoint utilizes WebSockets for efficient, low-latency communication with the OpenMind cloud service.

Connection Endpoint

wss://api-asr.openmind.org?api_key=<YOUR_API_KEY>

Basic Example

The following example demonstrates how to interact with the ASR endpoint using plain Python:

Response Format

The endpoint responds with transcriptions in the following JSON format:

Audio Input Configuration

Configure audio capture using PyAudio:

TTS Usage

Cloud-Based TTS (OpenMind API)

The TTS endpoint generates speech from text using the Riva Text-to-Speech model.

Endpoint

Basic Example

TTS Parameters

Parameter
Type
Description

text

string

Text to convert to speech

voice

string

Voice identifier (e.g., "English-US.Female-1")

language_code

string

Language code (e.g., "en-US", "es-ES")

Error Handling

Common Issues

  1. WebSocket connection failed

    Solution: Verify API key is valid and check network connectivity

  2. Invalid API key

    Solution: Ensure you're using a valid OpenMind API key

  3. Audio device not found

    Solution: Check that your microphone is connected and permissions are granted

Performance Optimization

Chunk Size Tuning

Optimize chunk size for your use case:

Sample Rate Selection

Choose appropriate sample rate based on quality requirements:

  • 16 kHz: Standard telephony quality, lower bandwidth (recommended for ASR)

  • 44.1 kHz: CD quality audio

  • 48 kHz: Professional audio quality

Security Considerations

API Key Management

Never hardcode API keys in your source code:

Best Practices

  • Store API keys in environment variables

  • Rotate API keys regularly

  • Monitor API usage for suspicious activity

  • Use HTTPS/WSS for all API communications

Troubleshooting

Enable Debug Logging

Check Audio Device

List available audio devices:

OpenMind developed om1_modulesarrow-up-right to simplify integration with VILA VLM and other services. For more details, visit Our GitHubarrow-up-right.

Last updated

Was this helpful?