Conversation
Using Cloud Endpoints for Voice Inputs and Text to Speech
This section provides various examples for integrating and using multiple cloud-based AI endpoints, such as OpenAI, DeepSeek, and others, for voice input processing, text-to-speech (TTS) and emotion detection synthesis. Whether you need to convert spoken language into text (ASR) or generate natural-sounding speech from text, these examples will help you interact with different cloud providers seamlessly.
Voice to Text processing with OpenAI
This example uses your default
audio in (microphone) and your default
audio output (speaker). Please test both your microphone and speaker in your system settings to make sure they are connected and working.
It will request permission to on your audio. Allow permissions.
Install OM1
Before your tutorial, please install OM1
Configure OM1 API key
Locate config file in config/conversation.json5
and update your api_key
to your OM1 API key.
Apply your OM1 API key
Conversation with OM1
Response
Code Explanation
Missing Unitree SDK
- The script attempts to load the Unitree SDK, which is likely used for controlling a Unitree robot (like a quadruped robot dog).
- Since the SDK is not installed, the script warns the user but continues execution (not a fatal error).
You will be able to speak to the LLM and it will generate voice outputs.
Hardware Audio Drivers
Audio Input Detection and Selection
- The system detects 7 audio devices (microphones/speakers).
- The default input device is selected: MacBook Pro Microphone (1).
- This suggests speech recognition or voice input functionality.
WebSocket Connection Established
- A WebSocket client thread starts.
- It successfully connects to wss://api-asr.openmind.org, which appears to be an Automatic Speech Recognition (ASR) service (real-time voice-to-text processing).
- Connection established, meaning speech recognition is now active.
Audio Processing Starts
- The script starts processing audio input from the selected microphone (device 1).
- It also registers a callback function to handle messages received via WebSocket.
OpenAI Client Initialization
- The script initializes an OpenAI-based client for handling conversational AI.
- It connects to https://api.openmind.org/api/core/openai, suggesting it’s using OpenAI’s LLM API (likely for chatbot responses).
- The API key is logged (security risk ⚠️ – API keys should not be exposed in logs).
Audio Output Device Detection
- The system detects audio output devices.
- It selects LG FHD as the output device (a monitor or external speaker).
- Audio streaming is successfully opened, meaning the system can play sound.
Speech Recognition Captured Input
- The system successfully recognized and transcribed speech input (hello hello).
- This confirms that the ASR (Automatic Speech Recognition) system is working.
Enumerating your audio
You can enumerate available audio via the test script in /system_hw_test`:
Especially on Linux, such as on Ubuntu 20.04 on the Nvidia Orin, audio support can be marginal. Expect some audio inputs and outputs to not work correctly, or to advertise incorrect hardware capabilities, such as USB microphones that report zero input channels etc. Typical work arounds are to try different audio cards.
Testing audio
You can provide test sentences to speak by adding the MockInput
to the config file:
Then connect to the ws
(wscat -c ws://localhost:8765
) and type in the words you want the system to speak. This is useful to debug audio out issues and related settings such as chunk values.
Was this page helpful?