Structure and Architecture
Core architecture and runtime flow
Project Structure
The system is based on a loop that runs at a fixed frequency of self.config.hertz
. This loop looks for the most recent data from various sources, fuses the data into a prompt, sends that prompt to one or more LLMs, and then sends the LLM responses to virtual agents or physical robots.
Architecture Overview
The whole system architecture is shown below:
The architecture diagram illustrates a comprehensive OM1’s distinct layers and modules. Here’s a detailed breakdown:
Sensors Layer
Sensors provide raw data inputs:
- Vision: Cameras for visual perception.
- Sound: Microphones capturing audio data.
- Battery/System: Monitoring battery and system health.
- Location/GPS: Positioning information.
- LIDAR: Laser-based sensing for 3D mapping and navigation.
AI and Conversational World Captioning Layer
Processes raw sensor data into meaningful descriptions:
- VLM (Vision Language Model): Converts visual data to natural language descriptions (e.g., human activities, object interactions).
- ASR (Automatic Speech Recognition): Converts audio data into textual representation.
- Platform State: Describes internal system status (e.g battery percentage, odometry readings).
- Spatial/NAV: Processes location and navigation data.
- 3D env.: Interprets 3D environmental data from sensors like LIDAR.
Natural Language Data Bus (NLDB)
A centralized bus that collects and manages natural language data generated from various captioning modules, ensuring structured data flow between components. Example outputs include:
Data Fuser
This module synthesizes fragmented inputs from the NLDB into coherent narratives or actionable insights, providing context and situation awareness to subsequent decision-making modules. It fuses spatial data (e.g., human location), audio commands, and visual cues into a unified situational assessment.
Example fused data:
Multi AI Planning/Decision Layer
Uses fused data to make decisions through sophisticated AI models:
- Fast Action LLM (Local): Localized Language Model quickly processes immediate or time-critical actions without significant latency.
- Cognition LLM (Cloud): Cloud-based Language Model for complex reasoning, long-term planning, and high-level cognitive tasks, leveraging more computational resources.
- Blockchain: Ensures transparency, traceability, and possibly decentralized decision-making or logging for accountability.
Feedback Loop:
- Adjustments based on performance metrics or environmental conditions (e.g., adjusting vision frame rates for efficiency).
NLDB (Lower Layer)
An additional natural language bus connecting the decision/planning layer outputs with the hardware control modules.
Hardware Abstraction Layer (HAL)
This layer translates high-level AI decisions into actionable commands for robotic hardware:
- Move: Controls robot movement.
- Sound: Generates auditory signals.
- Speech: Handles synthesized voice outputs.
- Wallet: Integrated digital wallet for economic transactions or identity verification.
Overall System Data Flow
Sensor Layer → AI Captioning → NLDB → Data Fuser → AI Decision Layer → HAL → Robot Actions