Project Structure

Project Structure
.
├── config/               # Agent configuration files
├── src/
│   ├── actions/          # Agent outputs/actions/capabilities
│   ├── fuser/            # Input fusion logic
│   ├── inputs/           # Input plugins (e.g. VLM, audio)
│   ├── llm/              # LLM integration
│   ├── providers/        # Providers
│   ├── runtime/          # Core runtime system
│   ├── simulators/       # Virtual endpoints such as `WebSim`
│   ├── zenoh_idl/        # Zenoh's Interface Definition Language (IDL)
│   └── run.py            # CLI entry point

The system is based on a loop that runs at a fixed frequency of self.config.hertz. This loop looks for the most recent data from various sources, fuses the data into a prompt, sends that prompt to one or more LLMs, and then sends the LLM responses to virtual agents or physical robots.

Architecture Overview

The whole system architecture is shown below:

The architecture diagram illustrates a comprehensive OM1’s distinct layers and modules. Here’s a detailed breakdown:

Sensors Layer

Sensors provide raw data inputs:

  • Vision: Cameras for visual perception.
  • Sound: Microphones capturing audio data.
  • Battery/System: Monitoring battery and system health.
  • Location/GPS: Positioning information.
  • LIDAR: Laser-based sensing for 3D mapping and navigation.

AI and Conversational World Captioning Layer

Processes raw sensor data into meaningful descriptions:

  • VLM (Vision Language Model): Converts visual data to natural language descriptions (e.g., human activities, object interactions).
  • ASR (Automatic Speech Recognition): Converts audio data into textual representation.
  • Platform State: Describes internal system status (e.g battery percentage, odometry readings).
  • Spatial/NAV: Processes location and navigation data.
  • 3D env.: Interprets 3D environmental data from sensors like LIDAR.

Natural Language Data Bus (NLDB)

A centralized bus that collects and manages natural language data generated from various captioning modules, ensuring structured data flow between components. Example outputs include:

Vision: “You see a human. He looks happy and is smiling and pointing to a chair.”
Sound: “You just heard: Bits, run to the chair.”
Odom: 1.3, 2.71, 0.32
Power: 73%

Data Fuser

This module synthesizes fragmented inputs from the NLDB into coherent narratives or actionable insights, providing context and situation awareness to subsequent decision-making modules. It fuses spatial data (e.g., human location), audio commands, and visual cues into a unified situational assessment.

Example fused data:

137.0270: You see a human, 3.2 meters to your left. He looks happy and is smiling. He is pointing to a chair. You just heard: Bits run to the chair.
139.0050: You see a human, 1.5 meters in front of you. He is showing you a flat hand. You just heard: Bits, stop.
...

Multi AI Planning/Decision Layer

Uses fused data to make decisions through sophisticated AI models:

  • Fast Action LLM (Local): Localized Language Model quickly processes immediate or time-critical actions without significant latency.
  • Cognition LLM (Cloud): Cloud-based Language Model for complex reasoning, long-term planning, and high-level cognitive tasks, leveraging more computational resources.
  • Blockchain: Ensures transparency, traceability, and possibly decentralized decision-making or logging for accountability.

Feedback Loop:

  • Adjustments based on performance metrics or environmental conditions (e.g., adjusting vision frame rates for efficiency).

NLDB (Lower Layer)

An additional natural language bus connecting the decision/planning layer outputs with the hardware control modules.

Hardware Abstraction Layer (HAL)

This layer translates high-level AI decisions into actionable commands for robotic hardware:

  • Move: Controls robot movement.
  • Sound: Generates auditory signals.
  • Speech: Handles synthesized voice outputs.
  • Wallet: Integrated digital wallet for economic transactions or identity verification.

Overall System Data Flow

Sensor Layer → AI Captioning → NLDB → Data Fuser → AI Decision Layer → HAL → Robot Actions