Structure and Architecture

Project Structure

.
├── config/               # Agent configuration files
├── src/
│   ├── actions/          # Agent outputs/actions/capabilities
│   ├── fuser/            # Input fusion logic
│   ├── inputs/           # Input plugins (e.g. VLM, audio)
│   ├── llm/              # LLM integration
│   ├── providers/        # Providers
│   ├── runtime/          # Core runtime system
│   ├── simulators/       # Virtual endpoints such as `WebSim`
│   ├── zenoh_idl/        # Zenoh's Interface Definition Language (IDL)
│   └── run.py            # CLI entry point

The system is based on a loop that runs at a fixed frequency of self.config.hertz. This loop looks for the most recent data from various sources, fuses the data into a prompt, sends that prompt to one or more LLMs, and then sends the LLM responses to virtual agents or physical robots.

Architecture Overview

The whole system architecture is shown below:

The architecture diagram illustrates a comprehensive OM1’s distinct layers and modules. Here’s a detailed breakdown:

Sensors Layer

Sensors provide raw data inputs:

Vision: Cameras for visual perception.
Sound: Microphones capturing audio data.
Battery/System: Monitoring battery and system health.
Location/GPS: Positioning information.
LIDAR: Laser-based sensing for 3D mapping and navigation.

AI and Conversational World Captioning Layer

Processes raw sensor data into meaningful descriptions:

VLM (Vision Language Model): Converts visual data to natural language descriptions (e.g., human activities, object interactions).
ASR (Automatic Speech Recognition): Converts audio data into textual representation.
Platform State: Describes internal system status (e.g battery percentage, odometry readings).
Spatial/NAV: Processes location and navigation data.
3D env.: Interprets 3D environmental data from sensors like LIDAR.

Natural Language Data Bus (NLDB)

A centralized bus that collects and manages natural language data generated from various captioning modules, ensuring structured data flow between components. Example outputs include:

Vision: “You see a human. He looks happy and is smiling and pointing to a chair.”
Sound: “You just heard: Bits, run to the chair.”
Odom: 1.3, 2.71, 0.32
Power: 73%

Data Fuser

This module synthesizes fragmented inputs from the NLDB into coherent narratives or actionable insights, providing context and situation awareness to subsequent decision-making modules. It fuses spatial data (e.g., human location), audio commands, and visual cues into a unified situational assessment.

Example fused data:

137.0270: You see a human, 3.2 meters to your left. He looks happy and is smiling. He is pointing to a chair. You just heard: Bits run to the chair.
139.0050: You see a human, 1.5 meters in front of you. He is showing you a flat hand. You just heard: Bits, stop.
...

Multi AI Planning/Decision Layer

Uses fused data to make decisions through sophisticated AI models:

Fast Action LLM (Local): Localized Language Model quickly processes immediate or time-critical actions without significant latency.
Cognition LLM (Cloud): Cloud-based Language Model for complex reasoning, long-term planning, and high-level cognitive tasks, leveraging more computational resources.
Blockchain: Ensures transparency, traceability, and possibly decentralized decision-making or logging for accountability.

Feedback Loop:

Adjustments based on performance metrics or environmental conditions (e.g., adjusting vision frame rates for efficiency).

NLDB (Lower Layer)

An additional natural language bus connecting the decision/planning layer outputs with the hardware control modules.

Hardware Abstraction Layer (HAL)

This layer translates high-level AI decisions into actionable commands for robotic hardware:

Move: Controls robot movement.
Sound: Generates auditory signals.
Speech: Handles synthesized voice outputs.
Wallet: Integrated digital wallet for economic transactions or identity verification.

Overall System Data Flow

Sensor Layer → AI Captioning → NLDB → Data Fuser → AI Decision Layer → HAL → Robot Actions

Getting Started

Architecture

Input Plugins

LLM Integration

Action Plugins

Robot Integration

Tutorials

Help

Miscellaneous

Structure and Architecture

Project Structure

Architecture Overview

Sensors Layer

AI and Conversational World Captioning Layer

Natural Language Data Bus (NLDB)

Data Fuser

Multi AI Planning/Decision Layer

NLDB (Lower Layer)

Hardware Abstraction Layer (HAL)

Overall System Data Flow

Getting Started

Architecture

Input Plugins

LLM Integration

Action Plugins

Robot Integration

Tutorials

Help

Miscellaneous

​Project Structure

​Architecture Overview

​Sensors Layer

​AI and Conversational World Captioning Layer

​Natural Language Data Bus (NLDB)

​Data Fuser

​Multi AI Planning/Decision Layer

​NLDB (Lower Layer)

​Hardware Abstraction Layer (HAL)

​Overall System Data Flow

Project Structure

Architecture Overview

Sensors Layer

AI and Conversational World Captioning Layer

Natural Language Data Bus (NLDB)

Data Fuser

Multi AI Planning/Decision Layer

NLDB (Lower Layer)

Hardware Abstraction Layer (HAL)

Overall System Data Flow