webhookViLA VLM

VILA Vision-Language Model API Reference

The VILA VLM API provides real-time vision-language model analysis of video streams. This WebSocket-based endpoint enables low-latency streaming of video frames and receiving intelligent visual descriptions and analysis.

Base URL: wss://api-vila.openmind.org

Authentication: Requires an OpenMind API key passed as a query parameter.

WebSocket Connection

Establish a persistent WebSocket connection for streaming video frames and receiving real-time VLM analysis.

Endpoint: wss://api-vila.openmind.org?api_key=YOUR_API_KEY

Connection Parameters

Parameter
Type
Required
Description

api_key

string

Yes

Your OpenMind API key for authentication

Connection Example

import asyncio
import websockets

async def connect_to_vlm():
    async with websockets.connect(
        "wss://api-vila.openmind.org?api_key=om1_live_your_api_key"
    ) as websocket:
        # Send and receive messages
        pass

asyncio.run(connect_to_vlm())

Sending Video Frames

Message Format

Send video frames as JSON messages over the WebSocket connection:

Message Fields

Field
Type
Required
Description

timestamp

float

Yes

Unix timestamp when the frame was captured

frame

string

Yes

Base64-encoded JPEG image data

Frame Specifications

  • Format: JPEG (base64-encoded)

  • Recommended Resolution: 640x480 pixels (configurable)

  • Recommended FPS: 30 frames per second (configurable)

  • Quality: JPEG compression quality 70 (default)

Receiving VLM Analysis

Response Format

VLM Analysis Result:

Response Fields

Field
Type
Description

vlm_reply

string

Vision-language model analysis of the video frames

Usage Examples

Python Example with VideoStream

The om1_vlm.VideoStream wrapper simplifies video capture and streaming:

VideoStream Parameters

Parameter
Type
Default
Description

frame_callback

Callable

None

Callback function to send frames (e.g., websocket.send)

fps

int

30

Frames per second to capture

resolution

Tuple[int, int]

(640, 480)

Video resolution (width, height)

jpeg_quality

int

70

JPEG compression quality (0-100)

device_index

int

0

Camera device index

Custom Implementation

For custom video streaming without the VideoStream wrapper:

JavaScript/Node.js Example

Best Practices

Video Quality

  • Use recommended resolution of 640x480 for optimal balance of quality and bandwidth

  • Maintain JPEG quality around 70 for efficient compression

  • Ensure good lighting for better visual analysis

  • Keep camera stable for consistent results

Network Optimization

  • Send frames at consistent intervals (30 FPS recommended)

  • Monitor WebSocket connection health

  • Implement reconnection logic for network interruptions

  • Buffer frames locally during temporary connection issues

Performance Tips

  • Don't accumulate frames before sending - stream in real-time

  • Process VLM responses asynchronously

  • Adjust FPS based on network conditions

  • Use appropriate resolution for your use case

Security

  • Never hardcode API keys in client-side code

  • Use environment variables for API key storage

  • Rotate API keys regularly

  • Monitor API key usage for suspicious activity

Error Handling

Connection Issues

  • Verify API key is valid and active

  • Check WebSocket support in your environment

  • Ensure network allows WebSocket connections

  • Test connection with basic example first

Poor Analysis Quality

  • Increase video resolution if bandwidth allows

  • Improve lighting conditions

  • Reduce motion blur by adjusting camera settings

  • Ensure frames are not corrupted during encoding

Cleanup

Always properly close connections and release resources:

OpenMind developed om1_modulesarrow-up-right to simplify integration with VILA VLM and other services. For more details, visit Our GitHubarrow-up-right.

Last updated

Was this helpful?