NVIDIA Cosmos 3 - Physical AI Reasoning, World, and Action Models

Overview

What is Cosmos 3?

NVIDIA Cosmos 3 is a frontier foundation model for physical AI that combines physical reasoning, world generation, and action generation within a single open model. Built on a Mixture-of-Transformers (MoT) architecture with two specialized towers, Cosmos 3 unifies capabilities that previous releases handled through separate models and workflows.

The Reasoner tower interprets multimodal observations like images, videos, and text to understand motion, object interactions, and physical context. The Generator tower produces physics-aware video and action outputs conditioned on the reasoner's understanding. This enables a single model to handle reasoning and generation tasks, simplifying development by eliminating orchestration between multiple models and inference pipelines.

Cosmos 3 is open-sourced with model checkpoints, training scripts, deployment tools, and datasets available on Hugging Face and GitHub, making physical AI development more open and reproducible.

Reasoner Tower

Vision-Language Model that interprets multimodal observations

Generator Tower

Diffusion-based process for physics-aware video and action outputs

Mixture-of-Transformers (MoT) Architecture

Key Highlights

What makes Cosmos 3 different

A unified architecture that brings reasoning, world generation, and action generation into a single model for the first time.

Unified Architecture

Combines physical reasoning, world generation, and action generation within a single Mixture-of-Transformers model, eliminating the need to orchestrate multiple separate models.

Reasoner + Generator Towers

A Vision-Language Model Reasoner interprets multimodal observations, while a diffusion-based Generator produces physics-aware video and action outputs.

Multiple Model Sizes

Cosmos 3 Nano (16B parameters) for workstation-grade GPUs like RTX PRO 6000, and Cosmos 3 Super (64B parameters) for datacenter deployment on Hopper and Blackwell GPUs.

Open Source Release

Full open release including model checkpoints, training scripts, deployment tools, and six synthetic data generation datasets for robotics, autonomous driving, and more.

State-of-the-Art Benchmarks

Leading open-source model on PAI-Bench, R-Bench Physics-IQ, RoboLab, and Artificial Analysis leaderboards across reasoning and generation tasks.

NVIDIA NIM Deployment

Available as NVIDIA NIM microservices with optimized inference including BF16, FP8, and NVFP4 quantization for up to 2x inference speedup.

How It Works

Cosmos 3 Architecture

A Mixture-of-Transformers architecture built around two specialized towers working in concert.

Multimodal Input

The Reasoner tower receives images, videos, and text, interpreting observations to understand motion, object interactions, and physical context.

Physical Reasoning

The autoregressive VLM processes the input, serving as the brain that reasons about the physical world before any generation takes place.

Guided Generation

The Generator tower activates both towers for guided generation, producing physics-aware video outputs conditioned on the reasoner's understanding.

Action Output

Generates action sequences for robotics, future observations, and policy learning through forward dynamics, inverse dynamics, and policy generation.

Model Options

Choose the right model size

Two model sizes optimized for different deployment scenarios, from workstation to datacenter.

Nano

Cosmos 3 Nano

16B parameters

Compact version optimized for efficient inference on workstation-grade compute. Designed to run on NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.

Parameters16 Billion

Target GPURTX PRO 6000

Use CaseReal-time inference

FormatBF16, FP8, NVFP4

Super

Cosmos 3 Super

64B parameters

Maximum quality and capability with the highest benchmark scores. Designed for datacenter deployment on NVIDIA Hopper and Blackwell GPUs for large-scale synthetic data generation and advanced workloads.

Parameters64 Billion

Target GPUHopper / Blackwell

Use CaseMax quality generation

FormatBF16, FP8, NVFP4

Supported Modalities

Input and output modalities

Cosmos 3 supports a wide range of modalities through its unified architecture.

Input	Output	Application
Text	Image	Physically-plausible image generation
Text \| Video	Video	World model for rare edge case video data generation
Text \| Image	Video	World model for prediction
Text \| Image	Text	VLM for reasoning
Action \| Video	Video	Action-conditioned world model
Video \| Text	Video \| Action	World action model, video action model, policy model for robot learning

Open Datasets

Physical AI synthetic data

Six synthetic data generation datasets open-sourced on Hugging Face for post-training Cosmos 3 and other models.

Embodied Robot Scenes

Synthetic data for robotic manipulation and embodied AI tasks, covering diverse robot morphologies and interaction scenarios.

Physical Interaction Scenes

Physics-grounded scenes demonstrating object interactions, collisions, and dynamic physical phenomena across varied environments.

Spatial Reasoning

Scenes designed to test and improve spatial understanding, depth perception, and 3D relationship reasoning in physical AI models.

Digital Human Scenes

Human motion, gesture, and interaction datasets for training models that need to understand and generate human behavior.

Autonomous Driving

Driving scenarios covering diverse road conditions, traffic patterns, and edge cases for autonomous vehicle perception and planning.

Warehouse Operations

Warehouse environment data for training models on logistics, inventory management, and automated material handling scenarios.

Benchmark Results

State-of-the-art performance

Cosmos 3 achieves leading results across multiple benchmark suites covering reasoning, generation, and domain-specific tasks.

PAI-Bench

Unified benchmark evaluating physical AI across video understanding and generation, spanning robotics, autonomous vehicles, and physics.

R-Bench Physics-IQ

Benchmark evaluating video-based world models for robotic video generation, assessing task completion and visual quality.

RoboLab

Simulation benchmark for evaluating task-generalist robot policies across diverse manipulation and navigation tasks.

Artificial Analysis

Leading open-source model on Text to Image and Image to Video leaderboards for physical AI generation quality.

VANTAGE-Bench

First public benchmark for evaluating VLMs on real-world fixed-camera footage across warehouses, transportation, and smart spaces.

Traffic Anomaly Reasoning

Official leaderboard for AI City Challenge 2026 Track 3, detecting and reasoning about anomalous events in transportation footage.

FAQ

Frequently asked questions

Quick answers to common questions about Cosmos 3.

What is NVIDIA Cosmos 3?

NVIDIA Cosmos 3 is a frontier foundation model for physical AI that combines physical reasoning, world generation, and action generation within a single open model. It uses a Mixture-of-Transformers architecture with a Reasoner tower (VLM) and a Generator tower (diffusion-based) to handle both reasoning and generation tasks.

What makes Cosmos 3 different from previous Cosmos releases?

Previous Cosmos releases separated world generation, physical understanding, and controlled scene generation into different models and workflows. Cosmos 3 unifies these capabilities with a Mixture-of-Transformers architecture built around two towers, eliminating the need for orchestration between multiple separate models and inference pipelines.

What model sizes are available?

Two sizes are available: Cosmos 3 Nano with 16B parameters optimized for workstation-grade GPUs like the NVIDIA RTX PRO 6000, and Cosmos 3 Super with 64B parameters designed for datacenter deployment on NVIDIA Hopper and Blackwell GPUs.

Is Cosmos 3 open source?

Yes, Cosmos 3 is fully open source. NVIDIA has released model checkpoints on Hugging Face, code and training recipes on GitHub, open datasets on Hugging Face, and deployment tools including NIM microservices for production inference.

What datasets are available with Cosmos 3?

Six synthetic data generation datasets are open-sourced on Hugging Face: Embodied Robot Scenes, Physical Interaction Scenes, Spatial Reasoning, Digital Human Scenes, Autonomous Driving Scenarios, and Warehouse Operations Scenes.

How do I deploy Cosmos 3 in production?

Cosmos 3 models are available as NVIDIA NIM microservices for optimized production deployment. NIM packages the model with optimized inference runtimes, supporting BF16, FP8, and NVFP4 quantization for up to 2x inference speedup. The Reasoner NIM is available now, with the Generator NIM coming soon.

What hardware is required to run Cosmos 3?

Cosmos 3 Nano is designed for workstation-grade GPUs like the NVIDIA RTX PRO 6000. Cosmos 3 Super targets datacenter deployment on NVIDIA Hopper and Blackwell GPUs. The post-training workflows require NVIDIA GPUs with sufficient VRAM for the chosen model size.

Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3