A frontier foundation model for physical AI that unifies physical reasoning, world generation, and action generation within a single open model. Open-source checkpoints, training scripts, deployment tools, and datasets available.
NVIDIA Cosmos 3 is a frontier foundation model for physical AI that combines physical reasoning, world generation, and action generation within a single open model. Built on a Mixture-of-Transformers (MoT) architecture with two specialized towers, Cosmos 3 unifies capabilities that previous releases handled through separate models and workflows.
The Reasoner tower interprets multimodal observations like images, videos, and text to understand motion, object interactions, and physical context. The Generator tower produces physics-aware video and action outputs conditioned on the reasoner's understanding. This enables a single model to handle reasoning and generation tasks, simplifying development by eliminating orchestration between multiple models and inference pipelines.
Cosmos 3 is open-sourced with model checkpoints, training scripts, deployment tools, and datasets available on Hugging Face and GitHub, making physical AI development more open and reproducible.
Vision-Language Model that interprets multimodal observations
Diffusion-based process for physics-aware video and action outputs
Mixture-of-Transformers (MoT) Architecture
A unified architecture that brings reasoning, world generation, and action generation into a single model for the first time.
Combines physical reasoning, world generation, and action generation within a single Mixture-of-Transformers model, eliminating the need to orchestrate multiple separate models.
A Vision-Language Model Reasoner interprets multimodal observations, while a diffusion-based Generator produces physics-aware video and action outputs.
Cosmos 3 Nano (16B parameters) for workstation-grade GPUs like RTX PRO 6000, and Cosmos 3 Super (64B parameters) for datacenter deployment on Hopper and Blackwell GPUs.
Full open release including model checkpoints, training scripts, deployment tools, and six synthetic data generation datasets for robotics, autonomous driving, and more.
Leading open-source model on PAI-Bench, R-Bench Physics-IQ, RoboLab, and Artificial Analysis leaderboards across reasoning and generation tasks.
Available as NVIDIA NIM microservices with optimized inference including BF16, FP8, and NVFP4 quantization for up to 2x inference speedup.
A Mixture-of-Transformers architecture built around two specialized towers working in concert.
The Reasoner tower receives images, videos, and text, interpreting observations to understand motion, object interactions, and physical context.
The autoregressive VLM processes the input, serving as the brain that reasons about the physical world before any generation takes place.
The Generator tower activates both towers for guided generation, producing physics-aware video outputs conditioned on the reasoner's understanding.
Generates action sequences for robotics, future observations, and policy learning through forward dynamics, inverse dynamics, and policy generation.
Two model sizes optimized for different deployment scenarios, from workstation to datacenter.
Compact version optimized for efficient inference on workstation-grade compute. Designed to run on NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.
Maximum quality and capability with the highest benchmark scores. Designed for datacenter deployment on NVIDIA Hopper and Blackwell GPUs for large-scale synthetic data generation and advanced workloads.
Cosmos 3 supports a wide range of modalities through its unified architecture.
| Input | Output | Application |
|---|---|---|
| Text | Image | Physically-plausible image generation |
| Text | Video | Video | World model for rare edge case video data generation |
| Text | Image | Video | World model for prediction |
| Text | Image | Text | VLM for reasoning |
| Action | Video | Video | Action-conditioned world model |
| Video | Text | Video | Action | World action model, video action model, policy model for robot learning |
Six synthetic data generation datasets open-sourced on Hugging Face for post-training Cosmos 3 and other models.
Synthetic data for robotic manipulation and embodied AI tasks, covering diverse robot morphologies and interaction scenarios.
Physics-grounded scenes demonstrating object interactions, collisions, and dynamic physical phenomena across varied environments.
Scenes designed to test and improve spatial understanding, depth perception, and 3D relationship reasoning in physical AI models.
Human motion, gesture, and interaction datasets for training models that need to understand and generate human behavior.
Driving scenarios covering diverse road conditions, traffic patterns, and edge cases for autonomous vehicle perception and planning.
Warehouse environment data for training models on logistics, inventory management, and automated material handling scenarios.
Cosmos 3 achieves leading results across multiple benchmark suites covering reasoning, generation, and domain-specific tasks.
Unified benchmark evaluating physical AI across video understanding and generation, spanning robotics, autonomous vehicles, and physics.
Benchmark evaluating video-based world models for robotic video generation, assessing task completion and visual quality.
Simulation benchmark for evaluating task-generalist robot policies across diverse manipulation and navigation tasks.
Leading open-source model on Text to Image and Image to Video leaderboards for physical AI generation quality.
First public benchmark for evaluating VLMs on real-world fixed-camera footage across warehouses, transportation, and smart spaces.
Official leaderboard for AI City Challenge 2026 Track 3, detecting and reasoning about anomalous events in transportation footage.
Quick answers to common questions about Cosmos 3.
Download model checkpoints, explore training recipes, and deploy with optimized NIM microservices. The open-source release makes physical AI development more accessible than ever.