SANA-WM: Open-Source World Model Generates One-Minute Video in 720p

NVIDIA has released SANA-WM — a 2.6-billion parameter world model that generates 60-second video in 720p resolution on a single GPU. The project is fully open-source and offers precise 6-DoF camera control based on a single input image and camera path.

What is SANA-WM and how does this NVIDIA world model work?
What are the hardware requirements for running SANA-WM?
How does SANA-WM differ from other AI video generators?
What does the training process for SANA-WM look like?
In what applications will SANA-WM prove useful?

TL;DR: NVIDIA’s SANA-WM is a 2.6-billion parameter world model that generates 60-second 720p video from a single image and camera path. The model was trained on 213,000 public video clips. It runs on a single GPU, and a distilled version supports the NVFP4 format on RTX 5090 cards. The project is fully open-source.

What is SANA-WM and how does this NVIDIA world model work?

SANA-WM is a world model with an architecture comprising 2.6 billion parameters, designed to generate minute-long video sequences in 720p resolution. It was trained on a dataset of 213,000 publicly available video clips. The project stands out from other video generators by offering precise control over camera movement while maintaining scene consistency throughout the full 60 seconds of footage. The tool enables controlled scene simulation rather than just creating short, dynamic clips. The model takes a single image and a defined camera path as input. The system then renders the entire sequence, preserving the environment’s geometry and physics. This approach differs from standard video generators, which focus primarily on image transformation without spatial awareness. SANA-WM’s architecture integrates 6-DoF (six degrees of freedom) control, enabling the virtual camera to move along all axes. In practice, this works differently than in simple video models. The system understands three-dimensional space, allowing it to generate realistic perspectives and smooth transitions between frames.

SANA-WM is a 2.6-billion parameter open-source model from NVIDIA that generates 60-second 720p video from a single input image and 6-DoF camera path, trained on 213,000 public video clips with the ability to run on a single GPU.

What are the hardware requirements for running SANA-WM?

The SANA-WM model has been optimized to run on a single GPU, which is rare for systems generating such long video sequences. The standard version of the model requires a graphics card with sufficient VRAM to handle 2.6 billion parameters during inference. NVIDIA has also prepared a distilled version of the model that uses the NVFP4 quantization format. This version was designed for the latest graphics cards, such as the RTX 5090. NVFP4 quantization reduces memory consumption while maintaining acceptable quality of the generated output. This means users with less advanced hardware can take advantage of the lighter model version. Below is a summary of the requirements and capabilities of both model variants.

Model Version	Parameters	Format	Required GPU	Video Length
SANA-WM (full)	2.6B	FP16/BF16	GPU with high VRAM	60 seconds
SANA-WM (distilled)	2.6B	NVFP4	RTX 5090	60 seconds

The full version of the model runs on a single GPU, while the distilled version with NVFP4 quantization is adapted for RTX 5090 cards, significantly lowering the barrier to entry for creators with less advanced hardware. As with training your own LLM from scratch, resource optimization is key here.

How does SANA-WM differ from other AI video generators?

SANA-WM differs from typical video generators like Sora or Runway primarily in its approach to space. Standard models generate frame sequences based on a text prompt, often treating video as a sequence of two-dimensional images. SANA-WM operates as a world model, meaning it possesses an internal representation of a three-dimensional environment. Furthermore, the system offers precise 6-DoF camera control, enabling movement along the X, Y, and Z axes as well as rotation around them. Most competing solutions offer only basic camera direction control. SANA-WM allows you to plan an exact flight trajectory through a scene. Another difference is the length of generated content. Models focused on other modalities, such as advanced voice AI or vision models, target different use cases. SANA-WM targets minute-long, spatially coherent sequences. Moreover, the model is fully open-source, which contrasts with closed systems from OpenAI or Runway.

SANA-WM stands out on the market with its open-source code, 60-second 720p video generation time, 6-DoF camera control, and the fact that it requires only a single GPU to operate, making it accessible to a broader range of researchers.

What does the training process for SANA-WM look like?

SANA-WM’s training dataset consists of 213,000 publicly available video clips, from which the model learns to understand physics and three-dimensional geometry. NVIDIA does not use closed private data, which directly enables full reproducibility of the training process. The open nature of the dataset allows researchers to verify and replicate experiments. This level of transparency is rare in the video generation industry. The training dataset of 213,000 public video clips enables the model to precisely learn scene geometry and camera movement without relying on proprietary databases (MarkTechPost, 2026).

The collection of 213,000 clips provides a solid foundation for learning perspective, though it does not match the volume of private corporate datasets. Nevertheless, the model achieves high generation quality. This results from the application of advanced spatial representation learning mechanisms. Training effectiveness is evaluated based on the consistency of generated imagery throughout the full 60 seconds of footage. The open-source code allows for independent verification of the procedure.

In what applications will SANA-WM prove useful?

SANA-WM excels in scene simulations where precise camera movement control is required while maintaining spatial consistency for 60 seconds. Possible applications include film prototyping, architecture, and testing robotic vision systems. The model generates 720p footage from a single input image. This opens up a wide creative space for creators. SANA-WM supports precise 6-DoF camera control, allowing you to plan an exact flight trajectory through a virtual environment and generate coherent minute-long sequences on a single GPU (NVIDIA Threads, 2026).

Below are the main application areas for the model:

Prototyping film scenarios and shot visualization
Architectural virtual walkthroughs from design projects
Generating synthetic training data for autonomous systems
Creating interactive multimedia installations and digital art
Testing vision algorithms under controlled simulated conditions
Video game design as a tool for rapid environment pre-rendering
Film studies and directing education without the need to rent equipment

As reported by MarkTechPost, the model targets controlled scene simulation rather than just creating short, dynamic clips from a text prompt. Engineers can therefore generate long, predictable video sequences for testing image analysis software. Unlike tools focused on text modalities, SANA-WM concentrates on the visual domain.

How to install and run SANA-WM locally?

Installing SANA-WM requires downloading the 2.6-billion parameter model weights and setting up an appropriate runtime environment with single GPU support. The project is available in an open repository, enabling self-configured inference. The user must prepare an input image and a camera path file. The process requires basic terminal familiarity. The distilled model version with NVFP4 quantization supports RTX 5090 cards, which significantly reduces hardware requirements compared to the full FP16 version (NVIDIA Threads, 2026).

The following table presents the key steps for launching the model:

Step	Description	Required Resource
1	Clone the code repository	Internet connection
2	Install dependencies and packages	Python environment
3	Download model weights	Sufficient disk space
4	Prepare the input image	Any source photo
5	Define the 6-DoF camera path	Configuration file
6	Run the generation script	GPU with sufficient VRAM

Basic execution comes down to running a script in the terminal. The system reads the camera configuration and renders the sequence accordingly. The resulting video file has a resolution of 720p and a duration of up to 60 seconds. The entire process runs locally, without sending data to an external cloud. This ensures complete privacy.

Frequently Asked Questions

How much VRAM does SANA-WM require to generate one minute of video?

The full 2.6-billion parameter model in FP16 format requires a GPU with high VRAM, while the distilled NVFP4 version runs properly on RTX 5090 cards — plan your hardware purchase according to your chosen inference format.

Does SANA-WM generate video from a text prompt?

The model accepts only a single input image and a defined 6-DoF camera path; it does not support generating footage from text descriptions — use other tools if you need to work with prompts.

How long does it take to generate a 60-second video at 720p resolution?

Generation time depends on the performance of the single GPU the model is running on, with the architecture optimized for local operation — test the timing on your own hardware for precise results.

Does the model maintain scene physics consistency throughout the full 60 seconds of footage?

Yes, SANA-WM was trained on 213,000 public video clips specifically to maintain geometry and environment consistency for the full minute — define an accurate camera path for the best results.

Summary

NVIDIA’s SANA-WM represents a significant step forward in the development of open world models. The 2.6-billion parameter architecture proves that generating minute-long, spatially coherent 720p video on a single GPU is possible. Full 6-DoF camera control and a transparent training dataset of 213,000 clips give researchers a solid foundation for further work. The distilled NVFP4 version for RTX 5090 cards lowers the barrier to entry.

The tool excels in film prototyping, architecture, and synthetic data generation. Open-source code and the absence of cloud dependency are major advantages. Those interested in installing and training their own solutions may want to check out the guide on training your own LLM from scratch. More information about the SANA-WM model can be found in the official post on MarkTechPost and on the NVIDIA Threads profile.