NVIDIA announced Cosmos on the CES stage, a foundation model trained on 20M hours of video data that learned physical dynamics from real-world observations. The model architecture supports both diffusion and autoregressive approaches, with the key innovation being its ability to predict and simulate physical interactions without explicit physics rules.
What makes this technically interesting is the scale of video processing required - the training pipeline handled 20 million hours of video, extracting 100 million distinct clips through a multi-stage filtering process. The model differs from previous video-based approaches by operating directly in the wavelet space rather than pixel space, allowing for more efficient compression while preserving temporal dynamics.
The technical approach centers on learning physics implicitly through observation rather than explicit simulation, making it particularly relevant for robotics and autonomous systems where traditional physics engines struggle with real-world complexity.