MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

submited by

Style Pass

2024-09-25 11:00:05

Character video synthesis aims to produce realistic videos of animatable characters within lifelike scenes. As a fundamental problem in the computer vision and graphics community, 3D works typically require multi-view captures for per-case training, which severely limits their applicability of modeling arbitrary characters in a short time. Recent 2D methods break this limitation via pre-trained diffusion models, but they struggle for pose generality and scene interaction. To this end, we propose MIMO, a novel generalizable model which can not only synthesize character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. The core idea is to encode the 2D video to compact spatial codes, considering the inherent 3D nature of video occurrence. Concretely, we lift the 2D frame pixels into 3D using monocular depth estimators, and decompose the video clip to three spatial components (i.e., main human, underlying scene, and floating occlusion) in hierarchical layers based on the 3D depth. These components are further encoded to canonical identity code, structured motion code and full scene code, which are utilized as control signals of synthesis process. This spatial decomposition strategy enables flexible user control, spatial motion expression, as well as 3D-aware synthesis for scene interactions. Experimental results demonstrate the proposed method’s effectiveness and robustness.

Users are allowed to feed multiple inputs (e.g., a single image for character, a pose sequence for motion, and a single video/image for scene) to provide desired attributes respectively or a direct driving video as input. The proposed model can embed target attributes into the latent space to construct target codes and encode the driving video with spatial-aware decomposition as spatial codes, thus enabling intuitive attribute control of the synthesis by freely integrating latent codes in a specific order.

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Leave a Comment

Related Posts

Recent Posts

The Real Mission of the Fed - by Arnold Kling - In My Tribe

Silicon chips are no longer sustainable. Here’s what’s next.

Weekly Scroll: Elon's Endgame - by Jeremiah Johnson

Bioengineered yeast mass produces herbal medicine

Whose Standards Are Too High? - by Bryan Caplan - Bet On It

How to Find and Apply for University Scholarships

You should make sure you're actually high status before proclaiming yourself to be

I’m a neurology ICU nurse. The creep of AI in our hospitals terrifies me

Unifying the $1T wine asset class through data, DePIN and RWA tokenization.

mssv + Have You Played

Exposé Reveals Ongoing Smartphone Location Tracking Threats

Channelling an attractor from the future

Alzheimer's timeline shows changes start as trickle, become torrent

How Long Is An Essay

Facebook and Instagram to Offer Subscription for No Ads in Europe

The EdTech Revolution Has Failed - by Jared Cooney Horvath

Jepsen: Bufstream 0.1.0

‘Every hammer blow makes a difference’: handcrafting whisky stills in Scotland – photo essay

Search code, repositories, users, issues, pull requests...

An Africanist Perspective

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Leave a Comment

Related Posts

Recent Posts

The Real Mission of the Fed - by Arnold Kling - In My Tribe

Silicon chips are no longer sustainable. Here’s what’s next.

Weekly Scroll: Elon's Endgame - by Jeremiah Johnson

Bioengineered yeast mass produces herbal medicine

Whose Standards Are Too High? - by Bryan Caplan - Bet On It

How to Find and Apply for University Scholarships

You should make sure you're *actually* high status before proclaiming yourself to be

I’m a neurology ICU nurse. The creep of AI in our hospitals terrifies me

Unifying the $1T wine asset class through data, DePIN and RWA tokenization.

mssv + Have You Played

Exposé Reveals Ongoing Smartphone Location Tracking Threats

Channelling an attractor from the future

Alzheimer's timeline shows changes start as trickle, become torrent

How Long Is An Essay

Facebook and Instagram to Offer Subscription for No Ads in Europe

The EdTech Revolution Has Failed - by Jared Cooney Horvath

Jepsen: Bufstream 0.1.0

‘Every hammer blow makes a difference’: handcrafting whisky stills in Scotland – photo essay

Search code, repositories, users, issues, pull requests...

An Africanist Perspective

You should make sure you're actually high status before proclaiming yourself to be