Seer: Language Instructed Video Prediction with Latent Diffusion Models

Abstract

Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning, i.e., predicting future video frames with a given language instruction and reference frames. It is a highly challenging task to ground task-level goals specified by instructions and high-fidelity frames together, requiring large-scale data and computation. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named Seer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We inflate the denoising U-Net and language conditioning model with two novel techniques, Autoregressive Spatial-Temporal Attention and Frame Sequential Text Decomposer, to propagate the rich prior knowledge in the pretrained T2I models across the frames. With the well-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2) and Bridgedata datasets demonstrate our superior video prediction performance with around 210-hour training on 4 RTX 3090 GPUs: decreasing the FVD of the current SOTA model from 290 to 200 on SSv2 and achieving at least 70% preference in the human evaluation.

Method

Seer's pipeline includes an Inflated 3D U-Net for diffusion and a Frame Sequential Text Transformer (FSeq Text Transformer) for text conditioning. During training, all video frames are compressed to latent space with a pre-trained VAE encoder. Conditional latent vectors, sampled from reference video frames (ref.), are concatenated with noisy latent vectors along the frame axis to form the input latent. During inference, the conditional latent vectors are concatenated with Gaussian noise vectors, text conditioning is injected for each frame by FSeq Text Transformer (e.g., global instruction embedding “Moving remote and small remote away from each other.” is decomposed into 12 frames sub-instructions along the frame axis), and the denoised outputs are decoded to RGB video frames with the pre-trained VAE decoder.

Results

Text-conditioned Video Prediction/Manipulation

(Something-Something V2 Dataset)

→

"Moving pen up."

"Moving pen down."

→

"Turning the camera left while filming wall mounted fan."

"Turning the camera right while filming wall mounted fan."

→

"Pretending to pick a box of juice up."

"Picking a box of juice up."

→

"Pushing scissors so that it falls off the table."

"Picking a scissors up."

Input frames	Text instruction	Real Video	Synthesized Video
	"Pushing iphone adapter from left to right."

Input frames	Text instruction	Real Video	Synthesized Video
	"Covering salt shaker with a towel."

Input frames	Text instruction	Real Video	Synthesized Video
	"Dropping a card in front of a coin."

Input frames	Text instruction	Real Video	Synthesized Video
	"Folding mat."

Input frames	Text instruction	Real Video	Synthesized Video
	"Moving a bottle and a glass away from each other."

Input frames	Text instruction	Real Video	Synthesized Video
	"Tearing a piece of paper into two pieces."

Text-conditioned Video Prediction/Manipulation (BridgeData)

→

"Put banana on plate."

"Put corn on plate."

→

"Pick up green mug."

"Pick up glass cup."

→

"Turn lever vertical to front."

"Pick up knife from sink."

→

"Pick up bowl and put in small box."

"Close small box flap."

Input frames	Text instruction	Real Video	Synthesized Video
	"Flip pot upright in sink."

Input frames	Text instruction	Real Video	Synthesized Video
	"Put pot on stove which is near stove."

Input frames	Text instruction	Real Video	Synthesized Video
	"Flip pot upright in sink."

Text-conditioned Video Prediction (Epic-Kitchens)

Input frames	Text instruction	Real Video	Synthesized Video
	"Open cupboard"

Input frames	Text instruction	Real Video	Synthesized Video
	"Wiping bowl with rag"

Input frames	Text instruction	Real Video	Synthesized Video
	"Chopping onion"

Bibtex

        @article{gu2023seer,
            author  = {Gu, Xianfan and Wen, Chuan and Ye, Weirui and Song, Jiaming and Gao, Yang},
            title   = {Seer: Language Instructed Video Prediction with Latent Diffusion Models},
            journal = {arXiv preprint arXiv:2303.14897},
            year    = {2023},
        }

Website template credit to Plug-and-Play Diffusion.