RAIN: Real-time Animation Of Infinite Video Stream

Abstract

Live animation has gained immense popularity for enhancing online engagement, yet achieving high-quality, real-time, and stable animation with diffusion models remains challenging, especially on consumer-grade GPUs. Existing methods struggle with generating long, consistent video streams efficiently, often being limited by latency issues and degraded visual quality over extended periods. In this paper, we introduce RAIN, a pipeline solution capable of animating infinite video streams in real-time with low latency using a single RTX 4090 GPU. The core idea of RAIN is to efficiently compute frame-token attention across different noise levels and long time-intervals while simultaneously denoising a significantly larger number of frame-tokens than previous stream-based methods. This design allows RAIN to generate video frames with much shorter latency and faster speed, while maintaining long-range attention over extended video streams, resulting in enhanced continuity and consistency. Consequently, a Stable Diffusion model fine-tuned with RAIN in just a few epochs can produce video streams in real-time and low latency without much compromise in quality or consistency, up to infinite long. Despite its advanced capabilities, the RAIN only introduces a few additional 1D attention blocks, imposing minimal additional burden. Experiments in benchmark datasets and generating super-long videos demonstrating that RAIN can animate characters in real-time with much better quality, accuracy, and consistency than competitors while costing less latency. All code and models will be made publicly available.

Framework

RAIN adopts a pipeline-like design for inferencing on streaming video. The latent states are filled with frames with stepped noise levels. Each time the denoising process will be performed on a group of frames. Therefore RAIN supports generation of infinite long video.

For speeding up, RAIN adopts several methods of acceleration. We perform LCM Distillation on the UNet model, and adopts TAESDV as VAE Decoder. With TensorRT acceleration, RAIN generally runs at 18 fps with latency ~ 1.5s on single RTX 4090 with resolution of 512x512 and DWPose as feature extractor.

Wholebody Animation

Test Example from UBC-Fashion Dataset, the model is only trained on 500 video clips from training dataset.

Cross Domain Face Morphing

Face morphing examples, expressions and head positions of real face are mapped into anime faces.

Future

RAIN provides a possible way of rendering real-time animation with AI. We could expecting the future of combining AI together with CG in rendering games, live streams and virtual realities. In which we could take the great advantage of generalization from AI in rendering countless new scenes and objects, and also provide a more interactive way to participate in the synthesized world.