CVPR 2026

Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning

Zhengjian Yao1,*, Yongzhi Li2,†, Xinyuan Gao2, Quan Chen2,‡, Peng Jiang2, Yanye Lu1,‡
1Peking University   2Kuaishou Technology
*Work done during an internship at Kuaishou Technology  Project Leader  Corresponding Authors

Narrative Weaver generates long-range, visually consistent narratives from multi-modal inputs — spanning controllable multi-scene generation, autonomous storytelling, and e-commerce advertising storyboards.

Abstract

We present Narrative Weaver, a novel framework that addresses a fundamental challenge in generative AI: achieving multi-modal controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences—a critical limitation for real-world applications such as filmmaking and e-commerce advertising.

Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift. To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data.

Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD)—the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations. Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method's superiority while opening new possibilities for AI-driven content creation.

Method

Narrative Weaver is a hybrid Autoregressive (AR) + Diffusion framework. The AR part — an MLLM acting as a "director" — takes initial visual and textual context, plans future narrative logic in textual form, and condenses historical multi-modal information into a compact set of learnable queries. Fine-grained VAE-encoded features from condition images are fused with these queries and passed to the diffusion decoder. A dynamic Memory Bank anchors each generative step to the initial visual condition and prior frames, mitigating visual drift over long sequences.

Narrative Weaver framework overview

EAVSD Dataset

We release the E-commerce Advertising Video Storyboard Dataset (EAVSD) — the first dataset providing (text, image) → (text, {imagei}i=1..N) triplets aligned with multi-scene storyboards. EAVSD contains over 330K high-quality images with rich narrative annotations, curated for e-commerce marketing where strict brand-level visual consistency is a commercial necessity.

EAVSD dataset overview

Task 1 · Controllable Multi-Scene Generation

Given multi-modal conditions, Narrative Weaver generates multiple keyframes that remain strictly consistent in identity, style, and scene context across the whole sequence.

Multi-scene consistency comparison

Task 2 · Autonomous Storyboard Planning

Narrative Weaver autonomously decomposes a high-level user instruction into a coherent storyboard, generating both the narrative text and visually consistent keyframes end-to-end.

Autonomous storyboard planning results

Task 3 · E-commerce Advertising Storyboards

From a product image, a description, and a marketing goal, Narrative Weaver produces a brand-consistent advertising storyboard, demonstrating real-world potential for AI-driven content creation.

E-commerce advertising storyboard results

Video Demos

BibTeX

@article{yao2026narrativeweaver,
  title   = {Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning},
  author  = {Yao, Zhengjian and Li, Yongzhi and Gao, Xinyuan and Chen, Quan and Jiang, Peng and Lu, Yanye},
  journal = {arXiv preprint arXiv:2603.06688},
  year    = {2026}
}