RAPO++ Text-to-Video Prompt Optimization

🎬 RAPO++ Text-to-Video Prompt Optimization

This demo showcases Stage 1 (RAPO): Retrieval-Augmented Prompt Optimization using knowledge graphs.

How it works:

Enter a simple text-to-video prompt

The system retrieves contextually relevant modifiers from a knowledge graph

Your prompt is enhanced with specific actions and atmospheric details

Use the optimized prompt for better T2V generation results!

Example prompts to try:

"A person walking"

"A car driving"

"Someone cooking"

"A group of people talking"

RAPO++ is a three-stage framework for text-to-video generation prompt optimization:

Stage 1 (RAPO): Retrieval-Augmented Prompt Optimization using relation graphs (demonstrated here)
Stage 2 (SSPO): Self-Supervised Prompt Optimization with test-time iterative refinement
Stage 3: LLM fine-tuning on collected feedback data

The system is model-agnostic and works with various T2V models (Wan2.1, Open-Sora-Plan, HunyuanVideo, etc.).

Papers:

RAPO (CVPR 2025): The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
RAPO++ (arXiv:2510.20206): Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Project Page: https://whynothaha.github.io/RAPO_plus_github/

GitHub: https://github.com/Vchitect/RAPO

🎬 RAPO++ Text-to-Video Prompt Optimization