🎬 RAPO++ Text-to-Video Prompt Optimization

This demo showcases Stage 1 (RAPO): Retrieval-Augmented Prompt Optimization using knowledge graphs.

How it works:

  1. Enter a simple text-to-video prompt
  2. The system retrieves contextually relevant modifiers from a knowledge graph
  3. Your prompt is enhanced with specific actions and atmospheric details
  4. Use the optimized prompt for better T2V generation results!

Example prompts to try:

  • "A person walking"
  • "A car driving"
  • "Someone cooking"
  • "A group of people talking"

Based on the paper: RAPO++ (arXiv:2510.20206)

Input

1 5
1 10

Results

Examples
Original Prompt Number of Places to Retrieve Modifiers per Place

About RAPO++

RAPO++ is a three-stage framework for text-to-video generation prompt optimization:

  • Stage 1 (RAPO): Retrieval-Augmented Prompt Optimization using relation graphs (demonstrated here)
  • Stage 2 (SSPO): Self-Supervised Prompt Optimization with test-time iterative refinement
  • Stage 3: LLM fine-tuning on collected feedback data

The system is model-agnostic and works with various T2V models (Wan2.1, Open-Sora-Plan, HunyuanVideo, etc.).

Papers:

  • RAPO (CVPR 2025): The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
  • RAPO++ (arXiv:2510.20206): Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Project Page: https://whynothaha.github.io/RAPO_plus_github/

GitHub: https://github.com/Vchitect/RAPO