--- license: apache-2.0 tags: - video-generation - video-editing - in-context-learning - pytorch pipeline_tag: video-to-video library_name: transformers authors: - XiangpengYang - horizonwind2004 ---

Unified Video Editing with Temporal Reasoner

👁️ See → 🧠 Reason → ✏️ Edit

🚀 A Chain of Frames editing method enbale temporal reasoning and 4x video length generalization with just 50k training pairs!

Xiangpeng Yang¹, Ji Xie², Yiyuan Yang¹, Yan Huang¹, Min Xu¹, Qiang Wu¹
¹University of Technology Sydney, ²Zhejiang University

# VideoCoF: Unified Video Editing with Temporal Reasoner **VideoCoF** is a unified video editing model that bridges the gap between expert models (precise but restricted) and unified in-context models (flexible but spatially inaccurate). By introducing a **"See → Reason → Edit"**, a Chain-of-Frames paradigm, VideoCoF predicts reasoning tokens before generating the target video tokens, thereby removing the need for user-provided masks while achieving precise instruction to-region alignment.

Click the image above to watch the full video on YouTube 🎬

## 🌟 Key Capabilities ![](assets/motivation_v2.gif) 1. **Temporal Reasoning**: Adopts a unique approach where the model first identifies *where* and *how* to edit (Reasoning) before predicting the target video tokens. 2. **Data Efficiency**: Achieves SOTA performance with only **50k training pairs** (33 frames each). 3. **Length Extrapolation**: Demonstrates robust multi-shot editing and can generalize to videos **4× longer** than training samples. 4. **Versatile Editing**: Supports: * Object Removal * Object Addition * Object Swap * Local Style Transfer ## 🔧 Quick Start To use these weights, please refer to the official [GitHub Repository](https://github.com/knightyxp/VideoCoF) for inference code and environment setup. ### Installation ```bash git clone https://github.com/knightyxp/VideoCoF cd VideoCoF # 1. Create and activate a conda environment conda create -n videocof python=3.10 conda activate videocof # 2. Install PyTorch (Choose version compatible with your CUDA) # For standard GPUs (CUDA 12.1): pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 # For Hopper GPUs (e.g., H100/H800) requiring fast inference: # pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128 # 3. Install other dependencies pip install -r requirements.txt ``` **Note on Flash Attention:** We recommend using **FlashAttention-3** (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs. If you are using these GPUs, please follow the [official FlashAttention-3 installation guide](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#flashattention-3-beta-release) after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8). ### Download Models * **Wan-2.1-T2V-14B Pretrained Weights:** ```bash git lfs install git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B # Or using huggingface-cli: # hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B ``` * **VideoCoF Checkpoint:** ```bash git lfs install git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight # Or using huggingface-cli: # hf download XiangpengYang/VideoCoF --local-dir videocof_weight ``` ### Inference ```bash export CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 inference.py \ --video_path assets/two_man.mp4 \ --prompt "Remove the young man with short black hair wearing black shirt on the left." \ --output_dir results/obj_rem \ --model_name /scratch3/yan204/models/Wan2.1-T2V-14B \ --seed 0 \ --num_frames 33 \ --source_frames 33 \ --reasoning_frames 4 \ --repeat_rope \ --videocof_path videocof_weight/videocof.safetensors ``` For parallel inference: ```bash sh scripts/parallel_infer.sh ``` ## 🙏 Acknowledgments We thank the authors of related works and the open-source community [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) and [Wan](https://github.com/Wan-Video/Wan2.1) for their contributions. ## 📜 License This project is licensed under the [Apache License 2.0](LICENSE). ## 📮 Contact For any questions, please feel free to reach out to the author Xiangpeng Yang [@knightyxp](https://github.com/knightyxp), email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au ## 📄 Citation If you find this work useful for your research, please consider citing: ```bibtex @article{yang2025videocof, title={Unified Video Editing with Temporal Reasoner}, author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang}, journal={arXiv preprint arXiv:2512.07469}, year={2025} } ```

❤️ **If you find this project helpful, please consider giving it a like!** ❤️