--- license: apache-2.0 tags: - video-generation - video-editing - in-context-learning - pytorch pipeline_tag: video-to-video library_name: transformers authors: - XiangpengYang - horizonwind2004 ---

Unified Video Editing with Temporal Reasoner

👁️ See → 🧠 Reason → ✏️ Edit

🚀 A Chain of Frames editing method enbale temporal reasoning and 4x video length generalization with just 50k training pairs!

Daily Paper arXiv Project Page GitHub
Xiangpeng Yang1, Ji Xie2, Yiyuan Yang1, Yan Huang1, Min Xu1, Qiang Wu1
1University of Technology Sydney, 2Zhejiang University

# VideoCoF: Unified Video Editing with Temporal Reasoner **VideoCoF** is a unified video editing model that bridges the gap between expert models (precise but restricted) and unified in-context models (flexible but spatially inaccurate). By introducing a **"See → Reason → Edit"**, a Chain-of-Frames paradigm, VideoCoF predicts reasoning tokens before generating the target video tokens, thereby removing the need for user-provided masks while achieving precise instruction to-region alignment.
Video Demo
Click the image above to watch the full video on YouTube 🎬
## 🌟 Key Capabilities ![](assets/motivation_v2.gif) 1. **Temporal Reasoning**: Adopts a unique approach where the model first identifies *where* and *how* to edit (Reasoning) before predicting the target video tokens. 2. **Data Efficiency**: Achieves SOTA performance with only **50k training pairs** (33 frames each). 3. **Length Extrapolation**: Demonstrates robust multi-shot editing and can generalize to videos **4× longer** than training samples. 4. **Versatile Editing**: Supports: * Object Removal * Object Addition * Object Swap * Local Style Transfer ## 🔧 Quick Start To use these weights, please refer to the official [GitHub Repository](https://github.com/knightyxp/VideoCoF) for inference code and environment setup. ### Installation ```bash git clone https://github.com/knightyxp/VideoCoF cd VideoCoF # 1. Create and activate a conda environment conda create -n videocof python=3.10 conda activate videocof # 2. Install PyTorch (Choose version compatible with your CUDA) # For standard GPUs (CUDA 12.1): pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 # For Hopper GPUs (e.g., H100/H800) requiring fast inference: # pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128 # 3. Install other dependencies pip install -r requirements.txt ``` **Note on Flash Attention:** We recommend using **FlashAttention-3** (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs. If you are using these GPUs, please follow the [official FlashAttention-3 installation guide](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#flashattention-3-beta-release) after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8). ### Download Models * **Wan-2.1-T2V-14B Pretrained Weights:** ```bash git lfs install git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B # Or using huggingface-cli: # hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B ``` * **VideoCoF Checkpoint:** ```bash git lfs install git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight # Or using huggingface-cli: # hf download XiangpengYang/VideoCoF --local-dir videocof_weight ``` ### Inference ```bash export CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 inference.py \ --video_path assets/two_man.mp4 \ --prompt "Remove the young man with short black hair wearing black shirt on the left." \ --output_dir results/obj_rem \ --model_name /scratch3/yan204/models/Wan2.1-T2V-14B \ --seed 0 \ --num_frames 33 \ --source_frames 33 \ --reasoning_frames 4 \ --repeat_rope \ --videocof_path videocof_weight/videocof.safetensors ``` For parallel inference: ```bash sh scripts/parallel_infer.sh ``` ## 🙏 Acknowledgments We thank the authors of related works and the open-source community [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) and [Wan](https://github.com/Wan-Video/Wan2.1) for their contributions. ## 📜 License This project is licensed under the [Apache License 2.0](LICENSE). ## 📮 Contact For any questions, please feel free to reach out to the author Xiangpeng Yang [@knightyxp](https://github.com/knightyxp), email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au ## 📄 Citation If you find this work useful for your research, please consider citing: ```bibtex @article{yang2025videocof, title={Unified Video Editing with Temporal Reasoner}, author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang}, journal={arXiv preprint arXiv:2512.07469}, year={2025} } ```
❤️ **If you find this project helpful, please consider giving it a like!** ❤️