File size: 6,698 Bytes
ea4ac22
 
 
 
509dc63
ea4ac22
 
 
509dc63
ea4ac22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
509dc63
ea4ac22
 
883a96e
 
ea4ac22
509dc63
ea4ac22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
509dc63
 
ea4ac22
 
766ff88
 
509dc63
 
 
 
 
 
ea4ac22
 
 
509dc63
ea4ac22
509dc63
ea4ac22
 
 
 
 
 
 
 
 
 
509dc63
ea4ac22
 
 
 
cb56f2f
ea4ac22
 
cb56f2f
ea4ac22
 
 
cb56f2f
 
 
ea4ac22
cb56f2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
883a96e
cb56f2f
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
license: apache-2.0
tags:
- video-generation
- video-editing
- in-context-learning
- pytorch
pipeline_tag: video-to-video
library_name: transformers
authors:
- XiangpengYang
- horizonwind2004
---

<div align="center">

  <h1 style="margin: 0; font-size: 1.8em;">
    Unified Video Editing with Temporal Reasoner
  </h1>

  <h4 style="margin: 15px 0; color: #2c3e50;">
    ๐Ÿ‘๏ธ See &rarr; ๐Ÿง  Reason &rarr; โœ๏ธ Edit
  </h4>

  <h4 style="margin: 15px 0; color: #2c3e50;">
    ๐Ÿš€ A Chain of Frames editing method enbale temporal reasoning and 4x video length generalization with just 50k training pairs!
  </h4>

  <a href="https://huggingface.co/papers/2512.07469"><img src="https://img.shields.io/badge/HuggingFace-Daily_Paper-ffd21e.svg" alt="Daily Paper"></a>
  <a href="https://arxiv.org/abs/2512.07469"><img src="https://img.shields.io/badge/arXiv-2512.07469-b31b1b.svg" alt="arXiv"></a>
  <a href="https://videocof.github.io"><img src="https://img.shields.io/badge/Project-Page-green" alt="Project Page"></a>
  <a href="https://github.com/knightyxp/VideoCoF"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub"></a>

</div>

<div align="center">
  <b>
    <a href="https://scholar.google.com/citations?user=reiIeYMAAAAJ">Xiangpeng Yang</a><sup>1</sup>,
    <a href="https://horizonwind2004.github.io/">Ji Xie</a><sup>2</sup>,
    <a href="https://scholar.google.com/citations?user=OvfI_HMAAAAJ">Yiyuan Yang</a><sup>1</sup>,
    <a href="https://scholar.google.com/citations?user=zfeWd6gAAAAJ">Yan Huang</a><sup>1</sup>,
    <a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Min Xu</a><sup>1</sup>,
    <a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Qiang Wu</a><sup>1</sup>
  </b>
  <br>
  <span style="font-size: 1em; color: #555;"><sup>1</sup>University of Technology Sydney, <sup>2</sup>Zhejiang University</span>
</div>

<br>

# VideoCoF: Unified Video Editing with Temporal Reasoner


**VideoCoF** is a unified video editing model that bridges the gap between expert models (precise but restricted) and unified in-context models (flexible but spatially inaccurate). By introducing a **"See &rarr; Reason &rarr; Edit"**, a Chain-of-Frames paradigm, VideoCoF predicts reasoning tokens before generating the target video tokens, thereby removing the need for user-provided masks while achieving precise instruction to-region alignment. 

<div align="center">
  <a href="https://www.youtube.com/watch?v=XrYj0Qmc49w" target="_blank">
    <img src="https://img.youtube.com/vi/XrYj0Qmc49w/maxresdefault.jpg" 
         alt="Video Demo" 
         width="80%" 
         style="max-width:900px; border-radius:10px; box-shadow:0 0 10px rgba(0,0,0,0.15);">
  </a>
  <br>
  <em>Click the image above to watch the full video on YouTube ๐ŸŽฌ</em>
</div>

## ๐ŸŒŸ Key Capabilities
![](assets/motivation_v2.gif)

1.  **Temporal Reasoning**: Adopts a unique approach where the model first identifies *where* and *how* to edit (Reasoning) before predicting the target video tokens.
2.  **Data Efficiency**: Achieves SOTA performance with only **50k training pairs** (33 frames each).
3.  **Length Extrapolation**: Demonstrates robust multi-shot editing and can generalize to videos **4&times; longer** than training samples.
4.  **Versatile Editing**: Supports:
    * Object Removal
    * Object Addition
    * Object Swap
    * Local Style Transfer

## ๐Ÿ”ง Quick Start

To use these weights, please refer to the official [GitHub Repository](https://github.com/knightyxp/VideoCoF) for inference code and environment setup.

### Installation

```bash
git clone https://github.com/knightyxp/VideoCoF
cd VideoCoF

# 1. Create and activate a conda environment
conda create -n videocof python=3.10
conda activate videocof

# 2. Install PyTorch (Choose version compatible with your CUDA)
# For standard GPUs (CUDA 12.1):
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# For Hopper GPUs (e.g., H100/H800) requiring fast inference:
# pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

# 3. Install other dependencies
pip install -r requirements.txt
```

**Note on Flash Attention:**
We recommend using **FlashAttention-3** (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs. 
If you are using these GPUs, please follow the [official FlashAttention-3 installation guide](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#flashattention-3-beta-release) after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8).

### Download Models

*   **Wan-2.1-T2V-14B Pretrained Weights:**
    
    ```bash
    git lfs install
    git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B
    
    # Or using huggingface-cli:
    # hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B
    ```

*   **VideoCoF Checkpoint:**
    
    ```bash
    git lfs install
    git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight

    # Or using huggingface-cli:
    # hf download XiangpengYang/VideoCoF --local-dir videocof_weight
    ```

### Inference

```bash
export CUDA_VISIBLE_DEVICES=0
torchrun --nproc_per_node=1 inference.py \
  --video_path assets/two_man.mp4 \
  --prompt "Remove the young man with short black hair wearing black shirt on the left." \
  --output_dir results/obj_rem \
  --model_name /scratch3/yan204/models/Wan2.1-T2V-14B \
  --seed 0 \
  --num_frames 33 \
  --source_frames 33 \
  --reasoning_frames 4 \
  --repeat_rope \
  --videocof_path videocof_weight/videocof.safetensors
```

For parallel inference:

```bash
sh scripts/parallel_infer.sh
```

## ๐Ÿ™ Acknowledgments

We thank the authors of related works and the open-source community [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) and [Wan](https://github.com/Wan-Video/Wan2.1) for their contributions.

## ๐Ÿ“œ License

This project is licensed under the [Apache License 2.0](LICENSE).

## ๐Ÿ“ฎ Contact

For any questions, please feel free to reach out to the author Xiangpeng Yang [@knightyxp](https://github.com/knightyxp), email: [email protected]/[email protected]

## ๐Ÿ“„ Citation

If you find this work useful for your research, please consider citing:

```bibtex
@article{yang2025videocof,
  title={Unified Video Editing with Temporal Reasoner},
  author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang},
  journal={arXiv preprint arXiv:2512.07469},
  year={2025}
}
```

<div align="center">
  โค๏ธ **If you find this project helpful, please consider giving it a like!** โค๏ธ
</div>