Improve MeanVC model card: Update pipeline tag and add sample usage

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +109 -81
README.md CHANGED
@@ -1,81 +1,109 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- - zh
6
- tags:
7
- - text-to-speech
8
- ---
9
-
10
- # MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows
11
-
12
- <div align="center">
13
-
14
- [![Paper](https://img.shields.io/badge/arXiv-2510.08392-b31b1b.svg)](https://arxiv.org/pdf/2510.08392)
15
- [![Github](https://img.shields.io/badge/Github-Page-green)](https://github.com/ASLP-lab/MeanVC)
16
- [![Demo Page](https://img.shields.io/badge/Demo-Audio%20Samples-green)](https://aslp-lab.github.io/MeanVC/)
17
-
18
- </div>
19
-
20
- **MeanVC** is a lightweight and streaming zero-shot voice conversion system that enables real-time timbre transfer from any source speaker to any target speaker while preserving linguistic content. The system introduces a diffusion transformer with chunk-wise autoregressive denoising strategy and mean flows for efficient single-step inference.
21
-
22
- ![img](figs/model.png)
23
-
24
- ## ✨ Key Features
25
-
26
- - **πŸš€ Streaming Inference**: Real-time voice conversion with chunk-wise processing.
27
- - **⚑ Single-Step Generation**: Direct mapping from start to endpoint via mean flows for fast generation.
28
- - **🎯 Zero-Shot Capability**: Convert to any unseen target speaker without re-training.
29
- - **πŸ’Ύ Lightweight**: Significantly fewer parameters than existing methods.
30
- - **πŸ”Š High Fidelity**: Superior speech quality and speaker similarity.
31
-
32
-
33
- ## πŸ’Ύ Model Download
34
-
35
- Use the following Python script to download the models into a local directory (e.g., ./checkpoints):
36
-
37
- ```python
38
- from huggingface_hub import snapshot_download
39
-
40
- # Download all necessary models and components for MeanVC
41
- snapshot_download(
42
- "ASLP-lab/MeanVC",
43
- allow_patterns=[
44
- "model_200ms.safetensors", # The trained MeanVC model weights
45
- "meanvc_200ms.pt", # JIT-compiled model for real-time inference
46
- "fastu2++.pt", # JIT-compiled ASR model
47
- "vocos.pt" # JIT-compiled Vocos vocoder
48
- ],
49
- local_dir="./checkpoints", # Specify your target directory
50
- local_dir_use_symlinks=False
51
- )
52
- ```
53
-
54
-
55
-
56
- ## πŸ“œ License & Disclaimer
57
-
58
- MeanVC is released under the Apache License 2.0. This open-source license allows you to freely use, modify, and distribute the model, as long as you include the appropriate copyright notice and disclaimer.
59
-
60
- MeanVC is designed for research and legitimate applications in voice conversion technology. Users must obtain proper consent from individuals whose voices are being converted or used as references. We strongly discourage any malicious use including impersonation, fraud, or creating misleading audio content. Users are solely responsible for ensuring their use cases comply with ethical standards and legal requirements.
61
-
62
- ## πŸ“„ Citation
63
-
64
- If you find our work helpful, please cite our paper:
65
-
66
- ```bibtex
67
- @article{ma2025meanvc,
68
- title={MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows},
69
- author={Ma, Guobin and Yao, Jixun and Ning, Ziqian and Jiang, Yuepeng and Xiong, Lingxin and Xie, Lei and Zhu, Pengcheng},
70
- journal={arXiv preprint arXiv:2510.08392},
71
- year={2025}
72
- }
73
- ```
74
-
75
- ## πŸ“§ Contact
76
-
77
- If you are interested in leaving a message to our research team, feel free to email guobin.[email protected]
78
-
79
- <p align="center">
80
- <img src="figs/[email protected]" width="500"/>
81
- </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ license: apache-2.0
6
+ pipeline_tag: audio-to-audio
7
+ ---
8
+
9
+ # MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows
10
+
11
+ <div align="center">
12
+
13
+ [![Paper](https://img.shields.io/badge/arXiv-2510.08392-b31b1b.svg)](https://arxiv.org/pdf/2510.08392)
14
+ [![Github](https://img.shields.io/badge/Github-Page-green)](https://github.com/ASLP-lab/MeanVC)
15
+ [![Demo Page](https://img.shields.io/badge/Demo-Audio%20Samples-green)](https://aslp-lab.github.io/MeanVC/)
16
+
17
+ </div>
18
+
19
+ **MeanVC** is a lightweight and streaming zero-shot voice conversion system that enables real-time timbre transfer from any source speaker to any target speaker while preserving linguistic content. The system introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy and mean flows for efficient single-step inference.
20
+
21
+ ![img](https://huggingface.co/ASLP-lab/MeanVC/resolve/main/figs/model.png)
22
+
23
+ ## ✨ Key Features
24
+
25
+ - **πŸš€ Streaming Inference**: Real-time voice conversion with chunk-wise processing.
26
+ - **⚑ Single-Step Generation**: Direct mapping from start to endpoint via mean flows for fast generation.
27
+ - **🎯 Zero-Shot Capability**: Convert to any unseen target speaker without re-training.
28
+ - **πŸ’Ύ Lightweight**: Significantly fewer parameters than existing methods.
29
+ - **πŸ”Š High Fidelity**: Superior speech quality and speaker similarity.
30
+
31
+ ## πŸ’» Sample Usage
32
+
33
+ ### 1. Environment Setup
34
+ First, follow these steps to clone the repository and install the required environment.
35
+
36
+ ```bash
37
+ # Clone the repository and enter the directory
38
+ git clone https://github.com/ASLP-lab/MeanVC.git
39
+ cd MeanVC
40
+
41
+ # Create and activate a Conda environment
42
+ conda create -n meanvc python=3.11 -y
43
+ conda activate meanvc
44
+
45
+ # Install dependencies
46
+ pip install -r requirements.txt
47
+ ```
48
+
49
+ ### 2. Download Pre-trained Models
50
+ Run the provided script to automatically download all necessary pre-trained models.
51
+
52
+ ```bash
53
+ python download_ckpt.py
54
+ ```
55
+
56
+ This will download the main VC model, vocoder, and ASR model into the `src/ckpt/` directories.
57
+ The speaker verification model (`wavlm_large_finetune.pth`) must be downloaded manually from Google Drive. Download the file from [this link](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view). Place the downloaded `wavlm_large_finetune.pth` file into the `src/runtime/speaker_verification/ckpt/` directory.
58
+
59
+ ### 3. Real-Time Voice Conversion
60
+ This script captures audio from your microphone and converts it in real-time to the voice of a target speaker.
61
+
62
+ ```bash
63
+ python src/runtime/run_rt.py --target-path "path/to/target_voice.wav"
64
+ ```
65
+
66
+ - `--target-path`: Path to a clean audio file of the target speaker. This voice will be used as the conversion target. An example file is provided at `src/runtime/example/test.wav`.
67
+
68
+ When you run the script, you will be prompted to select your audio input (microphone) and output (speaker) devices from a list.
69
+
70
+ ### 4. Offline Voice Conversion
71
+ For batch processing or converting pre-recorded audio files, use the offline conversion script.
72
+
73
+ ```bash
74
+ bash scripts/infer_ref.sh
75
+ ```
76
+
77
+ Before running the script, you need to configure the following paths in `scripts/infer_ref.sh`:
78
+
79
+ - `source_path`: Path to the source audio file or directory containing multiple audio files to be converted
80
+ - `reference_path`: Path to a clean audio file of the target speaker (used as voice reference)
81
+ - `output_dir`: Directory where converted audio files will be saved (default: `src/outputs`)
82
+ - `steps`: Number of denoising steps (default: 2)
83
+
84
+ ## πŸ“œ License & Disclaimer
85
+
86
+ MeanVC is released under the Apache License 2.0. This open-source license allows you to freely use, modify, and distribute the model, as long as you include the appropriate copyright notice and disclaimer.
87
+
88
+ MeanVC is designed for research and legitimate applications in voice conversion technology. Users must obtain proper consent from individuals whose voices are being converted or used as references. We strongly discourage any malicious use including impersonation, fraud, or creating misleading audio content. Users are solely responsible for ensuring their use cases comply with ethical standards and legal requirements.
89
+
90
+ ## πŸ“„ Citation
91
+
92
+ If you find our work helpful, please cite our paper:
93
+
94
+ ```bibtex
95
+ @article{ma2025meanvc,
96
+ title={MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows},
97
+ author={Ma, Guobin and Yao, Jixun and Ning, Ziqian and Jiang, Yuepeng and Xiong, Lingxin and Xie, Lei and Zhu, Pengcheng},
98
+ journal={arXiv preprint arXiv:2510.08392},
99
+ year={2025}
100
+ }
101
+ ```
102
+
103
+ ## πŸ“§ Contact
104
+
105
+ If you are interested in leaving a message to our research team, feel free to email [email protected]
106
+
107
+ <p align="center">
108
+ <img src="https://huggingface.co/ASLP-lab/MeanVC/resolve/main/figs/[email protected]" width="500"/>
109
+ </p>