๐๏ธ KugelAudio-0-Open
Open-source text-to-speech for European languages with voice cloning capabilities 7B parameter model powered by an AR + Diffusion architecture
|
|
|
|
License: MIT Python 3.10+ Hosted API
KugelAudio KI-Servicezentrum Berlin-Brandenburg Gefรถrdert durch BMFTR
Motivation
Open-source text-to-speech models for European languages are significantly lagging behind. While English TTS has seen remarkable progress, speakers of German, French, Spanish, Polish, and dozens of other European languages have been underserved by the open-source community.
KugelAudio aims to change this. Building on the excellent foundation laid by the VibeVoice team at Microsoft, we've trained a model specifically focused on European language coverage, using approximately 200,000 hours of highly pre-processed and enhanced speech data from the YODAS2 dataset.
๐ Benchmark Results: Outperforming ElevenLabs
KugelAudio achieves state-of-the-art performance, beating industry leaders including ElevenLabs in rigorous human preference testing. This breakthrough demonstrates that open-source models can now rival - and surpass - the best commercial TTS systems.
Human Preference Benchmark (A/B Testing)
We conducted extensive A/B testing with 339 human evaluations to compare KugelAudio against leading TTS models. Participants listened to a reference voice sample, then compared outputs from two models and selected which sounded more human and closer to the original voice.
German Language Evaluation
The evaluation specifically focused on German language samples with diverse emotional expressions and speaking styles:
- Neutral Speech: Standard conversational tones
- Shouting: High-intensity, elevated volume speech
- Singing: Melodic and rhythmic speech patterns
- Drunken Voice: Slurred and irregular speech characteristics
These diverse test cases demonstrate the model's capability to handle a wide range of speaking styles beyond standard narration.
OpenSkill Ranking Results
| Rank | Model | Score | Record | Win Rate |
|---|---|---|---|---|
| ๐ฅ 1 | KugelAudio | 26 | 71W / 20L / 23T | 78.0% |
| ๐ฅ 2 | ElevenLabs Multi v2 | 25 | 56W / 34L / 22T | 62.2% |
| ๐ฅ 3 | ElevenLabs v3 | 21 | 64W / 34L / 16T | 65.3% |
| 4 | Cartesia | 21 | 55W / 38L / 19T | 59.1% |
| 5 | VibeVoice | 10 | 30W / 74L / 8T | 28.8% |
| 6 | CosyVoice v3 | 9 | 15W / 91L / 8T | 14.2% |
Based on 339 evaluations using Bayesian skill-rating system (OpenSkill)
Audio Samples
Listen to KugelAudio's diverse voice capabilities across different speaking styles and languages:
German Voice Samples
| Sample | Description | Audio Player |
|---|---|---|
| Whispering | Soft whispering voice | |
| Female Narrator | Professional female reader voice | |
| Angry Voice | Irritated and frustrated speech | |
| Radio Announcer | Professional radio broadcast voice |
All samples are generated with zero-shot voice cloning from reference audio.
Training Details
- Base Model: Microsoft VibeVoice
- Training Data: ~200,000 hours from YODAS2
- Hardware: 8x NVIDIA H100 GPUs
- Training Duration: 5 days
Supported Languages
This model supports the following European languages:
| Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |
|---|---|---|---|---|---|---|---|---|
| English | en | ๐บ๐ธ | German | de | ๐ฉ๐ช | French | fr | ๐ซ๐ท |
| Spanish | es | ๐ช๐ธ | Italian | it | ๐ฎ๐น | Portuguese | pt | ๐ต๐น |
| Dutch | nl | ๐ณ๐ฑ | Polish | pl | ๐ต๐ฑ | Russian | ru | ๐ท๐บ |
| Ukrainian | uk | ๐บ๐ฆ | Czech | cs | ๐จ๐ฟ | Romanian | ro | ๐ท๐ด |
| Hungarian | hu | ๐ญ๐บ | Swedish | sv | ๐ธ๐ช | Danish | da | ๐ฉ๐ฐ |
| Finnish | fi | ๐ซ๐ฎ | Norwegian | no | ๐ณ๐ด | Greek | el | ๐ฌ๐ท |
| Bulgarian | bg | ๐ง๐ฌ | Slovak | sk | ๐ธ๐ฐ | Croatian | hr | ๐ญ๐ท |
| Serbian | sr | ๐ท๐ธ | Turkish | tr | ๐น๐ท |
๐ Language Coverage Disclaimer: Quality varies significantly by language. Spanish, French, English, and German have the strongest representation in our training data (~200,000 hours from YODAS2). Other languages may have reduced quality, prosody, or vocabulary coverage depending on their availability in the training dataset.
Model Specifications
| Property | Value |
|---|---|
| Parameters | 7B |
| Architecture | AR + Diffusion (Qwen2.5-7B backbone) |
| Base Model | Microsoft VibeVoice |
| Audio Sample Rate | 24kHz |
| Audio Format | Mono, float32 |
| VRAM Required | ~19GB |
| Training Hardware | 8x NVIDIA H100 |
| Training Duration | 5 days |
| Training Data | ~200,000 hours from YODAS2 |
Quick Start
Installation
# Install with pip
pip install kugelaudio-open
# Or with uv (recommended)
uv pip install kugelaudio-open
Basic Usage
from kugelaudio_open import (
KugelAudioForConditionalGenerationInference,
KugelAudioProcessor,
)
import torch
# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = KugelAudioForConditionalGenerationInference.from_pretrained(
"kugelaudio/kugelaudio-0-open",
torch_dtype=torch.bfloat16,
).to(device)
model.eval()
processor = KugelAudioProcessor.from_pretrained("kugelaudio/kugelaudio-0-open")
# Generate speech
inputs = processor(text="Hallo Welt! Das ist KugelAudio.", return_tensors="pt")
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, cfg_scale=3.0)
# Save audio
processor.save_audio(outputs.speech_outputs[0], "output.wav")
Voice Cloning
# Clone a voice using reference audio
inputs = processor(
text="Hallo, ich spreche jetzt mit deiner Stimme!",
voice_prompt="reference_voice.wav", # 5-30 seconds of clear speech
return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, cfg_scale=3.0)
processor.save_audio(outputs.speech_outputs[0], "cloned_output.wav")
Generation Parameters
| Parameter | Default | Description |
|---|---|---|
| cfg_scale | 3.0 | Classifier-free guidance scale (1.0-10.0). Higher = more adherence to text |
| max_new_tokens | 2048 | Maximum number of tokens to generate |
| do_sample | False | Whether to use sampling (vs greedy decoding) |
| temperature | 1.0 | Sampling temperature (if do_sample=True) |
Architecture
KugelAudio uses a hybrid Autoregressive + Diffusion architecture based on Microsoft's VibeVoice:
Text Input โ Qwen2.5-7B Backbone โ Diffusion Head โ Acoustic Decoder โ Audio Output
โ
Voice Prompt (optional)
- Text Encoder: Qwen2.5-7B language model encodes input text
- Diffusion Head: Predicts speech latents using denoising diffusion (20 steps)
- Acoustic Decoder: Hierarchical convolutional decoder converts latents to 24kHz audio
- Semantic Encoder: Extracts speaker characteristics from reference audio (for voice cloning)
Audio Watermarking
All audio generated by this model is automatically watermarked using Facebook's AudioSeal. The watermark is:
- Imperceptible: No audible difference in audio quality
- Robust: Survives compression, resampling, and editing
- Detectable: Can verify if audio was generated by KugelAudio
Verify Watermark
from kugelaudio_open.watermark import AudioWatermark
watermark = AudioWatermark()
result = watermark.detect(audio, sample_rate=24000)
print(f"Watermark detected: {result.detected}")
print(f"Confidence: {result.confidence:.1%}")
Intended Use
โ Appropriate Uses
- Accessibility: Text-to-speech for visually impaired users
- Content Creation: Podcasts, videos, audiobooks, e-learning
- Voice Assistants: Chatbots and virtual assistants
- Language Learning: Pronunciation practice and language education
- Creative Projects: With proper consent and attribution
โ Prohibited Uses
- Creating deepfakes or misleading content
- Impersonating individuals without explicit consent
- Fraud, deception, or scams
- Harassment or abuse
- Any illegal activities
Limitations
- VRAM Requirements: Requires ~19GB VRAM for inference
- Speed: Approximately 1.0x real-time on modern GPUs
- Voice Cloning Quality: Best results with 5-30 seconds of clear reference audio
- Language Quality Variation: Quality may vary across languages based on training data distribution
Hosted API
For production use without managing infrastructure, use our hosted API at kugelaudio.com:
- โก Ultra-low latency: <100ms end-to-end
- ๐ Global edge deployment
- ๐ง Zero setup required
- ๐ Auto-scaling
from kugelaudio import KugelAudio
client = KugelAudio(api_key="your_api_key")
audio = client.tts.generate(text="Hello from KugelAudio!", model="kugel-1-turbo")
audio.save("output.wav")
Acknowledgments
This model would not have been possible without the contributions of many individuals and organizations:
- Microsoft VibeVoice Team: For the excellent foundation architecture that this model builds upon
- YODAS2 Dataset: For providing the large-scale multilingual speech data
- Qwen Team: For the powerful language model backbone
- Facebook AudioSeal: For the audio watermarking technology
Special Thanks
- Carlos Menke: For his invaluable efforts in gathering the first datasets and extensive work benchmarking the model
- AI Service Center Berlin-Brandenburg (KI-Servicezentrum): For providing the GPU resources (8x H100) that made training this model possible
Citation
@software{kugelaudio2026,
title = {KugelAudio: Open-Source Text-to-Speech for European Languages with Voice Cloning},
author = {Kratzenstein, Kajo and Menke, Carlos},
year = {2026},
institution = {Hasso-Plattner-Institut},
url = {https://huggingface.co/kugelaudio/kugelaudio-0-open}
}
License
This model is released under the MIT License.
Author
Kajo Kratzenstein
๐ง kajo@kugelaudio.com
๐ kugelaudio.com
Carlos Menke
Funding Notice
Das zugrunde liegende Vorhaben wurde mit Mitteln des Bundesministeriums fรผr Forschung, Technologie und Raumfahrt unter dem Fรถrderkennzeichen ยปKI-Servicezentrum Berlin-Brandenburgยซ 16IS22092 gefรถrdert.
This project was funded by the German Federal Ministry of Research, Technology and Space under the funding code "AI Service Center Berlin-Brandenburg" 16IS22092.
- Downloads last month
- 6,318
Space using kugelaudio/kugelaudio-0-open 1
Evaluation results
- Human Preference vs ElevenLabs on YODAS2self-reported78.000