🎙️ KugelAudio-0-Open

Open-source text-to-speech for European languages with voice cloning capabilities 7B parameter model powered by an AR + Diffusion architecture

License: MIT Python 3.10+ Hosted API

KugelAudio KI-Servicezentrum Berlin-Brandenburg Gefördert durch BMFTR

Motivation

Open-source text-to-speech models for European languages are significantly lagging behind. While English TTS has seen remarkable progress, speakers of German, French, Spanish, Polish, and dozens of other European languages have been underserved by the open-source community.

KugelAudio aims to change this. Building on the excellent foundation laid by the VibeVoice team at Microsoft, we've trained a model specifically focused on European language coverage, using approximately 200,000 hours of highly pre-processed and enhanced speech data from the YODAS2 dataset.

🏆 Benchmark Results: Outperforming ElevenLabs

KugelAudio achieves state-of-the-art performance, beating industry leaders including ElevenLabs in rigorous human preference testing. This breakthrough demonstrates that open-source models can now rival - and surpass - the best commercial TTS systems.

Human Preference Benchmark (A/B Testing)

We conducted extensive A/B testing with 339 human evaluations to compare KugelAudio against leading TTS models. Participants listened to a reference voice sample, then compared outputs from two models and selected which sounded more human and closer to the original voice.

German Language Evaluation

The evaluation specifically focused on German language samples with diverse emotional expressions and speaking styles:

Neutral Speech: Standard conversational tones
Shouting: High-intensity, elevated volume speech
Singing: Melodic and rhythmic speech patterns
Drunken Voice: Slurred and irregular speech characteristics

These diverse test cases demonstrate the model's capability to handle a wide range of speaking styles beyond standard narration.

OpenSkill Ranking Results

Rank	Model	Score	Record	Win Rate
🥇 1	KugelAudio	26	71W / 20L / 23T	78.0%
🥈 2	ElevenLabs Multi v2	25	56W / 34L / 22T	62.2%
🥉 3	ElevenLabs v3	21	64W / 34L / 16T	65.3%
4	Cartesia	21	55W / 38L / 19T	59.1%
5	VibeVoice	10	30W / 74L / 8T	28.8%
6	CosyVoice v3	9	15W / 91L / 8T	14.2%

Based on 339 evaluations using Bayesian skill-rating system (OpenSkill)

Audio Samples

Listen to KugelAudio's diverse voice capabilities across different speaking styles and languages:

German Voice Samples

Sample	Description	Audio Player
Whispering	Soft whispering voice
Female Narrator	Professional female reader voice
Angry Voice	Irritated and frustrated speech
Radio Announcer	Professional radio broadcast voice

All samples are generated with zero-shot voice cloning from reference audio.

Training Details

Base Model: Microsoft VibeVoice
Training Data: ~200,000 hours from YODAS2
Hardware: 8x NVIDIA H100 GPUs
Training Duration: 5 days

Supported Languages

This model supports the following European languages:

Language	Code	Flag	Language	Code	Flag	Language	Code	Flag
English	en	🇺🇸	German	de	🇩🇪	French	fr	🇫🇷
Spanish	es	🇪🇸	Italian	it	🇮🇹	Portuguese	pt	🇵🇹
Dutch	nl	🇳🇱	Polish	pl	🇵🇱	Russian	ru	🇷🇺
Ukrainian	uk	🇺🇦	Czech	cs	🇨🇿	Romanian	ro	🇷🇴
Hungarian	hu	🇭🇺	Swedish	sv	🇸🇪	Danish	da	🇩🇰
Finnish	fi	🇫🇮	Norwegian	no	🇳🇴	Greek	el	🇬🇷
Bulgarian	bg	🇧🇬	Slovak	sk	🇸🇰	Croatian	hr	🇭🇷
Serbian	sr	🇷🇸	Turkish	tr	🇹🇷

📊 Language Coverage Disclaimer: Quality varies significantly by language. Spanish, French, English, and German have the strongest representation in our training data (~200,000 hours from YODAS2). Other languages may have reduced quality, prosody, or vocabulary coverage depending on their availability in the training dataset.

Model Specifications

Property	Value
Parameters	7B
Architecture	AR + Diffusion (Qwen2.5-7B backbone)
Base Model	Microsoft VibeVoice
Audio Sample Rate	24kHz
Audio Format	Mono, float32
VRAM Required	~19GB
Training Hardware	8x NVIDIA H100
Training Duration	5 days
Training Data	~200,000 hours from YODAS2

Quick Start

Installation

# Install with pip
pip install kugelaudio-open

# Or with uv (recommended)
uv pip install kugelaudio-open

Basic Usage

from kugelaudio_open import (
    KugelAudioForConditionalGenerationInference,
    KugelAudioProcessor,
)
import torch

# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = KugelAudioForConditionalGenerationInference.from_pretrained(
    "kugelaudio/kugelaudio-0-open",
    torch_dtype=torch.bfloat16,
).to(device)
model.eval()

processor = KugelAudioProcessor.from_pretrained("kugelaudio/kugelaudio-0-open")

# Generate speech
inputs = processor(text="Hallo Welt! Das ist KugelAudio.", return_tensors="pt")
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, cfg_scale=3.0)

# Save audio
processor.save_audio(outputs.speech_outputs[0], "output.wav")

Voice Cloning

# Clone a voice using reference audio
inputs = processor(
    text="Hallo, ich spreche jetzt mit deiner Stimme!",
    voice_prompt="reference_voice.wav",  # 5-30 seconds of clear speech
    return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, cfg_scale=3.0)

processor.save_audio(outputs.speech_outputs[0], "cloned_output.wav")

Generation Parameters

Parameter	Default	Description
cfg_scale	3.0	Classifier-free guidance scale (1.0-10.0). Higher = more adherence to text
max_new_tokens	2048	Maximum number of tokens to generate
do_sample	False	Whether to use sampling (vs greedy decoding)
temperature	1.0	Sampling temperature (if do_sample=True)

Architecture

KugelAudio uses a hybrid Autoregressive + Diffusion architecture based on Microsoft's VibeVoice:

Text Input → Qwen2.5-7B Backbone → Diffusion Head → Acoustic Decoder → Audio Output
                                         ↑
                              Voice Prompt (optional)

Text Encoder: Qwen2.5-7B language model encodes input text
Diffusion Head: Predicts speech latents using denoising diffusion (20 steps)
Acoustic Decoder: Hierarchical convolutional decoder converts latents to 24kHz audio
Semantic Encoder: Extracts speaker characteristics from reference audio (for voice cloning)

Audio Watermarking

All audio generated by this model is automatically watermarked using Facebook's AudioSeal. The watermark is:

Imperceptible: No audible difference in audio quality
Robust: Survives compression, resampling, and editing
Detectable: Can verify if audio was generated by KugelAudio

Verify Watermark

from kugelaudio_open.watermark import AudioWatermark

watermark = AudioWatermark()
result = watermark.detect(audio, sample_rate=24000)

print(f"Watermark detected: {result.detected}")
print(f"Confidence: {result.confidence:.1%}")

Intended Use

✅ Appropriate Uses

Accessibility: Text-to-speech for visually impaired users
Content Creation: Podcasts, videos, audiobooks, e-learning
Voice Assistants: Chatbots and virtual assistants
Language Learning: Pronunciation practice and language education
Creative Projects: With proper consent and attribution

❌ Prohibited Uses

Creating deepfakes or misleading content
Impersonating individuals without explicit consent
Fraud, deception, or scams
Harassment or abuse
Any illegal activities

Limitations

VRAM Requirements: Requires ~19GB VRAM for inference
Speed: Approximately 1.0x real-time on modern GPUs
Voice Cloning Quality: Best results with 5-30 seconds of clear reference audio
Language Quality Variation: Quality may vary across languages based on training data distribution

Hosted API

For production use without managing infrastructure, use our hosted API at kugelaudio.com:

⚡ Ultra-low latency: <100ms end-to-end
🌍 Global edge deployment
🔧 Zero setup required
📈 Auto-scaling

from kugelaudio import KugelAudio

client = KugelAudio(api_key="your_api_key")
audio = client.tts.generate(text="Hello from KugelAudio!", model="kugel-1-turbo")
audio.save("output.wav")

Acknowledgments

This model would not have been possible without the contributions of many individuals and organizations:

Microsoft VibeVoice Team: For the excellent foundation architecture that this model builds upon
YODAS2 Dataset: For providing the large-scale multilingual speech data
Qwen Team: For the powerful language model backbone
Facebook AudioSeal: For the audio watermarking technology

Special Thanks

Carlos Menke: For his invaluable efforts in gathering the first datasets and extensive work benchmarking the model
AI Service Center Berlin-Brandenburg (KI-Servicezentrum): For providing the GPU resources (8x H100) that made training this model possible

Citation

@software{kugelaudio2026,
  title = {KugelAudio: Open-Source Text-to-Speech for European Languages with Voice Cloning},
  author = {Kratzenstein, Kajo and Menke, Carlos},
  year = {2026},
  institution = {Hasso-Plattner-Institut},
  url = {https://huggingface.co/kugelaudio/kugelaudio-0-open}
}

License

This model is released under the MIT License.

Author

Kajo Kratzenstein
📧 kajo@kugelaudio.com
🌐 kugelaudio.com

Carlos Menke

Funding Notice

Das zugrunde liegende Vorhaben wurde mit Mitteln des Bundesministeriums für Forschung, Technologie und Raumfahrt unter dem Förderkennzeichen »KI-Servicezentrum Berlin-Brandenburg« 16IS22092 gefördert.

This project was funded by the German Federal Ministry of Research, Technology and Space under the funding code "AI Service Center Berlin-Brandenburg" 16IS22092.

Downloads last month: 6,318

Safetensors

Model size

9B params

Tensor type

BF16

Space using kugelaudio/kugelaudio-0-open 1

Evaluation results

Human Preference vs ElevenLabs on YODAS2
self-reported

78.000