๐ŸŽ™๏ธ KugelAudio-0-Open

Open-source text-to-speech for European languages with voice cloning capabilities 7B parameter model powered by an AR + Diffusion architecture

GitHub Source Code KugelAudio Website

KugelAudio KI-Servicezentrum Berlin-Brandenburg Gefรถrdert durch BMFTR

License: MIT Python 3.10+ Hosted API

KugelAudio KI-Servicezentrum Berlin-Brandenburg Gefรถrdert durch BMFTR


Motivation

Open-source text-to-speech models for European languages are significantly lagging behind. While English TTS has seen remarkable progress, speakers of German, French, Spanish, Polish, and dozens of other European languages have been underserved by the open-source community.

KugelAudio aims to change this. Building on the excellent foundation laid by the VibeVoice team at Microsoft, we've trained a model specifically focused on European language coverage, using approximately 200,000 hours of highly pre-processed and enhanced speech data from the YODAS2 dataset.

๐Ÿ† Benchmark Results: Outperforming ElevenLabs

KugelAudio achieves state-of-the-art performance, beating industry leaders including ElevenLabs in rigorous human preference testing. This breakthrough demonstrates that open-source models can now rival - and surpass - the best commercial TTS systems.

Human Preference Benchmark (A/B Testing)

We conducted extensive A/B testing with 339 human evaluations to compare KugelAudio against leading TTS models. Participants listened to a reference voice sample, then compared outputs from two models and selected which sounded more human and closer to the original voice.

German Language Evaluation

The evaluation specifically focused on German language samples with diverse emotional expressions and speaking styles:

  • Neutral Speech: Standard conversational tones
  • Shouting: High-intensity, elevated volume speech
  • Singing: Melodic and rhythmic speech patterns
  • Drunken Voice: Slurred and irregular speech characteristics

These diverse test cases demonstrate the model's capability to handle a wide range of speaking styles beyond standard narration.

OpenSkill Ranking Results

Rank Model Score Record Win Rate
๐Ÿฅ‡ 1 KugelAudio 26 71W / 20L / 23T 78.0%
๐Ÿฅˆ 2 ElevenLabs Multi v2 25 56W / 34L / 22T 62.2%
๐Ÿฅ‰ 3 ElevenLabs v3 21 64W / 34L / 16T 65.3%
4 Cartesia 21 55W / 38L / 19T 59.1%
5 VibeVoice 10 30W / 74L / 8T 28.8%
6 CosyVoice v3 9 15W / 91L / 8T 14.2%

Based on 339 evaluations using Bayesian skill-rating system (OpenSkill)

Audio Samples

Listen to KugelAudio's diverse voice capabilities across different speaking styles and languages:

German Voice Samples

Sample Description Audio Player
Whispering Soft whispering voice
Female Narrator Professional female reader voice
Angry Voice Irritated and frustrated speech
Radio Announcer Professional radio broadcast voice

All samples are generated with zero-shot voice cloning from reference audio.

Training Details

  • Base Model: Microsoft VibeVoice
  • Training Data: ~200,000 hours from YODAS2
  • Hardware: 8x NVIDIA H100 GPUs
  • Training Duration: 5 days

Supported Languages

This model supports the following European languages:

Language Code Flag Language Code Flag Language Code Flag
English en ๐Ÿ‡บ๐Ÿ‡ธ German de ๐Ÿ‡ฉ๐Ÿ‡ช French fr ๐Ÿ‡ซ๐Ÿ‡ท
Spanish es ๐Ÿ‡ช๐Ÿ‡ธ Italian it ๐Ÿ‡ฎ๐Ÿ‡น Portuguese pt ๐Ÿ‡ต๐Ÿ‡น
Dutch nl ๐Ÿ‡ณ๐Ÿ‡ฑ Polish pl ๐Ÿ‡ต๐Ÿ‡ฑ Russian ru ๐Ÿ‡ท๐Ÿ‡บ
Ukrainian uk ๐Ÿ‡บ๐Ÿ‡ฆ Czech cs ๐Ÿ‡จ๐Ÿ‡ฟ Romanian ro ๐Ÿ‡ท๐Ÿ‡ด
Hungarian hu ๐Ÿ‡ญ๐Ÿ‡บ Swedish sv ๐Ÿ‡ธ๐Ÿ‡ช Danish da ๐Ÿ‡ฉ๐Ÿ‡ฐ
Finnish fi ๐Ÿ‡ซ๐Ÿ‡ฎ Norwegian no ๐Ÿ‡ณ๐Ÿ‡ด Greek el ๐Ÿ‡ฌ๐Ÿ‡ท
Bulgarian bg ๐Ÿ‡ง๐Ÿ‡ฌ Slovak sk ๐Ÿ‡ธ๐Ÿ‡ฐ Croatian hr ๐Ÿ‡ญ๐Ÿ‡ท
Serbian sr ๐Ÿ‡ท๐Ÿ‡ธ Turkish tr ๐Ÿ‡น๐Ÿ‡ท

๐Ÿ“Š Language Coverage Disclaimer: Quality varies significantly by language. Spanish, French, English, and German have the strongest representation in our training data (~200,000 hours from YODAS2). Other languages may have reduced quality, prosody, or vocabulary coverage depending on their availability in the training dataset.

Model Specifications

Property Value
Parameters 7B
Architecture AR + Diffusion (Qwen2.5-7B backbone)
Base Model Microsoft VibeVoice
Audio Sample Rate 24kHz
Audio Format Mono, float32
VRAM Required ~19GB
Training Hardware 8x NVIDIA H100
Training Duration 5 days
Training Data ~200,000 hours from YODAS2

Quick Start

Installation

# Install with pip
pip install kugelaudio-open

# Or with uv (recommended)
uv pip install kugelaudio-open

Basic Usage

from kugelaudio_open import (
    KugelAudioForConditionalGenerationInference,
    KugelAudioProcessor,
)
import torch

# Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = KugelAudioForConditionalGenerationInference.from_pretrained(
    "kugelaudio/kugelaudio-0-open",
    torch_dtype=torch.bfloat16,
).to(device)
model.eval()

processor = KugelAudioProcessor.from_pretrained("kugelaudio/kugelaudio-0-open")

# Generate speech
inputs = processor(text="Hallo Welt! Das ist KugelAudio.", return_tensors="pt")
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, cfg_scale=3.0)

# Save audio
processor.save_audio(outputs.speech_outputs[0], "output.wav")

Voice Cloning

# Clone a voice using reference audio
inputs = processor(
    text="Hallo, ich spreche jetzt mit deiner Stimme!",
    voice_prompt="reference_voice.wav",  # 5-30 seconds of clear speech
    return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, cfg_scale=3.0)

processor.save_audio(outputs.speech_outputs[0], "cloned_output.wav")

Generation Parameters

Parameter Default Description
cfg_scale 3.0 Classifier-free guidance scale (1.0-10.0). Higher = more adherence to text
max_new_tokens 2048 Maximum number of tokens to generate
do_sample False Whether to use sampling (vs greedy decoding)
temperature 1.0 Sampling temperature (if do_sample=True)

Architecture

KugelAudio uses a hybrid Autoregressive + Diffusion architecture based on Microsoft's VibeVoice:

Text Input โ†’ Qwen2.5-7B Backbone โ†’ Diffusion Head โ†’ Acoustic Decoder โ†’ Audio Output
                                         โ†‘
                              Voice Prompt (optional)
  1. Text Encoder: Qwen2.5-7B language model encodes input text
  2. Diffusion Head: Predicts speech latents using denoising diffusion (20 steps)
  3. Acoustic Decoder: Hierarchical convolutional decoder converts latents to 24kHz audio
  4. Semantic Encoder: Extracts speaker characteristics from reference audio (for voice cloning)

Audio Watermarking

All audio generated by this model is automatically watermarked using Facebook's AudioSeal. The watermark is:

  • Imperceptible: No audible difference in audio quality
  • Robust: Survives compression, resampling, and editing
  • Detectable: Can verify if audio was generated by KugelAudio

Verify Watermark

from kugelaudio_open.watermark import AudioWatermark

watermark = AudioWatermark()
result = watermark.detect(audio, sample_rate=24000)

print(f"Watermark detected: {result.detected}")
print(f"Confidence: {result.confidence:.1%}")

Intended Use

โœ… Appropriate Uses

  • Accessibility: Text-to-speech for visually impaired users
  • Content Creation: Podcasts, videos, audiobooks, e-learning
  • Voice Assistants: Chatbots and virtual assistants
  • Language Learning: Pronunciation practice and language education
  • Creative Projects: With proper consent and attribution

โŒ Prohibited Uses

  • Creating deepfakes or misleading content
  • Impersonating individuals without explicit consent
  • Fraud, deception, or scams
  • Harassment or abuse
  • Any illegal activities

Limitations

  • VRAM Requirements: Requires ~19GB VRAM for inference
  • Speed: Approximately 1.0x real-time on modern GPUs
  • Voice Cloning Quality: Best results with 5-30 seconds of clear reference audio
  • Language Quality Variation: Quality may vary across languages based on training data distribution

Hosted API

For production use without managing infrastructure, use our hosted API at kugelaudio.com:

  • โšก Ultra-low latency: <100ms end-to-end
  • ๐ŸŒ Global edge deployment
  • ๐Ÿ”ง Zero setup required
  • ๐Ÿ“ˆ Auto-scaling
from kugelaudio import KugelAudio

client = KugelAudio(api_key="your_api_key")
audio = client.tts.generate(text="Hello from KugelAudio!", model="kugel-1-turbo")
audio.save("output.wav")

Acknowledgments

This model would not have been possible without the contributions of many individuals and organizations:

  • Microsoft VibeVoice Team: For the excellent foundation architecture that this model builds upon
  • YODAS2 Dataset: For providing the large-scale multilingual speech data
  • Qwen Team: For the powerful language model backbone
  • Facebook AudioSeal: For the audio watermarking technology

Special Thanks

  • Carlos Menke: For his invaluable efforts in gathering the first datasets and extensive work benchmarking the model
  • AI Service Center Berlin-Brandenburg (KI-Servicezentrum): For providing the GPU resources (8x H100) that made training this model possible

Citation

@software{kugelaudio2026,
  title = {KugelAudio: Open-Source Text-to-Speech for European Languages with Voice Cloning},
  author = {Kratzenstein, Kajo and Menke, Carlos},
  year = {2026},
  institution = {Hasso-Plattner-Institut},
  url = {https://huggingface.co/kugelaudio/kugelaudio-0-open}
}

License

This model is released under the MIT License.

Author

Kajo Kratzenstein
๐Ÿ“ง kajo@kugelaudio.com
๐ŸŒ kugelaudio.com

Carlos Menke


Funding Notice

Das zugrunde liegende Vorhaben wurde mit Mitteln des Bundesministeriums fรผr Forschung, Technologie und Raumfahrt unter dem Fรถrderkennzeichen ยปKI-Servicezentrum Berlin-Brandenburgยซ 16IS22092 gefรถrdert.

This project was funded by the German Federal Ministry of Research, Technology and Space under the funding code "AI Service Center Berlin-Brandenburg" 16IS22092.

Downloads last month
6,318
Safetensors
Model size
9B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using kugelaudio/kugelaudio-0-open 1

Evaluation results