---
language:
- en
- multilingual
license: gpl-3.0
library_name: pytorch
pipeline_tag: audio-classification
tags:
- phoneme-recognition
- speech-processing
- audio
- pytorch
- multilingual
model-index:
- name: en_libri1000_uj01d
results:
- task:
type: phoneme-classification
dataset:
name: LibriSpeech
type: speech-recognition
metrics:
- name: Phoneme Error Rate
type: phoneme-error-rate
value: 0.25
- name: Phoneme Group Error Rate
type: phoneme-group-error-rate
value: 0.23
- name: multi_MLS8_uh02
results:
- task:
type: phoneme-classification
dataset:
name: Multilingual LibriSpeech (MLS)
type: speech-recognition
metrics:
- name: Phoneme Error Rate
type: phoneme-error-rate
value: 0.31
- name: Phoneme Group Error Rate
type: phoneme-group-error-rate
value: 0.26
- name: multi_mswc38_ug20
results:
- task:
type: phoneme-classification
dataset:
name: MSWC Multilingual Spoken Words Corpus
type: speech-recognition
metrics:
- name: Phoneme Error Rate
type: phoneme-error-rate
value: 0.49
- name: Phoneme Group Error Rate
type: phoneme-group-error-rate
value: 0.39
---
# ๐ฃ๏ธ CUPE: Contextless Universal Phoneme Encoder
[](https://huggingface.co/Tabahi/CUPE-2i)
[](https://github.com/tabahi/contexless-phonemes-CUPE)
[](https://arxiv.org/abs/2508.15316)
[](https://www.gnu.org/licenses/gpl-3.0)
> ๐ **A PyTorch model for contextless phoneme prediction from speech audio**
CUPE processes 120ms frames independently, ensuring each frame's embeddings are acoustically pureโunlike transformer models that mix context across frames.
## ๐ Quick Links
- ๐ฏ [**Bournemouth Forced Aligner**](https://github.com/tabahi/bournemouth-forced-aligner) - For phoneme/word timestamp alignment
- ๐ [**CUPE GitHub**](https://github.com/tabahi/contexless-phonemes-CUPE) - Source code repository
- ๐ค [**CUPE Hugging Face**](https://huggingface.co/Tabahi/CUPE-2i) - Pre-trained models
---
## ๐ฏ Trained Models
> **๐ Three 30.1M parameter models available**
All models are available in the [**checkpoints directory**](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt).
### ๐ Model Performance
| ๐ท๏ธ **Model** | ๐ **Languages** | ๐ **PER** | ๐ **GER** | ๐ **Description** |
|------------|-------------|----------|----------|--------------|
| ๐ฌ๐ง **English** | English | **0.24** | **0.21** | ๐ Best quality for English speech |
| ๐ **Multilingual MLS** | 8 European | **0.31** | **0.26** | ๐ช๐บ en, de, fr, es, pt, it, pl, nl |
| ๐ **Multilingual MSWC** | 38 languages | **0.49** | **0.39** | ๐บ๏ธ Broad language coverage |
๐ Detailed Metrics
**๐ฌ๐ง English (New: Oct2025) ([en_libri1000_ua01c](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/en_libri1000_ua01c_e4_val_GER=0.2186.ckpt)):**
- ๐ฏ **PER:** 0.24 (Phoneme Error Rate)
- ๐ฏ **GER:** 0.22 (Phoneme Group Error Rate)
- Fixed rhotics and compound phonemes
**๐ฌ๐ง English ([en_libri1000_uj01d](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/en_libri1000_uj01d_e199_val_GER=0.2307.ckpt)):**
- ๐ฏ **PER:** 0.25 (Phoneme Error Rate)
- ๐ฏ **GER:** 0.23 (Phoneme Group Error Rate)
**๐ Multilingual MLS ([multi_MLS8_uh02](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/multi_MLS8_uh02_e36_val_GER=0.2334.ckpt)):**
- ๐ฏ **PER:** 0.31
- ๐ฏ **GER:** 0.26
**๐ Multilingual MSWC ([multi_mswc38_ug20](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/multi_mswc38_ug20_e59_val_GER=0.5611.ckpt)):**
- ๐ฏ **PER:** 0.49
- ๐ฏ **GER:** 0.39
๐ Dataset Details
**๐ LibriSpeech ASR corpus (SR12):**
- โฑ๏ธ 960 hours of English speech
- ๐ train-100, train-360, and train-500 splits
**๐ Multilingual LibriSpeech (MLS) (SLR94):**
- โฑ๏ธ 800 hours total (100 hours each)
- ๐ 8 languages: `pl`, `pt`, `it`, `es`, `fr`, `nl`, `de`, `en`
**๐ฃ๏ธ MSWC Multilingual Spoken Words Corpus:**
- โฑ๏ธ 240 hours from 50 languages (max 10 hours/language)
- ๐ **Training:** 38 languages (`en`, `de`, `fr`, `ca`, `es`, `fa`, `it`, `ru`, `pl`, `eu`, `cy`, `eo`, `nl`, `pt`, `tt`, `cs`, `tr`, `et`, `ky`, `id`, `sv-SE`, `ar`, `el`, `ro`, `lv`, `sl`, `zh-CN`, `ga-IE`, `ta`, `vi`, `gn`, `or`)
- ๐งช **Testing:** 6 languages (`lt`, `mt`, `ia`, `sk`, `ka`, `as`)
๐ Manual Setup Code
For more control, see [run.py](https://huggingface.co/Tabahi/CUPE-2i/blob/main/run.py):
```python
import torch
import torchaudio
from model2i import CUPEEmbeddingsExtractor # ๐ฏ Main CUPE model feature extractor
import windowing # ๐ง Provides slice_windows, stich_window_predictions
# ๐ Load model from local checkpoint
cupe_ckpt_path = "./ckpt/en_libri1000_uj01d_e199_val_GER=0.2307.ckpt"
extractor = CUPEEmbeddingsExtractor(cupe_ckpt_path, device="cuda")
# ๐ต Prepare audio
sample_rate = 16000
window_size_ms = 120
stride_ms = 80
max_wav_len = 10 * sample_rate # 10 seconds
dummy_wav = torch.zeros(1, max_wav_len, dtype=torch.float32, device="cpu")
audio_batch = dummy_wav.unsqueeze(0) # Add batch dimension
# ๐ช Window the audio
windowed_audio = windowing.slice_windows(
audio_batch.to("cuda"),
sample_rate,
window_size_ms,
stride_ms
)
batch_size, num_windows, window_size = windowed_audio.shape
windows_flat = windowed_audio.reshape(-1, window_size)
# ๐ฎ Get predictions
logits, _ = extractor.predict(windows_flat, return_embeddings=False, groups_only=False)
# ๐ Reshape and stitch window predictions
frames_per_window = logits.shape[1]
logits = logits.reshape(batch_size, num_windows, frames_per_window, -1)
logits = windowing.stich_window_predictions(
logits,
original_audio_length=audio_batch.size(2),
cnn_output_size=frames_per_window,
sample_rate=sample_rate,
window_size_ms=window_size_ms,
stride_ms=stride_ms
)
print(f"๐ Output shape: {logits.shape}") # [B, T, 66]
```
๐ง Training Setup
- ๐ See [mapper.py](https://huggingface.co/Tabahi/CUPE-2i/blob/main/mapper.py) for tokenization (66 phonemes + 16 groups)
- ๐ค Use IPA-based grapheme-to-phoneme tools: [Espeak-ng](https://pypi.org/project/espeakng/)
- ๐ Convert words to IPA sequences: [phonemizer](https://pypi.org/project/phonemizer/3.0.1/)
- ๐บ๏ธ Map IPA phonemes to tokens: [IPAPhonemeMapper](https://github.com/tabahi/IPAPhonemeMapper)
**Token Mapping:**
- Token 0: ๐ Silence
- Tokens 1-65: ๐ค IPA phonemes
- Token 66: ๐ป Blank/noise