--- language: - en - multilingual license: gpl-3.0 library_name: pytorch pipeline_tag: audio-classification tags: - phoneme-recognition - speech-processing - audio - pytorch - multilingual model-index: - name: en_libri1000_uj01d results: - task: type: phoneme-classification dataset: name: LibriSpeech type: speech-recognition metrics: - name: Phoneme Error Rate type: phoneme-error-rate value: 0.25 - name: Phoneme Group Error Rate type: phoneme-group-error-rate value: 0.23 - name: multi_MLS8_uh02 results: - task: type: phoneme-classification dataset: name: Multilingual LibriSpeech (MLS) type: speech-recognition metrics: - name: Phoneme Error Rate type: phoneme-error-rate value: 0.31 - name: Phoneme Group Error Rate type: phoneme-group-error-rate value: 0.26 - name: multi_mswc38_ug20 results: - task: type: phoneme-classification dataset: name: MSWC Multilingual Spoken Words Corpus type: speech-recognition metrics: - name: Phoneme Error Rate type: phoneme-error-rate value: 0.49 - name: Phoneme Group Error Rate type: phoneme-group-error-rate value: 0.39 --- # ๐Ÿ—ฃ๏ธ CUPE: Contextless Universal Phoneme Encoder [![๐Ÿค— Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/Tabahi/CUPE-2i) [![GitHub](https://img.shields.io/badge/GitHub-Repository-green)](https://github.com/tabahi/contexless-phonemes-CUPE) [![Paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2508.15316) [![License: GPLv3](https://img.shields.io/badge/License-GPLv3-yellow.svg)](https://www.gnu.org/licenses/gpl-3.0) > ๐Ÿš€ **A PyTorch model for contextless phoneme prediction from speech audio** CUPE processes 120ms frames independently, ensuring each frame's embeddings are acoustically pureโ€”unlike transformer models that mix context across frames. ## ๐Ÿ”— Quick Links - ๐ŸŽฏ [**Bournemouth Forced Aligner**](https://github.com/tabahi/bournemouth-forced-aligner) - For phoneme/word timestamp alignment - ๐Ÿ“ [**CUPE GitHub**](https://github.com/tabahi/contexless-phonemes-CUPE) - Source code repository - ๐Ÿค— [**CUPE Hugging Face**](https://huggingface.co/Tabahi/CUPE-2i) - Pre-trained models --- ## ๐ŸŽฏ Trained Models > **๐Ÿ“Š Three 30.1M parameter models available** All models are available in the [**checkpoints directory**](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt). ### ๐Ÿ“ˆ Model Performance | ๐Ÿท๏ธ **Model** | ๐ŸŒ **Languages** | ๐Ÿ“Š **PER** | ๐Ÿ“Š **GER** | ๐Ÿ“ **Description** | |------------|-------------|----------|----------|--------------| | ๐Ÿ‡ฌ๐Ÿ‡ง **English** | English | **0.24** | **0.21** | ๐Ÿ† Best quality for English speech | | ๐ŸŒ **Multilingual MLS** | 8 European | **0.31** | **0.26** | ๐Ÿ‡ช๐Ÿ‡บ en, de, fr, es, pt, it, pl, nl | | ๐ŸŒ **Multilingual MSWC** | 38 languages | **0.49** | **0.39** | ๐Ÿ—บ๏ธ Broad language coverage |
๐Ÿ“‹ Detailed Metrics **๐Ÿ‡ฌ๐Ÿ‡ง English (New: Oct2025) ([en_libri1000_ua01c](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/en_libri1000_ua01c_e4_val_GER=0.2186.ckpt)):** - ๐ŸŽฏ **PER:** 0.24 (Phoneme Error Rate) - ๐ŸŽฏ **GER:** 0.22 (Phoneme Group Error Rate) - Fixed rhotics and compound phonemes **๐Ÿ‡ฌ๐Ÿ‡ง English ([en_libri1000_uj01d](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/en_libri1000_uj01d_e199_val_GER=0.2307.ckpt)):** - ๐ŸŽฏ **PER:** 0.25 (Phoneme Error Rate) - ๐ŸŽฏ **GER:** 0.23 (Phoneme Group Error Rate) **๐ŸŒ Multilingual MLS ([multi_MLS8_uh02](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/multi_MLS8_uh02_e36_val_GER=0.2334.ckpt)):** - ๐ŸŽฏ **PER:** 0.31 - ๐ŸŽฏ **GER:** 0.26 **๐ŸŒ Multilingual MSWC ([multi_mswc38_ug20](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/multi_mswc38_ug20_e59_val_GER=0.5611.ckpt)):** - ๐ŸŽฏ **PER:** 0.49 - ๐ŸŽฏ **GER:** 0.39
> โš ๏ธ **Note:** CUPE models are designed for contextless phoneme prediction and are not optimal for phoneme classification tasks that require contextual information. CUPE excels at extracting pure, frame-level embeddings that represent the acoustic properties of each phoneme independently of surrounding context. --- ## ๐Ÿ“š Datasets ### ๐ŸŽต Training Data Sources - ๐Ÿ“– **LibriSpeech ASR corpus (SR12):** 960 hours of English speech - ๐ŸŒ **Multilingual LibriSpeech (MLS):** 800 hours across 8 languages - ๐Ÿ—ฃ๏ธ **MSWC Multilingual Spoken Words:** 240 hours from 50 languages
๐Ÿ” Dataset Details **๐Ÿ“– LibriSpeech ASR corpus (SR12):** - โฑ๏ธ 960 hours of English speech - ๐Ÿ“ train-100, train-360, and train-500 splits **๐ŸŒ Multilingual LibriSpeech (MLS) (SLR94):** - โฑ๏ธ 800 hours total (100 hours each) - ๐ŸŒ 8 languages: `pl`, `pt`, `it`, `es`, `fr`, `nl`, `de`, `en` **๐Ÿ—ฃ๏ธ MSWC Multilingual Spoken Words Corpus:** - โฑ๏ธ 240 hours from 50 languages (max 10 hours/language) - ๐ŸŽ“ **Training:** 38 languages (`en`, `de`, `fr`, `ca`, `es`, `fa`, `it`, `ru`, `pl`, `eu`, `cy`, `eo`, `nl`, `pt`, `tt`, `cs`, `tr`, `et`, `ky`, `id`, `sv-SE`, `ar`, `el`, `ro`, `lv`, `sl`, `zh-CN`, `ga-IE`, `ta`, `vi`, `gn`, `or`) - ๐Ÿงช **Testing:** 6 languages (`lt`, `mt`, `ia`, `sk`, `ka`, `as`)
> ๐Ÿ’ก **Need a new language?** Start a [new discussion](https://github.com/tabahi/bournemouth-forced-aligner/discussions) and we'll train it for you! --- ## ๐Ÿš€ Installation ### โšก Quick Start (Bournemouth Forced Aligner) ```bash # ๐Ÿ“ฆ Install the package pip install bournemouth-forced-aligner # ๐Ÿ”ง Install dependencies apt-get install espeak-ng ffmpeg # โ“ Show help balign --help ``` ๐Ÿ“– See complete [**BFA guide**](https://github.com/tabahi/bournemouth-forced-aligner). ### ๐Ÿ› ๏ธ Quick Start (CUPE) ```bash # ๐Ÿ“ฆ Install core dependencies pip install torch torchaudio huggingface_hub ``` --- ## ๐Ÿ’ป Easy Usage with Automatic Download > ๐ŸŽฏ **Zero-setup required** - automatic downloads from Hugging Face Hub ### ๐Ÿฆ‹ Example Output Running with sample audio [๐Ÿฆ‹ butterfly.wav](samples/109867__timkahn__butterfly.wav.wav): ```bash ๐Ÿ”„ Loading CUPE english model... โœ… Model loaded on cpu ๐ŸŽต Processing audio: 1.26s duration ๐Ÿ“Š Processed 75 frames (1200ms total) ๐Ÿ“‹ Results: ๐Ÿ”ค Phoneme predictions shape: (75,) ๐Ÿท๏ธ Group predictions shape: (75,) โ„น๏ธ Model info: {'model_name': 'english', 'sample_rate': 16000, 'frames_per_second': 62.5} ๐Ÿ” First 10 frame predictions: Frame 0: phoneme=66, group=16 Frame 1: phoneme=66, group=16 Frame 2: phoneme=29, group=7 ... ๐Ÿ”ค Phonemes: ['b', 'สŒ', 't', 'h', 'สŒ', 'f', 'l', 'รฆ']... ๐Ÿท๏ธ Groups: ['voiced_stops', 'central_vowels', 'voiceless_stops']... ``` ### ๐Ÿ Python Code ```python import torch import torchaudio from huggingface_hub import hf_hub_download import importlib.util def load_cupe_model(model_name="english", device="auto"): """๐Ÿ”„ Load CUPE model with automatic downloading from Hugging Face Hub""" model_files = { "english": "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt", "multilingual-mls": "multi_MLS8_uh02_e36_val_GER=0.2334.ckpt", "multilingual-mswc": "multi_mswc38_ug20_e59_val_GER=0.5611.ckpt" } if device == "auto": device = "cuda" if torch.cuda.is_available() else "cpu" # ๐Ÿ“ฅ Download files automatically from Hugging Face Hub repo_id = "Tabahi/CUPE-2i" model_file = hf_hub_download(repo_id=repo_id, filename="model2i.py") windowing_file = hf_hub_download(repo_id=repo_id, filename="windowing.py") checkpoint = hf_hub_download(repo_id=repo_id, filename=f"ckpt/{model_files[model_name]}") model_utils_file = hf_hub_download(repo_id=repo_id, filename="model_utils.py") # ๐Ÿ”ง Import modules dynamically _ = import_module_from_file("model_utils", model_utils_file) spec = importlib.util.spec_from_file_location("model2i", model_file) model2i = importlib.util.module_from_spec(spec) spec.loader.exec_module(model2i) spec = importlib.util.spec_from_file_location("windowing", windowing_file) windowing = importlib.util.module_from_spec(spec) spec.loader.exec_module(windowing) # ๐Ÿš€ Initialize model extractor = model2i.CUPEEmbeddingsExtractor(checkpoint, device=device) return extractor, windowing # ๐ŸŽฏ Example usage extractor, windowing = load_cupe_model("english") # ๐ŸŽต Load and process your audio audio, sr = torchaudio.load("your_audio.wav") if sr != 16000: resampler = torchaudio.transforms.Resample(sr, 16000) audio = resampler(audio) # ๐Ÿ“Š Add batch dimension and process audio_batch = audio.unsqueeze(0) windowed_audio = windowing.slice_windows(audio_batch, 16000, 120, 80) batch_size, num_windows, window_size = windowed_audio.shape windows_flat = windowed_audio.reshape(-1, window_size) # ๐Ÿ”ฎ Get predictions logits_phonemes, logits_groups = extractor.predict(windows_flat, return_embeddings=False, groups_only=False) print(f"๐Ÿ”ค Phoneme logits shape: {logits_phonemes.shape}") # [num_windows, frames_per_window, 66] print(f"๐Ÿท๏ธ Group logits shape: {logits_groups.shape}") # [num_windows, frames_per_window, 16] ``` --- ## ๐Ÿ”ง Advanced Usage (Manual Setup)
๐Ÿ“ Manual Setup Code For more control, see [run.py](https://huggingface.co/Tabahi/CUPE-2i/blob/main/run.py): ```python import torch import torchaudio from model2i import CUPEEmbeddingsExtractor # ๐ŸŽฏ Main CUPE model feature extractor import windowing # ๐Ÿ”ง Provides slice_windows, stich_window_predictions # ๐Ÿ“ Load model from local checkpoint cupe_ckpt_path = "./ckpt/en_libri1000_uj01d_e199_val_GER=0.2307.ckpt" extractor = CUPEEmbeddingsExtractor(cupe_ckpt_path, device="cuda") # ๐ŸŽต Prepare audio sample_rate = 16000 window_size_ms = 120 stride_ms = 80 max_wav_len = 10 * sample_rate # 10 seconds dummy_wav = torch.zeros(1, max_wav_len, dtype=torch.float32, device="cpu") audio_batch = dummy_wav.unsqueeze(0) # Add batch dimension # ๐ŸชŸ Window the audio windowed_audio = windowing.slice_windows( audio_batch.to("cuda"), sample_rate, window_size_ms, stride_ms ) batch_size, num_windows, window_size = windowed_audio.shape windows_flat = windowed_audio.reshape(-1, window_size) # ๐Ÿ”ฎ Get predictions logits, _ = extractor.predict(windows_flat, return_embeddings=False, groups_only=False) # ๐Ÿ”„ Reshape and stitch window predictions frames_per_window = logits.shape[1] logits = logits.reshape(batch_size, num_windows, frames_per_window, -1) logits = windowing.stich_window_predictions( logits, original_audio_length=audio_batch.size(2), cnn_output_size=frames_per_window, sample_rate=sample_rate, window_size_ms=window_size_ms, stride_ms=stride_ms ) print(f"๐Ÿ“Š Output shape: {logits.shape}") # [B, T, 66] ```
--- ## ๐Ÿ“Š Output Format - ๐Ÿ”ค **Phoneme logits**: `(time_frames, 66)` - 66 IPA phoneme classes - ๐Ÿท๏ธ **Group logits**: `(time_frames, 16)` - 16 phoneme groups - โฑ๏ธ **Time resolution**: ~16ms per frame (~62.5 FPS) - ๐Ÿ—บ๏ธ **Mapping**: See [mapper.py](https://huggingface.co/Tabahi/CUPE-2i/blob/main/mapper.py) for phoneme-to-index mapping --- ## โœจ Key Features - ๐Ÿš€ **No manual downloads** - automatic via Hugging Face Hub - ๐ŸŒ **Multiple languages** - English + 37 other languages - โšก **Real-time capable** - faster than real-time on GPU - โฑ๏ธ **Frame-level timing** - 16ms resolution - ๐ŸŽฏ **Contextless** - each frame processed independently --- ## ๐ŸŽจ Custom Dataset for Training
๐Ÿ”ง Training Setup - ๐Ÿ“‹ See [mapper.py](https://huggingface.co/Tabahi/CUPE-2i/blob/main/mapper.py) for tokenization (66 phonemes + 16 groups) - ๐Ÿ”ค Use IPA-based grapheme-to-phoneme tools: [Espeak-ng](https://pypi.org/project/espeakng/) - ๐Ÿ“ Convert words to IPA sequences: [phonemizer](https://pypi.org/project/phonemizer/3.0.1/) - ๐Ÿ—บ๏ธ Map IPA phonemes to tokens: [IPAPhonemeMapper](https://github.com/tabahi/IPAPhonemeMapper) **Token Mapping:** - Token 0: ๐Ÿ”‡ Silence - Tokens 1-65: ๐Ÿ”ค IPA phonemes - Token 66: ๐Ÿ“ป Blank/noise
--- ## ๐ŸŽฏ Use Cases - โฐ **Timestamp alignment** (examples coming soon) - ๐Ÿ“Š **Speech analysis** - ๐Ÿ” **Phoneme recognition** - ๐ŸŽต **Audio processing** --- ## ๐Ÿ“Š Visual Results ### ๐Ÿ“ˆ Sample Probabilities Timeline ![Sample output logits plot](plots/where_they_went_timeline.png) ### ๐ŸŒ Multilingual Confusion Plot ![Multilingual Confusion Plot (counts)](plots/uh02_multilingual_MLS8.png) ### ๐Ÿ‡ฌ๐Ÿ‡ง English-only Confusion Plot ![English-only Confusion Plot (probabiltities)](plots/uh03b_confusion_probs_heatmap_libri_dev_en.png) --- ## ๐Ÿ“– Citation ๐Ÿ“„ **Paper**: [CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing](https://arxiv.org/abs/2508.15316) ```bibtex @inproceedings{rehman2025cupe, title = {CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing}, author = {Abdul Rehman and Jian-Jun Zhang and Xiaosong Yang}, booktitle = {Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP 2025)}, year = {2025}, organization = {ICNLSP}, publisher = {International Conference on Natural Language and Speech Processing}, } ``` ---
### ๐ŸŒŸ **Star this repository if you find it helpful!** โญ [![GitHub stars](https://img.shields.io/github/stars/tabahi/contexless-phonemes-CUPE?style=social)](https://github.com/tabahi/contexless-phonemes-CUPE) [![Hugging Face likes](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Like-blue)](https://huggingface.co/Tabahi/CUPE-2i)