File size: 14,132 Bytes
0849444
1f1362b
 
 
76eeefc
6e1ab67
 
 
 
 
 
 
1f1362b
0849444
0532476
 
 
 
 
6e1ab67
 
0532476
6e1ab67
 
0532476
76eeefc
6e1ab67
0532476
1f1362b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0849444
d059b0e
76eeefc
d059b0e
 
 
f4716f5
76eeefc
d059b0e
1f1362b
d059b0e
1f1362b
d059b0e
76eeefc
d059b0e
 
 
76eeefc
d059b0e
 
 
 
 
 
 
 
 
 
 
 
c06c23e
d059b0e
 
76eeefc
d059b0e
 
76eeefc
c06c23e
 
 
 
 
d059b0e
 
 
76eeefc
d059b0e
 
 
 
 
 
 
 
 
 
 
 
 
76eeefc
d059b0e
d4f5bc1
d059b0e
d4f5bc1
d059b0e
 
 
2be2dc3
d059b0e
 
2be2dc3
d059b0e
 
 
2be2dc3
d059b0e
 
 
2be2dc3
d059b0e
 
 
 
2be2dc3
d059b0e
5bb0424
d059b0e
5bb0424
2be2dc3
 
d059b0e
2be2dc3
d059b0e
410ae2c
d059b0e
 
 
 
 
 
 
 
 
 
 
 
 
 
410ae2c
 
d059b0e
410ae2c
 
 
d059b0e
 
 
410ae2c
d059b0e
a65a2e6
d059b0e
 
a65a2e6
d059b0e
 
 
 
 
a65a2e6
d059b0e
 
 
 
 
 
a65a2e6
 
 
d059b0e
 
 
 
a65a2e6
410ae2c
d059b0e
410ae2c
 
 
 
 
 
 
 
d059b0e
410ae2c
 
 
 
 
 
 
 
 
 
d059b0e
410ae2c
 
 
 
d059b0e
410ae2c
d059b0e
 
410ae2c
 
 
 
 
 
 
 
d059b0e
410ae2c
 
 
d059b0e
410ae2c
 
d059b0e
410ae2c
 
 
 
 
d059b0e
410ae2c
 
 
 
 
d059b0e
410ae2c
 
d059b0e
 
410ae2c
 
d059b0e
 
 
 
 
 
410ae2c
d059b0e
2be2dc3
 
 
 
d059b0e
 
2be2dc3
d059b0e
2be2dc3
 
 
d059b0e
410ae2c
 
 
 
 
2be2dc3
 
 
d059b0e
2be2dc3
410ae2c
 
 
 
2be2dc3
 
 
 
d059b0e
2be2dc3
 
d059b0e
410ae2c
2be2dc3
 
410ae2c
 
 
 
 
 
2be2dc3
 
d059b0e
2be2dc3
 
d059b0e
410ae2c
d059b0e
410ae2c
d059b0e
410ae2c
d059b0e
 
 
 
410ae2c
d059b0e
410ae2c
d059b0e
410ae2c
d059b0e
 
 
 
 
410ae2c
d059b0e
410ae2c
d059b0e
1f1362b
d059b0e
 
d4f5bc1
d059b0e
 
 
 
1f1362b
d059b0e
 
 
 
d4f5bc1
d059b0e
2be2dc3
d059b0e
 
 
 
 
 
 
 
 
 
2be2dc3
d059b0e
2be2dc3
d059b0e
2be2dc3
 
d059b0e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2be2dc3
d059b0e
2be2dc3
d059b0e
 
2be2dc3
d059b0e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
---
language: 
  - en
  - multilingual
license: gpl-3.0
library_name: pytorch
pipeline_tag: audio-classification
tags:
  - phoneme-recognition
  - speech-processing
  - audio
  - pytorch
  - multilingual
model-index:
  - name: en_libri1000_uj01d
    results:
      - task:
          type: phoneme-classification
        dataset:
          name: LibriSpeech
          type: speech-recognition
        metrics:
          - name: Phoneme Error Rate
            type: phoneme-error-rate
            value: 0.25
          - name: Phoneme Group Error Rate
            type: phoneme-group-error-rate
            value: 0.23
  - name: multi_MLS8_uh02
    results:
      - task:
          type: phoneme-classification
        dataset:
          name: Multilingual LibriSpeech (MLS)
          type: speech-recognition
        metrics:
          - name: Phoneme Error Rate
            type: phoneme-error-rate
            value: 0.31
          - name: Phoneme Group Error Rate
            type: phoneme-group-error-rate
            value: 0.26
  - name: multi_mswc38_ug20
    results:
      - task:
          type: phoneme-classification
        dataset:
          name: MSWC Multilingual Spoken Words Corpus
          type: speech-recognition
        metrics:
          - name: Phoneme Error Rate
            type: phoneme-error-rate
            value: 0.49
          - name: Phoneme Group Error Rate
            type: phoneme-group-error-rate
            value: 0.39
---
# ๐Ÿ—ฃ๏ธ CUPE: Contextless Universal Phoneme Encoder

[![๐Ÿค— Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/Tabahi/CUPE-2i)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-green)](https://github.com/tabahi/contexless-phonemes-CUPE)
[![Paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/abs/2508.15316)
[![License: GPLv3](https://img.shields.io/badge/License-GPLv3-yellow.svg)](https://www.gnu.org/licenses/gpl-3.0)

> ๐Ÿš€ **A PyTorch model for contextless phoneme prediction from speech audio**

CUPE processes 120ms frames independently, ensuring each frame's embeddings are acoustically pureโ€”unlike transformer models that mix context across frames.

## ๐Ÿ”— Quick Links

- ๐ŸŽฏ [**Bournemouth Forced Aligner**](https://github.com/tabahi/bournemouth-forced-aligner) - For phoneme/word timestamp alignment
- ๐Ÿ“ [**CUPE GitHub**](https://github.com/tabahi/contexless-phonemes-CUPE) - Source code repository  
- ๐Ÿค— [**CUPE Hugging Face**](https://huggingface.co/Tabahi/CUPE-2i) - Pre-trained models

---

## ๐ŸŽฏ Trained Models

> **๐Ÿ“Š Three 30.1M parameter models available**

All models are available in the [**checkpoints directory**](https://huggingface.co/Tabahi/CUPE-2i/tree/main/ckpt).

### ๐Ÿ“ˆ Model Performance

| ๐Ÿท๏ธ **Model** | ๐ŸŒ **Languages** | ๐Ÿ“Š **PER** | ๐Ÿ“Š **GER** | ๐Ÿ“ **Description** |
|------------|-------------|----------|----------|--------------|
| ๐Ÿ‡ฌ๐Ÿ‡ง **English** | English | **0.24** | **0.21** | ๐Ÿ† Best quality for English speech |
| ๐ŸŒ **Multilingual MLS** | 8 European | **0.31** | **0.26** | ๐Ÿ‡ช๐Ÿ‡บ en, de, fr, es, pt, it, pl, nl |
| ๐ŸŒ **Multilingual MSWC** | 38 languages | **0.49** | **0.39** | ๐Ÿ—บ๏ธ Broad language coverage |

<details>
<summary>๐Ÿ“‹ <strong>Detailed Metrics</strong></summary>

**๐Ÿ‡ฌ๐Ÿ‡ง English (New: Oct2025) ([en_libri1000_ua01c](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/en_libri1000_ua01c_e4_val_GER=0.2186.ckpt)):**
- ๐ŸŽฏ **PER:** 0.24 (Phoneme Error Rate)
- ๐ŸŽฏ **GER:** 0.22 (Phoneme Group Error Rate)
- Fixed rhotics and compound phonemes

**๐Ÿ‡ฌ๐Ÿ‡ง English ([en_libri1000_uj01d](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/en_libri1000_uj01d_e199_val_GER=0.2307.ckpt)):**
- ๐ŸŽฏ **PER:** 0.25 (Phoneme Error Rate)
- ๐ŸŽฏ **GER:** 0.23 (Phoneme Group Error Rate)

**๐ŸŒ Multilingual MLS ([multi_MLS8_uh02](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/multi_MLS8_uh02_e36_val_GER=0.2334.ckpt)):**
- ๐ŸŽฏ **PER:** 0.31
- ๐ŸŽฏ **GER:** 0.26

**๐ŸŒ Multilingual MSWC ([multi_mswc38_ug20](https://huggingface.co/Tabahi/CUPE-2i/resolve/main/ckpt/multi_mswc38_ug20_e59_val_GER=0.5611.ckpt)):**
- ๐ŸŽฏ **PER:** 0.49
- ๐ŸŽฏ **GER:** 0.39

</details>

> โš ๏ธ **Note:** CUPE models are designed for contextless phoneme prediction and are not optimal for phoneme classification tasks that require contextual information. CUPE excels at extracting pure, frame-level embeddings that represent the acoustic properties of each phoneme independently of surrounding context.

---

## ๐Ÿ“š Datasets

### ๐ŸŽต Training Data Sources

- ๐Ÿ“– **LibriSpeech ASR corpus (SR12):** 960 hours of English speech
- ๐ŸŒ **Multilingual LibriSpeech (MLS):** 800 hours across 8 languages  
- ๐Ÿ—ฃ๏ธ **MSWC Multilingual Spoken Words:** 240 hours from 50 languages

<details>
<summary>๐Ÿ” <strong>Dataset Details</strong></summary>

**๐Ÿ“– LibriSpeech ASR corpus (SR12):** 
- โฑ๏ธ 960 hours of English speech
- ๐Ÿ“ train-100, train-360, and train-500 splits

**๐ŸŒ Multilingual LibriSpeech (MLS) (SLR94):**
- โฑ๏ธ 800 hours total (100 hours each)
- ๐ŸŒ 8 languages: `pl`, `pt`, `it`, `es`, `fr`, `nl`, `de`, `en`

**๐Ÿ—ฃ๏ธ MSWC Multilingual Spoken Words Corpus:**
- โฑ๏ธ 240 hours from 50 languages (max 10 hours/language)
- ๐ŸŽ“ **Training:** 38 languages (`en`, `de`, `fr`, `ca`, `es`, `fa`, `it`, `ru`, `pl`, `eu`, `cy`, `eo`, `nl`, `pt`, `tt`, `cs`, `tr`, `et`, `ky`, `id`, `sv-SE`, `ar`, `el`, `ro`, `lv`, `sl`, `zh-CN`, `ga-IE`, `ta`, `vi`, `gn`, `or`)
- ๐Ÿงช **Testing:** 6 languages (`lt`, `mt`, `ia`, `sk`, `ka`, `as`)

</details>

> ๐Ÿ’ก **Need a new language?** Start a [new discussion](https://github.com/tabahi/bournemouth-forced-aligner/discussions) and we'll train it for you!

---

## ๐Ÿš€ Installation 

### โšก Quick Start (Bournemouth Forced Aligner)

```bash
# ๐Ÿ“ฆ Install the package
pip install bournemouth-forced-aligner

# ๐Ÿ”ง Install dependencies
apt-get install espeak-ng ffmpeg

# โ“ Show help
balign --help
```

๐Ÿ“– See complete [**BFA guide**](https://github.com/tabahi/bournemouth-forced-aligner).

### ๐Ÿ› ๏ธ Quick Start (CUPE)

```bash
# ๐Ÿ“ฆ Install core dependencies
pip install torch torchaudio huggingface_hub
```

---

## ๐Ÿ’ป Easy Usage with Automatic Download

> ๐ŸŽฏ **Zero-setup required** - automatic downloads from Hugging Face Hub

### ๐Ÿฆ‹ Example Output
Running with sample audio [๐Ÿฆ‹ butterfly.wav](samples/109867__timkahn__butterfly.wav.wav):

```bash
๐Ÿ”„ Loading CUPE english model...
โœ… Model loaded on cpu
๐ŸŽต Processing audio: 1.26s duration
๐Ÿ“Š Processed 75 frames (1200ms total)

๐Ÿ“‹ Results:
๐Ÿ”ค Phoneme predictions shape: (75,)
๐Ÿท๏ธ Group predictions shape: (75,)
โ„น๏ธ Model info: {'model_name': 'english', 'sample_rate': 16000, 'frames_per_second': 62.5}

๐Ÿ” First 10 frame predictions:
Frame 0: phoneme=66, group=16
Frame 1: phoneme=66, group=16
Frame 2: phoneme=29, group=7
...

๐Ÿ”ค Phonemes: ['b', 'สŒ', 't', 'h', 'สŒ', 'f', 'l', 'รฆ']...
๐Ÿท๏ธ Groups: ['voiced_stops', 'central_vowels', 'voiceless_stops']...
```

### ๐Ÿ Python Code

```python
import torch
import torchaudio
from huggingface_hub import hf_hub_download
import importlib.util

def load_cupe_model(model_name="english", device="auto"):
    """๐Ÿ”„ Load CUPE model with automatic downloading from Hugging Face Hub"""
    
    model_files = {
        "english": "en_libri1000_uj01d_e199_val_GER=0.2307.ckpt",
        "multilingual-mls": "multi_MLS8_uh02_e36_val_GER=0.2334.ckpt", 
        "multilingual-mswc": "multi_mswc38_ug20_e59_val_GER=0.5611.ckpt"
    }
    
    if device == "auto":
        device = "cuda" if torch.cuda.is_available() else "cpu"
    
    # ๐Ÿ“ฅ Download files automatically from Hugging Face Hub
    repo_id = "Tabahi/CUPE-2i"
    model_file = hf_hub_download(repo_id=repo_id, filename="model2i.py")
    windowing_file = hf_hub_download(repo_id=repo_id, filename="windowing.py") 
    checkpoint = hf_hub_download(repo_id=repo_id, filename=f"ckpt/{model_files[model_name]}")
    model_utils_file = hf_hub_download(repo_id=repo_id, filename="model_utils.py")
    
    # ๐Ÿ”ง Import modules dynamically
    _ = import_module_from_file("model_utils", model_utils_file)
    spec = importlib.util.spec_from_file_location("model2i", model_file)
    model2i = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(model2i)
    
    spec = importlib.util.spec_from_file_location("windowing", windowing_file)
    windowing = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(windowing)
    
    # ๐Ÿš€ Initialize model
    extractor = model2i.CUPEEmbeddingsExtractor(checkpoint, device=device)
    return extractor, windowing

# ๐ŸŽฏ Example usage
extractor, windowing = load_cupe_model("english")

# ๐ŸŽต Load and process your audio
audio, sr = torchaudio.load("your_audio.wav")
if sr != 16000:
    resampler = torchaudio.transforms.Resample(sr, 16000)
    audio = resampler(audio)

# ๐Ÿ“Š Add batch dimension and process
audio_batch = audio.unsqueeze(0)
windowed_audio = windowing.slice_windows(audio_batch, 16000, 120, 80)
batch_size, num_windows, window_size = windowed_audio.shape
windows_flat = windowed_audio.reshape(-1, window_size)

# ๐Ÿ”ฎ Get predictions
logits_phonemes, logits_groups = extractor.predict(windows_flat, return_embeddings=False, groups_only=False)

print(f"๐Ÿ”ค Phoneme logits shape: {logits_phonemes.shape}")  # [num_windows, frames_per_window, 66]
print(f"๐Ÿท๏ธ Group logits shape: {logits_groups.shape}")     # [num_windows, frames_per_window, 16]
```

---

## ๐Ÿ”ง Advanced Usage (Manual Setup)

<details>
<summary>๐Ÿ“ <strong>Manual Setup Code</strong></summary>

For more control, see [run.py](https://huggingface.co/Tabahi/CUPE-2i/blob/main/run.py):

```python
import torch
import torchaudio
from model2i import CUPEEmbeddingsExtractor  # ๐ŸŽฏ Main CUPE model feature extractor
import windowing  # ๐Ÿ”ง Provides slice_windows, stich_window_predictions

# ๐Ÿ“ Load model from local checkpoint
cupe_ckpt_path = "./ckpt/en_libri1000_uj01d_e199_val_GER=0.2307.ckpt"
extractor = CUPEEmbeddingsExtractor(cupe_ckpt_path, device="cuda")

# ๐ŸŽต Prepare audio
sample_rate = 16000
window_size_ms = 120
stride_ms = 80
max_wav_len = 10 * sample_rate  # 10 seconds

dummy_wav = torch.zeros(1, max_wav_len, dtype=torch.float32, device="cpu")
audio_batch = dummy_wav.unsqueeze(0)  # Add batch dimension

# ๐ŸชŸ Window the audio
windowed_audio = windowing.slice_windows(
    audio_batch.to("cuda"),
    sample_rate,
    window_size_ms,
    stride_ms
)
batch_size, num_windows, window_size = windowed_audio.shape
windows_flat = windowed_audio.reshape(-1, window_size)

# ๐Ÿ”ฎ Get predictions
logits, _ = extractor.predict(windows_flat, return_embeddings=False, groups_only=False)

# ๐Ÿ”„ Reshape and stitch window predictions
frames_per_window = logits.shape[1]
logits = logits.reshape(batch_size, num_windows, frames_per_window, -1)
logits = windowing.stich_window_predictions(
    logits,
    original_audio_length=audio_batch.size(2),
    cnn_output_size=frames_per_window,
    sample_rate=sample_rate,
    window_size_ms=window_size_ms,
    stride_ms=stride_ms
)

print(f"๐Ÿ“Š Output shape: {logits.shape}")  # [B, T, 66]
```

</details>

---

## ๐Ÿ“Š Output Format

- ๐Ÿ”ค **Phoneme logits**: `(time_frames, 66)` - 66 IPA phoneme classes
- ๐Ÿท๏ธ **Group logits**: `(time_frames, 16)` - 16 phoneme groups  
- โฑ๏ธ **Time resolution**: ~16ms per frame (~62.5 FPS)
- ๐Ÿ—บ๏ธ **Mapping**: See [mapper.py](https://huggingface.co/Tabahi/CUPE-2i/blob/main/mapper.py) for phoneme-to-index mapping

---

## โœจ Key Features

- ๐Ÿš€ **No manual downloads** - automatic via Hugging Face Hub  
- ๐ŸŒ **Multiple languages** - English + 37 other languages  
- โšก **Real-time capable** - faster than real-time on GPU  
- โฑ๏ธ **Frame-level timing** - 16ms resolution  
- ๐ŸŽฏ **Contextless** - each frame processed independently

---

## ๐ŸŽจ Custom Dataset for Training

<details>
<summary>๐Ÿ”ง <strong>Training Setup</strong></summary>

- ๐Ÿ“‹ See [mapper.py](https://huggingface.co/Tabahi/CUPE-2i/blob/main/mapper.py) for tokenization (66 phonemes + 16 groups)
- ๐Ÿ”ค Use IPA-based grapheme-to-phoneme tools: [Espeak-ng](https://pypi.org/project/espeakng/)
- ๐Ÿ“ Convert words to IPA sequences: [phonemizer](https://pypi.org/project/phonemizer/3.0.1/)
- ๐Ÿ—บ๏ธ Map IPA phonemes to tokens: [IPAPhonemeMapper](https://github.com/tabahi/IPAPhonemeMapper)

**Token Mapping:**
- Token 0: ๐Ÿ”‡ Silence
- Tokens 1-65: ๐Ÿ”ค IPA phonemes  
- Token 66: ๐Ÿ“ป Blank/noise

</details>

---

## ๐ŸŽฏ Use Cases

- โฐ **Timestamp alignment** (examples coming soon)
- ๐Ÿ“Š **Speech analysis**
- ๐Ÿ” **Phoneme recognition**
- ๐ŸŽต **Audio processing**

---

## ๐Ÿ“Š Visual Results

### ๐Ÿ“ˆ Sample Probabilities Timeline
![Sample output logits plot](plots/where_they_went_timeline.png)

### ๐ŸŒ Multilingual Confusion Plot
![Multilingual Confusion Plot (counts)](plots/uh02_multilingual_MLS8.png)

### ๐Ÿ‡ฌ๐Ÿ‡ง English-only Confusion Plot  
![English-only Confusion Plot (probabiltities)](plots/uh03b_confusion_probs_heatmap_libri_dev_en.png)

---

## ๐Ÿ“– Citation

๐Ÿ“„ **Paper**: [CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing](https://arxiv.org/abs/2508.15316)

```bibtex
@inproceedings{rehman2025cupe,
  title     = {CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing},
  author    = {Abdul Rehman and Jian-Jun Zhang and Xiaosong Yang},
  booktitle = {Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP 2025)},
  year      = {2025},
  organization = {ICNLSP},
  publisher = {International Conference on Natural Language and Speech Processing},
}
```

---

<div align="center">

### ๐ŸŒŸ **Star this repository if you find it helpful!** โญ

[![GitHub stars](https://img.shields.io/github/stars/tabahi/contexless-phonemes-CUPE?style=social)](https://github.com/tabahi/contexless-phonemes-CUPE)
[![Hugging Face likes](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Like-blue)](https://huggingface.co/Tabahi/CUPE-2i)

</div>