| # [sentence-transformers/static-retrieval-mrl-en-v1](https://huggingface.co/sentence-transformers/static-retrieval-mrl-en-v1) | |
| License: [apache-2.0](https://choosealicense.com/licenses/apache-2.0/) | |
| English-only uncased similarity embeddings that were trained with Matroyshka | |
| loss that allows for more effective truncation of the embedding vectors. It | |
| was trained on a variety of domains of monolingual datasets. I was designed | |
| specifically for similarity retrieval. | |
| ## Model Stats | |
| Stats that describe the embeddings tensor shapes and value distribution. | |
| | item | metric | value | | |
| | --------------| ----------------------- | ----- | | |
| | vocab | size | 30,522 | | |
| | embedding | dimensions | 1,024 | | |
| | vector length | mean | 555.04 | | |
| | vector length | median | 573.92 | | |
| | vector length | stddev | 219.06 | | |
| | values | mean | 0.02 | | |
| | values | median | 0.01 | | |
| | values | stddev | 18.65 | | |
| ## Mean Pooled Quantization Loss | |
| This test roundtrips the vectors through quantization, but performs the | |
| mean pooling arithmetic in float32 space. The quantized and unquantized | |
| mean pooled vectors are compared to each other to determine their cosine | |
| similarity, to show how much the meaning of the vector has changed due | |
| to quantization. | |
| | Precision | Cosine Similarity | | |
| | ------------- | ----------------- | | |
| | fp16 | 1.00000 | | |
| | fp8 e4m3 | 0.99972 | | |
| | fp8 e5m2 | 0.99887 | | |
| ## Quantization Loss Per Vector | |
| While ultimately the embedding vectors will be mean pooled together, it's | |
| still useful to look at the loss per-vector in the embedding table to see | |
| which quantization strategies retain the most vector meaning. | |
| - **Cosine Similarity** — measures how well the *direction* of embedding vectors | |
| is preserved after quantization, independent of scale. This is especially | |
| relevant when embeddings are used for similarity search or retrieval. | |
| - **MSE (Mean Squared Error)** — emphasizes large errors by squaring the | |
| differences. Useful for detecting whether any values are badly distorted. | |
| - **MAE (Mean Absolute Error)** — the average absolute difference between | |
| original and quantized values. Easier to interpret, less sensitive to outliers. | |
| | Precision | Metric | Value | | |
| | ------------- | ------ | ----- | | |
| | fp16 | cosine similarity | 1.00000 | | |
| | fp8 e4m3 | cosine similarity | 0.99965 | | |
| | fp8 e5m2 | cosine similarity | 0.99861 | | |
| | fp16 | MSE | 0.00001 | | |
| | fp8 e4m3 | MSE | 0.24369 | | |
| | fp8 e5m2 | MSE | 0.96497 | | |
| | fp16 | MAE | 0.00244 | | |
| | fp8 e4m3 | MAE | 0.31206 | | |
| | fp8 e5m2 | MAE | 0.62205 | | |
| ## Tokenizer Examples | |
| **Input:** This is an example of encoding<br/> | |
| **Tokens**: `[CLS]` `this` `is` `an` `example` `of` `encoding` `[SEP]` | |
| **Input:** The quick brown fox jumps over the lazy dog.<br/> | |
| **Tokens**: `[CLS]` `the` `quick` `brown` `fox` `jumps` `over` `the` `lazy` `dog` `.` `[SEP]` | |
| **Input:** Curaçao, naïve fiancé, jalapeño, déjà vu.<br/> | |
| **Tokens**: `[CLS]` `cu` `##rac` `##ao` `,` `naive` `fiance` `,` `ja` `##la` `##pen` `##o` `,` `de` `##ja` `vu` `.` `[SEP]` | |
| **Input:** Привет, как дела?<br/> | |
| **Tokens**: `[CLS]` `п` `##р` `##и` `##в` `##е` `##т` `,` `к` `##а` `##к` `д` `##е` `##л` `##а` `?` `[SEP]` | |
| **Input:** Бързата кафява лисица прескача мързеливото куче.<br/> | |
| **Tokens**: `[CLS]` `б` `##ъ` `##р` `##з` `##а` `##т` `##а` `к` `##а` `##ф` `##я` `##в` `##а` `л` `##и` `##с` `##и` `##ц` `##а` `п` `##р` `##е` `##с` `##ка` `##ч` `##а` `м` `##ъ` `##р` `##з` `##е` `##л` `##и` `##в` `##о` `##т` `##о` `к` `##у` `##ч` `##е` `.` `[SEP]` | |
| **Input:** Γρήγορη καφέ αλεπού πηδάει πάνω από τον τεμπέλη σκύλο.<br/> | |
| **Tokens**: `[CLS]` `γ` `##ρ` `##η` `##γ` `##ο` `##ρ` `##η` `κ` `##α` `##φ` `##ε` `α` `##λ` `##ε` `##π` `##ου` `π` `##η` `##δ` `##α` `##ε` `##ι` `π` `##α` `##ν` `##ω` `α` `##π` `##ο` `τ` `##ο` `##ν` `τ` `##ε` `##μ` `##π` `##ε` `##λ` `##η` `σ` `##κ` `##υ` `##λ` `##ο` `.` `[SEP]` | |
| **Input:** اللغة العربية جميلة وغنية بالتاريخ.<br/> | |
| **Tokens**: `[CLS]` `ا` `##ل` `##ل` `##غ` `##ة` `ا` `##ل` `##ع` `##ر` `##ب` `##ي` `##ة` `ج` `##م` `##ي` `##ل` `##ة` `و` `##غ` `##ن` `##ي` `##ة` `ب` `##ا` `##ل` `##ت` `##ا` `##ر` `##ي` `##خ` `.` `[SEP]` | |
| **Input:** مرحبا بالعالم!<br/> | |
| **Tokens**: `[CLS]` `م` `##ر` `##ح` `##ب` `##ا` `ب` `##ا` `##ل` `##ع` `##ا` `##ل` `##م` `!` `[SEP]` | |
| **Input:** Simplified: 快速的棕色狐狸跳过懒狗。<br/> | |
| **Tokens**: `[CLS]` `simplified` `:` `[UNK]` `[UNK]` `的` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `。` `[SEP]` | |
| **Input:** Traditional: 快速的棕色狐狸跳過懶狗。<br/> | |
| **Tokens**: `[CLS]` `traditional` `:` `[UNK]` `[UNK]` `的` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `。` `[SEP]` | |
| **Input:** 素早い茶色の狐が怠け者の犬を飛び越える。<br/> | |
| **Tokens**: `[CLS]` `[UNK]` `[UNK]` `い` `[UNK]` `[UNK]` `の` `[UNK]` `か` `[UNK]` `け` `[UNK]` `の` `犬` `を` `[UNK]` `ひ` `[UNK]` `え` `##る` `。` `[SEP]` | |
| **Input:** コンピュータープログラミング<br/> | |
| **Tokens**: `[CLS]` `コ` `##ン` `##ヒ` `##ュ` `##ー` `##タ` `##ー` `##フ` `##ロ` `##ク` `##ラ` `##ミ` `##ン` `##ク` `[SEP]` | |
| **Input:** 빠른 갈색 여우가 게으른 개를 뛰어넘습니다.<br/> | |
| **Tokens**: `[CLS]` `[UNK]` `ᄀ` `##ᅡ` `##ᆯ` `##ᄉ` `##ᅢ` `##ᆨ` `ᄋ` `##ᅧ` `##ᄋ` `##ᅮ` `##ᄀ` `##ᅡ` `ᄀ` `##ᅦ` `##ᄋ` `##ᅳ` `##ᄅ` `##ᅳ` `##ᆫ` `ᄀ` `##ᅢ` `##ᄅ` `##ᅳ` `##ᆯ` `[UNK]` `.` `[SEP]` | |
| **Input:** तेज़ भूरी लोमड़ी आलसी कुत्ते के ऊपर कूदती है।<br/> | |
| **Tokens**: `[CLS]` `त` `##ज` `भ` `##र` `##ी` `ल` `##ो` `##म` `##ड` `##ी` `आ` `##ल` `##स` `##ी` `क` `##त` `##त` `क` `[UNK]` `क` `##द` `##त` `##ी` `ह` `।` `[SEP]` | |
| **Input:** দ্রুত বাদামী শিয়াল অলস কুকুরের উপর দিয়ে লাফ দেয়।<br/> | |
| **Tokens**: `[CLS]` `দ` `##র` `##ত` `ব` `##া` `##দ` `##া` `##ম` `##ী` `শ` `##ি` `##য` `##া` `##ল` `অ` `##ল` `##স` `ক` `##ক` `##র` `##ে` `##র` `উ` `##প` `##র` `দ` `##ি` `##য` `##ে` `[UNK]` `দ` `##ে` `##য` `।` `[SEP]` | |
| **Input:** வேகமான பழுப்பு நரி சோம்பேறி நாயின் மேல் குதிக்கிறது.<br/> | |
| **Tokens**: `[CLS]` `வ` `##ே` `##க` `##ம` `##ா` `##ன` `[UNK]` `ந` `##ர` `##ி` `[UNK]` `ந` `##ா` `##ய` `##ி` `##ன` `ம` `##ே` `##ல` `[UNK]` `.` `[SEP]` | |
| **Input:** สุนัขจิ้งจอกสีน้ำตาลกระโดดข้ามสุนัขขี้เกียจ.<br/> | |
| **Tokens**: `[CLS]` `[UNK]` `.` `[SEP]` | |
| **Input:** ብሩክ ቡናማ ቀበሮ ሰነፍ ውሻን ተዘልሏል።<br/> | |
| **Tokens**: `[CLS]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[UNK]` `[SEP]` | |
| **Input:** Hello 世界 مرحبا 🌍<br/> | |
| **Tokens**: `[CLS]` `hello` `世` `[UNK]` `م` `##ر` `##ح` `##ب` `##ا` `[UNK]` `[SEP]` | |
| **Input:** 123, αβγ, абв, العربية, 中文, हिन्दी.<br/> | |
| **Tokens**: `[CLS]` `123` `,` `α` `##β` `##γ` `,` `а` `##б` `##в` `,` `ا` `##ل` `##ع` `##ر` `##ب` `##ي` `##ة` `,` `中` `文` `,` `ह` `##ि` `##न` `##द` `##ी` `.` `[SEP]` |