rogkesavan commited on
Commit
d5e8c7a
·
verified ·
1 Parent(s): e785c97

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -182
README.md CHANGED
@@ -13,10 +13,10 @@ base_model:
13
  - MiniMaxAI/MiniMax-M2
14
  ---
15
 
16
- # THRIFT — Targeted Reduction for Inference and Fine-Tuning
17
 
18
  ![Screenshot](https://huggingface.co/VibeStudio/MiniMax-M2-THRIFT/resolve/main/vibe_processed_by_imagy.png)
19
 
 
20
 
21
  A performance-optimized variant of the base model that delivers faster responses and lower memory usage while preserving quality for everyday tasks, developed by VibeStud.io.
22
 
@@ -181,184 +181,97 @@ This model is derived from MiniMax-M2 and distributed under the MIT License [htt
181
 
182
  Model conversion and HF Transformers code by @Qubitum at ModelCloud.
183
 
184
- Positive references to related work:
185
-
186
- * Cerebras — [https://arxiv.org/abs/2510.13999](https://arxiv.org/abs/2510.13999)
187
- * Alibaba Cloud Computing — [https://arxiv.org/html/2511.01354v1](https://arxiv.org/html/2511.01354v1)
188
- * QLoRA [https://arxiv.org/abs/2307.02973](https://arxiv.org/abs/2307.02973)# THRIFT Targeted Reduction for Inference and Fine-Tuning
189
-
190
- A performance-optimized variant of the base model that delivers faster responses and lower memory usage while preserving quality for everyday tasks, developed by VibeStud.io.
191
-
192
- ## TLDR
193
-
194
- We, over-caffinated researchers at VibeStud.io wanted to create a 50% pruned version of the SOTA MiniMax M2 that is best suited for local/air-gapped coding. This version we achieved \~25%. A 50% pruned version is under development while a not so sucky team of ours is working on a 50% pruned version of Kimi K2 Thinking. Check back later, cheers\!
195
-
196
- ## Why it’s useful
197
-
198
- * **Lower latency:** Snappier responses for interactive apps and chatbots.
199
- * **Smaller memory footprint:** Runs on cheaper GPUs or with fewer resources per replica.
200
- * **Higher throughput:** Serve more concurrent users at the same cost.
201
- * **Deployment-friendly:** Drop-in replacement for the base model in most inference stacks.
202
- * **Adaptable:** Supports light fine-tuning to match your domain and style guidelines.
203
-
204
- ## Intended use
205
-
206
- * General chat and coding assistance
207
- * Enterprise assistants with strict latency/VRAM budgets
208
- * Batch or realtime serving in cloud and on-prem environments
209
- * Edge or cost-sensitive deployments where efficiency matters
210
-
211
- ## When to use it
212
-
213
- * You’re constrained by GPU memory or need shorter response times
214
- * You want to increase QPS without scaling infrastructure
215
- * You need a model that is “good enough” for most tasks at a better cost profile
216
-
217
- ---
218
-
219
- # Model Comparison Report
220
-
221
- **Models Under Evaluation**
222
-
223
- | Model | Type |
224
- | :---- | :---- |
225
- | ModelCloud/MiniMax-M2-BF16 | Base Model |
226
- | VibeStudio/MiniMax-M2-THRIFT | Compressed/Optimized |
227
-
228
- **Evaluation Date: November 7, 2025**
229
-
230
- ## 📊 Results Comparison
231
-
232
- ### 1\) Multiple Choice Q\&A (lm-eval)
233
-
234
- **Overall MMLU Performance**
235
-
236
- | Model | MMLU Overall | Humanities | STEM | Social Sciences | Other |
237
- | :---- | ----: | ----: | ----: | ----: | ----: |
238
- | MiniMax-M2-BF16 | 83.16% | 77.45% | 80.91% | 90.02% | 87.29% |
239
- | MiniMax-M2-THRIFT | 77.72% | 70.14% | 77.61% | 86.84% | 80.27% |
240
- | **Δ (Difference)** | **\-5.44%** | **\-7.31%** | **\-3.30%** | **\-3.18%** | **\-7.02%** |
241
-
242
- **Individual Task Performance**
243
-
244
- | Task | BF16 (Base) | THRIFT-BF16 | Difference |
245
- | :---- | ----: | ----: | ----: |
246
- | arc\_challenge | 73.21% | 61.01% | \-12.20% ⬇️ |
247
- | arc\_easy | 88.30% | 83.08% | \-5.22% ⬇️ |
248
- | boolq | 87.95% | 84.95% | \-3.00% ⬇️ |
249
- | hellaswag | 83.00% | 77.09% | \-5.91% ⬇️ |
250
- | mmlu | 83.16% | 77.72% | \-5.44% ⬇️ |
251
- | openbookqa | 48.60% | 43.00% | \-5.60% ⬇️ |
252
- | rte | 75.45% | 80.14% | **\+4.69% ⬆️** |
253
- | winogrande | 76.48% | 74.90% | \-1.58% ⬇️ |
254
-
255
- **Average Accuracy Drop: \-4.28%**
256
-
257
- ### 2\) Code Generation (EvalPlus)
258
-
259
- **MBPP Results**
260
-
261
- | Model | MBPP (base) | MBPP+ (extended) |
262
- | :---- | ----: | ----: |
263
- | MiniMax-M2-BF16 | 73.8% | 64.0% |
264
- | MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 Coming Soon |
265
-
266
- **HumanEval Results**
267
-
268
- | Model | HumanEval (base) | HumanEval+ (extended) |
269
- | :---- | ----: | ----: |
270
- | MiniMax-M2-BF16 | Complete | Complete |
271
- | MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 Coming Soon |
272
-
273
- ### 3\) Math Benchmarks
274
-
275
- **GSM8K Results**
276
-
277
- | Model | Accuracy | Problems |
278
- | :---- | ----: | ----: |
279
- | MiniMax-M2-BF16 | 92.72% | 1,319 |
280
- | MiniMax-M2-THRIFT | 🔄 Coming Soon | 1,319 |
281
-
282
- **MATH-500 Results**
283
-
284
- | Model | Overall | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 |
285
- | :---- | ----: | ----: | ----: | ----: | ----: | ----: |
286
- | MiniMax-M2-BF16 | 87.2% | 90.7% | 95.56% | 82.86% | 85.16% | 85.82% |
287
- | MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 | 🔄 | 🔄 | 🔄 | 🔄 |
288
-
289
- ### 4\) LiveCodeBench (Live Coding Problems)
290
-
291
- | Model | pass@1 | Problems | Status |
292
- | :---- | ----: | ----: | :---- |
293
- | **MiniMax-M2-BF16** | **35.71%** | 182 | ✅ Complete |
294
- | **MiniMax-M2-THRIFT** | 🔄 Coming Soon | 182 | ⏳ Not Started Yet |
295
-
296
- ---
297
-
298
- ## 📈 Analysis (Preliminary)
299
-
300
- ### Key Findings
301
-
302
- **MMLU Performance Drop**
303
-
304
- * THRIFT-BF16 shows **\-5.44%** overall MMLU drop
305
- * Largest drop: **arc\_challenge (-12.20%)**
306
- * Smallest drop: **winogrande (-1.58%)**
307
- * **RTE improved by \+4.69%** 🎉
308
-
309
- **Subject-Specific Performance**
310
-
311
- * Best preservation: **Social Sciences (-3.18%)**
312
- * Most degraded: **Other (-7.02%)**
313
- * STEM: **Moderate drop (-3.30%)**
314
-
315
- **Compression Trade-off**
316
-
317
- * THRIFT-BF16 (compressed) vs BF16 (base)
318
- * Average accuracy loss: **\~4–5%**
319
- * Expected for compressed/quantized models
320
-
321
- **MMLU Category Breakdown**
322
-
323
- | Category | BF16 (Base) | THRIFT-BF16 | Difference | Status |
324
- | :---- | ----: | ----: | ----: | :---- |
325
- | High School Government | 97.93% | 94.82% | \-3.11% | ✅ Still Excellent |
326
- | High School Psychology | 95.41% | 93.58% | \-1.83% | ✅ Well Preserved |
327
- | Marketing | 95.73% | 91.88% | \-3.85% | ✅ Good |
328
- | Professional Medicine | 92.28% | 79.78% | \-12.50% | ⚠️ Notable Drop |
329
- | Clinical Knowledge | 92.83% | 85.66% | \-7.17% | ⚠️ Moderate Drop |
330
-
331
- ---
332
-
333
- ## Benchmarks
334
-
335
- Coming soon.
336
-
337
- ## Research paper
338
-
339
- Coming soon.
340
-
341
- ---
342
-
343
- ## License
344
-
345
- This model is derived from MiniMax-M2 and distributed under the MIT License [http://github.com/MiniMax-AI/MiniMax-M2/blob/main/LICENSE](http://github.com/MiniMax-AI/MiniMax-M2/blob/main/LICENSE)
346
-
347
- ---
348
-
349
- ## Credits
350
-
351
- Model conversion and HF Transformers code by @Qubitum at ModelCloud.
352
-
353
- A special thanks to Cerebras for their contributions and innovations.
354
-
355
- Positive references to related work:
356
- * Alibaba Cloud Computing — [https://arxiv.org/html/2511.01354v1](https://arxiv.org/html/2511.01354v1)
357
- * Cerebras — [https://arxiv.org/abs/2510.13999](https://arxiv.org/abs/2510.13999)
358
- * QLoRA — [https://arxiv.org/abs/2307.02973](https://arxiv.org/abs/2307.02973)
359
- * SparseGPT ([https://arxiv.org/abs/2301.00774](https://arxiv.org/abs/2301.00774))
360
- * Wanda ([https://arxiv.org/abs/2306.11695](https://arxiv.org/abs/2306.11695))
361
- * LLM-Pruner ([https://arxiv.org/abs/2305.11627](https://arxiv.org/abs/2305.11627))
362
- * Sheared-LLaMA ([https://arxiv.org/abs/2310.06694](https://arxiv.org/abs/2310.06694))
363
- * Wanda++ (2025):([https://arxiv.org/abs/2503.04992](https://arxiv.org/abs/2503.04992))
364
- * Týr-the-Pruner ([https://arxiv.org/abs/2503.09657](https://arxiv.org/abs/2503.09657))
 
13
  - MiniMaxAI/MiniMax-M2
14
  ---
15
 
 
16
 
17
  ![Screenshot](https://huggingface.co/VibeStudio/MiniMax-M2-THRIFT/resolve/main/vibe_processed_by_imagy.png)
18
 
19
+ # THRIFT — Targeted Reduction for Inference and Fine-Tuning
20
 
21
  A performance-optimized variant of the base model that delivers faster responses and lower memory usage while preserving quality for everyday tasks, developed by VibeStud.io.
22
 
 
181
 
182
  Model conversion and HF Transformers code by @Qubitum at ModelCloud.
183
 
184
+ ## **References (BibTeX)**
185
+
186
+ ```
187
+ @article{cai2025thinking,
188
+ title = {Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series},
189
+ author = {Cai, Wenrui and Wang, Chengyu and Yan, Junbing and Huang, Jun and Fang, Xiangzhong},
190
+ journal = {arXiv preprint arXiv:2511.01354},
191
+ year = {2025},
192
+ eprinttype = {arXiv},
193
+ eprint = {2511.01354},
194
+ primaryclass = {cs.CL},
195
+ institution = {Shanghai Jiao Tong University and Alibaba Cloud Computing},
196
+ note = {License: arXiv.org perpetual non-exclusive license}
197
+ }
198
+
199
+ @misc{lasby-reap,
200
+ title = {{REAP the Experts: Why Pruning Prevails for One-Shot MoE compression}},
201
+ author = {Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
202
+ year = {2025},
203
+ publisher = {arXiv},
204
+ note = {arXiv:2510.13999v1 [cs]},
205
+ url = {https://arxiv.org/abs/2510.13999v1},
206
+ }
207
+
208
+ @article{yang2025wanda++,
209
+ title = {Wanda++: Pruning Large Language Models via Regional Gradients},
210
+ author = {Yang, Yifan and Zhen, Kai and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and K{"u}bler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Susanj, Nathan and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
211
+ journal = {arXiv preprint arXiv:2503.04992},
212
+ year = {2025},
213
+ eprinttype = {arXiv},
214
+ eprint = {2503.04992},
215
+ primaryclass = {cs.CL}
216
+ }
217
+
218
+ @article{li2025tyr,
219
+ title = {Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization},
220
+ author = {Li, G. and Xu, Yixing and Li, Zeping and Liu, Ji and Yin, Xuanwu and Li, Dong and Barsoum, Emad},
221
+ journal = {arXiv preprint arXiv:2503.09657},
222
+ year = {2025},
223
+ eprinttype = {arXiv},
224
+ eprint = {2503.09657},
225
+ primaryclass = {cs.CL}
226
+ }
227
+
228
+ @article{xia2023sheared,
229
+ title = {Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning},
230
+ author = {Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi},
231
+ journal = {arXiv preprint arXiv:2310.06694},
232
+ year = {2023},
233
+ eprinttype = {arXiv},
234
+ eprint = {2310.06694},
235
+ primaryclass = {cs.CL}
236
+ }
237
+
238
+ @article{ma2023llmpruner,
239
+ title = {LLM-Pruner: On the Structural Pruning of Large Language Models},
240
+ author = {Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
241
+ journal = {arXiv preprint arXiv:2305.11627},
242
+ year = {2023},
243
+ eprinttype = {arXiv},
244
+ eprint = {2305.11627},
245
+ primaryclass = {cs.CL}
246
+ }
247
+
248
+ @article{yang2023wanda,
249
+ title = {Wanda: Pruning by Weights and Activation-based Discriminant Analysis},
250
+ author = {Yang, Yifan and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and Kübler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
251
+ journal = {arXiv preprint arXiv:2306.11695},
252
+ year = {2023},
253
+ eprinttype = {arXiv},
254
+ eprint = {2306.11695},
255
+ primaryclass = {cs.CL}
256
+ }
257
+
258
+ @article{frantar2023sparsegpt,
259
+ title = {SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot},
260
+ author = {Frantar, Elias and Alistarh, Dan},
261
+ journal = {arXiv preprint arXiv:2301.00774},
262
+ year = {2023},
263
+ eprinttype = {arXiv},
264
+ eprint = {2301.00774},
265
+ primaryclass = {cs.CL}
266
+ }
267
+
268
+ @article{dettmers2023qlora,
269
+ title = {QLoRA: Efficient Finetuning of Quantized LLMs},
270
+ author = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
271
+ journal = {arXiv preprint arXiv:2307.02973},
272
+ year = {2023},
273
+ eprinttype = {arXiv},
274
+ eprint = {2307.02973},
275
+ primaryclass = {cs.CL}
276
+ }
277
+ ```