VibeStudio
/

MiniMax-M2-THRIFT

@@ -13,10 +13,10 @@ base_model:
 - MiniMaxAI/MiniMax-M2
 ---
-# THRIFT — Targeted Reduction for Inference and Fine-Tuning
 ![Screenshot](https://huggingface.co/VibeStudio/MiniMax-M2-THRIFT/resolve/main/vibe_processed_by_imagy.png)
 A performance-optimized variant of the base model that delivers faster responses and lower memory usage while preserving quality for everyday tasks, developed by VibeStud.io.
@@ -181,184 +181,97 @@ This model is derived from MiniMax-M2 and distributed under the MIT License [htt
 Model conversion and HF Transformers code by @Qubitum at ModelCloud.
-Positive references to related work:
-* Cerebras — [https://arxiv.org/abs/2510.13999](https://arxiv.org/abs/2510.13999)
-* Alibaba Cloud Computing — [https://arxiv.org/html/2511.01354v1](https://arxiv.org/html/2511.01354v1)
-* QLoRA — [https://arxiv.org/abs/2307.02973](https://arxiv.org/abs/2307.02973)# THRIFT — Targeted Reduction for Inference and Fine-Tuning
-A performance-optimized variant of the base model that delivers faster responses and lower memory usage while preserving quality for everyday tasks, developed by VibeStud.io.
-## TLDR
-We, over-caffinated researchers at VibeStud.io wanted to create a 50% pruned version of the SOTA MiniMax M2 that is best suited for local/air-gapped coding. This version we achieved \~25%. A 50% pruned version is under development while a not so sucky team of ours is working on a  50% pruned version of Kimi K2 Thinking. Check back later, cheers\!
-## Why it’s useful
-* **Lower latency:** Snappier responses for interactive apps and chatbots.
-* **Smaller memory footprint:** Runs on cheaper GPUs or with fewer resources per replica.
-* **Higher throughput:** Serve more concurrent users at the same cost.
-* **Deployment-friendly:** Drop-in replacement for the base model in most inference stacks.
-* **Adaptable:** Supports light fine-tuning to match your domain and style guidelines.
-## Intended use
-* General chat and coding assistance
-* Enterprise assistants with strict latency/VRAM budgets
-* Batch or realtime serving in cloud and on-prem environments
-* Edge or cost-sensitive deployments where efficiency matters
-## When to use it
-* You’re constrained by GPU memory or need shorter response times
-* You want to increase QPS without scaling infrastructure
-* You need a model that is “good enough” for most tasks at a better cost profile
----
-# Model Comparison Report
-**Models Under Evaluation**
-| Model | Type |
-| :---- | :---- |
-| ModelCloud/MiniMax-M2-BF16 | Base Model |
-| VibeStudio/MiniMax-M2-THRIFT | Compressed/Optimized |
-**Evaluation Date: November 7, 2025**
-## 📊 Results Comparison
-### 1\) Multiple Choice Q\&A (lm-eval)
-**Overall MMLU Performance**
-| Model | MMLU Overall | Humanities | STEM | Social Sciences | Other |
-| :---- | ----: | ----: | ----: | ----: | ----: |
-| MiniMax-M2-BF16 | 83.16% | 77.45% | 80.91% | 90.02% | 87.29% |
-| MiniMax-M2-THRIFT | 77.72% | 70.14% | 77.61% | 86.84% | 80.27% |
-| **Δ (Difference)** | **\-5.44%** | **\-7.31%** | **\-3.30%** | **\-3.18%** | **\-7.02%** |
-**Individual Task Performance**
-| Task | BF16 (Base) | THRIFT-BF16 | Difference |
-| :---- | ----: | ----: | ----: |
-| arc\_challenge | 73.21% | 61.01% | \-12.20% ⬇️ |
-| arc\_easy | 88.30% | 83.08% | \-5.22% ⬇️ |
-| boolq | 87.95% | 84.95% | \-3.00% ⬇️ |
-| hellaswag | 83.00% | 77.09% | \-5.91% ⬇️ |
-| mmlu | 83.16% | 77.72% | \-5.44% ⬇️ |
-| openbookqa | 48.60% | 43.00% | \-5.60% ⬇️ |
-| rte | 75.45% | 80.14% | **\+4.69% ⬆️** |
-| winogrande | 76.48% | 74.90% | \-1.58% ⬇️ |
-**Average Accuracy Drop: \-4.28%**
-### 2\) Code Generation (EvalPlus)
-**MBPP Results**
-| Model | MBPP (base) | MBPP+ (extended) |
-| :---- | ----: | ----: |
-| MiniMax-M2-BF16 | 73.8% | 64.0% |
-| MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 Coming Soon |
-**HumanEval Results**
-| Model | HumanEval (base) | HumanEval+ (extended) |
-| :---- | ----: | ----: |
-| MiniMax-M2-BF16 | ✅ Complete | ✅ Complete |
-| MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 Coming Soon |
-### 3\) Math Benchmarks
-**GSM8K Results**
-| Model | Accuracy | Problems |
-| :---- | ----: | ----: |
-| MiniMax-M2-BF16 | 92.72% | 1,319 |
-| MiniMax-M2-THRIFT | 🔄 Coming Soon | 1,319 |
-**MATH-500 Results**
-| Model | Overall | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 |
-| :---- | ----: | ----: | ----: | ----: | ----: | ----: |
-| MiniMax-M2-BF16 | 87.2% | 90.7% | 95.56% | 82.86% | 85.16% | 85.82% |
-| MiniMax-M2-THRIFT | 🔄 Coming Soon | 🔄 | 🔄 | 🔄 | 🔄 | 🔄 |
-### 4\) LiveCodeBench (Live Coding Problems)
-| Model | pass@1 | Problems | Status |
-| :---- | ----: | ----: | :---- |
-| **MiniMax-M2-BF16** | **35.71%** | 182 | ✅ Complete |
-| **MiniMax-M2-THRIFT** | 🔄 Coming Soon | 182 | ⏳ Not Started Yet |
----
-## 📈 Analysis (Preliminary)
-### Key Findings
-**MMLU Performance Drop**
-* THRIFT-BF16 shows **\-5.44%** overall MMLU drop
-* Largest drop: **arc\_challenge (-12.20%)**
-* Smallest drop: **winogrande (-1.58%)**
-* **RTE improved by \+4.69%** 🎉
-**Subject-Specific Performance**
-* Best preservation: **Social Sciences (-3.18%)**
-* Most degraded: **Other (-7.02%)**
-* STEM: **Moderate drop (-3.30%)**
-**Compression Trade-off**
-* THRIFT-BF16 (compressed) vs BF16 (base)
-* Average accuracy loss: **\~4–5%**
-* Expected for compressed/quantized models
-**MMLU Category Breakdown**
-| Category | BF16 (Base) | THRIFT-BF16 | Difference | Status |
-| :---- | ----: | ----: | ----: | :---- |
-| High School Government | 97.93% | 94.82% | \-3.11% | ✅ Still Excellent |
-| High School Psychology | 95.41% | 93.58% | \-1.83% | ✅ Well Preserved |
-| Marketing | 95.73% | 91.88% | \-3.85% | ✅ Good |
-| Professional Medicine | 92.28% | 79.78% | \-12.50% | ⚠️ Notable Drop |
-| Clinical Knowledge | 92.83% | 85.66% | \-7.17% | ⚠️ Moderate Drop |
----
-## Benchmarks
-Coming soon.
-## Research paper
-Coming soon.
----
-## License
-This model is derived from MiniMax-M2 and distributed under the MIT License [http://github.com/MiniMax-AI/MiniMax-M2/blob/main/LICENSE](http://github.com/MiniMax-AI/MiniMax-M2/blob/main/LICENSE)
----
-## Credits
-Model conversion and HF Transformers code by @Qubitum at ModelCloud.
-A special thanks to Cerebras for their contributions and innovations.
-Positive references to related work:
-* Alibaba Cloud Computing — [https://arxiv.org/html/2511.01354v1](https://arxiv.org/html/2511.01354v1)
-* Cerebras — [https://arxiv.org/abs/2510.13999](https://arxiv.org/abs/2510.13999)
-* QLoRA — [https://arxiv.org/abs/2307.02973](https://arxiv.org/abs/2307.02973)
-* SparseGPT ([https://arxiv.org/abs/2301.00774](https://arxiv.org/abs/2301.00774))
-* Wanda ([https://arxiv.org/abs/2306.11695](https://arxiv.org/abs/2306.11695))
-* LLM-Pruner ([https://arxiv.org/abs/2305.11627](https://arxiv.org/abs/2305.11627))
-* Sheared-LLaMA ([https://arxiv.org/abs/2310.06694](https://arxiv.org/abs/2310.06694))
-* Wanda++ (2025):([https://arxiv.org/abs/2503.04992](https://arxiv.org/abs/2503.04992))
-* Týr-the-Pruner ([https://arxiv.org/abs/2503.09657](https://arxiv.org/abs/2503.09657))

 - MiniMaxAI/MiniMax-M2
 ---
 ![Screenshot](https://huggingface.co/VibeStudio/MiniMax-M2-THRIFT/resolve/main/vibe_processed_by_imagy.png)
+# THRIFT — Targeted Reduction for Inference and Fine-Tuning
 A performance-optimized variant of the base model that delivers faster responses and lower memory usage while preserving quality for everyday tasks, developed by VibeStud.io.
 Model conversion and HF Transformers code by @Qubitum at ModelCloud.
+## **References (BibTeX)**
+```
+@article{cai2025thinking,
+  title        = {Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series},
+  author       = {Cai, Wenrui and Wang, Chengyu and Yan, Junbing and Huang, Jun and Fang, Xiangzhong},
+  journal      = {arXiv preprint arXiv:2511.01354},
+  year         = {2025},
+  eprinttype   = {arXiv},
+  eprint       = {2511.01354},
+  primaryclass = {cs.CL},
+  institution  = {Shanghai Jiao Tong University and Alibaba Cloud Computing},
+  note         = {License: arXiv.org perpetual non-exclusive license}
+}
+@misc{lasby-reap,
+    title       = {{REAP the Experts: Why Pruning Prevails for One-Shot MoE compression}},
+    author      = {Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
+    year        = {2025},
+    publisher   = {arXiv},
+    note        = {arXiv:2510.13999v1 [cs]},
+    url         = {https://arxiv.org/abs/2510.13999v1},
+}
+@article{yang2025wanda++,
+  title        = {Wanda++: Pruning Large Language Models via Regional Gradients},
+  author       = {Yang, Yifan and Zhen, Kai and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and K{"u}bler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Susanj, Nathan and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
+  journal      = {arXiv preprint arXiv:2503.04992},
+  year         = {2025},
+  eprinttype   = {arXiv},
+  eprint       = {2503.04992},
+  primaryclass = {cs.CL}
+}
+@article{li2025tyr,
+  title        = {Týr-the-Pruner: Structural Pruning LLMs via Global Sparsity Distribution Optimization},
+  author       = {Li, G. and Xu, Yixing and Li, Zeping and Liu, Ji and Yin, Xuanwu and Li, Dong and Barsoum, Emad},
+  journal      = {arXiv preprint arXiv:2503.09657},
+  year         = {2025},
+  eprinttype   = {arXiv},
+  eprint       = {2503.09657},
+  primaryclass = {cs.CL}
+}
+@article{xia2023sheared,
+  title        = {Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning},
+  author       = {Xia, Mengzhou and Gao, Tianyu and Zeng, Zhiyuan and Chen, Danqi},
+  journal      = {arXiv preprint arXiv:2310.06694},
+  year         = {2023},
+  eprinttype   = {arXiv},
+  eprint       = {2310.06694},
+  primaryclass = {cs.CL}
+}
+@article{ma2023llmpruner,
+  title        = {LLM-Pruner: On the Structural Pruning of Large Language Models},
+  author       = {Ma, Xinyin and Fang, Gongfan and Wang, Xinchao},
+  journal      = {arXiv preprint arXiv:2305.11627},
+  year         = {2023},
+  eprinttype   = {arXiv},
+  eprint       = {2305.11627},
+  primaryclass = {cs.CL}
+}
+@article{yang2023wanda,
+  title        = {Wanda: Pruning by Weights and Activation-based Discriminant Analysis},
+  author       = {Yang, Yifan and Ganesh, Bhavana and Galstyan, Aram and Huybrechts, Goeric and M{"u}ller, Markus and Kübler, Jonas M. and Swaminathan, Rupak Vignesh and Mouchtaris, Athanasios and Bodapati, Sravan Babu and Zhang, Zheng and FitzGerald, Jack and Kumar, Abhishek},
+  journal      = {arXiv preprint arXiv:2306.11695},
+  year         = {2023},
+  eprinttype   = {arXiv},
+  eprint       = {2306.11695},
+  primaryclass = {cs.CL}
+}
+@article{frantar2023sparsegpt,
+  title        = {SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot},
+  author       = {Frantar, Elias and Alistarh, Dan},
+  journal      = {arXiv preprint arXiv:2301.00774},
+  year         = {2023},
+  eprinttype   = {arXiv},
+  eprint       = {2301.00774},
+  primaryclass = {cs.CL}
+}
+@article{dettmers2023qlora,
+  title        = {QLoRA: Efficient Finetuning of Quantized LLMs},
+  author       = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
+  journal      = {arXiv preprint arXiv:2307.02973},
+  year         = {2023},
+  eprinttype   = {arXiv},
+  eprint       = {2307.02973},
+  primaryclass = {cs.CL}
+}
+```