Kimi-Linear-48B-Instruct-GGUF
Kimi Linear: An Expressive, Efficient Attention Architecture
I am currently looking for open positions! 🤗 If you find this model useful or are looking for a talented AI/LLM Engineer, please reach out to me on LinkedIn: Aaryan Kapoor.
Experimental Build Required 🚧 This model utilizes the Kimi Delta Attention (KDA) architecture, which is not yet supported in the main branch of
llama.cpp.To run this GGUF, you must compile
llama.cppfrom PR #17592. Attempting to run this on a standard build will result in errors.
Some test prompts :)
Description
This repository contains experimental GGUF format model files for Moonshot AI's Kimi Linear 48B.
Kimi Linear is a hybrid linear attention architecture designed to outperform traditional full attention methods in long-context and scaling regimes. It uses Kimi Delta Attention (KDA) and a hybrid architecture (3:1 KDA-to-MLA ratio) to reduce memory usage and boost throughput by up to 6x on long sequences.
Performance & Architecture. This model is currently quantized to Q2_K (and others) to fit on consumer hardware while testing the architecture's correctness. Despite the aggressive quantization, initial tests show the logic and reasoning capabilities remain intact.
| Feature | Kimi Linear Specification |
|---|---|
| Architecture | Hybrid Linear Attention (MoE + MLA + KDA) |
| Context Length | 1M Tokens (Supported by architecture) |
| Params | 48B Total / 3B Activated |
| Throughput | ~6.3x faster TPOT compared to MLA at 1M context |
| MMLU-Pro | 51.0 (4k context) |
| RULER | 84.3 (128k context, Pareto-optimal) |
How to Run (llama.cpp)
Prerequisite: You must clone and build the specific PR branch:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/17592/head:pr-17592
git checkout pr-17592
make -j
1. CLI Inference (Interactive Chat)
./llama-cli -m Kimi-Linear-48B-Instruct.Q2_K.gguf \
-n 2048 \ # Adjust generation limit
-c 8192 \ # Context window (Model supports up to 1M)
--temp 0.8 \ # Recommended temperature
--top-p 0.9 \
-ngl 99 \ # Offload all layers to GPU
-p "<|im_start|>user\nHello, who are you?<|im_end|>\n<|im_start|>assistant\n" \
-cnv
Note: The current GGUF implementation successfully mitigates previous "state collapse" issues found in early development.
2. Server Mode (API)
Running a persistent server is recommended for this size model to avoid reloading times.
./llama-server -m Kimi-Linear-48B-Instruct.Q2_K.gguf \
--port 8080 \
-ngl 99 \
-c 8192 \
--alias kimi
Hardware Requirements
- Full GPU Offloading (
-ngl 99):- Q4_K_M: Requires ~28GB VRAM (e.g., A100, A6000, or Mac Studio M2/M3 Max).
- Q2_K: Requires ~16-18GB VRAM (Fits on RTX 3090 / 4090).
- Split Offloading:
- If you have less VRAM (e.g., 12GB), use
-nglwith a lower number (e.g.,-ngl 20) to split layers between GPU and CPU RAM.
- If you have less VRAM (e.g., 12GB), use
Default Settings
- temperature:
0.8 - top-p:
0.9 - repeat-penalty:
1.05(Optional, if repetition occurs)
CLI Example
./llama-cli -m Kimi-Linear-48B-Instruct.Q2_K.gguf \
-c 8192 \
--temp 0.8 \
--top-p 0.9 \
-p "<|im_start|>user\nWrite a Python script to calculate Fibonacci numbers.<|im_end|>\n<|im_start|>assistant\n" \
-cnv
- Downloads last month
- 3,524
Model tree for AaryanK/Kimi-Linear-48B-A3B-Instruct-GGUF
Base model
moonshotai/Kimi-Linear-48B-A3B-Instruct