Llama-3.2-1B - ADPQ 4-bit Quantized
This work is part of a master thesis. The library used for quantization is available at auto-adpq.
pip install auto-adpq
Model Description
This is a compressed version of meta-llama/Llama-3.2-1B created using 4-bit quantization.
This model was quantized to reduce VRAM usage and increase inference speed while maintaining majority of the original model's performance.
Quantization Details
- Original Model: meta-llama/Llama-3.2-1B
- Quantization Method: ADPQ (Adaptive Quantization with data-free calibration)
- Precision: 4-bit
- Simulated: Yes
How to Use
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Tfloow/Llama-3.2-1B-adpq-4bit-sim-16workers"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
inputs = tokenizer("Hello, world!", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(output[0]))
Performance
| model | PPL |
|---|---|
| unsloth/Meta-Llama-3.1-8B | 4.8693 |
| unsloth/Meta-Llama-3.1-8B-bnb-4bit | 5.0733 |
| Tfloow/Meta-Llama-3.1-8B-weights-adpq-4bit-sim | 5.3671 |
| ---- | ---- |
| unsloth/Meta-Llama-3.2-1B | 6.5546 |
| unsloth/Meta-Llama-3.2-1B-bnb-4bit | 6.9971 |
| unsloth/Meta-Llama-3.2-1B-adpq | 7.5700 |
How was the model quantized?
import torch
from transformers import AutoModelForCausalLM
from auto_adpq import Auto_AdpQ, AutoAdpQConfig
model_name = "meta-llama/Llama-3.2-1B"
# Setup Auto-AdpQ configuration
adpq_config = AutoAdpQConfig(
group_size=group_size,
n_iters=30, # Seems quite slow otherwise
alpha=0.09,
device="cpu",
q_bit=4,
data_packing=False,
symmetrical_quantization=True,
)
user = "Tfloow"
adpq_model_name = f"{user}/{model_name.split('/')[-1]}-adpq-4bit-sim-16workers"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
print(model.dtype)
# virtual quantization
quantized = Auto_AdpQ.apply_quantization(model, adpq_config, multi_threaded=16)
model.push_to_hub(adpq_model_name)
- Downloads last month
- 17
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for Tfloow/Llama-3.2-1B-adpq-4bit-sim-16workers
Base model
meta-llama/Llama-3.2-1B