---
title: Zen VL Training
emoji: 🧘
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
license: apache-2.0
hardware: a10g-large
---

# 🧘 Zen VL Training Space

Train zen-vl vision-language models with combined ADP+xLAM datasets on HuggingFace Pro GPUs.

## Features

- **Multi-Size Support**: Train 4B, 8B, or 30B parameter models
- **GPU Options**: A10G (24GB), A100-Large (40GB), A100 (80GB)
- **Combined Datasets**: Agent Data Protocol (ADP) + xLAM Function Calling
- **Auto-Upload**: Trained models automatically uploaded to HuggingFace Hub
- **Real-time Monitoring**: Live training logs and progress tracking

## Datasets

### Agent Data Protocol (ADP)
- **Source**: [neulab/agent-data-collection](https://huggingface.co/datasets/neulab/agent-data-collection)
- **Size**: ~220k agent trajectories (8.4GB)
- **Citation**: arXiv:2510.24702

### xLAM Function Calling 60k
- **Source**: [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)
- **Size**: 60k function calling examples (101MB)
- **Citation**: Salesforce Research

## Training Configuration

### 4B Model (A10G - 24GB)
- Batch size: 1
- Gradient accumulation: 8
- Max samples: 30,000
- Estimated time: 6-8 hours

### 8B Model (A100-Large - 40GB)
- Batch size: 2
- Gradient accumulation: 8
- Max samples: 50,000
- Estimated time: 10-12 hours

### 30B Model (A100 - 80GB)
- Batch size: 4
- Gradient accumulation: 8
- Max samples: 100,000
- Estimated time: 20-24 hours

## Usage

1. Select model size (4b, 8b, or 30b)
2. Choose GPU type (a10g, a100-large, or a100)
3. Click "Start Training"
4. Monitor progress in real-time
5. Trained model automatically uploads to `zenlm/zen-vl-{size}-agent`

## Requirements

- HuggingFace Pro account (for GPU access)
- HF_TOKEN environment variable set
- Write access to zenlm organization

## Output Models

Trained models will be uploaded to:
- `zenlm/zen-vl-4b-agent`
- `zenlm/zen-vl-8b-agent`
- `zenlm/zen-vl-30b-agent`

## Technical Details

**Base Architecture**: Qwen3-VL
**Training Method**: Supervised Fine-Tuning (SFT)
**Data Mixture**: 80% ADP, 20% xLAM
**Precision**: bfloat16
**Framework**: Transformers + Accelerate

## License

Apache 2.0 - See [LICENSE](https://github.com/zenlm/zen-vl/blob/main/LICENSE)

## Citation

```bibtex
@software{zen-vl-2025,
  title={Zen VL: Vision-Language Models with Function Calling},
  author={Zen AI Team},
  year={2025},
  url={https://github.com/zenlm/zen-vl}
}
```

## Links

- **Website**: https://zenlm.org
- **GitHub**: https://github.com/zenlm/zen-vl
- **Models**: https://huggingface.co/zenlm
- **Paper**: Coming soon