# Data Pipeline

> **The Problem:** HuggingFace `datasets` doesn't natively support NIfTI/BIDS neuroimaging formats.
> **The Solution:** `neuroimaging-go-brrrr` extends `datasets` with `Nifti()` feature type.

---

## What is neuroimaging-go-brrrr?

```text
┌─────────────────────────────────────────────────────────────────────────────────┐
│              neuroimaging-go-brrrr EXTENDS HUGGINGFACE DATASETS                 │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│   pip install datasets              pip install neuroimaging-go-brrrr           │
│   ───────────────────────           ─────────────────────────────────           │
│   Standard HuggingFace              EXTENDS datasets with:                      │
│   • Images, text, audio             • Nifti() feature type for .nii.gz          │
│   • Parquet/Arrow storage           • BIDS directory parsing                    │
│   • Hub integration                 • Upload utilities (BIDS→Hub)               │
│                                     • Validation utilities                      │
│                                     • Bug workarounds for upstream issues       │
│                                                                                 │
│   When you install neuroimaging-go-brrrr, you get:                              │
│   • A patched datasets library with Nifti() support (pinned git commit)         │
│   • bids_hub module for upload/validation                                       │
│   • All upstream bug workarounds in one place                                   │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘
```

**Key insight:** `neuroimaging-go-brrrr` pins to a specific commit of `datasets` that includes `Nifti()` support:

```toml
# From neuroimaging-go-brrrr/pyproject.toml
[tool.uv.sources]
datasets = { git = "https://github.com/huggingface/datasets.git", rev = "004a5bf4..." }
```

---

## The Two Pipelines

### Pipeline 1: UPLOAD (How Data Gets to HuggingFace)

```text
┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────────┐
│  Local BIDS     │     │  neuroimaging-go-    │     │   HuggingFace Hub   │
│  Directory      │ ──► │  brrrr (bids_hub)    │ ──► │   hugging-science/  │
│  (Zenodo)       │     │                      │     │   isles24-stroke    │
└─────────────────┘     │  • build_isles24_    │     └─────────────────────┘
                        │    file_table()      │
                        │  • Nifti() features  │
                        │  • push_to_hub()     │
                        └──────────────────────┘
```

### Pipeline 2: CONSUMPTION (How This Demo Loads Data)

**THE CORRECT PATTERN:**

```python
from datasets import load_dataset

# neuroimaging-go-brrrr provides the patched datasets with Nifti() support
ds = load_dataset("hugging-science/isles24-stroke", split="train")

# Access data - Nifti() returns nibabel.Nifti1Image objects
example = ds[0]
dwi = example["dwi"]           # nibabel.Nifti1Image (NOT numpy array)
adc = example["adc"]           # nibabel.Nifti1Image
lesion_mask = example["lesion_mask"]  # nibabel.Nifti1Image

# To get numpy array: dwi.get_fdata()
# To save to file: dwi.to_filename("dwi.nii.gz")
```

This is the **intended consumption pattern**. It should just work because:
1. `neuroimaging-go-brrrr` provides the patched `datasets` with `Nifti()` support
2. The dataset was uploaded with `Nifti()` features
3. `Nifti(decode=True)` returns nibabel images with affine/header preserved

---

## Current State: REFACTOR NEEDED

**Problem:** stroke-deepisles-demo currently has a hand-rolled workaround in `data/adapter.py` that bypasses `datasets.load_dataset()`. This workaround uses `HfFileSystem` + `pyarrow` directly to download individual parquet files.

**Why this is wrong:**
1. Duplicates bug workarounds that should live in `neuroimaging-go-brrrr`
2. Doesn't use the `Nifti()` feature type properly
3. Harder to maintain - fixes need to happen in multiple places

**The fix:**
1. Delete the custom `HuggingFaceDataset` adapter in `data/adapter.py`
2. Use standard `datasets.load_dataset()` consumption pattern
3. If there are bugs, fix them in `neuroimaging-go-brrrr`, not locally

---

## Dependency Relationship

```text
stroke-deepisles-demo (this repo)
        │
        └── neuroimaging-go-brrrr @ v0.2.1
                │
                └── datasets @ git commit 004a5bf4... (patched with Nifti())
                └── huggingface-hub
                └── bids_hub module (upload + validation utilities)
```

**The consumption should flow through the standard pattern:**

```text
stroke-deepisles-demo
        │
        │ from datasets import load_dataset
        │ ds = load_dataset("hugging-science/isles24-stroke")
        ▼
neuroimaging-go-brrrr (provides patched datasets)
        │
        │ Nifti() feature type handles lazy loading
        ▼
HuggingFace Hub (isles24-stroke dataset)
```

---

## Dataset Info

| Property | Value |
|----------|-------|
| Dataset ID | `hugging-science/isles24-stroke` |
| Subjects | 149 |
| Modalities | DWI, ADC, Lesion Mask, NCCT, CTA, CTP, Perfusion Maps |
| Source | [Zenodo 17652035](https://zenodo.org/records/17652035) |

---

## What bids_hub Provides

```text
┌─────────────────────────────────────────────────────────────────────────────────┐
│                    neuroimaging-go-brrrr (bids_hub)                             │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│   FOR UPLOADING:                       FOR CONSUMING:                           │
│   ──────────────                       ──────────────                           │
│   build_isles24_file_table()           Patched datasets with Nifti()            │
│   get_isles24_features()               └── Use standard load_dataset()          │
│   push_dataset_to_hub()                                                         │
│                                        validate_isles24_download()              │
│   We DON'T use these in this demo.     └── ISLES24_EXPECTED_COUNTS              │
│   Dataset already uploaded.            └── Can use for sanity checking          │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘
```

---

## Related Documentation

- [neuroimaging-go-brrrr](https://github.com/The-Obstacle-Is-The-Way/neuroimaging-go-brrrr)
- [isles24-stroke dataset card](https://huggingface.co/datasets/hugging-science/isles24-stroke)

---

## TODO: Refactor Data Loading

The current hand-rolled adapter in `data/adapter.py` should be replaced with standard `datasets.load_dataset()` consumption. This refactor should:

1. Remove `HuggingFaceDataset` class from `data/adapter.py`
2. Update `data/loader.py` to use `datasets.load_dataset()`
3. Remove pre-computed constants in `data/constants.py` (no longer needed)
4. Test that `Nifti()` lazy loading works correctly
5. If bugs are found, report/fix them in `neuroimaging-go-brrrr`