nllb-200-distilled-600M-ft-efi-en / README.md

luel

Update README.md

a0fd8eb verified 5 months ago

preview code

raw

history blame contribute delete

2.4 kB

metadata

language:
  - eng
  - efi
tags:
  - translation
  - nllb
  - nllb-200
  - english-efik
license: apache-2.0
datasets:
  - Davlan/ibom-mt-en-efi
base_model: facebook/nllb-200-distilled-600M
library_name: transformers
pipeline_tag: translation
model-index:
  - name: nllb-200-distilled-600M-ft-efi-en
    results:
      - task:
          type: translation
          name: Machine Translation
        dataset:
          name: Ibom-MT (en-efi)
          type: Davlan/ibom-mt-en-efi
        metrics:
          - name: BLEU
            type: bleu
            value: 38.6
          - name: chrF
            type: chrf
            value: 54.5

Efik - English (NLLB-200 Distilled)

Fine-tuned NLLB-200 model for translating Efik -> English. Efik is not directly supported in NLLB, we use the Igbo language code ibo_Latn as a close proxy during training and inference.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "luel/nllb-200-distilled-600M-ft-efi-en"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True, src_lang="ibo_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, use_auth_token=True)

input_example = "Ami nko nko."
inputs = tokenizer(input_example, return_tensors="pt")

generated_ids = model.generate(
    **inputs, forced_bos_token_id = tokenizer.convert_tokens_to_ids("eng_Latn"), max_length=30
)
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

Training details (summary)

Item	Value
Base model	facebook/nllb-200-distilled-600M
Dataset	Davlan/ibom-mt-en-efi
Script	lafand-mt
Epochs	8
Effective batch size	32 (16 × 2 grad-accum)
Learning rate	3e-5
Mixed precision	bf16
Early stopping	Patience = 3, min_delta (BLEU) = 0.001

Evaluation

Metric	efi->en
BLEU	38.6
chrF	54.5

Limitations

Using the Igbo token (ibo_Latn) as a stand-in for Efik may introduce lexical differences and tokenization mismatches.
The model has not been extensively evaluated for bias, toxicity, or gender neutrality.