sozkz-nllb-1b-kk-gec-v1
NLLB-200-1.3B fine-tuned for Kazakh grammatical error correction (GEC). Fixes spelling, morphology, punctuation and word-order errors in Kazakh Cyrillic text.
Training
| Stage | Dataset | Pairs | Epochs | LR |
|---|---|---|---|---|
| 1 — Pretrain | sozkz-corpus-pretrain-gec-mix-v1 | 1.77M | 2 | 2e-5 |
| 2 — Fine-tune | sozkz-corpus-synthetic-kk-gec-v1 | 19K | 3 | 5e-6 |
Hardware: 1× H100 SXM 80GB, bf16.
Evaluation
Evaluated on the canonical 200-example Kazakh GEC test set
(stukenov/sozkz-corpus-gec-benchmark-kk-v1, split test)
using beam=5, kaz_Cyrl → kaz_Cyrl.
| Model | Exact Match | CER ↓ | Word Prec | Word Rec | Word F0.5 ↑ | Identity |
|---|---|---|---|---|---|---|
| sozkz-fix-mt5-50m-kk-gec-v1 (baseline) | 62.0% | 0.0802 | 0.494 | 0.661 | 0.520 | 100% |
| sozkz-nllb-1b-kk-pretrain-v1 | 43.5% | 0.2643 | 0.206 | 0.543 | 0.235 | 61.5% |
| sozkz-nllb-1b-kk-gec-v1 (this model) | 44.0% | 0.2447 | 0.233 | 0.550 | 0.264 | 61.5% |
Note: The NLLB model occasionally switches to Russian/English on sentences containing foreign proper nouns — this lowers the identity preservation score. The mT5 baseline (saken-tukenov/sozkz-fix-mt5-50m-kk-gec-v1) remains the recommended model for production use.
Inference
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_id = "stukenov/sozkz-nllb-1b-kk-gec-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="kaz_Cyrl", tgt_lang="kaz_Cyrl")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model = model.to("cuda" if torch.cuda.is_available() else "cpu")
def correct(text: str) -> str:
forced_bos = tokenizer.convert_tokens_to_ids("kaz_Cyrl")
inputs = tokenizer(text, return_tensors="pt", max_length=256,
truncation=True).to(model.device)
with torch.no_grad():
out = model.generate(**inputs, forced_bos_token_id=forced_bos,
num_beams=5, max_length=256)
return tokenizer.decode(out[0], skip_special_tokens=True)
print(correct("Қазақ тіліне мемлекеттик мәртебе бериліп, онын міндетті түрде қолданылу аясы белгіленуі мүмкін емес."))
# → Қазақ тіліне мемлекеттік мәртебе беріліп, оның міндетті түрде қолданылу аясы белгіленуі мүмкін емес.
Sample Corrections
| Input (erroneous) | Output (corrected) |
|---|---|
| Қазақ тіліне мемлекеттик мәртебе бериліп, онын міндетті түрде қолданылу аясы белгіленуі мүмкін емес. | Қазақ тіліне мемлекеттік мәртебе беріліп, оның міндетті түрде қолданылу аясы белгіленуі мүмкін емес. |
| Онын лирикасы адамды ерлик жасауға шакырады. | Оның лирикасы адамды ерлік жасауға шакырады. |
| Бала өмірге келген соң алты ай бойы тек емізу керек керек. | Бала өмірге келген соң алты ай бойы тек емізу керек. |
Limitations
- Output can occasionally switch to Russian on sentences containing foreign proper nouns
- Trained on synthetic GEC data — real-world error coverage may differ
- Optimized for Kazakh Cyrillic; Latin-script Kazakh not supported
- Identity preservation is 61.5%: the model sometimes modifies grammatically correct sentences
Related
- Pretrain checkpoint: stukenov/sozkz-nllb-1b-kk-pretrain-v1
- Recommended baseline: saken-tukenov/sozkz-fix-mt5-50m-kk-gec-v1
- GEC benchmark: stukenov/sozkz-corpus-gec-benchmark-kk-v1
Benchmark Results
Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).
| Category | Score |
|---|---|
| Орфография (емле) | 0/30 (0%) |
| Грамматика | 1/20 (5%) |
| Пунктуация | 0/15 (0%) |
| Смешанный | 0/20 (0%) |
| Identity preservation | 0/15 (0%) |
| Total | 1/100 (1%) |
Leaderboard (100-example custom benchmark)
| Модель | Total | Емле/30 | Грамм/20 | Пункт/15 | Смеш/20 | Ident/15 |
|---|---|---|---|---|---|---|
| sozkz-core-llama-600m-kk-gec-v1 | 47% | 15 | 12 | 3 | 2 | 15/15 |
| sozkz-fix-qwen-500m-kk-gec-v3 | 38% | 0 | 16 | 9 | 0 | 13/15 |
| sozkz-core-llama-300m-kk-gec-v4 | 37% | 9 | 6 | 4 | 3 | 15/15 |
| sozkz-fix-qwen-500m-kk-gec-v1 | 35% | 0 | 12 | 8 | 0 | 15/15 |
| sozkz-fix-qwen-500m-kk-gec-v2 | 30% | 0 | 11 | 7 | 0 | 12/15 |
| sozkz-core-llama-1b-kk-gec-v1 | 16% | 2 | 6 | 1 | 0 | 7/15 |
| sozkz-fix-qwen-500m-kk-gec-v4 | 5% | 0 | 1 | 4 | 0 | 0/15 |
| sozkz-fix-mt5b-kk-gec-run13-v1 | 5% | 0 | 2 | 0 | 0 | 3/15 |
| sozkz-nllb-1b-kk-gec-v1 | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-nllb-1b-kk-pretrain-v1 | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-core-llama-300m-kk-gec-v3 | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-core-llama-300m-kk-gec-v1/v2a/v2b | 0–1% | 0 | 0 | 0 | 0 | 0–1 |
| sozkz-fix-mt5-50m-kk-gec-v1 | 0% | 0 | 0 | 0 | 0 | 0/15 |
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for stukenov/sozkz-nllb-1b-kk-gec-v1
Base model
facebook/nllb-200-1.3BDatasets used to train stukenov/sozkz-nllb-1b-kk-gec-v1
Viewer • Updated • 19.3k • 22
stukenov/sozkz-corpus-pretrain-gec-mix-v1
Viewer • Updated • 1.77M • 17
Evaluation results
- Exact Match (100-example custom) on sozkz-corpus-gec-benchmark-kk-v1test set self-reported1.000