You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

sozkz-nllb-1b-kk-gec-v1

NLLB-200-1.3B fine-tuned for Kazakh grammatical error correction (GEC). Fixes spelling, morphology, punctuation and word-order errors in Kazakh Cyrillic text.

Training

Stage Dataset Pairs Epochs LR
1 — Pretrain sozkz-corpus-pretrain-gec-mix-v1 1.77M 2 2e-5
2 — Fine-tune sozkz-corpus-synthetic-kk-gec-v1 19K 3 5e-6

Hardware: 1× H100 SXM 80GB, bf16.

Evaluation

Evaluated on the canonical 200-example Kazakh GEC test set (stukenov/sozkz-corpus-gec-benchmark-kk-v1, split test) using beam=5, kaz_Cyrl → kaz_Cyrl.

Model Exact Match CER ↓ Word Prec Word Rec Word F0.5 ↑ Identity
sozkz-fix-mt5-50m-kk-gec-v1 (baseline) 62.0% 0.0802 0.494 0.661 0.520 100%
sozkz-nllb-1b-kk-pretrain-v1 43.5% 0.2643 0.206 0.543 0.235 61.5%
sozkz-nllb-1b-kk-gec-v1 (this model) 44.0% 0.2447 0.233 0.550 0.264 61.5%

Note: The NLLB model occasionally switches to Russian/English on sentences containing foreign proper nouns — this lowers the identity preservation score. The mT5 baseline (saken-tukenov/sozkz-fix-mt5-50m-kk-gec-v1) remains the recommended model for production use.

Inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "stukenov/sozkz-nllb-1b-kk-gec-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="kaz_Cyrl", tgt_lang="kaz_Cyrl")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model = model.to("cuda" if torch.cuda.is_available() else "cpu")

def correct(text: str) -> str:
    forced_bos = tokenizer.convert_tokens_to_ids("kaz_Cyrl")
    inputs = tokenizer(text, return_tensors="pt", max_length=256,
                       truncation=True).to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, forced_bos_token_id=forced_bos,
                             num_beams=5, max_length=256)
    return tokenizer.decode(out[0], skip_special_tokens=True)

print(correct("Қазақ тіліне мемлекеттик мәртебе бериліп, онын міндетті түрде қолданылу аясы белгіленуі мүмкін емес."))
# → Қазақ тіліне мемлекеттік мәртебе беріліп, оның міндетті түрде қолданылу аясы белгіленуі мүмкін емес.

Sample Corrections

Input (erroneous) Output (corrected)
Қазақ тіліне мемлекеттик мәртебе бериліп, онын міндетті түрде қолданылу аясы белгіленуі мүмкін емес. Қазақ тіліне мемлекеттік мәртебе беріліп, оның міндетті түрде қолданылу аясы белгіленуі мүмкін емес.
Онын лирикасы адамды ерлик жасауға шакырады. Оның лирикасы адамды ерлік жасауға шакырады.
Бала өмірге келген соң алты ай бойы тек емізу керек керек. Бала өмірге келген соң алты ай бойы тек емізу керек.

Limitations

  • Output can occasionally switch to Russian on sentences containing foreign proper nouns
  • Trained on synthetic GEC data — real-world error coverage may differ
  • Optimized for Kazakh Cyrillic; Latin-script Kazakh not supported
  • Identity preservation is 61.5%: the model sometimes modifies grammatically correct sentences

Related

Benchmark Results

Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).

Category Score
Орфография (емле) 0/30 (0%)
Грамматика 1/20 (5%)
Пунктуация 0/15 (0%)
Смешанный 0/20 (0%)
Identity preservation 0/15 (0%)
Total 1/100 (1%)

Leaderboard (100-example custom benchmark)

Модель Total Емле/30 Грамм/20 Пункт/15 Смеш/20 Ident/15
sozkz-core-llama-600m-kk-gec-v1 47% 15 12 3 2 15/15
sozkz-fix-qwen-500m-kk-gec-v3 38% 0 16 9 0 13/15
sozkz-core-llama-300m-kk-gec-v4 37% 9 6 4 3 15/15
sozkz-fix-qwen-500m-kk-gec-v1 35% 0 12 8 0 15/15
sozkz-fix-qwen-500m-kk-gec-v2 30% 0 11 7 0 12/15
sozkz-core-llama-1b-kk-gec-v1 16% 2 6 1 0 7/15
sozkz-fix-qwen-500m-kk-gec-v4 5% 0 1 4 0 0/15
sozkz-fix-mt5b-kk-gec-run13-v1 5% 0 2 0 0 3/15
sozkz-nllb-1b-kk-gec-v1 1% 0 1 0 0 0/15
sozkz-nllb-1b-kk-pretrain-v1 1% 0 1 0 0 0/15
sozkz-core-llama-300m-kk-gec-v3 1% 0 1 0 0 0/15
sozkz-core-llama-300m-kk-gec-v1/v2a/v2b 0–1% 0 0 0 0 0–1
sozkz-fix-mt5-50m-kk-gec-v1 0% 0 0 0 0 0/15
Downloads last month
2
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stukenov/sozkz-nllb-1b-kk-gec-v1

Finetuned
(29)
this model

Datasets used to train stukenov/sozkz-nllb-1b-kk-gec-v1

Evaluation results