You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

sozkz-nllb-1b-kk-gec-v1

NLLB-200-1.3B fine-tuned for Kazakh grammatical error correction (GEC). Fixes spelling, morphology, punctuation and word-order errors in Kazakh Cyrillic text.

Training

Stage	Dataset	Pairs	Epochs	LR
1 — Pretrain	sozkz-corpus-pretrain-gec-mix-v1	1.77M	2	2e-5
2 — Fine-tune	sozkz-corpus-synthetic-kk-gec-v1	19K	3	5e-6

Hardware: 1× H100 SXM 80GB, bf16.

Evaluation

Evaluated on the canonical 200-example Kazakh GEC test set (stukenov/sozkz-corpus-gec-benchmark-kk-v1, split test) using beam=5, kaz_Cyrl → kaz_Cyrl.

Model	Exact Match	CER ↓	Word Prec	Word Rec	Word F0.5 ↑	Identity
sozkz-fix-mt5-50m-kk-gec-v1 (baseline)	62.0%	0.0802	0.494	0.661	0.520	100%
sozkz-nllb-1b-kk-pretrain-v1	43.5%	0.2643	0.206	0.543	0.235	61.5%
sozkz-nllb-1b-kk-gec-v1 (this model)	44.0%	0.2447	0.233	0.550	0.264	61.5%

Note: The NLLB model occasionally switches to Russian/English on sentences containing foreign proper nouns — this lowers the identity preservation score. The mT5 baseline (saken-tukenov/sozkz-fix-mt5-50m-kk-gec-v1) remains the recommended model for production use.

Inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "stukenov/sozkz-nllb-1b-kk-gec-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="kaz_Cyrl", tgt_lang="kaz_Cyrl")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model = model.to("cuda" if torch.cuda.is_available() else "cpu")

def correct(text: str) -> str:
    forced_bos = tokenizer.convert_tokens_to_ids("kaz_Cyrl")
    inputs = tokenizer(text, return_tensors="pt", max_length=256,
                       truncation=True).to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, forced_bos_token_id=forced_bos,
                             num_beams=5, max_length=256)
    return tokenizer.decode(out[0], skip_special_tokens=True)

print(correct("Қазақ тіліне мемлекеттик мәртебе бериліп, онын міндетті түрде қолданылу аясы белгіленуі мүмкін емес."))
# → Қазақ тіліне мемлекеттік мәртебе беріліп, оның міндетті түрде қолданылу аясы белгіленуі мүмкін емес.

Sample Corrections

Input (erroneous)	Output (corrected)
Қазақ тіліне мемлекеттик мәртебе бериліп, онын міндетті түрде қолданылу аясы белгіленуі мүмкін емес.	Қазақ тіліне мемлекеттік мәртебе беріліп, оның міндетті түрде қолданылу аясы белгіленуі мүмкін емес.
Онын лирикасы адамды ерлик жасауға шакырады.	Оның лирикасы адамды ерлік жасауға шакырады.
Бала өмірге келген соң алты ай бойы тек емізу керек керек.	Бала өмірге келген соң алты ай бойы тек емізу керек.

Limitations

Output can occasionally switch to Russian on sentences containing foreign proper nouns
Trained on synthetic GEC data — real-world error coverage may differ
Optimized for Kazakh Cyrillic; Latin-script Kazakh not supported
Identity preservation is 61.5%: the model sometimes modifies grammatically correct sentences

Pretrain checkpoint: stukenov/sozkz-nllb-1b-kk-pretrain-v1
Recommended baseline: saken-tukenov/sozkz-fix-mt5-50m-kk-gec-v1
GEC benchmark: stukenov/sozkz-corpus-gec-benchmark-kk-v1

Benchmark Results

Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).

Category	Score
Орфография (емле)	0/30 (0%)
Грамматика	1/20 (5%)
Пунктуация	0/15 (0%)
Смешанный	0/20 (0%)
Identity preservation	0/15 (0%)
Total	1/100 (1%)

Leaderboard (100-example custom benchmark)

Модель	Total	Емле/30	Грамм/20	Пункт/15	Смеш/20	Ident/15
sozkz-core-llama-600m-kk-gec-v1	47%	15	12	3	2	15/15
sozkz-fix-qwen-500m-kk-gec-v3	38%	0	16	9	0	13/15
sozkz-core-llama-300m-kk-gec-v4	37%	9	6	4	3	15/15
sozkz-fix-qwen-500m-kk-gec-v1	35%	0	12	8	0	15/15
sozkz-fix-qwen-500m-kk-gec-v2	30%	0	11	7	0	12/15
sozkz-core-llama-1b-kk-gec-v1	16%	2	6	1	0	7/15
sozkz-fix-qwen-500m-kk-gec-v4	5%	0	1	4	0	0/15
sozkz-fix-mt5b-kk-gec-run13-v1	5%	0	2	0	0	3/15
sozkz-nllb-1b-kk-gec-v1	1%	0	1	0	0	0/15
sozkz-nllb-1b-kk-pretrain-v1	1%	0	1	0	0	0/15
sozkz-core-llama-300m-kk-gec-v3	1%	0	1	0	0	0/15
sozkz-core-llama-300m-kk-gec-v1/v2a/v2b	0–1%	0	0	0	0	0–1
sozkz-fix-mt5-50m-kk-gec-v1	0%	0	0	0	0	0/15

Downloads last month: 2

Safetensors

Model size

1B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stukenov/sozkz-nllb-1b-kk-gec-v1

Base model

facebook/nllb-200-1.3B

Finetuned

(29)

this model

Datasets used to train stukenov/sozkz-nllb-1b-kk-gec-v1

Evaluation results

Exact Match (100-example custom) on sozkz-corpus-gec-benchmark-kk-v1
test set self-reported

1.000

stukenov
/

sozkz-nllb-1b-kk-gec-v1