This repository provides a controlled-access Kazakh-Russian mixed-speech ASR dataset prepared for academic research on automatic speech recognition.

The dataset focuses on speech where Kazakh is the dominant language, while Russian words or short Russian phrases may appear naturally inside the utterance.

The main goal of the dataset is to support reproducible experiments for:

Kazakh speech recognition;
Kazakh-Russian mixed-speech transcription;
code-switching ASR evaluation;
comparison of Whisper-based and XLS-R-based ASR systems;
low-resource and mixed-language speech recognition research.

Access Notice

This dataset is shared as a controlled-access research resource, not as a fully open public corpus.

Some custom mixed-speech materials were prepared from publicly available online speech sources. Public availability of the original sources does not automatically make derived audio fragments suitable for unrestricted redistribution. For this reason, access is limited and may be considered for academic research, verification, or evaluation purposes.

Users requesting access should agree that the dataset will not be used for:

commercial redistribution;
re-publication of raw audio materials;
speaker identification;
biometric profiling;
demographic profiling;
surveillance or tracking of individuals;
high-stakes decision-making systems.

Dataset Scope

This dataset is intended for speech-to-text / ASR only.

It does not include speaker identity labels, demographic labels, personal profile labels, or biometric annotations.

The dataset was prepared to evaluate how ASR models handle informal Kazakh-dominant speech with Russian insertions, filler words, repetitions, conversational pronunciation, and variable recording conditions.

Dataset Composition

The Hugging Face dataset repository focuses on the custom Kazakh-Russian mixed-speech subset used in the KRASR experiments.

Split / subset	Source label	Files / rows	Duration
Train mixed subset	`mixed`	5,974	21.069 h
Validation mixed subset	`mixed`	578	2.023 h
Test mixed subset	`mixed`	588	2.149 h
Total mixed subset	`mixed`	7,140	25.241 h

In the full thesis experiments, additional Kazakh and Russian support data from external corpora were used internally for training and comparison. Those external support resources are not redistributed here as part of this dataset card.

Relation to Thesis Manifests

The internal thesis experiments used four main manifest groups:

Manifest	Purpose	Sources	Rows	Duration
`train_all.jsonl`	Model training	mixed, KSC, Golos	23,726	47.071 h
`val_all.jsonl`	Checkpoint monitoring	mixed, KSC, Golos	1,269	3.024 h
`test_mixed.jsonl`	Main mixed-speech evaluation	mixed	588	2.149 h
`test_pure.jsonl`	Pure-language evaluation	KSC, Golos	685	1.001 h

The public Hugging Face dataset page should be understood as the controlled-access KRASR mixed-speech dataset resource. The full internal experimental setup also involved separately available Kazakh and Russian support corpora.

Data Fields

Typical dataset entries contain:

Field	Description
`audio`	Audio file or audio object used for ASR input
`text`	Normalized reference transcription
`source`	Source label, usually `mixed` for the custom mixed-speech subset
`duration`	Audio duration in seconds, if provided

Example entry:

{
  "audio": "path/to/audio.wav",
  "text": "normalized reference transcription",
  "source": "mixed",
  "duration": 8.42
}

Field names may depend on the exact split or export format. Users should inspect dataset["train"][0].keys() after loading.

Loading the Dataset

After access is approved, the dataset can be loaded with the Hugging Face datasets library.

pip install -U datasets soundfile librosa

from datasets import load_dataset, Audio

dataset = load_dataset("KRASR/kazakh-russian-asr-dataset")

print(dataset)
print(dataset["train"][0].keys())

dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

sample = dataset["train"][0]
audio = sample["audio"]
text = sample["text"]

print(audio["array"].shape)
print(audio["sampling_rate"])
print(text)

Preparing a JSONL Manifest

For ASR evaluation, users may convert the dataset into a simple JSONL manifest.

import json
from pathlib import Path
from datasets import load_dataset, Audio

dataset = load_dataset("KRASR/kazakh-russian-asr-dataset", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

output_path = Path("train_manifest.jsonl")

with output_path.open("w", encoding="utf-8") as f:
    for i, row in enumerate(dataset):
        item = {
            "audio": row["audio"]["path"],
            "text": row["text"],
            "source": row.get("source", "mixed"),
            "duration": row.get("duration", None),
        }
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

print(f"Saved manifest to {output_path}")

Preprocessing

The dataset was prepared using a consistent ASR preprocessing workflow.

Audio preprocessing included:

segmentation into short speech utterances;
filtering of unusable or corrupted segments;
conversion to mono audio;
16 kHz sampling rate for training and evaluation;
validation of audio paths, durations, and non-empty transcripts.

Text normalization included:

lowercasing;
whitespace normalization;
punctuation cleanup;
removal of formatting noise;
preservation of Kazakh Cyrillic letters;
preservation of Russian words in mixed utterances;
preservation of filler words, repetitions, Russian insertions, and meaningful informal expressions when clearly present in the audio.

Kazakh-specific letters were preserved, including characters such as:

ә ғ қ ң ө ұ ү һ і

Recommended Evaluation Use

For fair ASR evaluation:

apply the same text normalization to references and predictions;
report both WER and CER;
inspect substitutions, deletions, and insertions;
check hypothesis-to-reference length ratio;
manually inspect possible hallucination-like outputs for generative models;
evaluate mixed-speech and pure-language behavior separately when possible.

The KRASR thesis experiments used the dataset to evaluate:

Whisper-Small baseline;
Whisper-Small LoRA;
Whisper-Small full fine-tuning;
XLS-R 1B CTC;
Whisper Large-v3 LoRA.

Example Evaluation Setup

import evaluate

wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

predictions = [
    "model transcription one",
    "model transcription two",
]

references = [
    "reference transcription one",
    "reference transcription two",
]

wer = wer_metric.compute(predictions=predictions, references=references)
cer = cer_metric.compute(predictions=predictions, references=references)

print(f"WER: {wer:.4f}")
print(f"CER: {cer:.4f}")

Intended Use

This dataset is intended for:

academic ASR research;
Kazakh speech recognition experiments;
Kazakh-Russian mixed-speech transcription;
code-switching ASR evaluation;
benchmarking ASR models on mixed-language speech;
reproducing KRASR thesis-style experiments;
testing speech-to-text systems under realistic mixed-speech conditions.

Out-of-Scope Use

This dataset is not intended for:

speaker identification;
speaker verification;
biometric recognition;
demographic classification;
emotion or personality inference;
surveillance or tracking of individuals;
commercial redistribution of audio;
building high-stakes decision-making systems;
re-publishing raw audio fragments or transcripts outside the agreed access conditions.

Limitations

The dataset has several limitations:

it may not represent all Kazakh-Russian speaking styles;
the speech is domain-specific and partly conversational;
some recordings may contain background noise or informal pronunciation;
transcription quality depends on the clarity of the source audio;
exact timestamp-level provenance may not be available for every fragment;
the dataset should not be treated as a complete public benchmark for all Kazakh-Russian ASR scenarios.

The dataset is best understood as a task-specific research resource for evaluating Kazakh-dominant mixed speech with Russian insertions.

Ethical Considerations

The dataset was prepared for speech recognition research only.

Speaker identity, personal profiles, demographic labels, and biometric attributes were not used as training targets. When speaker context was considered during split design, it was used only to reduce overlap between training and final test data.

Because some source materials came from publicly available online speech, the dataset is managed as a controlled-access resource. This helps reduce the risk of unrestricted redistribution of derived audio fragments while still allowing research verification and evaluation.

Users should review model outputs carefully before using them in formal contexts, because ASR systems may produce omissions, substitutions, or extra generated text.

Related KRASR Resources

KRASR/kazakh-russian-asr-whisper-small-lora
KRASR/kazakh-russian-asr-whisper-small-full-ft
KRASR/kazakh-russian-asr-whisper-large-v3-lora
KRASR/kazakh-russian-asr-xls-r-1b-ctc
KRASR/kazakh-russian-speech-to-text-module

Citation

There is no formal publication for this dataset yet.

If you use this dataset, dataset description, or preparation workflow in academic work, please cite or mention the KRASR Hugging Face repository and the related thesis project:

@misc{krasr_kazakh_russian_asr_dataset_2026,
  title        = {Kazakh-Russian ASR Dataset},
  author       = {Mukhambet, Madiyar and Makhmud, Danial},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/KRASR/kazakh-russian-asr-dataset}},
  note         = {Controlled-access dataset for Kazakh and Kazakh-Russian mixed-speech ASR}
}

Related thesis project:

Madiyar Mukhambet and Danial Makhmud.
Development of a Software Module for Automatic Kazakh Speech-to-Text Conversion Based on Fine-Tuned Whisper-Small Model.
Astana IT University, 2026.

Contact

Access requests may be considered for academic research, verification, or evaluation purposes.

When requesting access, please briefly describe:

your affiliation or research context;
intended use of the dataset;
whether you need training, validation, or evaluation access;
confirmation that the dataset will not be used for speaker identification, biometric profiling, or redistribution.

Downloads last month: 13

Total file size:

2.91 GB

Models trained or fine-tuned on KRASR/kazakh-russian-asr-dataset