Omnilingual ASR CTC 300M v2, FLEURS Persian + Thomcles Continuation

This repository contains the step-5000 checkpoint from continuing Peacockery/omni-ctc-300m-v2-fleurs-fa-ir on Thomcles/Persian-Farsi-Speech.

The checkpoint is a fairseq2 / Omnilingual ASR checkpoint. It is not packaged as a Transformers model.

Files

  • checkpoint-step-5000.pt: final continued model checkpoint.
  • fairseq2_card.yaml: local fairseq2 asset card for the checkpoint.
  • training-config.yaml: continuation training configuration.
  • benchmarks/fleurs-test-step5000-thomcles-summary.md: FLEURS fa_ir test benchmark after Thomcles continuation.
  • benchmarks/fleurs-test-before-thomcles-summary.md: FLEURS fa_ir test benchmark before Thomcles continuation.
  • dev-scores/: Thomcles dev WER scores saved during continuation training.
  • data/thomcles-language_distribution_0.tsv: prepared Thomcles training-hour summary.

Results

FLEURS fa_ir test, 871 samples:

Checkpoint WER CER
FLEURS + Thomcles step 5000 18.02% 5.11%
FLEURS-only step 5000 18.55% 5.28%

Thomcles dev validation:

Step WER
500 31.41%
1000 28.91%
1500 26.54%
2000 26.11%
2500 25.64%
3000 24.69%
3500 24.21%
4000 23.95%
4500 23.73%
5000 23.62%

Training Notes

  • Starting checkpoint: Peacockery/omni-ctc-300m-v2-fleurs-fa-ir
  • Continuation dataset: Thomcles/Persian-Farsi-Speech
  • Prepared Thomcles data: 108,306 train rows, 1,095 dev rows, 417.46 total hours
  • Tokenizer: omniASR_tokenizer_written_v2
  • Continuation steps: 5000
  • Optimizer learning rate: 1e-5
  • Gradient accumulation: 8 batches
  • Precision: bfloat16

See training-config.yaml for the exact trainer settings.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train Peacockery/omni-ctc-300m-v2-fleurs-fa-ir-thomcles-continue