YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
RapBank — Final Dataset
Processed and filtered subset of the RapBank corpus, ready for model training.
Data Source
Raw audio and transcripts originate from:
./work/corpus/rap/raw/
Audio files are source-separated vocal clips segmented by VAD, stored as .flac
under vocal_cut/{video_id}/. Transcripts were produced by Whisper and
cross-validated against Parakeet TDT; clips longer than 30 s were removed.
The full preprocessing pipeline is documented in ./work/corpus/rap/raw/README.md.
Dataset Format
Each split (train/, valid/, test/) contains two manifest files:
utt2fpath.tsv
Maps utterance ID to its audio file path (relative to ./work/corpus/rap/raw/):
{utt_id}\t{relfpath}
Example:
TTEiUDYwk84_1_29390 vocal_cut/TTEiUDYwk84/TTEiUDYwk84_1_29390.flac
utt2asrtext.tsv
Maps utterance ID to its ASR transcript (Whisper, cross-validated against Parakeet):
{utt_id}\t{transcript}
Example:
TTEiUDYwk84_1_29390 Who want it, nigga? We the violentest...
The utt_id is the filename stem of the audio clip and is consistent across
all utt2*.tsv files within a split.
Planned: Arrow Files
.arrow (Apache Arrow / HuggingFace Datasets format) files will be generated
from the utt2*.tsv manifests in a future preparation step. Each shard will
bundle the raw audio waveform alongside all metadata columns.
Current Contents
dataset_final/
├── train/
│ ├── utt2fpath.tsv
│ └── utt2asrtext.tsv
├── valid/
│ ├── utt2fpath.tsv
│ └── utt2asrtext.tsv
└── test/
├── utt2fpath.tsv
└── utt2asrtext.tsv
No .arrow files have been generated yet.
Generating the Splits
python egs/taste_s/rslm/scripts/data/prepare_data.py \
--src_tsv work/corpus/rap/raw/final_valid_ids.tsv \
--tgt_dir work/corpus/rap/dataset_final \
--valid_size 1000 \
--test_size 1000
- Downloads last month
- 5