You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this dataset content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

RapBank — Final Dataset

Processed and filtered subset of the RapBank corpus, ready for model training.


Data Source

Raw audio and transcripts originate from:

./work/corpus/rap/raw/

Audio files are source-separated vocal clips segmented by VAD, stored as .flac under vocal_cut/{video_id}/. Transcripts were produced by Whisper and cross-validated against Parakeet TDT; clips longer than 30 s were removed. The full preprocessing pipeline is documented in ./work/corpus/rap/raw/README.md.


Dataset Format

Each split (train/, valid/, test/) contains two manifest files:

utt2fpath.tsv

Maps utterance ID to its audio file path (relative to ./work/corpus/rap/raw/):

{utt_id}\t{relfpath}

Example:

TTEiUDYwk84_1_29390    vocal_cut/TTEiUDYwk84/TTEiUDYwk84_1_29390.flac

utt2asrtext.tsv

Maps utterance ID to its ASR transcript (Whisper, cross-validated against Parakeet):

{utt_id}\t{transcript}

Example:

TTEiUDYwk84_1_29390    Who want it, nigga? We the violentest...

The utt_id is the filename stem of the audio clip and is consistent across all utt2*.tsv files within a split.


Planned: Arrow Files

.arrow (Apache Arrow / HuggingFace Datasets format) files will be generated from the utt2*.tsv manifests in a future preparation step. Each shard will bundle the raw audio waveform alongside all metadata columns.


Current Contents

dataset_final/
├── train/
│   ├── utt2fpath.tsv
│   └── utt2asrtext.tsv
├── valid/
│   ├── utt2fpath.tsv
│   └── utt2asrtext.tsv
└── test/
    ├── utt2fpath.tsv
    └── utt2asrtext.tsv

No .arrow files have been generated yet.


Generating the Splits

python egs/taste_s/rslm/scripts/data/prepare_data.py \
    --src_tsv work/corpus/rap/raw/final_valid_ids.tsv \
    --tgt_dir work/corpus/rap/dataset_final \
    --valid_size 1000 \
    --test_size  1000
Downloads last month
5