Papers
arxiv:2508.08680

TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation

Published on Aug 12, 2025
· Submitted by
Armel Randy Zebaze
on Aug 13, 2025
Authors:
,
,

Abstract

TopXGen uses LLMs to generate high-quality, topic-diverse target-side texts for LRLs, which can be backtranslated to improve translation performance in ICL and fine-tuning.

LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present TopXGen, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that TopXGen boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at https://github.com/ArmelRandy/topxgen.

Community

Paper submitter

We introduce TopXGen, a pipeline for generating high-quality, topic-diverse synthetic data for low-resource languages using LLMs. While LLMs often struggle to correctly translate into LRLs, their multilingual capabilities allow them to produce decent, natural-sounding text in these languages, which can then be back-translated into a high-resource language to create parallel datasets. Unlike traditional back-translation, TopXGen does not require large existing corpora in the target language. We demonstrate that TopXGen improves MT performance in both supervised fine-tuning and in-context learning settings.

Code: https://github.com/ArmelRandy/topxgen

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2508.08680
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.08680 in a model README.md to link it from this page.

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.08680 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.