Qwen2.5-0.5B-grpo-summarization-quality-meteor-rouge

The model was first trained with a length-only reward to learn output length control, then fine-tuned further with quality-only rewards (no explicit length penalty). This checkpoint used METEOR + ROUGE as the quality signal.

G-Eval Scores (each 0–1; Composite max 4.0)

Faithfulness	Coverage	Conciseness	Clarity	Composite	Pass Rate
0.853	0.489	0.692	0.762	2.796	38.3%

Evaluated on 200 examples from mlabonne/smoltldr test split · judge: gpt-5-mini-2025-08-07 via DeepEval G-Eval (5 rounds averaged, each metric 0–1)

Eval Rollouts

Full per-example outputs, G-Eval scores, significance tests, and summary tables are in the dataset repo: YuvrajSingh9886/reddit-posts-summarization-grpo

All Length-Penalty Fine-tuned Runs

Run	Faithfulness	Coverage	Conciseness	Clarity	Composite	Pass Rate
`grpo-summarization-quality-bleu-rouge` ⭐	0.865	0.329	0.839	0.784	2.817	18.2%
`grpo-summarization-quality-meteor-rouge`	0.853	0.489	0.692	0.762	2.796	38.3%
`grpo-summarization-quality-rouge`	0.818	0.338	0.841	0.779	2.777	19.6%
`grpo-summarization-quality-meteor-bleu`	0.933	0.716	0.322	0.763	2.734	26.1%
`grpo-summarization-quality-meteor`	0.883	0.619	0.444	0.751	2.697	30.5%
`grpo-summarization-quality-bleu`	0.722	0.439	0.575	0.678	2.414	32.1%

Length-Penalty Included Runs (alternative strategy)

Run	Faithfulness	Coverage	Conciseness	Clarity	Composite	Pass Rate
`grpo-summarization-length-quality-meteor-rouge` ⭐	0.832	0.511	0.659	0.767	2.769	44.3%
`grpo-summarization-length-quality-bleu-rouge`	0.810	0.502	0.650	0.770	2.732	39.1%
`grpo-summarization-length-quality-meteor-bleu`	0.792	0.468	0.648	0.756	2.664	38.3%
`grpo-summarization-length-quality-rouge`	0.725	0.415	0.637	0.778	2.555	32.4%
`grpo-summarization-length-quality-meteor`	0.721	0.427	0.625	0.711	2.484	—
`grpo-summarization-length-only`	0.678	0.407	0.592	0.739	2.416	30.7%
`grpo-summarization-length-quality-bleu`	0.680	0.399	0.577	0.744	2.400	26.9%

Usage (MLX)

from mlx_lm import load, generate

model, tokenizer = load("YuvrajSingh9886/Qwen2.5-0.5B-grpo-summarization-quality-meteor-rouge")
messages = [
    {"role": "system", "content": "You are an assistant who is an expert at summarization task. The user gives you a post and you are required to summarize it, keeping the key points and main ideas intact, in EXACTLY 50 words"},
    {"role": "user",   "content": "<paste Reddit post here>"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(generate(model, tokenizer, prompt=prompt, max_tokens=128, verbose=False))

Training Details

Setting	Value
Base model	Qwen/Qwen2.5-0.5B-Instruct
Algorithm	GRPO (via smolcluster)
Dataset	`mlabonne/smoltldr` (train split, Reddit summarization)
Stage 1 reward	Length penalty (deviation from 50-token target)
Stage 2 reward	METEOR + ROUGE
Hardware	Apple Silicon Mac mini cluster
Framework	MLX
Weights format	MLX safetensors (bf16)
Eval examples	200 (test split)
Judge	`gpt-5-mini-2025-08-07` via DeepEval G-Eval

Downloads last month: 534

Safetensors

Model size

0.5B params

Tensor type

BF16

MLX

Hardware compatibility

Quantized

Model tree for YuvrajSingh9886/Qwen2.5-0.5B-grpo-summarization-quality-meteor-rouge

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Finetuned

(815)

this model

YuvrajSingh9886
/

Qwen2.5-0.5B-grpo-summarization-quality-meteor-rouge