Papers
arxiv:2511.07397

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

Published on Jun 23
· Submitted by
Vidya Srinivas
on Jun 29
Authors:
,
,

Abstract

Conversational infill enables small real-time models to maintain responsiveness while integrating delayed reasoning outputs, bridging the gap between latency and capability in voice agents.

Voice agents face a fundamental tension: the reasoning, retrieval, and tool use that make foundation models capable are iterative and slow, while conversational interaction demands responses on a millisecond timescale. Smaller, real-time models meet the latency bar but cannot match foundation models on complex tasks, leaving current voice agents to trade away either responsiveness or capability. We introduce conversational infill, where a small talker model both immediately generates contextually grounded responses to hide the latency of an external reasoner model and fluently integrates streamed reasoner knowledge into its responses during inference. We curate a 290,571-example synthetic dataset spanning six domains and demonstrate that this task is learnable across seven widely used small language models ranging from 135M to 1.7B parameters. Our system implementation, ConvFill, sustains millisecond-level time-to-first-response while closing the accuracy gap to within 6.3% of the corresponding frontier reasoner performance. In a live user study (n=18) with talker deployments running on an Apple M2 SoC, participants rank ConvFill on par with frontier models overall, prefer it for retrieval-heavy tasks, and rate it significantly more responsive. These results show that conversational infill unlocks a new point on the latency-capability Pareto frontier, offering a practical path toward voice agents that are both responsive and highly capable. Code, models, and datasets are available at https://github.com/vysri/conversational-infill.

Community

Paper author Paper submitter

👋 Authors here, thanks for checking out our paper!

Summary: Conversational AI systems leverage lightweight models that can run real-time, but slower Frontier LLMs are more capable (accuracy, tool use, etc.). We use model collaboration to approach frontier-level performance in a responsive conversational system. A tiny on-device Talker starts replying in milliseconds and weaves in knowledge from a frontier LLM Reasoner as it becomes available. This gives 7–19× faster time-to-first-response and runs on a laptop (Apple M2, 16 GB).

The 7 Talker models and 290k-sample training dataset are in the collection linked.

We've also got a repo set up here 💻 github.com/vysri/conversational-infill. It has the runnable demo (same as the video) so you can load the released models and talk to the full system yourself, plus the training framework if you want to use the dataset to fine-tune other SLMs into your own Talkers.

Happy to answer any questions, we'll be around in the comments!

The tension between reasoning latency and conversational responsiveness is the primary bottleneck for voice agents. Using a "talker" model to mask the latency of a heavier reasoner is a practical engineering move, but the real challenge is the seamless integration of the stream. I'm interested in how this handles mid-sentence corrections when the reasoner updates its state. If the talker is too far ahead, you get those awkward "wait, actually" moments that break the illusion of intelligence. This approach is a step toward a more deployable agentic voice system, provided the handoff is handled at the token level rather than the sentence level.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2511.07397
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 14

Browse 14 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.07397 in a Space README.md to link it from this page.

Collections including this paper 1