arxiv:2511.07397

Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents

Published on Jun 23

· Submitted by

Vidya Srinivas on Jun 29

Upvote

Authors:

Vidya Srinivas ,

Zachary Englhardt ,

Maximus Powers

Abstract

Conversational infill enables small real-time models to maintain responsiveness while integrating delayed reasoning outputs, bridging the gap between latency and capability in voice agents.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Voice agents face a fundamental tension: the reasoning, retrieval, and tool use that make foundation models capable are iterative and slow, while conversational interaction demands responses on a millisecond timescale. Smaller, real-time models meet the latency bar but cannot match foundation models on complex tasks, leaving current voice agents to trade away either responsiveness or capability. We introduce conversational infill, where a small talker model both immediately generates contextually grounded responses to hide the latency of an external reasoner model and fluently integrates streamed reasoner knowledge into its responses during inference. We curate a 290,571-example synthetic dataset spanning six domains and demonstrate that this task is learnable across seven widely used small language models ranging from 135M to 1.7B parameters. Our system implementation, ConvFill, sustains millisecond-level time-to-first-response while closing the accuracy gap to within 6.3% of the corresponding frontier reasoner performance. In a live user study (n=18) with talker deployments running on an Apple M2 SoC, participants rank ConvFill on par with frontier models overall, prefer it for retrieval-heavy tasks, and rate it significantly more responsive. These results show that conversational infill unlocks a new point on the latency-capability Pareto frontier, offering a practical path toward voice agents that are both responsive and highly capable. Code, models, and datasets are available at https://github.com/vysri/conversational-infill.

View arXiv page View PDF GitHub 6 Add to collection

Community

vysri

Paper author Paper submitter 1 day ago

👋 Authors here, thanks for checking out our paper!

Summary: Conversational AI systems leverage lightweight models that can run real-time, but slower Frontier LLMs are more capable (accuracy, tool use, etc.). We use model collaboration to approach frontier-level performance in a responsive conversational system. A tiny on-device Talker starts replying in milliseconds and weaves in knowledge from a frontier LLM Reasoner as it becomes available. This gives 7–19× faster time-to-first-response and runs on a laptop (Apple M2, 16 GB).

The 7 Talker models and 290k-sample training dataset are in the collection linked.

We've also got a repo set up here 💻 github.com/vysri/conversational-infill. It has the runnable demo (same as the video) so you can load the released models and talk to the full system yourself, plus the training framework if you want to use the dataset to fine-tune other SLMs into your own Talkers.

Happy to answer any questions, we'll be around in the comments!

O96a

1 day ago

The tension between reasoning latency and conversational responsiveness is the primary bottleneck for voice agents. Using a "talker" model to mask the latency of a heavier reasoner is a practical engineering move, but the real challenge is the seamless integration of the stream. I'm interested in how this handles mid-sentence corrections when the reasoner updates its state. If the talker is too far ahead, you get those awkward "wait, actually" moments that break the illusion of intelligence. This approach is a step toward a more deployable agentic voice system, provided the handoff is handled at the token level rather than the sentence level.