Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents
Abstract
Conversational infill enables small real-time models to maintain responsiveness while integrating delayed reasoning outputs, bridging the gap between latency and capability in voice agents.
Voice agents face a fundamental tension: the reasoning, retrieval, and tool use that make foundation models capable are iterative and slow, while conversational interaction demands responses on a millisecond timescale. Smaller, real-time models meet the latency bar but cannot match foundation models on complex tasks, leaving current voice agents to trade away either responsiveness or capability. We introduce conversational infill, where a small talker model both immediately generates contextually grounded responses to hide the latency of an external reasoner model and fluently integrates streamed reasoner knowledge into its responses during inference. We curate a 290,571-example synthetic dataset spanning six domains and demonstrate that this task is learnable across seven widely used small language models ranging from 135M to 1.7B parameters. Our system implementation, ConvFill, sustains millisecond-level time-to-first-response while closing the accuracy gap to within 6.3% of the corresponding frontier reasoner performance. In a live user study (n=18) with talker deployments running on an Apple M2 SoC, participants rank ConvFill on par with frontier models overall, prefer it for retrieval-heavy tasks, and rate it significantly more responsive. These results show that conversational infill unlocks a new point on the latency-capability Pareto frontier, offering a practical path toward voice agents that are both responsive and highly capable. Code, models, and datasets are available at https://github.com/vysri/conversational-infill.
Community
👋 Authors here, thanks for checking out our paper!
Summary: Conversational AI systems leverage lightweight models that can run real-time, but slower Frontier LLMs are more capable (accuracy, tool use, etc.). We use model collaboration to approach frontier-level performance in a responsive conversational system. A tiny on-device Talker starts replying in milliseconds and weaves in knowledge from a frontier LLM Reasoner as it becomes available. This gives 7–19× faster time-to-first-response and runs on a laptop (Apple M2, 16 GB).
The 7 Talker models and 290k-sample training dataset are in the collection linked.
We've also got a repo set up here 💻 github.com/vysri/conversational-infill. It has the runnable demo (same as the video) so you can load the released models and talk to the full system yourself, plus the training framework if you want to use the dataset to fine-tune other SLMs into your own Talkers.
Happy to answer any questions, we'll be around in the comments!
The tension between reasoning latency and conversational responsiveness is the primary bottleneck for voice agents. Using a "talker" model to mask the latency of a heavier reasoner is a practical engineering move, but the real challenge is the seamless integration of the stream. I'm interested in how this handles mid-sentence corrections when the reasoner updates its state. If the talker is too far ahead, you get those awkward "wait, actually" moments that break the illusion of intelligence. This approach is a step toward a more deployable agentic voice system, provided the handoff is handled at the token level rather than the sentence level.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning When to Think While Listening in Large Audio-Language Models (2026)
- How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue (2026)
- Adaptive Turn-Taking for Real-time Multi-Party Voice Agents (2026)
- Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models (2026)
- Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs (2026)
- OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations (2026)
- Context-Driven Incremental Compression for Multi-Turn Dialogue Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2511.07397 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 14
vysri/SmolLM360M-IT-ConvFill
Datasets citing this paper 1
zenglhardt/convfill-dataset
Spaces citing this paper 0
No Space linking this paper