We made inference time compute scaling more efficient, and used it to make LLMs more accurate.

Efficiency: Parallel Reasoning Chain Pruning

By evaluating the semantic similarity between reasoning chains at various stages and pruning similar paths early on in the decoding process, we can achieve the same accuracy as sampling 50 approaches while only decoding 10 to completion

Graph showing efficiency gains from parallel reasoning chain pruning

Accuracy: Inference-Time Hallucination Detection via Self-Verification

By having reasoning LLMs extend their thinking with self-verification statements after reaching an answer, we can implement a majority-vote mechanism to detect hallucinations, enabling confidence-based compute allocation for improved accuracy on various benchmarks.

Graph showing accuracy gains from inference-time hallucination detection via self-verification
Efficiency

Parallel Reasoning Chain Pruning

Our research demonstrates that parallel reasoning chain pruning can achieve the same accuracy as sampling 50 approaches while pruning 80% of the reasoning chains at only 300 tokens decoded.

As reasoning LLMs grow more and more popular for use in production coding and mathematics - domains with strong verifiers - we believe that decoding many reasoning chains in parallel for a prompt will become a common practice to scale inference time compute and improve performance.

However, these reasoning chains can go on for tens of thousands of tokens, and take up valuable bandwidth during inference (in both GPUs and custom ASICS like Sohu). Instead of decoding reasoning chains that we can predict will be redundant, we can prune them early on in the decoding process via the methdology we describe.

Detailed visualization of adaptive computation allocation for parallel reasoning chain pruning
Accuracy

Inference-Time Hallucination Detection via Self-Verification

We discovered that by allowing reasoning models to self correct their answers to hallucination benchmarks and analyzing the diversity of their reasoning as a hueristic for model confidence, we can detect model hallucations at a higher rate and offer refusals instead of a confidently incorrect answer

This allows us to offer a confidence-based compute allocation mechanism that can offer a model that is capable of knowing when its wrong instead of outputing something misleading.

We prove this out on the SimpleQA benchmark, where our method shows a marked improvement in providing refusals instead of confidently incorrect answers.

Real-World Impact

Tangible Applications

Our optimal test-time computation research potentially addresses critical challenges in AI deployment.

We're out of GPUs

As AI capabilities expand, we're facing unprecedented demand for compute. Our research could offer optimizations that help alleviate this bottleneck.

๐Ÿ’ฐ80% less tokens generated for the same performance through strategic resource allocation
๐ŸŒฑSignificantly lower energy consumption and carbon footprint
๐Ÿ›ก๏ธEnhanced reliability through reduced hallucinations

"...we will add tens of thousands of GPUs next week and roll it out to the plus tier then. (hundreds of thousands coming soon, and i'm pretty sure y'all will use every one we can rack up.)..."

โ€” Sam Altman

Industry Applications

Our research enables breakthrough capabilities for leading AI innovators:

Etched โ†’ Sohu

Sohu, the world's first specialized chip (ASIC) for transformers, could leverage our "parallel reasoning chain pruning" to maximize throughput. By "pruning", Sohu can cut redudant reasoning chains early on, and fill the remaining bandwidth with more user requests, without sacrificing on quality.

Cognition (Devin)

Devin, the AI software engineer, requires high factual accuracy and efficient resource usage when working across large codebases, while not generating any errors. Our refusal research could enable more reliable code generation, making autonomous coding systems like Devin more powerful.

Mercor

For AI-driven development platforms like Mercor, our refusal research could enhance the accuracy of technical solutions while maintaining responsiveness, while our "pruning research" could enable more efficient inference time scaling for their matching agents.

Thank You

This research was made possible through the Cognition X Mercor X Etched Hackathon. We want to thank @Coreweave through @Northflank for providing acces to 8 x H100s and their team for trouble shooting with us. Thank you Mercor for the office space and organizing the event. Thanks to @Anthropic for the credits to use Claude Code and Sonnet, and to @Cognition for Devin access. Very special thanks to @Etched for extremely interesting late night conversations and guidance. And finally thank you to all other participants for great time hacking in FIDI!

About Our Team

We like AI. We love trying to make it better.

Vijay Kumaravel

Vijay Kumaravel ยท GitHub

Cofounder @JargonLearn & @Empathy, Researcher/Junior @USC

David Bai

David Bai ยท GitHub

Cofounder @JargonLearn & @Empathy, Researcher/Sophmore @USC

Balaji Kumaravel

Balaji Kumaravel ยท GitHub

Founding Engineer @Adapt API, ex-Quantative Trading Engineer

Get in Touch

Interested in learning more about our research or exploring collaboration opportunities? Reach out to us at our LinkedIn profiles!