Interruption is All You Need: Improving Reasoning Model Refusal Rates through measuring Parallel Reasoning Diversity

A novel approach to reducing hallucinations in large language models through parallel reasoning and diversity measurement

David BaiVijay KumaravelBalaji Kumaravel
June 2024

Abstract

We propose a method for increasing LLM refusal rates on questions they'd typically get incorrect by utilizing inference-time compute to scale reasoning, then verifying the diversity of reasoning chains that arrive towards a coherent answer. Given a query, we run inference on Deepseek R1, then interrupt it at N tokens, injecting an interruption token: "No, but". We then run P parallel inferences from that Nth token, allowing us to sample different reasoning traces given the same interruption token. Once R1 has resolved to an answer, we utilize an SLM to judge the coherence and diversity of the P reasoning traces with a score. If this score exceeds a tuned threshold, we choose to reject the original LLM answer and choose not to attempt to answer the respective question.

We find this method maintains accuracy while increasing refusal rates on incorrect answers, with further work necessary to derive the optimum Nth token at which to inject. We believe this method is highly applicable to deployments and spaces where false negatives are highly consequential— such as high-trust environments like the medical or legal field, or in spaces where LLM outputs are too large to be human-verified, where this method is a form of scaleable oversight.

Introduction

Reasoning models leverage chain of thought to reason through and perform extremely well on verifiable tasks like coding and math, which require multi-hop reasoning and thinking over long contexts. This use of inference-time compute is typically emergent through reinforcement learning, as demonstrated by models like Deepseek-R1 and the slew of replications that followed.

However, this reinforcement learning and emergent behavior often means that these models, on tasks like retrieval or memorization (hallucination benchmarks like SimpleQA), attempt to "reason through" this sort of question. While it's possible that the model may reason over other context it has, the majority of these questions result in overconfident hallucination, especially in the case of Deepseek-R1, which is the model we tested our method on.

Prior work, like s1 from Stanford and other papers pre-reasoning paradigm, has demonstrated that we can "budget force" a model by injecting tokens like "Wait" or even just periods. We aim to use this to gauge model uncertainty.

Method

Given a query Q, we run inference on a reasoning model M, until an EOS and answer is produced. Additionally, at hyperparameter N tokens in the same generated sequence, we inject an interrupt token "No, but" (though we'd like to try other examples) to induce uncertainty into the reasoning trace. We then continue this reasoning trace in hyperparameter P parallel instances until they all terminate.

We then prompt a SLM (Small Language Model) to judge the diversity of these reasoning traces— how different they are, whether they are coherent with each other. We find that a scalar for diversity from 1-10 is expressive enough for our purposes, as opposed to a regression or a larger range.

Then, for Q, if the diversity score exceeds our set threshold, we make the answer a refusal, e.g "I don't know", which is classified by our judge benchmark as NO_ATTEMPT.

Method diagram showing interruption and parallel reasoning
Figure 1: Method OverviewThe interruption method process: inference until N tokens, inject 'No, but', branch into P parallel reasoning paths, and measure diversity for refusal decision.

Experiment Details

Benchmark: SimpleQA

We utilize OpenAI's SimpleQA, a benchmark for measuring language model factuality, featuring 4,326 diverse fact-seeking questions with verified answers, where model responses are classified as "correct," "incorrect," or "not attempted" using a prompted classifier, specifically designed to identify and reduce hallucinations in AI-generated content.

Given budget and compute constraints, we form two sets, the first 100 and second 100 questions of SimpleQA to evaluate on, and report results for both.

Model: Deepseek R1

We utilize Deepseek R1 for this because of its open-source reasoning traces and reported ability to self-correct, as well as the ability to insert prefills into its thinking tokens, crucial to our method. For this paper, all results are from the full 671B version of Deepseek R1. We run Deepseek R1 using the Fireworks API, with a temperature of 1.0 and a top p of 1.

Answer Diversity through SLMs

An initial attempt we tried, for finding consensus from parallel reasoning chains (the original motivation for this work) was to simply measure the answer diversity after the reasoning chains had concluded (and the answer was extracted from the response after the </think> token). We discovered that the reasoning model often arrived at the same conclusion, with fairly different reasoning chains or rationales. Thus, this did not work on our preliminary experiments.

Because we ultimately did not follow through with this method, we do not include its results in the Results Section.

Continuation Diversity through Embeddings

The next intuition we had was that if many different reasoning chains were being used to arrive to the same conclusion, then that conclusion could likely be false. Or in the rare case, supported by many different sources and thus true. Our first attempt at measuring reasoning chain diversity (solely the sequence starting from the injected interruption token) was to use embeddings, specifically OpenAI's text-embedding-3-small. What we discovered was that similarity rates were often very high, and the embeddings did not capture the nuance we wanted. When comparing 2 large-token chains-of-thought with just 1 or 2 differences, its likely that the embedding cosine similarity (what we used to measure diversity) will be very high even if those changes result in very different reasoning traces.

Because we ultimately did not follow through with this method, we do not include its results in the Results Section. We include details and figures about this in the appendix.

Reasoning Diversity though SLMs

We ultimately decided to use a SLM to judge reasoning chain diversity, using Gemini 2.0 Flash-Lite as a proxy instead of locally hosting a solution, as our compute node was being fully utilized by our other experiments for another project. The SLM is given all of the parallel reasoning chains, and asked to rate their diversity on a scale from 1-10.

Results

We utilize GPT-4o and Claude 3.5 Sonnet as our baselines for their strong performance and high deployment rate in real-world scenarios. For our method, we report three results, each one representing the method applied on an injection token at 250 tokens, 500 tokens, and unbounded (10,000 but never reached) tokens. We find that our average token expenditure for SimpleQA questions is around 1000 tokens, which is what prompted us to choose 250 and 500. We utilize a diversity threshold of 7 for all R1 results, and we report all results across the two sets we partitioned, as mentioned in Experiment details.

Accuracy vs Refusal Rate Comparison

We find that we maintain accuracy on attempted questions close to our baseline models, while improving in refusal rates. That is, while we are unable to add new knowledge to the model, we are able to leverage inference-time compute to accurately judge when a model may be wrong.

You may observe that our results from the second set of questions have much higher refusal rates— this is because the token limits of 250 and 500 prevented several answers from completing, which were then parsed as NO_ATTEMPT by the SimpleQA grader. We would like to introduce adaptive interrupt injection to solve this problem, but we decided not to pursue this given the time constraints of the hackathon.

Conclusion

Our work demonstrates that measuring diversity in parallel reasoning chains can serve as an effective signal for when reasoning models should refuse to answer questions they would otherwise hallucinate on. By injecting an interruption token ("No, but") at varying points in the reasoning process and analyzing the diversity of resulting thought patterns, we were able to maintain accuracy on attempted questions while significantly improving refusal rates on potentially incorrect answers.

This approach requires no additional model training and leverages inference-time computation to enable more trustworthy AI systems. The method shows particular promise for high-stakes applications in medicine, law, and other domains where false information can have serious consequences. Future work should focus on optimizing the interruption point through adaptive injection techniques, exploring alternative interruption tokens, and extending this approach to other reasoning models beyond Deepseek R1.

Overall, our method provides a practical framework for improving AI reliability through enhanced self-awareness of knowledge limitations.

Appendix

In our initial experiments with embedding-based diversity measurement, we found that semantic similarity metrics often failed to capture the nuanced differences in reasoning chains, especially when those differences were concentrated in relatively few tokens of long reasoning sequences.

Embedding similarity matrix
Figure 4: Embedding Similarity MatrixCosine similarity between embeddings of different reasoning chains, showing high similarity despite meaningful reasoning differences.

We also experimented with different interruption tokens beyond "No, but", including tokens like "Wait", "Actually", and simple punctuation marks like periods. In general, we found that tokens that explicitly introduce doubt or contradiction (like "No, but") were most effective at generating diverse reasoning paths.

Additional data and code implementation details are available in our GitHub repository.