TL;DR
Reasoning models like OpenAI o1 and DeepSeek-R1 spend thousands of tokens thinking through simple problems. Seven recent papers demonstrate fixes: Sketch-of-Thought drops token usage 84% with cognitive-inspired shorthand, shorter reasoning chains turn out to be 34.5% more accurate than long ones, and budget-aware prompting halves costs with almost no accuracy loss. If you’re paying per-token for inference, this research directly affects your bill.
The $50 Problem I Created in One Afternoon
I ran a batch of 12,000 math-adjacent classification tasks through DeepSeek-R1 last month. The logic was simple: each item needed maybe three reasoning steps. But the model averaged 1,400 tokens of “thinking” per request. By the end I’d burned through $50 in API credits on a job that should have cost $8. The model was overthinking every single input, generating paragraphs of intermediate reasoning for problems it could have solved in two sentences.
That experience sent me down a rabbit hole of recent papers on efficient reasoning. The field has exploded in the past year, with researchers attacking the problem from at least three directions: making models think in shorthand, teaching them to stop early, and routing easy problems away from expensive models entirely. What follows is a synthesis of the seven papers I found most useful, with concrete numbers and a code example you can apply today.
What LLM Overthinking Looks Like
The “Stop Overthinking” survey by Sui et al. (accepted at TMLR 2025) named and surveyed this problem, mapping the whole space. Their framing: while Chain-of-Thought (CoT) prompting improved accuracy by forcing models to show their work, it also introduced a pathological side effect. Models generate verbose, redundant reasoning steps even when the answer is obvious.
The waste goes beyond compute costs. Zhou et al.’s “When More Thinking Hurts” (April 2026) showed that extra reasoning actively degrades accuracy in some cases. Models abandon correct intermediate answers when given more tokens to think. The model reaches the right answer at token 400, then talks itself out of it by token 1,200. They call this a “negative flip,” the reasoning equivalent of second-guessing a correct gut reaction.
The survey categorizes solutions into three buckets:
- Model-based: train or fine-tune a model to reason more concisely from the start
- Output-based: dynamically shorten reasoning during inference (early stopping, token budgets)
- Input-based: modify the prompt to elicit shorter reasoning (cognitive sketches, difficulty routing)
Each bucket has trade-offs. Model-based approaches need training compute upfront but pay off at scale. Output-based methods work on any model but require a stopping heuristic (related: sparse attention and KV cache compression also reduce inference cost, but at the architecture level). Input-based techniques are the easiest to deploy — you can start today by changing your prompts — but they rely on the model cooperating with the instructions.
The Papers: What Works and by How Much
Sketch-of-Thought: Think in Shorthand, Not Paragraphs
Aytes, Baek, and Hwang introduced Sketch-of-Thought (SoT) at EMNLP 2025 with a simple premise: humans don’t reason in full sentences. A mathematician working through a proof scribbles notation, not prose. An experienced programmer debugging code thinks in terms like “null check → async race → state mismatch” — not “First, I should consider whether the variable might be null, and then I need to examine the possibility of an asynchronous race condition…”
SoT implements this by prompting models to reason using three shorthand approaches, selected dynamically by a lightweight routing model:
- Conceptual Chaining links abstract concepts with arrows rather than explaining each step
- Chunked Symbolism uses mathematical notation and symbolic abbreviations
- Expert Lexicons employs domain-specific shorthand that a specialist would use
The routing model picks the right approach based on the input: math problems get Chunked Symbolism, multi-hop reasoning gets Conceptual Chaining, domain-specific questions get Expert Lexicons.
Results across 18 reasoning datasets: up to 84% token reduction with minimal accuracy loss. On mathematical and multi-hop reasoning tasks, SoT actually improved accuracy while producing shorter outputs. The model was both faster and more accurate, because the compressed format forced it to focus on the logical structure instead of padding with natural language filler.
Don’t Overthink it: Shorter Chains Are More Accurate
Hassid et al. published their findings in May 2025 (revised February 2026) with a counterintuitive result: when you sample multiple reasoning chains for the same problem, the shorter ones are significantly more likely to be correct. Specifically, the shortest chain was up to 34.5% more accurate than the longest chain sampled for the same question.
This overturns the assumption that longer reasoning = deeper thinking. In many cases, longer chains indicate the model is lost: exploring dead-end branches, reconsidering solved subproblems, or generating filler to reach a perceived target length.
Their practical method, short-m@k, generates k parallel reasoning chains and stops after the first m complete:
- short-1@k matches standard majority voting performance while using up to 40% fewer thinking tokens
- short-3@k consistently beats majority voting across all compute budgets, with up to 33% wall-time reduction
The method requires zero model changes. It’s pure inference-time optimization. You can apply it today with any reasoning model by spawning parallel completions and taking the first one that finishes.
When More Thinking Hurts: The Diminishing Returns Curve
Zhou et al.’s April 2026 paper ran controlled experiments on test-time compute scaling. Marginal returns diminish as you increase the token budget, and past a threshold, accuracy actually drops.
Three results stood out, in order of practical importance:
Optimal thinking length varies by problem difficulty. Using a uniform token budget wastes compute on easy problems and still underspends on hard ones. A 200-token budget might be perfect for “What is 17 × 23?” but starvation-level for a multi-step algebra proof.
Models abandon correct answers. Given unlimited tokens, a model will sometimes reach the correct answer midway through reasoning, then continue thinking and eventually switch to a wrong answer. The extra tokens don’t add information — they add noise.
Stopping at moderate budgets saves compute without hurting accuracy. For most problem distributions, there’s a sweet spot where 60-70% of the maximum budget captures 95%+ of the accuracy — and the remaining 30-40% of compute buys almost nothing.
Token-Budget-Aware Reasoning: Tell the Model How Much to Think
Han et al.’s Token-Budget-Aware framework (published December 2024, code on GitHub as TALE) takes the difficulty-varying insight from Zhou et al. and makes it actionable. The idea: add an explicit token budget to the prompt, adjusted per problem based on an estimated difficulty score.
For easy problems, the prompt says “solve this in under 100 tokens.” For harder ones, “use up to 500 tokens.” The model cooperates with the budget constraint surprisingly well — it compresses its reasoning to fit, dropping redundant steps rather than cutting corners on logic.
The trade-off is a slight accuracy loss on some tasks. But the cost reduction is substantial, and the authors show it’s worth it for production deployments where you’re processing thousands of requests. The GitHub repository includes example implementations you can adapt.
Here’s a simplified version of how you’d implement budget-aware prompting in Python:
import openai
def classify_difficulty(question: str) -> str:
word_count = len(question.split())
if word_count < 20 and "?" in question:
return "easy"
elif any(kw in question.lower() for kw in ["prove", "derive", "explain why"]):
return "hard"
return "medium"
BUDGETS = {"easy": 100, "hard": 800, "medium": 300}
def reason_with_budget(question: str, model: str = "o1-mini") -> str:
difficulty = classify_difficulty(question)
budget = BUDGETS[difficulty]
response = openai.chat.completions.create(
model=model,
messages=[{
"role": "user",
"content": (
f"Answer this question. Use at most {budget} reasoning tokens. "
f"Be concise — skip obvious steps.\n\n{question}"
),
}],
max_completion_tokens=budget + 200, # buffer for the final answer
)
return response.choices[0].message.content
The difficulty classifier here is crude (in production you’d use a lightweight model or embedding-based classifier), but the pattern works. Tell the model how much thinking is appropriate, and it compresses accordingly.
ReasonMaxxer: Targeted Corrections Match Full RL
The most recent paper in this roundup, “Rethinking RL for LLM Reasoning” (May 7, 2026), analyzed token-level behavior across multiple model families and RL algorithms. They found that RL’s beneficial impact is concentrated at high-entropy decision points, moments where the base model is genuinely uncertain which direction to take. Outside those moments, RL has negligible effect. The model already knows how to reason; it only needs a nudge at a handful of forks.
ReasonMaxxer exploits this by applying contrastive loss only at those decision points, using a few hundred base-model rollouts and no online generation. It matches or exceeds full RL performance across three model families, six scales, and six math reasoning benchmarks — while requiring only tens of problems and minutes of single-GPU training. (For a deeper look at how smaller models can match larger ones on reasoning through code-based CoT, see our coverage of THINC.)
If you’re fine-tuning a reasoning model, you don’t need a full RLHF pipeline. Identify the decision points where your model hesitates (high entropy in the next-token distribution), apply targeted corrections there, and skip the rest.
Thinking with Reasoning Skills: Reuse What You’ve Already Figured Out
Zhao et al.’s “TRS: Thinking with Reasoning Skills” (2026) approaches the problem from a different angle. Every new request currently starts reasoning from scratch, but a human programmer encountering a new bug doesn’t re-derive debugging principles from first principles. They apply patterns they’ve seen before — “this looks like a race condition” or “check the null coalescing” — and go straight to the relevant fix.
TRS stores reusable reasoning skills extracted from previous successful problem-solving sessions. At inference time, the system retrieves relevant skills and applies them as shortcuts, skipping the exploration that the model already did on similar problems.
The approach reduces token usage while improving accuracy on coding and mathematical reasoning tasks. The per-request cost drops because the model doesn’t re-explore dead-end branches it already mapped out in previous sessions.
Comparison: Which Method Fits Your Use Case
| Method | Token Reduction | Accuracy Impact | Requires Training | Best For |
|---|---|---|---|---|
| Sketch-of-Thought | Up to 84% | Neutral to +2.5% | No (prompting) | Math, multi-hop reasoning |
| short-m@k | Up to 40% | Neutral to +4% | No (inference) | Any reasoning model |
| Token Budget | Up to 69% | Slight loss (-1-3%) | No (prompting) | High-volume batch jobs |
| ReasonMaxxer | N/A (training efficiency) | Matches full RL | Yes (minutes, 1 GPU) | Fine-tuning reasoning models |
| TRS | Significant (not quantified) | Positive | Yes (skill extraction) | Repeated problem patterns |
| Early Stopping | 30-40% | Neutral | No (inference) | Cost-sensitive production |
The fastest wins come from prompting-based methods (Sketch-of-Thought and token budgets) because they need zero infrastructure changes. Changing your prompt template is enough to see savings on the next batch run. short-m@k is almost as easy if your provider supports parallel completions (most do). ReasonMaxxer and TRS require training but deliver more durable improvements.
For production deployments, I’d stack methods: use a difficulty classifier to route easy problems to a cheaper model entirely (skip reasoning), apply token budgets on medium problems, and reserve full reasoning for genuinely hard inputs. That’s roughly what the Route-and-Reason framework from Tsinghua suggests, and the compounding savings are large.
What This Means for Your Inference Bill
The papers converge on a practical rule of thumb: most LLM reasoning tasks use 2-5x more tokens than necessary. A model that generates 1,000 thinking tokens for a task that needs 200 is wasting 80% of your compute budget on noise.
If you’re running reasoning models in production (o1 for code review, DeepSeek-R1 or V4 Pro for math pipelines, Claude with extended thinking for complex analysis), applying even one of these techniques should cut your per-request cost by 30-50%. Stacking them pushes savings past 70%.
These techniques sit alongside other inference optimizations (attention sparsity, cache compression, speculative decoding) covered in our LLM inference efficiency guide. The bigger insight is about the nature of LLM reasoning itself. These models aren’t thinking harder when they generate longer chains. They’re often doing the equivalent of a student filling up an exam page with repetitive restatements because they think length demonstrates effort. The actual reasoning, the decision points where the model commits to a direction, is sparse and concentrated. The rest is padding.
FAQ
How do I reduce LLM reasoning token costs?
Three immediate options that need zero training: add explicit token budgets to your prompts (e.g., “solve this in under 200 tokens”), use Sketch-of-Thought prompting to have the model reason in shorthand rather than prose, or run parallel completions with short-m@k and take the first to finish. All three work with existing API providers and require only prompt changes or client-side logic.
What is LLM overthinking and how do I fix it?
Overthinking is when a model generates far more reasoning tokens than a problem requires, spending 1,000+ tokens on a question solvable in 100. It wastes money and can actually reduce accuracy, since models sometimes abandon correct answers during extended reasoning. Fix it by constraining reasoning length, routing easy problems to cheaper models, or stopping inference early when the model’s token entropy drops (indicating it has reached a conclusion).
What is Sketch-of-Thought for efficient LLM reasoning?
Sketch-of-Thought (Aytes et al., EMNLP 2025) prompts models to reason using shorthand (symbols, notation, domain abbreviations) instead of full sentences. A lightweight routing model selects one of three approaches (Conceptual Chaining, Chunked Symbolism, Expert Lexicons) based on the question type. It cuts token usage by up to 84% with neutral-to-positive accuracy impact across 18 datasets.
Is shorter LLM reasoning always better?
Not always, but more often than you’d expect. Hassid et al. showed that shorter reasoning chains are up to 34.5% more accurate than the longest chains sampled for the same question. The exceptions are genuinely complex multi-step problems where each reasoning step builds on the previous one. Truncating too early there loses necessary intermediate computations. The practical move is difficulty-adaptive: short reasoning for easy problems, full reasoning for hard ones.
How does batch prompting reduce reasoning tokens?
Batch prompting groups multiple questions into a single prompt. The model amortizes its reasoning setup across all items instead of repeating it per question. Srivastava et al. measured a 76% token reduction (from 2,950 to 710 tokens per query on average) because the model generalizes from earlier examples in the batch and suppresses the hedging/metacognitive loops it would run on isolated questions.
Sources
- Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models — the foundational survey categorizing approaches into model-based, output-based, and input-based efficiency (Sui et al., TMLR 2025)
- Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching — 84% token reduction via cognitive shorthand paradigms (Aytes et al., EMNLP 2025)
- Don’t Overthink it: Preferring Shorter Thinking Chains for Improved LLM Reasoning — shorter chains are 34.5% more accurate, short-m@k method (Hassid et al., 2025-2026)
- When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling — diminishing returns and negative flips in extended reasoning (Zhou et al., April 2026)
- Token-Budget-Aware LLM Reasoning — dynamic token budget allocation by difficulty (Han et al., 2024)
- Rethinking RL for LLM Reasoning: It’s Sparse Policy Selection, Not Capability Learning — ReasonMaxxer matches full RL with minutes of training (May 2026)
- Thinking with Reasoning Skills: Fewer Tokens, More Accuracy — reusable reasoning shortcuts for repeated problem patterns (Zhao et al., 2026)
- Batch Prompting Suppresses Overthinking Reasoning Under Constraint — 76% token reduction via batched queries (Srivastava et al.)
Bottom Line
The LLM reasoning tax is real and measurable. Every paper in this roundup attacks the same underlying issue: models generate far more reasoning tokens than the problem requires. The fix is making reasoning proportional to difficulty.
My own approach after reading these papers: I pipe easy classification tasks through a non-reasoning model (GPT-4o-mini or Haiku), apply token budgets on medium tasks, and reserve full extended thinking for the handful of genuinely hard problems. That combination cut my monthly inference spend by roughly 60% without any measurable accuracy loss on the tasks I track. The research backs this up — and the methods keep getting better.