TL;DR

Researchers at Korea University trained a 4-billion-parameter model to solve competition-level math problems by writing and executing code instead of reasoning in natural language. Their framework, THINC, scored 78.1% across five elite benchmarks — beating Qwen3-235B-A22B-Thinking (75.2%), a model with roughly 60x more parameters. The trick: code does the reasoning, the Python interpreter verifies every step, and natural language shows up only for a brief planning sentence at the start. 99.2% of the model’s answers come directly from interpreter output, leaving almost no room for the hallucinated arithmetic that plagues chain-of-thought reasoning.

When I Saw the Benchmark Table, I Checked It Twice

I spend a lot of time reading papers about LLM reasoning, enough to be skeptical when a title promises a small model beating a large one. Usually the benchmark is cherry-picked, the comparison is unfair, or the margin vanishes under scrutiny. So when THINC’s paper showed a 4B model outperforming Qwen3-235B on four out of five competition-level math benchmarks, I went straight to the methodology section before reading anything else.

The claim checks out, and the reason it works is surprisingly clean: instead of letting the model reason in English and occasionally call a code interpreter, you make code the entire reasoning medium. The natural language reasoning step, the one where most models hallucinate calculations, gets reduced to a single planning sentence. Everything else is Python.

I’ve been thinking about this result for a few days now, and it changes how I think about where reasoning capability actually lives in these models. The bottleneck was the reasoning medium, not the model’s problem-solving ability. (This connects to a pattern I’ve noticed across recent model reviews — how you use the model can matter more than how big it is.)

The Problem THINC Solves

Most “tool-integrated reasoning” (TIR) systems follow the same pattern: the model writes natural language reasoning, calls a code interpreter to verify something, reads the output, then continues reasoning in natural language. Systems like ASTER, ReTool, and ToRA all work this way. The model thinks in English and uses code as a calculator.

This creates three specific failure modes:

  1. Code verifies instead of derives. The model does the actual reasoning in natural language, then writes code to check its work. If the NL reasoning is wrong, the verification code often just implements the same mistake.

  2. Unverified arithmetic slips through. Between code blocks, the model performs mental math in natural language. Numbers get rounded, carried incorrectly, or fabricated. The interpreter never sees these intermediate calculations.

  3. The model second-guesses the interpreter. After getting a code output, TIR models sometimes override it with their own natural language reasoning, literally ignoring verified computation in favor of vibes.

THINC’s fix is structural: don’t let the model reason in natural language at all (past the initial plan). Every derivation, every intermediate value, every calculation runs through the Python interpreter. The model’s job is to write code that solves the problem, and the interpreter’s job is to produce the answer.

How THINC Works

The framework has three stages: trajectory distillation, supervised fine-tuning (SFT), and reinforcement learning (RL). Each stage is straightforward on its own. The contribution is how they’re combined to force code-centric behavior.

Stage 1: Distilling Code-Centric Trajectories

The team used Qwen3.5-27B as a teacher model, prompting it with 3-shot examples to produce code-centric solutions for math problems from Skywork-OR1 and OpenMathReasoning. They filtered aggressively:

  • Only correct answers kept
  • All code blocks must execute without errors
  • At least three code blocks per trajectory
  • Less than 50% of tokens spent on natural language planning

This yielded 12,200 trajectories where code genuinely carries the reasoning. A THINC trajectory looks like this:

# Planning thought (NL): "This is a combinatorics problem.
# I'll enumerate valid (a,b) pairs where a+b+ab ≤ 100."

# Code block 1: brute force enumeration
results = set()
for a in range(1, 100):
    for b in range(a+1, 100):
        val = a + b + a * b
        if val <= 100:
            results.add(val)
print(f"Count: {len(results)}")
# Output: Count: 70

# Code block 2: verify with algebraic reformulation
results_v2 = set()
for a in range(1, 100):
    for b in range(a+1, 100):
        # a + b + ab = (a+1)(b+1) - 1
        val = (a + 1) * (b + 1) - 1
        if val <= 100:
            results_v2.add(val)
print(f"Verification: {len(results_v2)}")
# Output: Verification: 70

Compare that to a standard TIR trajectory, where the model would write two paragraphs of natural language reasoning between those code blocks. Paragraphs where it might miscalculate or introduce unverified assumptions.

Stage 2: Supervised Fine-Tuning

The 12.2K trajectories become the training data for fine-tuning Qwen3-1.7B and Qwen3-4B-Thinking-2507. Standard setup: learning rate 7×10⁻⁶ with cosine schedule, batch size 16, three epochs, 32K context length. The SFT stage teaches the model the format (how to produce code-centric trajectories) but doesn’t make it good at math yet.

After SFT alone, THINC-4B-SFT scored 48.1% on the benchmark suite. That’s worse than the teacher model (64.7%) and worse than the tool-prompted baseline (62.9%). SFT establishes the format; the gains come from RL.

Stage 3: Reinforcement Learning With GRPO

The RL stage teaches the model to solve hard problems in code. The team used Group Relative Policy Optimization (GRPO) on DAPO-Math-17k, with verifiable rewards. (GRPO comes from the DeepSeekMath line of work — a simpler alternative to PPO that drops the critic model.) The reward signal is simply whether the code produces the correct final answer.

Training runs in three curriculum stages:

StageStepsContext LengthMax Tool CallsData
128016K tokens20Full problem set
212016K tokens20Filtered (removed 100%-solved)
3400+32K tokens40Filtered

The curriculum gradually increases difficulty: easy problems get removed, context grows, and the model gets more tool calls to work with. RL added 29.9 percentage points at 4B scale, the single biggest jump in the pipeline.

All training ran on a single node with 8× NVIDIA H200 GPUs. The compute is modest by 2026 standards.

The Numbers

78.1%
THINC-4B avg accuracy
75.2%
Qwen3-235B-A22B
99.2%
Answers from interpreter
349
Lines of code per solution

The benchmark suite covers AIME 2024, AIME 2025, AIME 2026, HMMT 2025, and BeyondAIME. All competition-level math. Here are the full results (avg@16, average accuracy over 16 samples per problem):

ModelParamsAIME 24AIME 25AIME 26HMMT 25BeyondAIMEAvg
THINC-4B4B88.3%85.8%86.0%74.0%56.1%78.1%
Qwen3-235B-A22B235B90.6%80.6%82.1%68.8%54.1%75.2%
ASTER-4B4B78.8%84.6%78.8%73.1%54.0%73.8%
Qwen3-4B-Thinking4B79.2%73.1%76.7%50.2%45.8%65.0%
THINC-1.7B1.7B59.0%50.2%42.9%39.0%22.7%42.8%
Qwen3-1.7B1.7B47.3%35.0%36.2%22.5%19.8%32.2%

THINC-4B beats the 235B model on four of five benchmarks. The one exception is AIME 2024, the most saturated benchmark in the set, and the gap is narrow (88.3% vs 90.6%).

At 1.7B parameters, THINC still pulls its weight: it jumps the base Qwen3-1.7B from 32.2% to 42.8%, a 10.6 percentage point gain from a model small enough to run on a laptop.

Efficiency: Fewer Calls, Shorter Responses

The efficiency numbers are just as striking:

MetricTHINC-4BASTER-4B
Tool calls per problem6.111.1
Response length13.5K tokens15.4K tokens
Lines of code349102

More lines of code, fewer tool calls, shorter overall response. THINC writes denser code blocks that do more work per call, instead of the short-snippet-plus-long-NL pattern of interleaved systems.

Why Code Reasoning Beats English Reasoning

The 99.2% interpreter-grounded answer rate is the most telling metric in the paper. In comparison, ReTool grounds 88.4% of answers in interpreter output, and rStar2 manages 74.3%. The gap means that in roughly 1 out of every 4 rStar2 solutions, the model’s final answer comes from its own natural language reasoning instead of verified computation.

Three properties of code make it a better reasoning medium for math.

Every intermediate value gets verified. When THINC-4B computes (a+1)*(b+1) - 1, the Python interpreter runs the actual multiplication. There’s no room for the model to quietly write “which gives us 143” when the real answer is 131. Chain-of-thought reasoning doesn’t have this check. The model generates both the computation and the result, and nobody verifies the arithmetic.

Errors are also explicit and recoverable. A wrong calculation in natural language looks like correct text. The model and the reader both pass over it. A wrong calculation in code throws a ValueError or produces an obviously incorrect output, and the model can catch and fix it in the next code block. The paper measures this: when THINC-4B encounters 5 consecutive code execution errors, it still recovers and produces a correct final answer 33.3% of the time. ASTER manages 18.5%. rStar2 recovers 0%.

Code also forces decomposition. Writing code for a complex problem requires breaking it into functions, loops, and intermediate variables. That structural decomposition is exactly what good mathematical reasoning needs, and it happens automatically when the reasoning medium is code. Natural language can paper over gaps with phrases like “by similar reasoning” or “which gives us.” Code can’t.

Out-of-Distribution: GPQA-Diamond

The paper also tests THINC on GPQA-Diamond, a science QA benchmark that’s outside the training distribution (math competition problems). THINC-4B scored 66.48% avg@16, edging out the Qwen3-4B base model at 66.32% and beating ASTER-4B’s 63.42%.

The gains are much smaller here than in math, but the fact that a math-trained code-reasoning model doesn’t lose performance on science questions is encouraging. Code-centric reasoning generalizes at least somewhat. You can use Python to verify physics calculations, chemistry stoichiometry, and statistical claims the same way you’d verify competition math.

What THINC Can’t Do Yet

The paper is upfront about three limitations.

Everything was tested at 1.7B and 4B parameters because of compute constraints. Would the code-reasoning advantage persist at 70B or 400B? Larger models already have better internal arithmetic (the latest frontier models rarely make arithmetic mistakes at all), so the gap might shrink. Or code-centric reasoning might compound with scale and produce even bigger gains. The paper doesn’t have the data to say.

The training data and evaluation are also all competition math. Problems that don’t reduce to computation (literary analysis, ethical reasoning, creative writing) won’t benefit from code-centric reasoning. The GPQA-Diamond results hint at cross-domain transfer, but a 0.16 percentage point gain isn’t much to build on.

Then there’s the interpreter dependency. THINC requires a code interpreter at inference time. That’s fine for cloud deployment (vLLM, together.ai, any managed endpoint with sandboxed execution), but it rules out pure-text inference and makes edge deployment harder. Every code block needs a round-trip to a Python runtime.

I’d add one more: the training data is distilled from a 27B teacher model. THINC-4B’s 78.1% accuracy exists in the context of a teacher that was already strong at math. Whether the method works with a weaker teacher, or with entirely self-generated trajectories, is an open question. The paper’s results show what’s possible with good distillation; the floor of the technique is less clear.

What This Means for Practitioners

If you’re building reasoning pipelines today (RAG systems that verify claims, coding agents that validate their own output, math tutoring tools), THINC suggests a concrete change: stop treating code as a verification step and start treating it as the primary reasoning channel.

I’ve started experimenting with this in my own Claude Code workflows. When I need an agent to reason about data (calculate statistics, verify numerical claims, check consistency across a dataset), I now prompt it to produce a Python script first and derive the conclusion from the script’s output, rather than asking it to “think step by step” in natural language and then optionally write code.

The results are anecdotal and not measured, but the error rate on numerical claims has dropped noticeably. The model still makes mistakes in the code (wrong loop bounds, off-by-one errors), but those mistakes throw exceptions or produce visibly wrong output. They don’t hide inside plausible-sounding sentences.

For researchers, the THINC paper opens a question about reasoning scaling laws. We’ve been measuring how reasoning quality scales with model size, measured in parameters. But THINC shows that the medium of reasoning might matter more than the model’s raw size. A 4B model reasoning in code beats a 235B model reasoning in English. A different axis to optimize along, and a much cheaper one.

FAQ

Can large language models reason with code instead of natural language?

Yes, and the THINC paper provides strong evidence that code-based reasoning produces more accurate results on mathematical problems. THINC-4B achieves 78.1% accuracy on competition math by conducting all reasoning through Python code blocks, with 99.2% of its answers derived from interpreter output rather than natural language generation.

Is code reasoning better than chain-of-thought reasoning for math?

For computation-heavy problems, code reasoning outperforms chain-of-thought by a wide margin. THINC-4B (4B parameters) beat Qwen3-235B-A22B-Thinking (235B parameters) on four of five competition math benchmarks. The key advantage: every intermediate calculation is verified by the Python interpreter, eliminating the hallucinated arithmetic that chain-of-thought is prone to.

How does THINC compare to other tool-integrated reasoning approaches?

THINC differs from standard TIR systems like ASTER, ReTool, and ToRA by making code the primary reasoning medium rather than an auxiliary verification tool. Where ASTER uses 11.1 tool calls with 102 lines of code per problem, THINC uses 6.1 calls with 349 lines. Denser code blocks that do more computation per call, producing both higher accuracy and shorter overall responses.

Can small LLMs outperform larger ones with the right training approach?

THINC demonstrates that a 4B parameter model can outperform a 235B model when trained to reason in code rather than natural language. The advantage comes from the medium of reasoning (verified code vs unverified text), not from raw model size. The 1.7B variant also shows large gains over its base model (42.8% vs 32.2%), though it can’t match the larger models overall.

Does code-based reasoning work outside of math?

Early signs are mixed. THINC-4B tested on GPQA-Diamond (science QA, outside the training distribution) scored 66.48% — slightly above the base model’s 66.32% and above ASTER-4B’s 63.42%. The code-reasoning capability transfers to problems involving calculation and quantitative reasoning, but the gains are much smaller than in pure math. Problems that don’t reduce to computation likely won’t benefit.

Sources

Bottom Line

THINC’s result is clean and the mechanism is clear. Code-centric reasoning works because it eliminates the unverified gap between “thinking” and “computing.” Every step runs through an interpreter, every intermediate value is real, and errors surface as exceptions instead of hiding in plausible-sounding sentences.

The 4B-beats-235B headline grabs attention, but the metric I keep coming back to is the 99.2% answer grounding rate. That number means the model almost never makes up its final answer. It reads it from verified code output. If I were building a system that needed reliable numerical reasoning, I’d build around that number.