What Claude Code's Extended Thinking Actually Reveals Inside Its Reasoning Output — AI article on gikiewicz.com

Anthropic’s Claude Code has become a focal point for developers who want to understand how large language models reason through complex programming tasks. The tool’s “Extended Thinking” mode exposes intermediate reasoning steps that normally remain hidden. Users see raw chain-of-thought text appear in real time. This visibility raises practical questions about what the output actually contains and how it should inform engineering decisions.

TL;DR: Extended Thinking is a documented Claude feature that surfaces intermediate reasoning tokens before the model produces its final answer. According to Anthropic’s model overview, Claude models are large language models developed as a family of systems with distinct capability tiers. The thinking output is structured text the model generates while working through problems.

What Is Extended Thinking in Claude Code?

Extended Thinking is a capability documented in Anthropic’s Claude API that causes the model to produce visible intermediate reasoning before delivering a final response. According to the official model documentation, Claude is described as a family of large language models developed by Anthropic, and the thinking feature applies across the model lineup. When enabled, the model allocates additional computational effort to work through problems step by step.

The feature generates what Anthropic calls “thinking tokens.” These are textual representations of the model’s internal problem-solving process. Developers integrating Claude through the API can configure this behavior explicitly. The output appears as a distinct section separate from the final answer the model returns.

Claude Code surfaces this reasoning directly in the terminal interface. Users watch the model narrate its approach to code generation, debugging, or architectural decisions. The thinking text streams as the model works. This differs from standard interactions where only the completed response is visible.

Anthropic’s documentation positions Extended Thinking as a trade-off between capability depth and computational cost. The model spends more tokens reasoning internally, which consumes more of the context window. This directly affects how developers plan their API usage and budget.

How Does the Thinking Output Differ From the Final Response?

The thinking output and the final response serve fundamentally different purposes in Claude’s architecture. Thinking text represents the model’s working process. The final response represents the polished deliverable the model produces after completing its reasoning chain.

According to Anthropic’s model selection guide, choosing the optimal Claude model involves balancing three key considerations: capabilities, speed, and cost. Extended Thinking adds a fourth dimension to this balance. The thinking tokens consume context window capacity and increase latency, but they give the model space to decompose problems before committing to an answer.

In Claude Code specifically, the thinking section often contains the model examining the repository structure, identifying relevant files, and weighing multiple approaches. The final response then presents the chosen solution. Developers reading both sections get visibility into alternatives the model considered and rejected. This transparency can reveal whether the model understood the task constraints correctly.

The two sections also differ in format. Thinking text tends to be more stream-of-consciousness, with the model asking itself questions and noting observations. The final output follows a more structured format appropriate to the task, whether that is a code diff, a function implementation, or an architectural recommendation.

Which Claude Models Support Extended Thinking?

Anthropic’s model overview documents Claude as a family of large language models, and the Extended Thinking capability applies across the current model tiers. The documentation presents available models and compares their performance characteristics. Developers need to consult the current model list to determine which specific versions support the feature at any given time.

The model selection documentation emphasizes that different models serve different use cases. When choosing a model, developers balance capabilities against speed and cost according to their specific requirements. Extended Thinking interacts with each model tier differently because higher-capability models generally produce more detailed reasoning chains.

Claude Code users interact with these models through the CLI tool, which handles the API communication. The thinking output format remains consistent across models that support the feature. What changes is the depth and quality of reasoning each model tier produces.

Developers should verify current model availability through Anthropic’s documentation rather than relying on third-party summaries. The model lineup evolves, and new versions may introduce changes to how Extended Thinking behaves or which models support it.

What Does the Thinking Text Actually Contain?

The thinking text in Claude Code typically contains several categories of content that reveal how the model approaches programming tasks. Based on user observations and Anthropic’s documentation of Claude’s capabilities, the output includes genuine problem-solving narration.

The model frequently examines the project context within its thinking. This includes referencing specific files, considering import dependencies, and evaluating how proposed changes interact with existing code. The reasoning often shows the model checking whether a particular approach aligns with patterns already present in the repository.

Thinking sections also reveal the model’s uncertainty management. The text shows Claude identifying ambiguities in the request and making assumptions to proceed. Sometimes the model explicitly notes where it lacks information and decides on a reasonable default. This gives developers insight into whether their prompt was sufficiently specific.

The output contains alternative consideration as well. The model sometimes weighs different implementation strategies, noting trade-offs between approaches before selecting one. Developers can observe whether the model considered an approach they would have preferred, which helps refine future prompts.

Finally, the thinking text includes self-correction. The model sometimes begins down one path, identifies a problem, and adjusts course within the thinking section before producing the final answer. This internal debate is invisible in standard mode but becomes visible with Extended Thinking enabled.

Why Does the Thinking Output Matter for Developers?

The thinking output matters because it transforms the developer-model interaction from a black-box transaction into an observable process. According to Anthropic’s model selection guidance, developers already balance capabilities, speed, and cost when choosing models. Extended Thinking adds interpretability to that equation.

For debugging, the thinking section reveals whether the model misread the prompt, missed a file, or made an incorrect assumption. Instead of receiving a wrong answer and guessing why, developers can trace the model’s reasoning to find where it diverged from expectations. This diagnostic capability reduces iteration cycles.

The output also serves as a prompt engineering tool. When developers see the model asking itself questions during thinking, those questions highlight missing context. If the model wonders whether a function should be synchronous or asynchronous, the developer knows to specify that in future prompts. The thinking text becomes feedback on prompt quality.

For code review purposes, the reasoning section helps developers evaluate whether the model’s solution is sound. Seeing the alternatives the model considered and rejected provides context for the final decision. Reviewers can assess whether the model’s reasoning aligns with the team’s engineering principles.

The transparency also builds appropriate trust. Developers can verify the model is reasoning correctly rather than producing confident-sounding but poorly grounded output. This matters especially for complex architectural decisions where the stakes are higher.

How Is Extended Thinking Different From Standard Chain-of-Thought?

Standard chain-of-thought prompting asks a model to reason step by step within the response itself. Extended Thinking is a distinct architectural feature that separates the reasoning process from the final output. The thinking happens in a dedicated section that the API treats differently from the response content.

Anthropic’s documentation describes Claude as a family of state-of-the-art large language models. The Extended Thinking feature represents a specific capability within this family that goes beyond simple chain-of-thought techniques. The model allocates dedicated computational resources to the thinking phase.

In practice, this means the thinking section can be longer and more detailed than what would fit naturally in a response. The model is not constrained by the need to present reasoning and answers in a single coherent flow. Instead, it uses the thinking section as a scratchpad and then synthesizes conclusions into the final response.

Claude Code users see this separation clearly. The thinking text streams first, often containing exploratory analysis and dead ends. Then the final response appears, clean and focused. This two-phase approach produces different output quality compared to models that must reason and answer simultaneously.

How Does Extended Thinking Handle Complex Code Refactoring?

Extended thinking processes code refactoring through a structured analysis pipeline where Claude breaks down large changes into sequential reasoning steps. According to Claude API documentation, the Claude Opus 4 and Claude Sonnet 4 models support extended thinking with configurable token budgets ranging from 1,024 to over 32,000 tokens per reasoning block. The output reveals how the model evaluates dependencies before modifying code. Each reasoning trace shows explicit consideration of import paths, variable scopes, and function signatures.

The thinking blocks expose intermediate hypotheses. Developers can read how Claude tests assumptions about type compatibility and runtime behavior. This transparency helps programmers catch reasoning errors before the model writes production code. The traces also show when Claude deliberately pauses to verify edge cases.

During refactoring tasks, extended thinking output typically follows a pattern: scope assessment, risk identification, step planning, and execution. The reasoning text documents why specific functions were selected for modification and why others were left untouched. This creates an auditable trail that standard output does not provide.

What Token Budgets Are Available for Extended Thinking?

Claude’s API exposes extended thinking through a thinking parameter where developers allocate a specific token budget for reasoning before the model generates its final response. The documentation specifies minimum thresholds of 1,024 tokens and supports scaling up to tens of thousands depending on the model selected. Claude Opus 4 supports the largest thinking budgets, while Claude Sonnet 4 offers a balance of reasoning depth and response speed.

Developers configure this through the budget_tokens field in API requests. The model consumes tokens from this budget during its internal reasoning process, and any unused portion does not carry over. The thinking output is returned separately from the final response text, allowing applications to display or hide it as needed. This separation is architecturally significant.

Token cost is a real consideration. Reasoning tokens are billed at the same rate as input and output tokens, meaning a 32,000-token thinking block on Claude Opus 4 adds measurable cost to each request. Teams must evaluate whether deeper reasoning improves output quality enough to justify the expense.

Can Developers Inspect or Debug the Thinking Output?

Yes. The Claude API returns thinking content in a dedicated thinking content block within the response object, completely separate from the text response. This design lets developers inspect, log, and analyze reasoning traces programmatically. The documentation shows that thinking blocks include a type field set to "thinking" and contain the raw reasoning text as a string value.

This inspectability has practical debugging applications. When Claude Code produces unexpected output, developers can read the thinking trace to understand where the reasoning diverged from expectations. The traces reveal whether the model misread a file, made an incorrect assumption about project structure, or skipped a necessary verification step.

Teams building tools on Claude’s API can surface these traces in user interfaces. Claude Code itself displays condensed versions of thinking blocks in the terminal, giving developers real-time visibility into the model’s problem-solving approach. This stands in contrast to closed reasoning systems.

How Does Extended Thinking Compare Across Claude Model Tiers?

Anthropic’s model lineup offers extended thinking at different performance and price points. According to the models overview documentation, Claude Opus 4 delivers the deepest reasoning capabilities and supports the largest thinking budgets, priced at $15 per million input tokens and $75 per million output tokens. Claude Sonnet 4 provides a mid-tier option at $3 per million input tokens and $15 per million output tokens while still supporting extended thinking.

The reasoning quality differs between tiers. Opus 4 models demonstrate stronger performance on complex multi-step reasoning tasks, making them better suited for intricate codebase analysis and architectural decisions. Sonnet 4 models handle moderately complex reasoning efficiently, offering faster response times at lower cost. The thinking output from each tier reflects these capability differences.

Developers choose based on task complexity. Routine code generation may only need Sonnet 4’s thinking budget, while large-scale refactoring or security analysis benefits from Opus 4’s deeper reasoning. The API documentation explicitly recommends matching model selection to task requirements.

What Are the Cost Implications of Extended Thinking in Production?

Extended thinking tokens are billed at standard model rates, meaning reasoning adds direct cost to every API call that uses it. A single request with a 10,000-token thinking block on Claude Opus 4 costs $0.15 in reasoning tokens alone, before any output is generated. At scale, this compounds significantly across thousands of requests.

The documentation provides transparent pricing for all tiers. Claude Sonnet 4 thinking tokens cost $3 per million for input and $15 per million for output, making extended reasoning roughly five times cheaper than Opus 4. Teams running Claude Code in continuous integration pipelines must account for these costs when configuring thinking budgets.

Cost optimization strategies emerge from the thinking output itself. By analyzing reasoning traces, developers can identify tasks where smaller thinking budgets produce equivalent results. Some operations require deep reasoning; others do not. The visibility into thinking blocks enables data-driven budget tuning.

Frequently Asked Questions

Does extended thinking improve code generation accuracy?

Extended thinking gives Claude space to reason through complex problems before committing to an answer. According to Anthropic’s model documentation, Claude Opus 4 with extended thinking achieves higher performance on coding benchmarks compared to standard inference, particularly on multi-step refactoring tasks where intermediate reasoning helps the model avoid cascading errors.

How much latency does extended thinking add to API responses?

Latency scales with the allocated thinking budget. A 1,024-token thinking block adds minimal delay, while a 32,000-token block on Claude Opus 4 can add several seconds of processing time before the final response begins streaming. The documentation notes that thinking tokens are generated sequentially before output, creating a measurable gap between request and first response token.

Is the thinking output cached between API calls?

The Claude API supports prompt caching for input tokens, which can reduce costs when the same context is sent repeatedly. Thinking tokens generated during a request are not automatically cached for subsequent calls. Developers must re-enable extended thinking on each request where reasoning is needed, and the model generates fresh reasoning traces every time.

Can extended thinking be combined with tool use in Claude Code?

Yes. Claude’s API allows extended thinking to operate alongside tool use, meaning Claude Code can reason through a problem, decide which tools to invoke, and execute file operations within a single turn. The documentation confirms that thinking blocks precede tool use decisions, giving the model space to plan its tool interactions before acting.

Summary

Extended thinking output in Claude Code provides visibility into the model’s reasoning process through dedicated thinking blocks that are separate from final responses. Key takeaways:

  • Token budgets are configurable from 1,024 to 32,000+ tokens, with costs billed at standard model rates ($15/$75 per million tokens for Opus 4, $3/$15 for Sonnet 4)
  • Thinking blocks are fully inspectable through the API, returned as typed content blocks that developers can log, analyze, and display
  • Model tier affects reasoning depth, with Claude Opus 4 supporting larger budgets and stronger multi-step reasoning than Sonnet 4
  • Debugging benefits are concrete — reasoning traces reveal where assumptions break and why specific code decisions are made
  • Tool use integrates with thinking, allowing Claude Code to plan file operations and command execution during the reasoning phase

For developers building on Claude’s API, extended thinking transforms the model from a black box into an observable reasoning system. The thinking output is not decorative — it is a diagnostic tool that improves code quality, debugging efficiency, and cost management. Read the Claude API model documentation and the model selection guide to configure thinking budgets for your specific use case.