Reasoning-channel models eat your max_tokens budget

You switch to a reasoning-class model, your narratives start truncating mid-sentence, and your answer-quality metric doesn't move. Everything looks fine until it isn't.

The mechanism

This isn't a Bedrock quirk. It's how reasoning models bill their output everywhere. The token ceiling you set covers total output, and for a reasoning model total output includes the reasoning channel. The model thinks against the same budget your answer draws from. Spend enough of it thinking and the answer truncates, with a stop reason that flags exactly that.

The field name is the only thing that moves between providers. Bedrock Converse and the Anthropic API call it max_tokens. OpenAI renamed it max_completion_tokens for the o-series, precisely because reasoning tokens now count against it. Gemini calls it maxOutputTokens and adds a separate thinkingBudget knob, but thinking still spends from the output side and still truncates the answer if you starve it. Same invariant, three spellings.

We hit it on Bedrock with gpt-oss, so that's the code below, but the failure is identical wherever you are. gpt-oss emits its analysis channel before the final answer. Claude with thinking spends a budget before it writes a word you'll keep. o1 and o3 burn reasoning tokens you never see. In every case, code that sized the ceiling to expected narrative length was sizing for the answer alone. The model spends a chunk of that budget thinking, hits the cap mid-prose, and the call comes back with stopReason: max_tokens (or finish_reason: length, same thing) on a sentence that stops mid-clause.

The text it hands you is real prose. It reads fine in a spot check. Your eval, if it scores coherence on what's present rather than completeness against what should be there, won't flag it. The regression shows up downstream, which is the worst place to find it.

Why this bit us harder than a cosmetic chop

Our truncated output was a monthly report, and we cache it. A mid-sentence narrative doesn't just look bad in one response. It gets written to the cache and served for thirty days. The chop becomes the artifact. Nothing regenerates it, because nothing upstream thinks anything failed. The stopReason was sitting right there in the response and the code read the text instead of the flag.

For a cached artifact, a chopped answer is strictly worse than no answer. No answer triggers the fallback and a regeneration. A chopped answer poisons the slot until the next cycle. So for us the correct outcome on truncation is nothing. Discard it, fail the call, let the chain produce something whole or produce nothing and retry. Never cache a partial.

The fix, in two parts

The invariant, which is always correct. Treat stopReason == max_tokens as a hard failure. Discard the text, including the empty-answer case where reasoning consumed the entire budget and left no answer block at all, emit a metric, and advance the fallback chain. This is the non-negotiable part, and it's the part that actually saves you. The dial below only changes how often it fires.

The dial, which is local tuning. Give the model headroom so the invariant fires rarely. We multiply the caller's expected size by a safety factor before it reaches the model. Ours is 3x. That number is ours. It's tuned to our reports, our prompts, and gpt-oss at our fixed reasoning-effort setting, and you should not lift it. The generalizable claim is "reserve room for the reasoning channel on top of your answer budget," not "the number is three." The principled version couples the reserve to reasoning effort, since effort is what drives reasoning-token consumption, and we wire effort as an additional model field anyway. A flat multiplier is fine when your effort setting is fixed, which ours is. The moment effort becomes dynamic, a constant stops being safe.

One cost note, because it's easy to misread. Raising the cap is free. You bill generated tokens, not the ceiling, so a completion that stops on its own never touches the headroom and costs what it did before. What isn't free is the reasoning itself. Those are real output tokens you now pay for on every call, visible nowhere near the max_tokens line. That's the actual cost of the migration, and it's worth naming so nobody reads "3x max_tokens, same cost" as "reasoning is free."

Demo (gpt-oss)

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/aws/aws-sdk-go-v2/aws"
    "github.com/aws/aws-sdk-go-v2/config"
    "github.com/aws/aws-sdk-go-v2/service/bedrockruntime"
    "github.com/aws/aws-sdk-go-v2/service/bedrockruntime/types"
    "github.com/aws/smithy-go/document"
)

func summarize(ctx context.Context, c *bedrockruntime.Client, maxTokens int32) {
    out, err := c.Converse(ctx, &bedrockruntime.ConverseInput{
        ModelId: aws.String("openai.gpt-oss-120b-1:0"),
        Messages: []types.Message{{
            Role: types.ConversationRoleUser,
            Content: []types.ContentBlock{
                &types.ContentBlockMemberText{Value: "Write a five-sentence summary of the French Revolution."},
            },
        }},
        InferenceConfig: &types.InferenceConfiguration{MaxTokens: aws.Int32(maxTokens)},
        AdditionalModelRequestFields: document.NewLazyDocument(map[string]any{
            "reasoning_effort": "medium",
        }),
    })
    if err != nil {
        log.Fatal(err)
    }

    if out.StopReason == types.StopReasonMaxTokens {
        fmt.Printf("[max_tokens=%d] TRUNCATED, discarding\n\n", maxTokens)
        return // never ship or cache this
    }
    msg := out.Output.(*types.ConverseOutputMemberMessage).Value
    text := msg.Content[0].(*types.ContentBlockMemberText).Value
    fmt.Printf("[max_tokens=%d] clean:\n%s\n\n", maxTokens, text)
}

func main() {
    cfg, _ := config.LoadDefaultConfig(context.TODO())
    c := bedrockruntime.NewFromConfig(cfg)
    summarize(context.TODO(), c, 200) // reasoning eats the budget, chop
    summarize(context.TODO(), c, 600) // headroom, clean termination
}

Run it and the 200-token call chops mid-sentence with stopReason: max_tokens while the 600-token call terminates clean. The completion that fit was billed identically in both. Note the 200 is gpt-oss-specific. Claude thinking rejects a budget that small outright, since the minimum thinking budget is 1024 and max_tokens has to clear it, so the starve-it version of this demo only works on a model that lets you starve it.

What travels and what's ours

The invariant travels to every reasoning-class deployment, on Bedrock or off it: reserve budget for the reasoning channel, and treat a max-tokens stop as a failure rather than a short answer. The multiplier, the token numbers, the cache semantics, and the fix locations are ours. Lift the lesson, not the constants.