Streaming sends tokens as they are generated rather than waiting for the complete response. When the complete response is valid JSON, streaming appears to work. The problem appears at the consumer: you are receiving a partial JSON string that is not valid JSON until the stream is complete. If your code tries to parse before the stream ends, it throws. If the stream is interrupted, you receive invalid JSON with no error signal.
Analysis Briefing
- Topic: LLM streaming, JSON parsing, and partial response handling
- Analyst: Mike D (@MrComputerScience)
- Context: A technical briefing developed with Claude Sonnet 4.6
- Source: Pithy Cyborg | AI News Made Simple
- Key Question: What is the right architecture for consuming streaming JSON from an LLM without intermittent parse failures?
The Three Failure Modes of Streaming JSON
Premature parsing. The most common mistake is parsing each chunk as it arrives. A chunk that contains {"name": "Mi is not valid JSON. Any JSON parser called on it will throw. Code that works in testing (where the mock returns a complete response) fails in production (where streaming sends partial chunks) because the test environment never exercised the streaming path correctly.
Stream interruption producing invalid JSON. Network timeouts, rate limit responses, and server errors can interrupt a stream mid-response. The accumulated buffer at interruption contains a valid JSON prefix with no closing braces or brackets. There is no parse error at the HTTP level because the stream sent valid HTTP chunks. The failure is invisible until the JSON parser runs on the complete buffer and throws on malformed input.
Structural JSON from LLMs is unreliable without structured output mode. Even a complete, uninterrupted stream from an LLM that was asked to produce JSON will occasionally produce malformed JSON. The model might add a comment, close a bracket incorrectly, or add trailing text after the closing brace. The structured output malformed JSON failure happens even with structured output mode enabled. Without it, the rate is higher.
The Correct Architecture for Streaming JSON Consumption
Accumulate the full stream before parsing. Do not attempt to parse chunks as they arrive. Buffer all chunks into a string, and parse once the stream signals completion with a done event.
import anthropic
import json
client = anthropic.Anthropic()
def stream_json_response(prompt: str) -> dict:
buffer = []
with client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
) as stream:
for text in stream.text_stream:
buffer.append(text)
full_response = "".join(buffer)
# Strip markdown code fences if present
if full_response.startswith("```"):
lines = full_response.split("\n")
full_response = "\n".join(lines[1:-1])
try:
return json.loads(full_response)
except json.JSONDecodeError as e:
# Log the full response for debugging
raise ValueError(f"Failed to parse JSON: {e}\nResponse: {full_response}")
For use cases where you need to display streaming progress to a user while also getting structured output at the end, stream text for display and parse the complete accumulated response for data consumption as two separate operations.
When to Use Structured Output Mode
Anthropic and OpenAI both offer structured output modes that constrain model outputs to a specific JSON schema. This reduces (but does not eliminate) malformed JSON and eliminates the problem of the model adding prose before or after the JSON block.
Use structured output mode for any production system that requires JSON output. The schema constraint forces the model to produce valid structure at the cost of occasional refusals or truncated outputs when the model cannot fit its answer into the schema. Handle those cases explicitly rather than assuming they won’t occur.
What This Means For You
- Never parse streaming chunks as they arrive. Buffer the complete stream and parse once, because partial JSON is always invalid JSON regardless of how the stream is progressing.
- Always handle JSONDecodeError explicitly with the full response logged, because a silent failure that returns an empty dict is harder to debug than an explicit error that shows you what the model actually produced.
- Use structured output mode in production for any endpoint that requires JSON, and add a fallback that retries with an explicit repair prompt when the parse fails.
Enjoyed this? Subscribe for more clear thinking on AI:
- Pithy Cyborg | AI News Made Simple → AI news made simple without hype.
