You can run local agentic AI workflows on a Raspberry Pi in 2026 using Ollama with models like Qwen3-Coder or DeepSeek-V3.2. No cloud. No subscription. No API bill. The catch: hardware constraints are real, and “self-improving” needs a precise definition before you start.
Analysis Briefing
- Topic: Agentic AI workflows on Raspberry Pi locally
- Analyst: Mike D (@MrComputerScience)
- Context: A structured investigation kicked off by DeepSeek-V3.2
- Source: Pithy Cyborg
- Key Question: Can a $100 Pi actually run agentic AI without melting or calling home?
What Ollama on Raspberry Pi Actually Supports in 2026
Ollama runs on 64-bit ARM, which means Raspberry Pi 4 and Pi 5 are viable targets. The Pi 5 with 8GB RAM is the practical floor for anything resembling a capable agentic loop.
Qwen3-Coder at the 7B quantized level (Q4_K_M) fits in roughly 5GB of RAM. DeepSeek-V3.2 is a different story: the full model is enormous. You need the distilled or heavily quantized variants (1.5B to 7B range) to run it locally on Pi hardware. Plan for 4-bit quantization or smaller parameter counts as your baseline.
Ollama handles model pulling, serving, and REST API exposure automatically. Once running, any agentic framework that hits a local HTTP endpoint can use it. That includes LangChain, AutoGen, and lighter Python orchestration scripts.
How to Wire Up a Self-Improving Agent Loop Locally
“Self-improving” in this context means the agent generates outputs, evaluates them, and uses that evaluation to refine the next prompt or sub-task. Full weight fine-tuning is not happening on a Pi. That is not what this setup does.
The realistic architecture is: a Python orchestrator sends a task to Ollama, captures the output, feeds it back as context with a critique prompt, and iterates. Tools like AutoGen or CrewAI can manage this loop. For Pi-scale hardware, lighter custom scripts often outperform heavy frameworks in latency and RAM use.
A functional stack looks like this: Raspberry Pi 5 (8GB), Raspberry Pi OS 64-bit, Ollama installed via the official ARM install script, a quantized Qwen3-Coder or DeepSeek distill model pulled with ollama pull, and a Python agent loop using the ollama Python library or direct requests calls to localhost:11434.
When This Setup Works and When It Hits a Wall
This works well for coding assistants, document summarizers, structured data extraction, and iterative prompt refinement tasks. Response latency on Pi 5 with a 7B Q4 model is roughly 5-15 tokens per second. Usable for automated background tasks. Not usable for real-time conversation.
It does not work well for multi-agent parallelism, vision-language tasks at scale, or anything requiring the full DeepSeek-V3.2 parameter count. The Pi also has no GPU acceleration path for these models. Every token runs on CPU.
Thermal management matters. Under sustained agentic load, a Pi 5 will throttle without active cooling. A small heatsink and fan are not optional here.
What This Means For You
- Start with Qwen3-Coder 7B Q4_K_M. It balances reasoning quality and RAM use better than most alternatives at this size on ARM hardware.
- Use the
ollamaPython library directly instead of LangChain for Pi deployments. Fewer dependencies means lower RAM overhead and faster startup. - Avoid DeepSeek-V3.2 full-size models entirely on Pi. Pull a distilled variant (1.5B or 7B) and set expectations accordingly.
- Add active cooling before running sustained loops. Thermal throttling kills token throughput faster than model size does.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg → AI news made simple without hype.
