Affiliate disclosure: Some links in this article may be affiliate links. We may earn a commission if you purchase via these links — at no extra cost to you. This does not affect our editorial coverage. Full disclosure.
Cactus Compute Drops a 26M Parameter AI Model That Runs at 6,000 Tokens/Second on Your Laptop
TL;DR
Cactus Compute released Needle on May 12 — a 26 million parameter model distilled from Gemini 3.1 that runs on consumer hardware at 6,000 tokens/second. MIT licensed, open source, and built specifically for tool calling. It is the clearest signal yet that capable AI agents are moving off the cloud and onto the device.
26M
parameters — smaller than most chatbot models by 1,000×
6,000
tokens/second prefill speed on a consumer laptop
$0
per-token cost when running locally — changes agent economics entirely
Needle is a 26 million parameter model released by Cactus Compute on May 12, with weights on Hugging Face and code on GitHub under an MIT licence. It was distilled from Gemini 3.1 using a novel architecture the team calls a Simple Attention Network — no MLP or feed-forward layers, just attention and gating. The result runs at 6,000 tokens per second for prefill and 1,200 tokens per second for decode on a consumer Mac or PC. That is not a typo.
What it is built for. Needle is not a general-purpose chat model. It was designed specifically for single-shot function calling — the mechanism by which AI agents invoke tools, trigger automations, and take actions. Most on-device AI research targets language generation. Needle targets the action layer: the part of an agent stack that decides what to do next and calls the right function to do it.
Why the size matters. Cloud-hosted models charge by the token. At volume, API costs become significant — especially for automations that run hundreds or thousands of times per day. A 26M parameter model that handles tool routing locally at effectively zero cost per call changes the economics of agent-heavy products. For a Chrome extension, a mobile app, or a background automation workflow, the difference between $0.003/call and $0.000/call is the difference between a viable product and an unviable one at scale.
How to run it. Cactus published a one-command setup: `git clone https://github.com/cactus-compute/needle.git && cd needle && source ./setup needle playground` — which opens a local web UI at `http://127.0.0.1:7860` for testing and finetuning on your own tool definitions. Finetuning on a consumer GPU is supported out of the box.
The broader Cactus platform. Needle is one component of the Cactus AI engine, a mobile and edge inference framework targeting smartphones, laptops, and wearables. The Needle release is framed as a research run for Simple Attention Networks — the team explicitly expects the architecture to improve with scale.
Why It Matters
The cloud dependency assumption is cracking. Most AI agent architectures today assume the model lives in the cloud: you make an API call, pay per token, absorb the latency. Needle challenges that assumption for a specific but important task — tool calling and function execution. If the routing layer of an agent stack can run locally at near-zero latency and zero marginal cost, the architecture of AI-powered products changes. Mobile apps, browser extensions, and edge automations become viable that were not before. This is early — Needle is a research preview, and 26M parameters is limited in scope. But the direction is clear: the model layer is following the same trajectory as every prior computing paradigm, from mainframe to PC to mobile. AI agents on the device is not a question of if, only when.
Who's Affected
- — Builders of Chrome extensions and browser tools — a local tool-calling model eliminates per-call API cost for high-frequency automations
- — Mobile app developers adding AI agent capabilities — on-device inference at this speed is viable for production use cases that cloud latency would make awkward
- — Automation builders on Make.com, Zapier, n8n — local model orchestration could eventually replace cloud API calls for routing logic in self-hosted agent workflows
- — AI tool product teams tracking cost-per-query — the $0 marginal cost model changes unit economics for agent-heavy products at scale
What To Do Now
- 1. If you are building a Chrome extension or browser automation tool, pull the Needle repo and test it against your tool definitions this week. The setup takes under 10 minutes and the speed numbers are worth seeing firsthand.
- 2. If you are paying meaningful OpenAI or Anthropic API costs for routing/orchestration logic (not generation), Needle is worth evaluating as a local replacement for that specific step. Generation still warrants a frontier model; routing may not.
- 3. Watch the Cactus platform roadmap. Needle is the function-calling layer; the broader Cactus engine targets mobile and wearable deployment. If the architecture proves out, this becomes relevant to mobile app builders much sooner than most expect.
- 4. MIT licence means commercial use is unrestricted. You can ship Needle inside a commercial product today without a licensing conversation.
More on this topic — Best AI Coding Tools 2026
Independent Review
Claude AI
Pricing, pros and cons, real-world verdict — no affiliate spin.
Read the Claude AI reviewThe AI Hustle Playbook Newsletter
Get the curated shortlist.
A playbook of AI tools and strategies for building income streams.
No spam. Unsubscribe anytime.
More from ToolNav News
Flux 2 Emerges as the Open-Source Image Model to Beat in 2026
2026-04-03
Anthropic Puts Third-Party Claude Agent Usage Behind a Separate Credit Meter — OpenAI Counters With 2 Free Months of Codex
2026-05-17
Cursor Lands in Microsoft Teams + Switches Bugbot to Usage-Based Billing — Solo Devs Get a Real Pricing Reset
2026-05-15