May 12, 2026 · By the ToolNav Team · 4 min read AI Models On-Device AI AI Agents Open Source AI Tools

Affiliate disclosure: Some links in this article may be affiliate links. We may earn a commission if you purchase via these links — at no extra cost to you. This does not affect our editorial coverage. Full disclosure.

Cactus Compute Drops a 26M Parameter AI Model That Runs at 6,000 Tokens/Second on Your Laptop

TL;DR

Cactus Compute released Needle on May 12 — a 26 million parameter model distilled from Gemini 3.1 that runs on consumer hardware at 6,000 tokens/second. MIT licensed, open source, and built specifically for tool calling. It is the clearest signal yet that capable AI agents are moving off the cloud and onto the device.

26M

parameters — smaller than most chatbot models by 1,000×

6,000

tokens/second prefill speed on a consumer laptop

per-token cost when running locally — changes agent economics entirely

Needle is a 26 million parameter model released by Cactus Compute on May 12, with weights on Hugging Face and code on GitHub under an MIT licence. It was distilled from Gemini 3.1 using a novel architecture the team calls a Simple Attention Network — no MLP or feed-forward layers, just attention and gating. The result runs at 6,000 tokens per second for prefill and 1,200 tokens per second for decode on a consumer Mac or PC. That is not a typo.

What it is built for. Needle is not a general-purpose chat model. It was designed specifically for single-shot function calling — the mechanism by which AI agents invoke tools, trigger automations, and take actions. Most on-device AI research targets language generation. Needle targets the action layer: the part of an agent stack that decides what to do next and calls the right function to do it.

Why the size matters. Cloud-hosted models charge by the token. At volume, API costs become significant — especially for automations that run hundreds or thousands of times per day. A 26M parameter model that handles tool routing locally at effectively zero cost per call changes the economics of agent-heavy products. For a Chrome extension, a mobile app, or a background automation workflow, the difference between $0.003/call and $0.000/call is the difference between a viable product and an unviable one at scale.

How to run it. Cactus published a one-command setup: `git clone https://github.com/cactus-compute/needle.git && cd needle && source ./setup needle playground` — which opens a local web UI at `http://127.0.0.1:7860` for testing and finetuning on your own tool definitions. Finetuning on a consumer GPU is supported out of the box.

The broader Cactus platform. Needle is one component of the Cactus AI engine, a mobile and edge inference framework targeting smartphones, laptops, and wearables. The Needle release is framed as a research run for Simple Attention Networks — the team explicitly expects the architecture to improve with scale.

Why It Matters

The cloud dependency assumption is cracking. Most AI agent architectures today assume the model lives in the cloud: you make an API call, pay per token, absorb the latency. Needle challenges that assumption for a specific but important task — tool calling and function execution. If the routing layer of an agent stack can run locally at near-zero latency and zero marginal cost, the architecture of AI-powered products changes. Mobile apps, browser extensions, and edge automations become viable that were not before. This is early — Needle is a research preview, and 26M parameters is limited in scope. But the direction is clear: the model layer is following the same trajectory as every prior computing paradigm, from mainframe to PC to mobile. AI agents on the device is not a question of if, only when.

Who's Affected

— Builders of Chrome extensions and browser tools — a local tool-calling model eliminates per-call API cost for high-frequency automations
— Mobile app developers adding AI agent capabilities — on-device inference at this speed is viable for production use cases that cloud latency would make awkward
— Automation builders on Make.com, Zapier, n8n — local model orchestration could eventually replace cloud API calls for routing logic in self-hosted agent workflows
— AI tool product teams tracking cost-per-query — the $0 marginal cost model changes unit economics for agent-heavy products at scale

What To Do Now

1. If you are building a Chrome extension or browser automation tool, pull the Needle repo and test it against your tool definitions this week. The setup takes under 10 minutes and the speed numbers are worth seeing firsthand.
2. If you are paying meaningful OpenAI or Anthropic API costs for routing/orchestration logic (not generation), Needle is worth evaluating as a local replacement for that specific step. Generation still warrants a frontier model; routing may not.
3. Watch the Cactus platform roadmap. Needle is the function-calling layer; the broader Cactus engine targets mobile and wearable deployment. If the architecture proves out, this becomes relevant to mobile app builders much sooner than most expect.
4. MIT licence means commercial use is unrestricted. You can ship Needle inside a commercial product today without a licensing conversation.

Source Cactus Compute / GitHub

More on this topic — Best AI Coding Tools 2026

Independent Review

Claude AI

Pricing, pros and cons, real-world verdict — no affiliate spin.

Read the Claude AI review

The AI Hustle Playbook Newsletter

Get the curated shortlist.

A playbook of AI tools and strategies for building income streams.

No spam. Unsubscribe anytime.