Browse - macintosh.world

macintosh.world | Log In | Register

How to Setup a Local Coding Agent on macOS - Kyle Howells

How to Setup a Local Coding Agent on macOS - Kyle Howells

Kyle Howells

About

How to Setup a Local Coding Agent on macOS

Running Gemma 4 26B-A4B and Qwen3.6 35B-A3B locally with llama.cpp, MTP speculative decoding, multimodal support, and PI as a coding agent.

Qwen
coding agent
llama.cpp
Pi
Gemma
macOS
llm

Posted: 18 hours ago
Last Modified: 18 hours ago

By Kyle Howells
11 min read

I'd had my internet fail a few times recently leaving me stranded without a coding agent, and so when I saw the "Gemma 4 now runs 2x faster with MTP" Multi-Token Prediction update for Gemma 4 I decided to have a go at getting it running.

I wanted a local coding agent setup that:

was fast enough to actually use on my Mac

worked through an OpenAI compatible API (so I could use it in other tools)

and preferably could handle screenshots/images when needed, so I can feed it screenshots of what it has made.

And I did! This video is realtime. And shows the agent responding at a perfectly usable speed.

After a bit of testing the final setup I ended up with is:

llama.cpp built with Metal on macOS

Gemma 4 26B-A4B in GGUF format

A Q8 MTP draft model for speculative decoding

The Gemma 4 multimodal projector

Pi as the terminal coding agent

This was tested on an Apple M1 Max with 64 GB unified memory, running macOS 15.7.7.

The main model is: gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf.

Link on Huggingface: models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

That file is about 16 GB. With the MTP draft head and multimodal projector the model folder is about 17 GB.

The benchmark prompt was:

Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.

Each benchmark generated about 128 tokens.

Baseline: llama.cpp + Metal

First I ran the main model directly through llama.cpp with Metal acceleration:

repos/llama.cpp/build/bin/llama-cli \
-m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
-ngl 999 \
-fa on \
-c 4096 \
-n 128

Setup
Prompt tok/s
Generation tok/s

Gemma 4 26B-A4B Q4, llama.cpp Metal
298.0
58.2

58 tokens/second is not fast, but is usable, but for coding-agent work you want it to be as fast as possible, especially when the agent is making many tool calls.

Adding the MTP Draft Model

Gemma 4 now has the MTP draft model available:

MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf

This can be loaded by llama.cpp as a speculative draft model:

repos/llama.cpp/build/bin/llama-cli \
-m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
--model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
-ngl 999 \
-fa on \
-c 4096 \
-n 128

The first run with MTP came in at 69.2 tokens/second using 4 draft tokens. However, Unsloth's guide on How to Run MTP Models includes this note:

"We found --spec-draft-n-max 2 is the best starting point however, do not assume 2 is optimal, as performance is hardware-dependent. Try any value from 1 through 6 and use whichever is fastest for your system."

After sweeping --spec-draft-n-max, the best result was 72.2 tokens/second with 3 draft tokens.

Setup
Prompt tok/s
Generation tok/s
Speedup

Main model only
298.0
58.2
1.00x

Main model + Q8 MTP draft
295.6
72.2
1.24x

The useful part is that prompt processing stayed basically the same, while generation improved by about 24%.

I tested --spec-draft-n-max values from 1 to 6.

--spec-draft-n-max
Prompt tok/s
Generation tok/s

On my M1 Max machine, 3 was the fastest, with 2 close enough that either would be fine. Values above that got slower.

I also tested MLX models through mlx-lm, to find out which is the faster way to run the model on a Mac, llama.cpp or mlx.

Runtime
Model
Generation tok/s

llama.cpp Metal + MTP
Unsloth GGUF Q4 + Q8 MTP
72.2

llama.cpp Metal
Unsloth GGUF Q4
58.2

MLX-LM
Unsloth UD MLX 4-bit
45.8

MLX-LM
mlx-community 4-bit
43.9

MLX-LM
mlx-community OptiQ 4-bit
38.1

I thought MLX (being optimised for the Mac) would be fastest.

However, for this specific setup, llama.cpp was faster than MLX, and llama.cpp with MTP was clearly the best option.

I guess all the effort and tweaking which has gone into llama.cpp over time means it quite well optimised fr macOS despite being cross platform.

I also tried Gemma 4 MTP through gemma-4-swift-mlx, but the tested 26B 4-bit MLX checkpoints did not match the loader's expected weight keys, and I already had the previous MLX tests, so moved on rather than redownload new models and try to tweak things to match.

Adding Image Support

For Pi, I also wanted to be able to attach screenshots. The local model entry I setup for it originally declared the model as text-only:

Links

Open - Kyle Howells

Open - About

Open - Portfolio

Open - Blog

Open - Qwen

Open - coding agent

Open - llama.cpp

Open - Pi

Open - Gemma

Open - macOS

Browse another page: