Aurornis > The benchmark prompt was:> Write a compact Python
function that parses a unified diff and returns the
changed file paths. Then explain two edge cases.> Each
benchmark generated about 128 tokens.Generating 128 tokens
is probably not enough for good benchmark results. MTP
speedup depends on how often the predicted tokens are
accepted. In my experience, the very early output has a
higher acceptance rate, so short testing can give false
positive speedups.llama.cpp includes a tool specifically
for benchmarking that will sweep the arguments for you so
you don't have to restart the server and send it
prompts:https://github.com/ggml-org/llama.cpp/blob/master/
tools/llam...EDIT: Also the section about downloading the
models should have mentioned that llama.cpp has a "-hf"
argument that will download the models for you. I
appreciate the author for sharing their experience, but
for beginners this might not be the best guide to use.
|
> liuliu Realistically, you need to experiment with any user
prompt + a good amount of system prompt (at least >
1000 tokens, but realistically, in the range of 3000
tokens probably good).llama.cpp includes tools for
that, what you are looking at is to have a prefill
before token generation to measure it properly.
Increasingly also, measuring token generation speed at
longer context (32k or 64k) is important too.
|
> willXare At 128 tokens, you're benchmarking the overture, not
the opera.
|
> reactordev This is akin to saying "it runs on my machine" without
actually examining the problem. Sad. You're absolutely
right that 128 tokens is nothing, it's a little more
than a hello response.
|
ig0r0 I wrote a similar post some time ago just used ollama and
opencode
https://blog.kulman.sk/running-local-llm-coding-server/
|
> takethebus this is the way, given anyone could swap for oh my pi
/ pi / etc
|
> > mark_l_watson yes, whether for home experiments or at work, it
is good practice (good hygiene) to be able to swap
out both agentic harnesses and models. It is
important to have a good strategy for exporting
skills, etc.
|
> sleepybrett actually useful and the ollama gui could probably even
simplify this more.
|
c-hendricks Not sure you really need huggingface-cli to download
anything if you're just using llama.cpp. You can pass `-hf
...` and it will download the models for you. Set
`LLAMA_CACHE` to change where the downloads go:
LLAMA_CACHE="models" ./llama-server \
-hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
...
|
> dofm Yes.-hfd for the draft model.
|
> > c-hendricks Nice, was wondering if there was a flag for the
draft as well.Not knocking huggingface-cli, just
find it's much easier for people to try out this
stuff when they can just mise use --global
github:ggml-org/llama.cpp
LLAMA_CACHE="models" llama-server \
-hf
unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL \
--host 0.0.0.0 \
--port 11434 \
...
|
> > > dofm is also pretty useful if you're doing this
just to try agentic coding and you're not
processing images/voice. Stops it downloading
the multimodal projector.
|
vladgur I have used omlx.ai with great success to both download
multiple mlx models (including gemma and qwen) suited for
my hardware AND to be able to automagically launch both
open-source and close-source (claude code, codex)
harnesses using these models. All from a web or desktop
UIYou would not need to follow a blog post with omlx IMHO
|
> dofm FWIW I have not, on a 64GB M1 Max, seen any advantage
from oMLX specifically or MLX generally over GGUF with
llama.cpp.The Gemma 4 MLX builds I have found so far
have been slower at the same quantisation and much
slower with MTP.The built-in web UI for llama.cpp is
really quite good once you have chosen your model.
Otherwise I quite like LM Studio for tinkering.One
thing I would say is that both Gemma-4 and Qwen 3.6
simply do not need a large chunk of the typical
opencode system prompt. Better off without it.
|
> Dotnaught In case anyone is looking for a sandbox to go with
oMLX and Pi:
https://github.com/Dotnaught/pi-sandbox
|
> > dofm This is useful. I'm still tinkering with Multipass
VMs because I need the whole VM environment anyway
and I'm on Sequoia. But I'd be interested if you
did anything like that with Apple's container CLI
instead; sooner or later I will have to upgrade to
Tahoe because I want to play with the container
CLI (and apfel).
|
> > zmmmmm it looks handy but ... sbx policy set-default open just so the single pi sandbox can talk to
localhost? ... this gives me some grave doubts
about the rest of it being set up well.
|
> fridder It truly is the SOTA for local inference on mac. Even
when there are regressions the dev(s) are insanely
responsive. It is the most impressive opensource
project I've seen in a awhile
|
> > benbojangles Omlx needs to incorporate macos native shortcuts
use - macos can almost instantly extract text from
pdfs and a bunch of other things using it's ane
neural engine keeping unified ram for llm use. The
two together would be awesome
|
jumploops I've been quite impressed with DeepSeek v4 Flash running
via antirez's ds4[0].It feels like a GPT-4 class model in
terms of "stored knowledge" but is better at long-horizon
tool calling than any of the GPT-4 class models.Running on
a 128GB MBP M4 Max, I'm getting ~24 t/s on generation and
~200 t/s on prefill. I was expecting it to feel slow, and
it certainly does when e.g. generating code, but it's
surprisingly useful as a "machine orchestrator" for simple
tasks.For non-agentic usecases, it's a decent enough model
to converse with, and has the benefit of being entirely
self-contained/private.[0]https://github.com/antirez/ds4
|
dofm Useful stuff in here that I wish I'd seen a few days ago
:-)I am not convinced that the MTP setup for the QAT model
adds very much in terms of speed on my M1 Max, but it is
definitely worth experimenting with.Fiddling about with
local models has done so much for my conceptual
understanding of what is going on.FWIW and YMMV but I also
found the Gemma 4 MTP head was occasionally breaking
markup in Opencode, causing the thinking to display
untidily and ultimately in some cases missing the stop
token. So I've stopped using MTP there for now.Recent Qwen
3.6 models have developer role support so it will
occasionally surprise you with a structured multiple
choice questionnaire.
|
> mft_ I found a marginal downside to Qwen3.6-35B-A3B-MTP vs.
the non-MTP equivalent on an M1 Max. I'll maybe
experiment with settings further though.
|
> > freehorse And the upsides of using draft models for MOE
models with so low number of active parameters (as
here or as in the article) are quite low, compared
to dense models where you can get enormous
speedups. I would prefer running the dense 27b
models with speculative decoding instead.
|
> > > dofm That is what I have learned, yes. Not tested
the dense Qwen yet. IIRC the 31B Gemma was
slow enough that I doubt MTP will help me
much.
|
> > dofm Yeah. I think it might speed up time to first
token but I am not sure how much that matters.I do
enjoy their different personalities when they are
tackling "explain this" type puzzles, though.Gemma
writes so well - like a concise code blogger. It
makes you understand that the thing we hate about
AI slop writing is specifically the cheesy,
marketingese sycophantic ChatGPT tone. It's a
choice to sound that way.Qwen writes more tersely
by default, like much english language
documentation in Chinese open source projects. A
couple of lines, code example, fact, code example,
line of blurb.I use this prompt every now and then
with a new model. It's obviously a classic SQL
puzzle but I've asked new web developers this in
the past (prompted by discovering that a client's
subcontractor didn't understand it and was
therefore unable to migrate some code from relying
on dodgy pre-MySQL 5.x behaviours)- I have a MySQL
5 table like this: [id, label, category, score].
It contains a list of items in different
categories (text names like cat1, cat2, cat3) with
a numerical score. Is there a way I can write a
SQL query to find the item in each category that
has the highest score, without using a subquery?
No two entries in any category share a score. -I enjoy seeing what it deduces from the
subtext.Without "thinking" mode on, they always
initially fail and you need to prompt them to find
the answer. With thinking mode, they both produce
really nice explanations.For me, as an old
freelancer who is pretty cynical about vibe coding
or "agentic engineering", what I really want is an
AI tool that can help me start to solve problems
and help me find the right terminology or generate
some boilerplate I can tinker with. Both of these
models do fine at the kind of "starter" writing
that I want when I am trying to untangle an idea.
|
> mark_l_watson when I started using QAT recently, I stopped trying to
improve my configuration after that. I will try tuning
my local environment again in a few months, but with
QAT things are good enough for now.
|
reddit_clone >64 GBThats the rub.
I have an M4 with 48G. I wonder if it is worth testing
this out.My past attempts (with Ollama and various LLMs)
were too slow to use.
|
> hkchad I have a M5 MAX with 128, local models are toys
compared to hosted ones. I've spent a lot of time and
money trying to make it work even 1/2 as well.
|
> > dofm It all depends on what you want to do, I guess.If
you're seeking the kind of hands-off claude
experience, obviously not. They are slow.If you
want to learn how these things work, train them
locally, tinker, play with the code, grasp the
fundamentals, or just out of sheer
bloody-mindedness and principle refuse to tether
the functioning of your application to a cloud
API...
|
> dofm Some of these models will be a bit of a squeeze at
Q4_0 I suspect; almost certainly they will be using
CPU. Probably the 31B Gemma will be too much. Maybe
not the Gemma-4 26B QAT.But if you just want to play
around rather than code, you really might find the
Gemma 4 12B model worth mucking about with just so
you've gone through the steps. Especially if you want
to muck about with image analysis or audio
transcription.If you're writing PHP I think you could
even find it good enough. I've been modestly
surprised. You can do that basic fiddling with the
Edge AI Gallery app, which can enable thinking and has
a customisable system prompt and some agent
support.You could also try the 14B Deepseek
R1.Honestly even if it is not good enough, if you are
anything like me, I think you'll find that going
through this process is really quite educational - it
has made a lot of things more concrete for me in a way
that I have found reassuring and valuable.
|
> codazoda I'm running an M3 on an Air with just 16GB. I can
still get useful results without an internet
connection in "chat mode". It's a different experience
than using Claude, for sure, but it's workable. I
typically use the Qwen variants these days.
|
> > mark_l_watson This might be useful when 'coding in chat mode': I
have a few scripts that I run in a project
directory that takes a prompt from me, and creates
a single long one-shot prompt that I can paste
into a chat window and ask that any generating
code is inside markdown code blocks for easier
copy/pasting. Also, pardon the plug, but you can
read my new tiny book free online that documents
my experiences using agentic coding on my 16G Mac
and my 32G Mac:
https://leanpub.com/read/local-coding-agents
|
> > > codazoda Looks cool, I'll checkout the book. Your
download links (PDF and EPUB) are down for
me.> NoSuchKeyThe specified key does not
exist...
|
> contingencies M4 24GB here. You'll be fine, if you're anything like
me minor latency is acceptable to obtain (a) privacy
(b) reliability (c) CI/CD/guardrails (d) network
independence (e) future-proofing vs. AIaaS.
https://omlx.ai/ gives you intelligent local hardware
based model download recommendations. That said it
probably depends heavily on your workload, process and
polish expectations. See also
https://news.ycombinator.com/item?id=48089091
|
> > spike021 what are you using on yours? I've got a M4 Pro
24GB also. tried the open source gpt one. it's
alright but I found it can get stuck at times.
maybe just my config in LM Studio.
|
mark_l_watson Nice writeup, thanks.I run something very similar except
for directly using pi as the agentic harness I use
little-coder that wraps pi with reasonable defaults for
running local models. Even though my local setup is a bit
slow, it is a thrill to do real work completely locally.
|
jmkni FYI you can open Claude code in the terminal, point it at
this article and just tell it to "do it", if you're
feeling extra lazy
|
> echelon This is the way.I'm not Googling much of anything
anymore. 9/10 times the information is awful, it's
hard to parse out of whatever other spam it's
surrounded by. Meanwhile, Claude will just do the
thing one-shot or with a tiny bit of refinement.The
gateway to knowledge and getting stuff done is the
LLM.Google Search is a dinosaur.It feels like we're
living a century into the future. Not even smartphones
were this cool.
|
> > kingofthehill98 Yeah, if the future is "Claude, think for me" I'm
happy to stay at the good old present.
|
> > > echelon https://en.wikipedia.org/wiki/Is_Google_Making
_Us_Stupid%3Fhttps://newsletter.pessimistsarch
ive.org/p/when-educators-mo...New decade, same
old argument.It's not> "Claude, think for
me"It's> "Claude, be my subordinate and get
this done for me"Instead of complaining on the
sidelines, I'm getting a shit ton of work
done.
|
> > > > ultrarunner For what it's worth, even this reply reads
like LLM output. It's not "quote
describing the scenario", it's "some other
linked-in-coded plot twist". If you're the
average of the people you spend the most
time around, and you spend the most time
around a chatbot, do you start to absorb
its speech patterns and logic
structures?Yeah, good ol' present for me
too then, thanks.
|
> > > > wwweston As one famous agent said: "I say your
civilization because as soon as we started
thinking for you it really became our
civilization which is of course what this
is all about."An argument can be as old as
the search engine and hold real value.
There are ways in which unreflective
search engine use has misled and
mistrained people.There's always been
argument to be had about how we manage and
offload attention, what we gain and what
we lose when resistance is reduced. It's
part of reflection that's been necessary
in order to make progress solid ground,
and is more necessary with
non-deterministic tech.The phrase
"Tactical tornados" may be older than web
search and describes people who also got a
lot done.Models can be incredibly helpful
boosters and situationally effective
subordinates... and also patchy as a real
engineering IC or org.
|
> > > > sdevonoes > I'm getting a shit ton of work done.It's
weird when people are proud of doing ton
of work. Im the opposite, Im proud that Im
doing minimal stuff without llms.
|
> > > > this_user > Instead of complaining on the sidelines,
I'm getting a shit ton of work done.Nah,
you are just producing a bunch of slop and
hope that nobody notices.
|
> > tobyhinloopen Claude "respond in a friendly way that I agree
with this comment"
|
anigbrowl This video is realtime. And shows the agent responding at
a perfectly usable speed.Alas, this video appears not have
been linked to the text that describes it. Perhaps I
should ask an AI to generate an artistic rendering of the
author's description.
|
smetannik I wonder why something like LM Studio didn't work for the
author?
|
> b3ing That's what I was wondering, lm studio and draw things
are easy to use apps that handle much of the cruft for
you
|
reenorap My biggest pet peeve with all these articles on local AI
is the only thing they talk about is tokens per second. No
one mentions the quality of the answers. No one. I don't
mind waiting a little longer if the quality is better.
Quickly serving me slop doesn't make it more useful. Are
people really only looking at tokens per second?
|
> frollogaston The model already has its own quality benchmarks
elsewhere. The article is just about running the model
on X hardware, so the remaining question is then how
fast it is. Or does the output quality somehow depend
on the hardware too?
|
> ozim Local model as such will give you "autocomplete on
steroids" but it is not going to run away and
implement cross project feature like frontier model in
let's say Cursor.So there is no value in testing
quality of answers, but there is value in testing
token speed.You just have to have correct
expectations.
|
> akman That's fair. There are even many dimensions to define
'quality' which include use case (coding? writing?
multimedia?) and prompt. I suppose if you ask testers
to provide benchmarks with their analysis, that might
hamper their desire to share.
|
hanifbbz Here's a visual post for using LM Studio and VS Code (and
Pi):
https://blog.alexewerlof.com/p/local-llms-for-agentic-codi
ngOne way or another local AI is the future. I actually
find weaker models more interesting because it keeps me
sharp (at the cost of velocity of course).
|
bicepjai I assumed lmstudio is the obvious choice after ollama. Is
there a reason lmstudio is not used widely ?
|
> dofm LM Studio is fine. Gorgeous actually. I've found it
really helpful for understanding parameters, settings,
general figuring out.But there is an incentive not to
use it if you want to write an article that uses only
open-source tools, because it isn't.
|
> stingraycharles Yeah I've also been using it on macOS, my experience
is that it works better with the metal API and has
better performance.
|
everlier You can also install Harbor and then it's:harbor up omlx
opencode
|
cdolan Is there a link to the video? It did not render when I
went to the page. Curious about the real-time feel of this
|
> dewey That's the direct link:
https://ikyle.me/blog/2026/how-to-setup-a-local-coding
-agent...
|
> > c-hendricks Note this is cut to just before the model
responds, so not a great way for people to judge
the real-time feel of this.
|
rectang Does anybody run a local agent on a Mac using an outboard
GPU?
|
> benbojangles I run a second Mac for local llm use and access it
remotely using ssh from the first mac
|
attogram 8b max on a std 16gb macbook. Anything more and your mac
is toast
|
> benbojangles 70b on my M1 max 64gb
|
metadaemon Has anyone compared a setup like this to just using LM
Studio?
|
> CharlesW Yes, I can confirm that LM Studio works great for
this.
|
namnnumbr oMLX (https://github.com/jundot/omlx) makes running the
mlx inference server quite easy for those interested in
UI-based hosting. oMLX also supports mtp or dflash
drafting.
|
> w10-1 Agreed (not sure what you mean by UI-based
hosting).oMLX does the caching I need to fit models
that are near gross memory, and it handles most of the
work in finding usable models. After cobbling together
various solutions over months, I now just use oMLX,
often from Xcode. I can tell the difference between
Gemma-4 (local/free) and Claude (paid) only on the
largest tasks.
|
LoganDark I poured a couple days into custom Burn inference for
Qwen3-Coder-Next only to find it doesn't come with a
speculative decoder, so on my M4 Max I can't push it much
further than 120t/s. That's still kinda slow, though still
faster than llama.cpp's 70.9t/s and MLX's 80.6t/s with the
same model. Claude Fable 5 is recommending I use the Qwen3
MTP -- I worry that will compromise the quality somewhat,
but might give it a try to see if I can get more usable
speeds.
|
sleepybrett or you can just load up ollama, have it load a local model
and point claude or opencode at it...is this article old?
It's not. I'm not sure why he went through all the bother
of llama.cpp
|
> malkosta That was exactly my same question. Then I finished
reading the post. The reason is pretty clear, and
written in the post: it is faster than ollama+mlx.
|