macintosh.world | Log In | Register
Today | News | Books | Recipes | Notes | YouTube | QuickTake
Translate | Wiki | Browse | Maps | Reference | Reddit | About

Back to HN

/architect: Reduce Fable tokens by 80%, Fable orchestrates/reviews, Codex builds

by DanMcInerney | 104 points | 41 comments | 2026-06-12 15:33:22 Central

Open Source Link | Read Source Here

Open on Hacker News

Comments

Denvercoder9
DESIGN.md:> Each rule below is enforced mechanically by
the skill, not left to vibes.> R1. Repo docs are the
memory; not in HANDOFF.md = didn't happenSKILL.md:> Not in
docs/HANDOFF.md = didn't happen. Refuse to judge results
that exist only in conversation or builder chat
output."Mechnical enforcement" just means "prompting the
LLM a bit extra" these days? It (still) amazes me how much
effort and tokens we expend on what could and should be a
two line script...

  > everforward
Agents are in a wacky state, which makes projects like
this fall into a weird spot. Eg I vaguely expect my
agent to do two disparate things: manage dependency
injection for tools, prompt modifications, etc, but
also be the sort of "brain trust" that controls the
flow of execution (can we stop now, do we keep going,
etc).This project is meant to be the latter, but
there's not a clean way to integrate that into Claude
Code or Codex because they expect to do both.Pi can do
it, but then your users can't use their Claude
subscriptions, so you have to cludgily try to do the
same thing via LLM prompts.

    > > nostrebored
But why does your agent control doneness? It seems
to me the most odd part to delegate. All LLMs are
terrible at it. Most LLM tasks can be expressed as
a DAG or DAG of DAGs. Why delegate that to a
random point in context instead of enforcing the
flow?

      > > > everforward
Most LLM tasks can be expressed as a DAG, but
the odds of it succeeding go way, way up if
you drop the acyclic requirement (eg a "run
tests, if they fail, fix it and loop back to
running the tests" stage).And it gets
delegated to context because it's either to
have another session and tell it to double
check and critique the first LLM than it is to
write a deterministic test for every prompt.
Like if I want a new form that sends a REST
request on submit, I can have two LLMs duking
it out in 5 minutes. If I have to write
Selenium tests then I might as well just write
the feature. Or I can have an LLM write the
tests, but that's more or less the same as
letting a second LLM judge the first.

mpalmer
Reduce Fable tokens by 80%, simply by not using it!> I am
fairly convinced this is the shape serious agent work
keeps converging toward."this" being "plan with expensive
model, implement with cheap model".Anyone who follows HN
would be hard-pressed to disagree; this architecture is
re-invented twice
monthly.https://www.facebook.com/groups/vibecodinglife/pos
ts/1946207...
https://github.com/openai/codex/discussions/10628
https://build5nines.com/stop-burning-premium-requests-how-
to...> Not because it is aesthetically pleasing. Because
every other shape eventually runs into the same boring
failures: context rot, self-grading, goalpost drift, and
merge chaos.Actual failure isn't boring. But struggling
through a generated software project that celebrates its
own genius and doesn't have a single self-critical or
genuinely reflective thing to say...at least watching
paint dry I might get giddy off the fumes.I'm not
interested in critiquing the project itself, either,
you'll just run that through a model, too.

  > seaal
>https://www.facebook.com/groups/vibecodinglife/posts/
1946207...wow linking a facebook groups post might
actually be worse than x, is there an xcancel
alternative for facebook?

  > DanMcInerney
I don't disagree with any of this. It is generated
software, and it's not a novel idea. I didn't mean for
it to come off like that. It's just solving an itch
that I couldn't find a solution to and I'm getting a
lot of personal utility out of it. I do have a lot of
experience with agentic memory, multi-agent systems
and harnesses and wasn't super impressed by the
workflow of Fable calling opus subagents so I figured
I'd apply best practices to what already exists to
make it a teensy bit better and easier to use.

    > > mpalmer
Cheers. Absent explanation, I do think it's
reasonable to assume that you stood by the
wording/claims of the README when you posted it,
but I appreciate the patch you made to the
docs.FWIW, re: best practices, your install script
potentially runs `rm -rf` on the user's global
skills whose names shadow your project's.

Retr0id
> freezes the gatesLLM-written readmes love to use
inscrutable jargon that means nothing outside of the
context window that birthed it.

  > nostrebored
LLMs are obsessed with "gates". Freezing the gates
here is intuitive to me as this point - don't let
validation drift.

    > > Retr0id
"drift" is another one!
Teknomadix
US Govt reduces Fable Tokens by 100%.
rockwotj
I actually just started doing this by having Fable
roleplay as Jeff Dean and to use Codex as Sanjay driving
the implementation and have them go back and forth. Works
really well and it's cool to see AI pair program

felixgallo
Fable will do this itself, by spawning Opus/Sonnet
subagents to do easy work.

  > RazerWazer
GPT 5.5 xhigh is better than Opus and Sonnet.
    > > timcobb
Not in my subjective experience sadly
    > > sosodev
I don't know why you're getting downvoted. It's
true. Averaged across a wide variety of benchmarks
Fable is the only Anthropic model that performs
better than GPT 5.5 xhigh.

      > > > Eridrus
The problem is that there are a bunch of
benchmarks, the model providers often don't
even use the same benchmarks, a bunch of them
have known problems, and it's expensive to do
your own benchmarks.I am a GPT 5.x booster
since to me it just feels smarter, and I
generally felt like the benchmarks backed me
up, but it's not every benchmark, so sadly
we're mostly arguing about vibes.SWEBench-Pro
was a big one, though apparently Claude was
reading solutions out of the .git folder it
wasn't meant to have access to among other
problems.

        > > > > smoe
I find it fascinating that every time this
kind of discussion comes up, people talk
about night and day experiences between
Claude and Codex, in both directions. I'm
really wondering what people are doing to
get such different outcomes.I'm currently
working on two projects/clients one using
Claude, one using Codex. I have a strong
preference for the latter, but not because
I think it is much more intelligent or
writes much better code. It is simply
because I find the way of interacting with
it more pleasant: more literal,
mechanical, makes fewer assumption and or
double checks, and is less proactive in my
experience. At least until some updates
over the last few weeks.

          > > > > > Eridrus
I think I like Codex for the same
reason tbh. I think it's just general
misanthropy or autism or something
lol. Most people seem to prefer
Claude.For me, I think Codex was
visibly smarter than Claude until 4.8
came out, it would regularly do better
debugging and IMO write better code.
4.8 I think is close.I think Claude is
widely regarded to have a big lead in
front-end, which I do not work
on.Claude's Ultrathink is pretty cool,
though it eats up tokens like nothing
else obviously.

          > > > > > AlphaSite
It probably means they're close enough
that there's no observable difference.
Or better at every different things.

  > apsurd
/advisor has been really good experience for me
especially with having only a Pro plan.I exclusively
use sonnet and advisor is basically "hey opus chime in
on my approach". been working great as far as i can
tell.

phpp
@DanMcInerney Thank you for sharing this! Using a larger
model for planning and a cheaper, smaller model for
execution is a smart way to save tokens and seems like the
way to go in general.I wanted to see what would happen if
Claude delegated work to pi wiht a model like Deepseek, so
I forked your repo and tried it out. It's working really
well so far.
https://github.com/pcomans/architect-loop-pi

corvad
Who's gonna tell them...
cohix
I do exactly this with awman workflows:
https://github.com/prettysmartdev/awman/blob/main/docs/05-
wo...You can use any agent and/or model for each step and
share context between them.

diavelguru
yes I'm using Fable to inspect, generate plan and
architectural docs then using Gemini to implement then
have Fable review, find bugs. saving lots of usage.

colechristensen
Last night I switched back to Codex for a minute having
burned through my tokens for the week with Fable and oh
boy I had a terrible experience. Running in circles over
simple problems (which I ended up solving myself, like a
peasant) and running "terraform apply" several times
despite several instructions all over the place to never
do that. The performance difference was stark.

  > malshe
I had a similar experience. So far Fable has been a
game changer, at least for the work I used it for.
Having said that, I think its writing is definitely
worse than GPT 5.5. Ethan Mollick also observed the
same. He called it more "Claudy." It generates worse
academic prose than other frontier models.

    > > colechristensen
I think the claude code harness made up a
significant part of the improvements co-released
with Fable, the nested agent capabilities seem to
be much better even with opus (which I guess we're
stuck with for a while).

  > nsingh2
Could you provide some details, if possible, like what
model & thinking effort, what kinds of tasks? I used
to swap between Claude Code and Codex often, and these
days use Codex more because of the usage limits.
Wondering if I should go to Claude for a month, I get
a strange FOMO when I read vague comments like
this.The one major difference I noticed is that the
GPT models are more analytical (e.g. better at
mathematical analysis, code review) vs Claude models
tend to write more straight forward code. Besides that
I don't really see any significant differences.There
are a few gotchas with swapping, like being careful
with AGENTS.md/CLAUDE.md naming (Claude Code only
recognizes CLAUDE.md, and I think Codex only works
with AGENTS.md), and updating skill files to match the
tool.

    > > colechristensen
I just symlink AGENTS.md and CLAUDE.mdI was using
gpt-5.5 high. Writing terraform code for GCP,
debugging app launch and Dockerfile issues, that
sort of thing. It was going in loops hallucinating
features of GCP, looking things up in strange
ways, running terraform apply after being
explicitly told in the last interaction not to,
and overall not solving problems. These were very
straightforward tasks and it couldn't be trusted
for five minutes. It's the difference in what I
would trust an early senior engineer to do vs what
I would trust an unreliable high school intern to
do.

hmokiguess
I guess that didn't age well
aetherspawn
Fool me once. Fool me twice. Fool me thirty three times
and here we are trying lucky number 34.

DanMcInerney
ANNNNNND it's gone. Guys, I found a way to reduce Fable
token usage 100%. You can find it here:
github.com/USGov/idiotic-overreach.

avaer
Reducing token usage is this year's "one weird trick". It
doesn't make sense on the face of it.Even if one
discovered something that millions (billions?) of dollars
of AI compute and the best statisticians in the world was
not able to find via exhaustive research, domain search
and training... what do you think are the chances this
won't be folded into the next update of every model,
making the rigmarole moot?Extraordinary claims require
extraordinary evidence and technology-shattering
innovations in AI are not know to come from a markdown.

  > apsurd
incentives aren't aligned
analogpixel
I know how to reduce Fable tokens by 100% ;
https://www.anthropic.com/news/fable-mythos-access

  > testfrequency
I ran this and seem to have good results with a 100%
reduction also: curl -fsSL
https://chatgpt.com/codex/install.sh | sh

Uptrenda
Reduce fable token usage even more by not using it. What a
clever idea, op! Wow.