Tiberium It seems to really be a nice step-up and is getting quite
close to the frontier. I wish they'd start focusing on the
reasoning efficiency now, though. I have a simple
(relatively) test task to evaluate LLMs: writing a simple
math evaluator library in Nim (it's about 400-600 lines
total max), and GLM 5.2 (xhigh which maps to max effort)
spent over 15 minutes (!) reasoning, spending about 45k
tokens, before it finally wrote the first file.I know it's
hard to improve on that, but now that their models are
good enough at raw intelligence, I think this should
become a higher priority task.Currently on
https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh
spends 16k tokens total on average, GPT 5.5 high is 10k,
Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is
extremely reasoning efficient.Of course if you convert
those values to actual request cost, GLM 5.2 will probably
beat GPT 5.5/Opus 4.8, but speed matters for a lot of
people, I think.
|
> benjiro29 GLM 5.2 Max = Opus 4.8 Max in thinking behavior. The
thinking chain is so similar, and so is the amount of
token usage on the output.If you want reasonable token
usage, you need to run it GLM 5.2 at High. There is
little drop in quality from Max to High (for most
tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2,
Max is really something you only need for complex
tasks.In essence, GLM 5.2 is Opus 4.8 its little
brother, at a way, WAY cheaper price.There has been
really no training on Opus models going on, really,
none i tell you! /sarcasm
|
> > matheusmoreira > GLM 5.2 Max = Opus 4.8 Max in thinking
behaviorThis is insane! I can't wait until
technology progresses to the point we can run
these things on consumer hardware!
|
> > > chartpath Are there any indications that this will be
possible? Consumer hardware will continue
getting better but I can't see 512GB RAM in a
MacBook Pro any time soon. I'm hoping linear
attention techniques plus MoE will make
breakthroughs in size/compression and
throughput.
|
> > > > nijave Well, we're probably not going to be
running frontier models anytime soon, but
I think the general assumption is smaller
models will continue to improve until
they're sufficiently good frontier models
aren't needed.There's potentially also
augmentation through tools, harnesses and
RAG to help boost how well they work
without tons of parameters.
|
> > > > carter2099 > but I can't see 512GB RAM in a MacBook
Pro any time soonCould totally see this
being a comment from a forum in like 1994
but swap out GB for MB and MacBook Pro to
whatever the popular consumer pc was at
the time
|
> > > > > r-w Yeah but the price of RAM wasn't
increasing at that point.
|
> > > > deadbabe There will be a 1024GB unified memory
MacBook Pro.
|
> > > > matheusmoreira Certainly not any time soon, but I have
faith it'll happen one day.
|
> > > > majormajor In the last ten years laptop memory
footprints have, what, doubled at the low
end? Smallest MacBook Pro in 2016 was 8GB,
smallest is 16GB today? Max I think has
gone up 8x meanwhile, 16 to 128?I wonder
if there's a bit of a chicken-and-egg
issue where there wasn't much that
demanded 10x the RAM, so there wasn't much
pressure to develop more or increase
production to support it at consumer
prices.There's wayyyyyyy more demand for
memory generally now, so assuming it's not
a demand bubble that pops rapidly, I'd
expect the new normal to end up at a much
higher baseline. 512GB would be 4x greater
than today's max, so even with the
relatively slow last 10 years development
pace, give it five years max?
|
> > > > > regularfry The problem is that the situation in
the RAM market might just... not go
away. It's locked in for the next
couple of years unless the AI market
goes pop. Which it might! But if it
doesn't, there's no particular reason
to think that the incentives for
cornering the market like OpenAI have
would go away.We might see that new
normal in five years or so. We will
see a new normal sooner than that if
there's a run on AI because of the
sudden availability of DRR fab
capacity, but also we'll probably see
the level of local models freeze at
whatever state they've got to at that
point. But an equally likely outcome
is that any new DDR capacity that
comes online is just immediately
absorbed by frontier AI, and consumer
devices stay at "just good enough" for
a decade.
|
> > > > > mikestorrent The new Macbook Neo is 8GB. I think
that if we are lucky, the huge RAM
demand right now means new factory
buildouts which eventually means more
supply and prices go back down, and
capacity begins to go up. This level
of demand was just not anticipated by
anyone.
|
> > > muyuu you need 8 x 96GB Blackwell or equivalentso
around US$150k which is
Small/Medium-Enterprise territory already, but
who knows when it will hit "reasonable" home
consumer territoryI think there's hope future
generations of unified memory machines may get
this sort of memory availability when new fabs
open in then next couple of years and then
ramp up production for a few years afterwards
- that makes ~2030s credible at this point,
but nobody can really predict the market that
far ahead
|
> > > > matheusmoreira > I think there's hope future generations
of unified memory machines may get this
sort of memory availabilityI hope you're
right. This is a very exciting idea. The
weights are out there. The demand is
astronomical. The manufacturers just need
to make it happen.
|
> > > > sterlind there are cheaper ways to do it. not like,
consumer-cheap, but I'm setting up a rig
for 80% cheaper than that.I'm a tad
worried about triggering a run on the
particular hardware I'm buying though so
I'll leave it vague here, but hit me up on
Discord if you're curious.
|
> > > > > sankalpmukim Hey, very intrigued about how it can
be done for cheaper. Sent a friend
request to sterlind on Discord,
interested if you do a write up
|
> > > > > muyuu But at what kind of speed? We're
aiming at some speed that would negate
the point of even using an off-site
provider.
|
> > > harshit119 This is quite evident for personal AI but
general intelligence with current scaling laws
and how model keep getting better with more
number of parameters, certainly the path does
not converge.
Personal AI is more deprived of context today
than quality of token. Having a on-system
knowledge base paired with Gemma works well to
large extend.
|
> > FooBarWidget With such ridiculously long thinking traces I'm
surprised max outperforms high. After all,
performance falls off a hill after a certain
amount of context, and long thinking traces can
fill that up really quickly.
|
> > maxdo looking at the score this is rather a gemini 3.5
flash competitor, yes, for cheaper, but distance
to opus and fable is as big as their price diff.
|
> > vitalyan123 distillation of thinking models is not
particularly effective - both "Open"AI and
Misanthropic don't show you the real chain of
thought, only its severely downscaled version.
both do everything in their power to combat such
outrageous copyright infringement, so the bulk of
unethically scrapped data the Chinese have is from
several generations ago.
|
> > > nyrikki It is quite likely that the intermediate
tokens don't have 'semantic import'[0]There
are methods like Habitual Reasoning
Distillation or Inverted Reasoning Traces [1]
that can help.While there are reasons to hide
the intermediate tokens from a IP protection
stand point, there is also a need to hide more
effective and efficient generating that
doesn't fit the R1 claims of an aha moment
that has been debunked, but is a consumer
expectation.While hidden intermediate tokens
do increase the difficulty, it is not a from
barrier in itself, especially as they are
billed, given information about their
length.[0]
https://arxiv.org/abs/2504.09762v4[1]
https://arxiv.org/abs/2603.07267
|
> > > kmeisthax Chinese distillation attacks are about as
unethical as Robin Hood stealing from the rich
to give to the poor. The real unethical
scraping was done by Anthropic to train
Claude.To be clear, if Anthropic was using
totally licensed data, I'd be sympathetic to
these claims. But if you're going to pirate
the world's creativity you'd better be willing
to gimme dat shit for free[0].[0] As said by
Hungry Santa.
|
> > > duskdozer >such outrageous copyright
infringementSarcasm, considering the source of
their own training data?
|
> > > > margalabargala Considering they called the company
"Misanthropic", sarcasm is a safe bet.
|
> > > > > duskdozer Somehow, I completely overlooked that.
|
> > > > orphea Narrator: it was sarcasm, indeed.
|
> > > > baron3dl IP for me, not thee.
|
> > > Bolwin For Claude models at least, you can tell to
just manually think in the output and it works
fine. I do it reguralrly because for creative
writing and summarization, they seem to
believe they don't need to think at all, and
get way worse results.
|
> > > > carterschonwald this helps so much. i do it too. with some
of the newer frontier models its unclear
if you can even turn it off in the first
party chat apps. havent compared api
semantics yet.
|
> > > overfeed FYI: model outputs are not protected by
copyright.
|
> > > mannanj The companies that did copyright infringement
and unethically scrapped data think that
copyright infringement and unethically
scrapping data is wrong and needs to be
stopped.Though only in particular situations,
like when it's done to them and not when they
do it. Cause they have the power and are
morally right and know better than you. And if
you question this at all, well you're a threat
to American values and a supporter of the
Chinese and leading to the break down of
Democracy.This isn't a type of reasoning
argument or manipulation tactic used by the
rich throughout history to trick the naive and
gullible masses or anything like that. Trust
me, I'm rich and I'm morally right. /sarcasm
|
> > > > brookst It's been amazing to see the arc of tech
people going from "evil Disney, copyright
is an abomination, information wants to be
free" to "OMG copyright is inviolable and
AI is taking money out of Plato's
descendants' pockets!"
|
> > > > > solid_fuel > taking money out of Plato's
descendants' pocketsYeah, remind me -
is it Plato's descendants that people
are concerned about here, or is it
every single author who had any work
in Anna's Archive, any work published
online, any work published on github,
etc?I think that people are probably
upset about the harm to living people
who had their work stolen by Meta and
other LLM companies - regardless of
license, terms of use, or any other
attempted protection.
|
> > > > > > brookst Sure, that's the motte / bailey.
Easy to point to living, starving
writers who suffer grevious harm,
in defense of perpetual copyright.
Disney and others use literally
this exact argument year after
year.I'm not even disagreeing. I'm
just saying the shift in attitude
about copyright in the tech space
has been sudden, dramatic, and
really funny. Remember "you
wouldn't steal a car"? Today's
anti-AI tech contingent are
enthusiastically embracing that
false equivalence that we all
laughed at 20 years ago.
|
> > > > > > toraway Having a static, immovable belief
system about something like
copyright that is unaffected by
seismic shifts in the real world
also doesn't seem very logical.If
like, Disney did a 180 overnight
and bought rights from Google to
scan every writer's saved work in
Docs with some flimsy legal
argument then a person saying
"wait doesn't copyright actually
protect that" would make sense.
Even if you were previously upset
about them suing schools for using
80 year art.
|
> > > > > > brookst Sure. So you're saying MPAA was
right and you've come
around?Creative works have always
been accretive. There had never
been a creative work made out of
whole cloth, with no debt to any
previous work.The fact your
opinions about creative works
change based on who's profiting
does not change that.
|
> > > BoorishBears Reasoning models can coaxed to reason like
they do in dedicated reasoning blocks, outside
of those blocks: in normal parts of the
response.But Anthropic at least has openly
admitted they try to detect that and interfere
|
> > > ComputerGuru Supposedly there are "jailbreaks" that expose
considerably more of the thinking traces.
|
> > > > woctordho Simple trick: Use an agentic tool like Pi
or OpenCode that allows you to switch
models. First do some chats with DeepSeek
or GLM who shows full thinking traces,
then switch to Claude or GPT and it's more
likely to show full thinking traces.
|
> > > mirekrusin I don't understand why there isn't public
dataset for reasoning that can be improved by
humans/llms like Wikipedia (ie with auto
judging contributions etc).
|
> > > > woctordho There is already a lot of effort to
collect agent traces including reasonings,
e.g. see the recent discussion:
https://old.reddit.com/r/LocalLLaMA/commen
ts/1u795pb/donate_...We've been developing
DataClaw for this:
https://github.com/peteromallet/dataclaw
|
> > > > > mirekrusin Did I get it wrong or the first link
has dataset with 30 entries only?
|
> > > > logicchains For reasoning a manually-curated dataset
is too small; you need to be able to
automatically generate vast volumes of
synthetic reasoning data with provably
correct answers. That's presumably why
Claude and GPT are so good at using Lean
(the theorem prover), because they get fed
a bunch of synthetic, verifiably correct
training data.
|
> > > > > mirekrusin Wikipedia is a lot of data as well but
we manage to do it, no?
|
> > > orbital-decay You can trivially leak the CoT of any current
model, it's not a problem.>outrageous
copyright infringement>unethically scrapped
dataHahahahaha
|
> alexjplant > It seems to really be a nice step-up and is getting
quite close to the frontier.IMHO it's already
surpassed them. I vastly prefer my personal GLM and
OpenCode setup to the Claude Code and Opus one that I
have to use at work. The former makes way fewer
StackOverflow brogrammer-tier mistakes and is
considerably better at following instructions. The
harness UX is also vastly superior as it doesn't
ignore, randomly change, or incorrectly report
settings.Maybe it's the harness and I'd have even
greater success with OpenCode and Anthropic, but I
think it safe to say that Anthropic's moat is
evaporating.
|
> > carter2099 You would be surprised at how much of an impact
the harness has. I switched to Pi and chinese open
source models, and models that _I know_ are less
capable than sonnet outperform my sonnet + claude
code stack at work.
|
> vorticalbox This is a problem I find with opus is will spend so
long thinking then going "but wait what if"To point
where I stop it and simple tell it to "start writing
code you can work it out as you go along"Seems writers
block also effects LLM
|
> > robertkarl https://arxiv.org/abs/2606.00206In this paper they
nerf an LLMs ability to emit waffling thinking
tokens like "wait", "but", "alternatively", and
the models (they're old, small models in the
paper) terminate reasoning faster and perform
better. I bet Anthropic is tuning this on their
backend.
|
> > > addandsubtract Didn't they originally introduce those tokens
to make the models smarter by second guessing
their "thoughts"?
|
> > > meatmanek This is super cool. Do you know if any of the
inference backends (llama.cpp, vllm, etc)
support this technique?
|
> > > > iaw vLLM supports "banning" certain tokens but
I don't know if it can dynamically reduce
them.To my knowledge you can also "ban"
with llama.cpp but it is passed in the API
call rather than to the server at
initialization.
|
> > > orbital-decay I imagine Anthropic would rather train a small
control model instead of resorting to sampling
hacks
|
> > giancarlostoro I usually have Claude build a plan first, then I
put it into an XML file it updates with phases,
usually we talk about some of those tasks, and
then once its good and I like it, I have Claude
implement the plan.Another thing I tell Claude to
do is to not guess, but look at documentation, it
messes up a lot less, might use some tokens
reading docs, but at least it has a higher success
rate code wise.
|
> > > > giancarlostoro Apparently because of how Claude is
trained, even the system level prompts go
through as XML, it works better with XML
"prompting" so I figured I could have it
write plans in XML. I need to update my
ticketing tool to output XML maybe by
default.https://www.reddit.com/r/ClaudeAI/
comments/1psxuv7/anthropic...
|
> > > > > saltsucker Comments later in thread say markdown
works just as fine and that it's more
important to organize your plan into
sections.Also just think about it, why
would a model trained on the world's
corpus of text (that isnt formatted in
xml) perform better with XML? It would
be a better study if that post tested
markdown, org, xml, json, etc. 10
times to see if their is a difference
|
> > > > > > swingboy Anthropic's best practices still
include the use of XML:
https://platform.claude.com/docs/e
n/build-with-claude/prompt...
|
> > > > > > adastra22 A year or so ago XML worked more
reliably for long-lived prompt
instructions. Now it is cargo
culting.
|
> > > > > > orbital-decay XML consistently performed better
than markdown and JSON in all
evals I've ever seen on any model,
except for a couple very specific
ones.
|
> > > > aesthesia One reason to use XML-like formatting is
that it makes the beginning and end of
sections explicit. This is less of an
issue when the model is generating text
but can still be helpful when using
templated prompts.
|
> > > > root-parent XML stands for Xtra ML....
|
> > > > > noworriesnate I'd like to switch to a sales
career--can you give me any pointers?
|
> > mikeocool Seriously. Whenever I read the thinking output I
get mad and turn down effort to medium or low.Just
output the code and we'll work through it!I feel
similarly about having codex review claude's
plans. I don't think I've ever seen it catch a
major issue. It just points out things that would
have inevitably been addressed during
implementation anyway.
|
> > > SubiculumCode A lot of times this is how humans work. Just
start 'putting words on paper', 'think by
doing', etc. sometimes it's more efficient to
see why something won't work after writing a
bit of it, and sometimes you get lucky and it
works right off the bat
|
> > epolanski Fable was 20 times worse on that.It's clear it was
the vibe coding model, as like no other model
before, fully turned you into his assistant
instead of the other way around.
|
> > > RyanHamilton Could it be possible, these firms are
optimizing for two things: a) Better
performance. b) Gathering data from you to
further improve performance later. I've also
found the huge amount of planning rather than
iteration frustrating. I've felt like I'm
teaching a junior!
|
> > > > epolanski I think they simply optimize around E2E
benchmarks, none of those benchmarks is
designed as multi turn assistance to the
user, but going from a prompt straight to
the final solution.
|
> > > > > celrod Exactly. How can "we" develop and
encourage benchmarks for multi-turn
user assistance?
That is what I want.
I feel like the models and harnesses
push much too hard against this
workflow -- that they push you towards
letting go and vibe coding, with only
your discipline (and desire for a
quality and maintainable product)
holding it back.
|
> > > > happyPersonR more thinking == more tokens === more
money LOLL
|
> > > > > overfeed Os there a cost benchmark out there? I
wonder how frontier models are doing
over time for cost per problem solved.
|
> > > > > drob518 I think they are optimizing for
one-shot performance because that will
drive usage. They can't afford to look
bad in the benchmarks. And if that
means consuming an order of magnitude
more tokens, well, that's good for
business, too.
|
> > drob518 Qwen is notorious for this, too. It'll sometimes
spin in a long loop of "But wait..." paragraphs.
|
> > thinkingtoilet I've been having success with Opus but you REALLY
have to tame it. Long prompts that list what files
to look at, relationships between entities, etc...
I went from regularly hitting my daily limit to
almost never hitting it. Oh, and also I was being
lazy with small changes and stopping that helped a
lot too. As you said, it gets in these loops where
it's just churning and if you don't stop it it can
go on for way too long.
|
> h14h Hopefully the recent work Moonshot did with Kimi K2.7
Code trickles in to the other open-model labs.Per AA,
while K2.7 Code is roughly on par w/ K2.6 in terms of
intelligence, it uses half the output tokens to get
there.
|
> > h14h I've been doing some testing with GLM 5.2 on
Fireworks and it looks like the "High" reasoning
level uses fewer tokens than even K2.7 Code by a
considerable margin (roughly half).Don't have any
evals indicating how it compares on upper-bound
quality, but for a well-defined task it seems like
GLM 5.2 on "High" is remarkably token efficient.
Looking forward to seeing where it lands on the AA
index.
|
> bertili This is GLM 5.2 Max. GLM 5.2 High which use less than
half[1] the tokens.[1] https://z.ai/blog/glm-5.2
|
> > Tiberium Yes, but the Artificial Analysis result is also
from GLM 5.2 (max), not high.
|
> > > andai They have this with a lot of models, measuring
only the max setting, while the one you'd
actually want to use for most tasks is much
lower.
|
> > > > epolanski For the brief period with had Fable, I
never had to use it above medium.Low
nailed the overwhelming majority of
mundane tasks on it's own, medium was good
for more complex stuff.
|
> cmrdporcupine > Of course if you convert those values to actual
request cost, GLM 5.2 will probably beat GPT 5.5/Opus
4.8, but speed matters for a lot of people, I
think.GLM5.2 ends up being far more expensive than I
thought it would be when I tried it on openrouter. I
ground through $5 USD worth of tokens quite
quickly.And this was high, not max.
|
> > guelo Using these open models really makes you realize
how subsidized Anthropic and OpenAi's subscription
plans are.
|
> > > nijave Absolutely. You can also run codeburn or
ccusage and they'll scan the session files and
tell you how much you burnt in API token
pricing equivalent.
|
> esafak I agree. I've noticed that it is quite smart but it
has a tendency to doubt itself and overthink. I
monitor its internal dialogue and prod it when it does
this. They need to optimize the chain of thought early
stopping.
|
> abgruszecki Agreed that models should get better at working with
rare programming languages like Nim! Using them tends
to confuse agents a lot in general. We're working on a
paper right now where we compare how token-efficient
models are when trying to implement the exact same
program in different programming languages, and that's
one of the trends we're seeing.
|
> robmccoll That's interesting. I gave nearly the same task to
Gemma4 31b as a test yesterday. Write a symbolic math
engine in Typescript that can perform evaluation and
simple expression reductions over +-/*(). It performed
the task correctly with minimal reasoning - much fewer
reasoning tokens than output tokens.
|
> > gbingles Tbh, so what? I googled "symbolic math engine in
Typescript that can perform evaluation and simple
expression reductions over +-/*()" and got what
looks to be viable answers without using any AI
model at all. Reciting well established things
from memory isn't terribly interesting. Show it a
novel codebase and have it implement something
within it.
|
> > > SubiculumCode TBH, while your point is a fair one, your
attitude is off-putting and needlessly
condescending.
|
> > > drob518 So, a natural question would be why a model
would ever get it wrong?
|
> xyzsparetimexyz Reminiscent of
https://en.wikipedia.org/wiki/Portia_(spider)
|
> rdsubhas As per stats in other comments, it is frontier, not
close to frontier.
|
> HWR_14 I thought you could not compare tokens across models
because their cost and speed was so different between
models.
|
> nurumaik You asked for maximum effort, you got maximum effort
|
kristopolous I have a script that ranks these based on codingindex from
Artificial Analysis.All it does is pull a json from their
main table page and parses it with the fields I care about
(coding).There used to be a mailing list associated with
it but eh ... there wasn't much interest. I use the script
every day though.Current partial output score age size
name
47.1 58 large Kimi K2.6
47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort)
47.5 70 - Muse Spark
47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max
Effort)
48.6 55 - GPT-5.5 (Non-reasoning)
48.7 188 - GPT-5.2 (xhigh)
50.1 29 - Qwen3.7 Max
50.7 1 large GLM-5.2 (max)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max
Effort)
51.5 92 - GPT-5.4 mini (xhigh)
52.1 55 - GPT-5.5 (low)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max
Effort)
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
55.5 118 - Gemini 3.1 Pro Preview
56.2 55 - GPT-5.5 (medium)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max
Effort)
57.2 104 - GPT-5.4 (xhigh)
58.5 55 - GPT-5.5 (high)
59.1 55 - GPT-5.5 (xhigh)
62 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort,
Opus 4.8 Fallback) To see everything, run it like so $ curl
day50.dev/art-analysis.sh | bash The repo: https://github.com/day50-dev/aa-eval-emailsome
key takeaways:* open models are on about a 4-7 month lag
right now depending on how you want to measure it* if this
keeps up, you might see an open-weights model doing claude
fable 5 level work before the new year.if people sign up
for the free mailing list (that just does this) I'll go
and put it back on ... emails when new model evals drop -
it was pretty useful.
|
> papersail score age size name
62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max
Effort, Opus 4.8 Fallback)
59.1 55 - GPT-5.5 (xhigh)
58.5 55 - GPT-5.5 (high)
57.2 104 - GPT-5.4 (xhigh)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max
Effort)
56.2 55 - GPT-5.5 (medium)
55.5 118 - Gemini 3.1 Pro Preview
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High
Effort)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max
Effort)
52.1 55 - GPT-5.5 (low)
51.5 92 - GPT-5.4 mini (xhigh)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max
Effort)
50.7 1 large GLM-5.2 (max)
50.1 29 - Qwen3.7 Max
48.7 188 - GPT-5.2 (xhigh)
48.6 55 - GPT-5.5 (Non-reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max
Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)
|
> > tcp_handshaker Short comments...- GPT 5.5 consistently the best,
an opinion who gets me constant downvotes here by
the Anthropic Marketeer strike force...- China is
going to eat the US lunch on AI- What have
European universities and companies been doing?
Its like if, on a parallel past/future, Nikola
Tesla and
Edison would have created flying Cyberpunk
machines,
while Europeans researchers, would be getting
together to
request EU funds, for investigation on how to
breed faster horses.- If Zuckerberg could be
fired, after spending
a total of $235 billion on AI and having
NOTHING to show for...should he be fired?
|
> > > Certhas None of these models come from universities,
European or otherwise.Mistral is clearly
currently not competing for Frontier Model.
Whether this is due to a lack of VC Funds or a
lack of technical ability or the former
arising from the latter would be interesting
to know.The top models are from startups.
Among the FAANG only Google managed to get a
Frontier model, and they litterally invented
the architecture and have more money than they
can possibly spend to throw at the problem.
Facebook shows that even ungodly amounts of
money don't get you there though.So why did no
EU based Startups succeed while two US start
ups succeeded? I agree that that's a very
important question the EU should ask. The
Internet revolution was driven by US
companies, and now AI will be as well, with
Chinese Open Weights mixed in. The EU
consistently can not turn its considerable
economic output into fast moving tech firms.
|
> > > > Quarrel Mistral have moved to actually trying to
make money, and been relatively
successful; at least if we lived in a
normal world.They've got a heap of
contractors working to help industry adopt
LLMs. It is just classic consulting work,
and they'd look like a really great
company if we weren't comparing them to
literal $2T+ companies losing money
hand-over-fist...
|
> > > > sschueller Apertus was built by universities in
Switzerland. Although not frontier it is
fully open.[1]
https://apertvs.ai/pages/about/
|
> > > > kristopolous I'm actually more curious about IBM. Their
granite series appears to be nowhere close
to competitive.They had Watson, remember,
it won on jeopardy like 15 years ago?
They've been at this for a long timeMaybe
it's good at something else?
|
> > > > > tekchip IBM doesn't do technology they do
contracts. Any "technology" is
marketing stunts. They hire a bunch of
"fellows" outside contractors to make
a thing they can be first at or
whatever, do the stunt, then get a
bunch of 5-10 year contracts with
customers off the stunt. They then
fuck it up for that length of time but
still get paid due to those contracts.
After that space of time the folks
theyve burned have moved on, rinse
repeat. Pretty easy to look back at
the timeline of "firsts" they have and
see the pattern.
|
> > > > > > JSR_FDED Don't forget the marketing for the
new $1B "initiative" (fill in:
mobile, cloud, blockchain,
AI,...)Upon closer inspection the
$1B is (a) over 10 years, (b)
mostly internal cross-billing
between departments.
|
> > > > > > drob518 Yes, but the key point is that
nobody got fired for buying it
from IBM.
|
> > > > > > tanseydavid "HAL, I want you to train a
frontier-level large language
model for me.""I'm sorry Dave, I
can't do that"
|
> > > > > root-parent Agree that IBM has no excuse.
Specially for how long they have been
trying to do AI. Although Watson was a
completely different technology.They
had to start from scratch, but dont
seem to have the management to be
smart enough, to stop doing it in
house. They could have just acquired a
startup that could build a frontier
model.What is also very ironic since
their whole bussiness for the last 15
years, has been buying companies a la
CA Associates...Their previous Watson
branding and collapse of Watson
expectations cost them one CEO, but
the current CEO was part of the same
team. They just dont learn....
|
> > > > > vunderba I view Watson in the same light as
Deep Blue, one-offs that brought more
prestige and potential share value to
IBM than necessarily "moving the
needle" in the respective technology.
|
> > > > > greenavocado Granite is OK for speech to text (ASR)
|
> > > marcus_cemes To be honest, living in Switzerland and
speaking with peers, we're just exhausted by
the constant AI hype. For a lot of us, the
fact that Europe isn't frantically trying to
scrape the entire internet and every book in
existence for the next massive model isn't a
bad thing. The big players are doing their
thing, like with the nuclear arms race. We
regulate a lot, too much a lot of the time,
but sometimes that trickles down to other
places too. A lot was done right, imo.ETH
Zurich and EPFL universities recently put out
an open model called Apertus (was on the HN
front page a few months back), it's not a
frontier model, but they built it properly
regarding copyright and data transparency.It
might look a bit slow or old-fashioned, but
focusing on doing things ethically and legally
feels like a much better path than just
joining the race to scrape everything.
|
> > > > dr_dshiv Sir, I would suggest that if Europe fails
to be economically competitive, the
downstream implications on European
society will produce much worse outcomes
than (for instance) data
transparency...Doing things with ethical
intentions does not necessarily produce
outcomes that are beneficial for society
at large.
|
> > > > > marcus_cemes I'm inclined to agree with you, but
you could make the same argument for
exploiting natural resources and the
environment. I don't think it's being
done right at the moment, and it does
not seem to be benefiting people as
much as certain companies.
|
> > > > > muvlon Well, is this mad dash for AI
producing "outcomes that are
beneficial for society at large" yet?
So far it looks like its mostly
producing a ton of negative
externalities and wealth transfer to
corrupt elites.Also, no, abandoning
ethics is not an option, what a
ridiculous suggestion.
|
> > > > > > dr_dshiv Data transparency and copyright
does not constitute "ethics."
|
> > > > _zoltan_ also living in Swizerland and I disagree.
Hard.it's horrible that Europe is so
backwards in AI. too much regulation and
nothing to show for it. we should be way
faster.there is no money. the culture in
both Europe and Switzerland is that you
don't fail, while in the US it's perfectly
fine to be on your 4th startup because the
first 3 failed.it's not that it LOOKS slow
and old fashioned, it IS slow and old
fashioned. it's horrible.
|
> > > > tsss If these models ever reach the point where
they are as good a programmer as a human
is (and thus can self-improve completely
independently), then there won't be an
independent Switzerland much longer. AI
race is a race for first place.> like with
the nuclear arms raceMacArthur was about
to nuke the Chinese in the Korean war.
China knows that nuclear weapons, AI and
robotics are a matter of survival and not
a nice-to-have.
|
> > > wunderlotus > - If Zuckerberg could be fired, after
spending a total of $235 billion on AI and
having NOTHING to show for...should he be
fired?Yes, if the premise was true but it's
not.https://opper.ai/ai-roundtable/questions/b
bf5a4e9-204
|
> > > > tcp_handshaker Interesting...but this shows how dumb
these AI are.And they misunderstood
nothing to show for as...literally nothing
to show for. Yes not factually but he has
nothing effectively not much that is
competitive to show for so its literally
true.And had they been give this
clarification then would have suddenly
said: "Oh yes of course, you are
absolutely right, you are correct on
challenging me on that...."
|
> > > ricardobayes Well Europe is famously a laggard when it
comes to new tech - in parts of Switzerland,
two horses were required be mounted in front
to carry cars up until 1925. UK required a
person to walk in front of a car and wave a
red flag.
|
> > > kristopolous They did muse spark ... it's not garbage.Also
what are they building it for? I'd think it's
to serve ads better or something like that.
Maybe Muse Spark fits facebook's needs
perfectly...
|
> > > > jansan Mo Bitar said something like "Meta's LLM
is the one you use if you accidentially
hit the wrong button in WhatsApp. Its user
base is fat-finger phone users."
|
> > > > > tcp_handshaker As comparison the WHOLE NASA budget is
24 billion. Meta burned 10x that on
AI...
|