I am kind of amazed at how many commenters respond to this result by confidently asserting that LLMs will never generate 'truly novel' ideas or problem solutions.
> AI is a remixer; it remixes all known ideas together. It won't come up with new ideas
> it's not because the model is figuring out something new
> LLMs will NEVER be able to do that, because it doesn't exist
It's not enough to say 'it will never be able to do X because it's not in the training data,' because we have countless counterexamples to this statement (e.g. 167,383 * 426,397 = 71,371,609,051, or the above announcement). You need to say why it can do some novel tasks but could never do others. And it should be clear why this post or others like it don't contradict your argument.
If you have been making these kinds of arguments against LLMs and acknowledge that novelty lies on a continuum, I am really curious why you draw the line where you do. And most importantly, what evidence would change your mind?
I might as well answer my own question, because I do think there are some coherent arguments for fundamental LLM limitations:
1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.
2. LLMs do not learn from experience. They might perform as well as most humans on certain tasks, but a human who works in a certain field/code base etc. for long enough will internalize the relevant information more deeply than an LLM.
However I'm increasingly doubtful that these arguments are actually correct. Here are some counterarguments:
1. It may be more efficient to just learn correct logical reasoning, rather than to mimic every human foible. I stopped believing this argument when LLMs got a gold metal at the Math Olympiad.
2. LLMs alone may suffer from this limitation, but RL could change the story. People may find ways to add memory. Finally, it can't be ruled out that a very large, well-trained LLM could internalize new information as deeply as a human can. Maybe this is what's happening here:
I studied philosophy focusing on the analytic school and proto-computer science. LLMs are going to force many people start getting a better understanding about what "Knowledge" and "Truth" are, especially the distinction between deductive and inductive knowledge.
Math is a perfect field for machine learning to thrive because theoretically, all the information ever needed is tied up in the axioms. In the empirical world, however, knowledge only moves at the speed of experimentation, which is an entirely different framework and much, much slower, even if there are some areas to catch up in previous experimental outcomes.
Having a focus in philosophy of language is something I genuinely never thought would be useful. It’s really been helpful with LLMs, but probably not in the way most people think. I’d say that folks curious should all be reading Quine, Wittgenstein’s investigations, and probably Austin.
I think we may have similar perspectives. Regarding empirical knowledge, consider when the knowledge is in relation to chaotic systems. Characterize chaotic systems at least as systems where inaccurate observations about the system in the past and present while useful for predicting the future, nevertheless see the errors grow very quickly for the task of predicting a future state. Then indeed, prediction is difficult.
One domain of knowledge I think you have yet to mention. We can talk about fundamentally computationally hard problems. What comes to mind regarding such problems that are nevertheless of practical benefit are physics simulations, material simulations, fluid simulations, but there exist problems that are more provably computationally difficult. It seems to me that with these systems, the chaotic nature is one where even if you have one infinitely precise observation of a deterministic system, accessing a future state of the system is difficult as well, even though once accessed, memorization seems comparatively trivial.
Also, we can do thought experiments, simulations in our heads, that often are as good as doing them for real - it has limitations and isn't perfect though. But it does work often. Einstein used to purposely dose off in a weird position so that something hit his leg or something like that to slightly nudge him half awake so he could remember his half-dreaming state - which is where he discovered some things
Where can I read about how LLMs have changed epistemology? Is there a field of philosophy that tries to define and understand 'intelligence'? That sounds very interesting.
There is already philosophy of mind, but it was pretty young when I was in grad school, which was really at the dawn of deep learning algorithms.
I’d say the two most important topics here are philosophy of language (understanding meaning) and philosophy of science (understanding knowledge).
I’ve already mentioned the language philosophers in an edit above, but in philosophy of science I’d add Popper as extremely important here. The concept of negative knowledge as the foundation of empirical understanding seems entirely lost on people. The Black Swan, by Nassim Taleb is a very good casual read on the subject.
> distinction between deductive and inductive knowledge
There's also intuitive knowledge btw.
Anyway, the recent developments of AI make a lot of very interesting things practically possible. For example, our society is going to want a way to reliably tell whether something is AI generated, and a failure to do so pretty much settles the empirical part of the Turing test issue. Or alternatively if we actually find something that AI can't reliably mimic in humans, that's going to be a huge finding. By having millions of people wonder whether posts on social media are AI generated, it is the largest scale Turing test we have inadvertently conducted.
The fact that AI seems to be able to (digitally) do anything we ask for is also very interesting. If humans are not bogged down by the small details or cost of implementation concerns, and we can just say what we want and get what we wished for (digitally), what level of creativity can we reach?
Also once we get the robots to do things in the physical space...
I don't want to do the thing where we fight on the internet. I don't know your background, but I'll push back here just because this type of comment that non-philosophers seem to present to me, which misses a lot of the points I'm trying to make.
(1) "intuitive knowledge" - whether or not you want to take "intuitive knowledge" as a type of knowledge (I don't think I would) is basically immaterial. The deductive-inductive framework dynamic is for reasoning frameworks, not knowledge. The reasoning frameworks are pointed in opposite directions. The deductive framework is inherited from rationalist tradition, it's premises are by definition arbitrary and cannot be justified, and information is perfect (excepting when you get rare truth values, like something being undecidable). Inductive/empirical framework is quite the opposite. Its premises are observations and absolutely not arbitrary, the information is wholly imperfect (by necessity, thanks Popper), and there is always a kind of adjustable resolution to any research conducted. Newton vs Einsteinian physics, for example, shows how zooming in on the resolution of experimentation shows how a perfectly workable model can fail when instruments get precise enough. I'll also note here that abduction is another niche reasoning framework, but is effectively immaterial to my point here.
(2) The Turing Test is not, and has never been, a philosophically rigorous test. It's effectively a pointless exercise. The literature about "philosophical zombies" has covered this, but the most important work here is Searle's "Chinese Room."
>The fact that AI seems to be able to (digitally) do anything we ask for is also very interesting.
I don't even know how to respond to this. It's trivially, demonstrably false. Beyond that, my entire point is that philosophy of language actually presents so hard problems with regards to what meaning actually is that might end up creating a kind of uncertainty principle to this line of thinking in the long run. Specifically Quine's indeterminacy of translation.
Searle's Chinese Room is a fallacious mess ... see the works of Larry Hauser, e.g., https://philpapers.org/rec/HAUNGT and https://philpapers.org/rec/HAUSCB-2
The importance of Searle's Chinese Room is how such extraordinarily bad argumentation has persuaded so many people open to it.
And the literature about philosophical zombies is contentious, to say the least, and much of it is also among the worst arguments in philosophy--Dennett confided in me that he thought it set back progress in Philosophy of Mind for decades, along with that monstrosity of misdirection, "the hard problem". Chalmers (nice guy, fun drunk at parties, very smart, but hopelessly deluded) once admitted to me on the Psyche-D list that his argument in The Conscious Mind that zombies are conceivable is logically equivalent to denying that physicalism is conceivable, so it's no argument against physicalism ... he said he used the argument to till the soil to make people more susceptible to his later arguments against physicalism (which I consider unethical)--all of which are bogus, like the Knowledge Argument--even Frank Jackson who originated it admits this.
Similarly, Robert Kirk, who coined the phrase "philosophical zombie" in 1974, wrote his book Zombies and Consciousness "as penance", he told me when he signed my copy.
> I don't want to do the thing where we fight on the internet.
Nor me ... I've had these "fights" too many times already and I know how they go, and I understand why people believe what they believe and why they can't be swayed, so I won't comment further ... I just want to put a dent in this "I'm a philosopher" argumentum ad verecundiam.
I would hope that philosophy would be exempt from accusations of arguments from authority. I say I don’t want to fight exactly because I don’t want to come off like a jerk because I’m arguing. If the Chinese Room is a mess, I welcome the argument, and will happily read the paper.
I’m less open to push back against philosophical zombies, as the argument seems trivially plausible, from a position of solipsism.
There are ways to go beyond the human-quality data limitation. AI can be trained on better quality than average human data because many problems are easy to verify their solutions. For example, in theory, reinforcement learning with an automatic grader on competitive programming problems can lead to an LLM that is better than humans at it.
It's also possible that there can be emergent capabilities. Perhaps a little obtuse, but you can say that humans are trained on human-quality data too and yet brilliant scientists and creative minds can rise above the rest of us.
The idea that they don’t learn from experience might be true in some limited sense, but ignores the reality of how LLMs are used. If you look at any advanced agentic coding system the instructions say to write down intermediate findings in files and refer to them. The LLM doesn’t have to learn. The harness around it allows it to. It’s like complaining that an internal combustion engine doesn’t have wheels to push it around.
Modern LLMs, just like everyone reading this, will instead reach for a calculator to perform such tasks. I can't do that in my head either, but a python script can so that's what any tool-using LLM will (and should) do.
Long multiplication is a trivial form of reasoning that is taught at elementary level. Furthermore, the LLM isn't doing things "in its head" - the headline feature of GPT LLMs is attention across all previous tokens, all of its "thoughts" are on paper. That was Opus with extended reasoning, it had all the opportunity to get it right, but didn't. There are people who can quickly multiply such numbers in their head (I am not one of them).
I tried this with Claude - it has to be explicitly instructed to not make an external tool call, and it can get the right answer if asked to show its work long-form.
Mathematics is not the only kind of reasoning, so your conclusion is false. The human brain also has compartments for different types of activities. Why shouldn't an AI be able to use tools to augment its intelligence?
Furthermore, the LLM isn't doing things "in its head" - the headline feature of GPT LLMs is attention across all previous tokens, all of its "thoughts" are on paper
LOL, talk about special pleading. Whatever it takes to reshape the argument into one you can win, I guess...
LLMs don't reason.
Let's see you do that multiplication in your head. Then, when you fail, we'll conclude you don't reason. Sound fair?
I thought it might do better if I asked it to do long-form multiplication specifically rather than trying to vomit out an answer without any intermediate tokens. But surprisingly, I found it doesn't do much better.
I asked Gemini 3 Thinking to compute the multiplication "by hand." It showed its work and checked its answer by casting out nines and then by asking Python.
Sonnet 4.6 with Extended Thinking on also computed it correctly with the same prompt.
LLMs can generate anything by design. LLMs can't understand what they are generating so it may be true, it may be wrong, it may be novel or it may be known thing. It doesn't discern between them, just looks for the best statistical fit.
The core of the issue lies in our human language and our human assumptions. We humans have implicitly assigned phrases "truly novel" and "solving unsolved math problem" a certain meaning in our heads. Some of us at least, think that truly novel means something truly novel and important, something significant. Like, I don't know, finding a high temperature superconductor formula or creating a new drug etc. Something which involver real intelligent thinking and not randomizing possible solutions until one lands. But formally there can be a truly novel way to pack the most computer cables in a drawer, or truly novel way to tie shoelaces, or indeed a truly novel way to solve some arbitrary math equation with an enormous numbers. Which a formally novel things, but we really never needed any of that and so relegated these "issues" to a deepest backlog possible. Utilizing LLMs we can scour for the solutions to many such problems, but they are not that impressive in the first place.
> It doesn't discern between them, just looks for the best statistical fit
Of course at the lowest level, LLMs are trained on next-token prediction, and on the surface, that looks like a statistics problem. But this is an incredibly reductionist viewpoint and I don't see how it makes any empirically testable predictions about their limits. LLMs 'learned' a lot of math and science in this way.
> "truly novel" and "solving unsolved math problem"
OK again if novelty lies on a continuum, where do you draw the line? And why is it correct to draw it there and not somewhere else? It seems like you are just naming exceptionally hard research problems.
This is why I put 'learned' in quotes. They started from a state of not being able to solve algebra problems or produce basic steps of scientific reasoning to being able to. Operationally, that is what I mean by learning and they unambiguously do it.
If LLMs can come up with formerly truly novel solutions to things, and you have a verification loop to ensure that they are actual proper solutions, I don't understand why you think they could never come up with solutions to impressive problems, especially considering the thread we are literally on right now? That seems like a pure assertion at this point that they will always be limited to coming up with truly novel solutions to uninteresting problems.
The problem with these bromides is not that they're wrong, it's that they're not even wrong. They're predictive nulls.
What observable differences can we expect between an entity with True Understanding and an entity without True Understanding? It's a theological question, not a scientific one.
I'm not an AI booster by any means, but I do strongly prefer we address the question of AI agent intelligence scientifically rather than theologically.
We've tested this in the small with AI art. When people believe they're viewing human-made art which is later revealed to be AI art, they feel disappointed. The actual content is incidental, the story that supports it is more important than the thing itself.
It's the same mechanism behind artisanal food, artist struggles, and luxury goods. It is the metaphysical properties we attach to objects or the frames we use to interpret strips of events. We author all of these and then promptly forget we've done so, instead believing they are simply reality.
There are already people dealing with AI intelligence scientifically. That's what benchmarks do.
It's the "it's just a stochastic parrot!" camp that's doing the theological work. (and maybe also those in the Singularity camp...)
That said, I do think there's value in having people understand what "Understanding" means, which is kinda a theological (philosophical :D) question. IMHO, in every-day language there's a functional part (that can be tested with benchmarks), and there's a subjective part (i.e. what does it feel like to understand something?). Most people without the appropriate training simply mix up these two things, and together with whatever insecurities they have with AI taking over the world (which IMHO is inevitable to some extent), they just express their strong opinions about it online...
Well said. That's exactly what has been rubbing me the wrong way with all those "LLMs can never *really* think, ya know" people. Once we pass some level of AI capability (which we perhaps already did?), it essentially turns into an unfalsifiable statement of faith.
Agreed. We should be asking what the machines measurably can or can't do. If it can't be measured, then it doesn't matter from an engineering standpoint. Does it have a soul? Can't measure it, so it doesn't matter.
That's a bit too pessimistic. Often times you can productively find some measurable proxy for the thing you care about but can't measure. Turing's test is a famous example, of that.
Sometimes you only have a one-sided proxy. Eg I can't tell you whether Claude has a soul, but I'm fairly sure my dishwasher ain't.
It probably can, but won't realize that and it won't be efficient in that. LLM can shuffle tokens for an enormous number of tries and eventually come up with something super impressive, though as you yourself have mentioned, we would need to have a mandatory verification loop, to filter slop from good output and how to do it outside of some limited areas is a big question. But assuming we have these verification loops and are running LLMs for years to look for something novel. It's like running an energy grid of small country to change a few dozen of database entries per hour. Yes, we can do that, but it's kinda weird thing to do. But it is novel, no argue about that. Just inefficient.
We never had a big demand to define how humans are intelligent or conscious etc, since it is too hard and was relegated to a some frontier researchers. And with LLMs we now do have such demand but the science wasn't ready. So we are all collectively searching in the dark, trying to define if we are different from these programs if not how. I certainly can't do that. I do know that LLMs are useful, but I also suspect that AI (aka AGI nowadays) is not yet reached.
- insane growth rates (go back and look at where we were maybe 2 years ago and then consider the already signed compute infrastructure deals coming online)
And still say with a straight face that this is some kind of parlor trick or monkeys with typewriters.
we don’t need to run LLMs for years. The point is look at where we are today and consider performance gets 10x cheaper every year.
LLMs and agentic systems are clearly not monkeys with typewriters regurgitating training data. And they have and continue to grow in capabilities at extremely fast rates.
I was talking about highest difficulty problems only, in the scope of that comment. Sure at mundane tasks they are useful and we optimizing that constantly.
But for super hard tasks, there is no situation when you just dump a few papers for context add a prompt and LLM will spit out correct answer. It's likely that a lead on such project would need to additionally train LLM on their local dataset, then parse through a lot of experimental data, then likely run multiple LLMs for for many iterations homing on the solution, verifying intermediate results, then repeating cycle again and again. And in parallel the same would do other team members. All in all, for such a huge hard task a year of cumulative machine-hours is not something outlandish.
This is just not true. Maybe it will be true if you increase the problem difficulty in concert with model performance? You don't need fine tuning for this and you haven't for years now. Reasoning performance for now may be SOMEWHAT brittle but again look at where we have come from in like 2 years. Then also consider the logical next steps
- better context compression (already happening) + memory solutions that extend the effective context length [memory _is_ compression]
- continual learning systems (likely already prototyped)
- these domains are _verifiable_ which I think just seems to confuse people. RL in verifiable domains takes you farther and farther. Training data is a bootstrap to get to a starting point, because RL from scratch is too inefficient.
agents can already deal with large codebases and datasets, just like any SWE, DS or researcher.
and yes! If you throw more compute at a problem you will get better solutions! But you are missing the point: for the frontier solutions, which changes with every model update, you of course need to eek out as much performance as you can, which requires a large amount of test time compute. But what you can do _without_ this is continually improving. The pattern _already in place_ is that at first you need an extreme amount of compute, then the next model iterations need far less compute to reach that same solution, etc etc. The costs + compute requirements to perform a particular task decrease exponentially.
> We never had a big demand to define how humans are intelligent or conscious etc, since it is too hard and was relegated to a some frontier researchers. And with LLMs we now do have such demand but the science wasn't ready. So we are all collectively searching in the dark, trying to define if we are different from these programs if not how. I certainly can't do that. I do know that LLMs are useful, but I also suspect that AI (aka AGI nowadays) is not yet reached.
Alternative perspective: the science may not have been ready, so instead we brute-forced the problem, through training of LLMs. Consider what the overall goal function of LLM training is: it's predicting tokens that continue given input in a way that makes sense to humans - in fully general meaning of this statement.
It's a single training process that gives LLMs the ability to parse plain language - even if riddled with 1337-5p34k, typos, grammar errors, or mixing languages - and extract information from it, or act on it; it's the same single process that makes it equally good at writing code and poetry, at finding bugs in programs, inconsistencies in data, corruptions in images, possibly all at once. It's what makes LLMs good at lying and spotting lies, even if input is a tree of numbers.
(It's also why "hallucinations" and "prompt injection" are not bugs, but fundamental facets of what makes LLMs useful. They cannot and will not be "fixed", any more than you can "fix" humans to be immune to confabulation and manipulation. It's just the nature of fully general sytems.)
All of that, and more, is encoded in this simple goal function: if a human looks at the output, will they say it's okay or nonsense? We just took that and thrown a ton of compute at it.
> (It's also why "hallucinations" and "prompt injection" are not bugs, but fundamental facets of what makes LLMs useful. They cannot and will not be "fixed", any more than you can "fix" humans to be immune to confabulation and manipulation. It's just the nature of fully general sytems.)
This is spot on and one of the reasons why I don't think putting LLMs or LLM based devices into anything that requires security is a good idea.
We can't tell yet if that is true, partially true, or false for humans. We do know that LLM can't do anything else besides that (I mean as a fundamental operating principle).
Why is it important? “Statistical fit” is what you want…not understanding this is indicative of a limited understanding of what statistics is. What do you think it means to truly understand something? I don’t get it: read probability theory by Jaynes. It doesn’t really matter if the brain does Bayesian updates but that’s what’s optimal…
I've been working on a utility that lets me "see through" app windows on macOS [1] (I was a dev on Apple's Xcode team and have a strong understanding of how to do this efficiently using private APIs).
I wondered how Claude Code would approach the problem. I fully expected it to do something most human engineers would do: brute-force with ScreenCaptureKit.
It almost instantly figured out that it didn't have to "see through" anything and (correctly) dismissed ScreenCaptureKit due to the performance overhead.
This obviously isn't a "frontier" type problem, but I was impressed that it came up with a novel solution.
Thanks! I've been doing a lot of work on a laptop screen (I normally work on an ultrawide) and got tired of constantly switching between windows to find the information I need.
I've also added the ability to create a picture-in-picture section of any application window, so you can move a window to the background while still seeing its important content.
Was it a novel solution for you or for everyone? Because that's a pretty big difference. A lot stuff novel for me would be something someone had been doing for decades somewhere.
How confident are you that this knowledge was not part of the training data? Was there no stackoverflow questions/replies with it, no tech forum posts, private knowledge bases, etc?
Not trying to diminish its results, just one should always assume that LLMs have a rough memory on pretty much the whole of the internet/human knowledge. Google itself was very impressive back then in how it managed to dig out stuff interesting me (though it's no longer good at finding a single article with almost exact keywords...), and what makes LLMs especially great is that they combine that with some surface level transformation to make that information fit the current, particular need.
Do you think AlphaGo is regurgitating human gameplay? No it’s not: it’s learning an optimal policy based on self play. That is essentially what you’re seeing with agents. People have a very misguided understanding of the training process and the implications of RL in verifiable domains. That’s why coding agents will certainly reach superhuman performance. Straw/steel man depending on what you believe: “But they won’t be able to understand systems! But a good spec IS programming!” also a bad take: agents absolutely can interact with humans, interpret vague deseridata, fill in the gaps, ask for direction. You are not going to need to write a spec the same way you need to today. It will be exactly like interacting with a very good programmer in EVERY sense of the word
How does alphago come into picture? It works in a completely different way all together.
I'm not saying that LLMs can't solve new-ish problems, not part of the training data, but they sure as hell not got some Apple-specific library call from a divine revelation.
AlphaGo comes into the picture to explain that in fact coding agents in verifiable domains are absolutely trained in very similar ways.
It’s not magic they can’t access information that’s not available but they are not regurgitating or interpolating training data. That’s not what I’m saying. I’m saying: there is a misconception stemming from a limited understanding of how coding agents are trained that they somehow are limited by what’s in the training data or poorly interpolating that space. This may be true for some domains but not for coding or mathematics. AlphaGo is the right mental model here: RL in verifiable domains means your gradient steps are taking you in directions that are not limited by the quality or content of the training data that is used only because starting from scratch using RL is very inefficient. Human training data gives the models a more efficient starting point for RL.
Because you can't control what the content server is doing. SCK doesn't care if you only need a small section of a window: it performs multiple full window memory copies that aren't a problem for normal screen recorders... but for a utility like mine, the user needs to see the updated content in milliseconds.
Also, as I mentioned above, when using SCK, the user cannot minimize or maximize any "watched" window, which is, in most cases, a deal-breaker.
My solution runs at under 2% cpu utilization because I don't have to first receive the full window content. SCK was not designed for this use case at all.
Well, I'm not going to share either solution as this is actually a pretty useful utility that I plan on releasing, but the short answer is: 1) don't use ScreenCaptureKit, and 2) take advantage of what CGWindowListCreateImage() offers through the content server. This is a simple IPC mechanism that does not trigger all the SKC limitations (i.e., no multi-space or multi-desktop support). In fact, when using SKC, the user cannot even minimize the "watched" window.
Claude realized those issues right from the start.
One of the trickiest parts is tracking the window content while the window is moving - the content server doesn't, natively, provide that information.
No it didn't. Like I said... it may have gotten something that worked but there is no way Claude got it to work while supporting multi-spaces, multi-desktops, and using under 2% cpu utilization. My solution can display app window content even when those windows are minimized, which is not something the content server supports.
My point was that Claude realized all the SKC problems and came up with a solution that 99% of macOS devs wouldn't even know existed.
> it may have gotten something that worked but there is no way Claude got it to work while supporting multi-spaces, multi-desktops, and using under 2% cpu utilization.
Maybe, but that's the magic of LLMs - they can now one-shot or few-shot (N<10) you something good enough for a specific user. Like, not supporting multi-desktops is fine if one doesn't use them (and if that changes, few more prompts about this particular issue - now the user actually knows specifically what they need - should close the gap).
Do you believe my brief overview of the problem will help Claude identify the specific undocumented functions required for my solution? Is that how you think data gets fed back into models during training?
Yes. I don't think you appreciate just how much information your comments provide. You just told us (and Claude) what the interesting problems are, and confirmed both the existence of relevant undocumented functions, and that they are the right solution to those problems. What you didn't flag as interesting, and possible challenges you did not mention (such as these APIs being flaky, or restricted to Apple first-party use, or such) is even more telling.
Most hard problems are hard because of huge uncertainty around what's possible and how to get there. It's true for LLMs as much as it is for humans (and for the same reasons). Here, you gave solid answers to both, all but spelling out the solution.
ETA:
> Is that how you think data gets fed back into models during training?
No, one comment chain on a niche site is not enough.
It is, however, how the data gets fed into prompt, whether by user or autonomously (e.g. RAG).
> Yes. I don't think you appreciate just how much information your comments provide
Lol... no. You don't know how I solved the problem and you just read everything that Claude did.
Absolutely nothing in the key part of my solution uses a single public API (and there are thousands). And you think that Claude can just "figure that out" when my HK comments gets fed back in during training?
I sincerely wish we'd see less /r/technology ridiculousness on HN.
I wonder how many 'ideas guys' will now think that with LLMs they can keep their precious to themselves while at the same bragging about them in online fora. Before they needed those pesky programmers negotiating for a slice of the pie, but this time it will be different.
Next up: copyright protection and/or patents on prompts. Mark my words.
I'm pretty sure a large fraction of the vibecoded stuff out there is from the "ideas guys." This time will be different because they'll find out very quickly whether their ideas are worth anything. The term "slop" substantially applies to the ideas themselves.
I don't think there will be copyright or patents on prompts per se, but I do think patents will become a lot more popular. With AI rewriting entire projects and products from scratch, copyright for software is meaningless, so patents are one of the very few moats left. Probably the only moat for the little guys.
> 67,383 * 426,397 = 71,371,609,051 ... You need to say why it can do some novel tasks but could never do others.
Model interpretability gives us the answers. The reason LLMs can (almost) do new multiplication tasks is because it saw many multiplication problems in its training data, and it was cheaper to learn the compressed/abstract multiplication strategies and encode them as circuits in the network, rather than memorize the times tables up to some large N. This gives it the ability to approximate multiplication problems it hasn't seen before.
> This gives it the ability to approximate multiplication problems it hasn't seen before.
More than approximate. It straight up knows the algorithms and will do arbitrarily long multiplications correctly. (Within reason. Obviously it couldn't do a multiplication so large the reasoning tokens would exceed its context window.)
Having ChatGPT 5.4 do 1566168165163321561 * 115616131811365737 without tools, after multiplying out a lot of coefficients, it eventually answered 181074305022287409585376614708755457, which is correct.
At this point, it's less misleading to say it knows the algorithm.
Claude, OpenAI, etc.'s AIs are not just LLMs. If you ask it to multiply something, it's going to call a math library. Go feed it a thousand arithmetic problems and it'll get them 100% right.
The major AIs are a lot more than just LLMs. They have access to all sorts of systems they can call on. They can write code and execute it to get answers. Etc.
My take as well. Furthermore, most innovations come relatively shortly after their technological prerequisites have been met, so that suggests the "novelty space" that humans generally explore is a relatively narrow band around the current frontier. Just as humans can search through this space, so too should machines be capable of it. It's not an infinitely unbounded search which humans are guided through by some manner of mystic soul or other supernatural forces.
Most created things are remixes of existing things.
Hallucinations are “something new”. And like most new things, useless. But the truth is the entire conversation is a hallucination. We just happen to agree that most of it is useful.
I think "novel" is ill defined here, perhaps. LLMs do appear to be poor general reasoners[0], and it's unclear if they'll improve here.
It would be unintuitive for them to be good at this, given that we know exactly how they're implemented - by looking at text and then building a statistical model to predict the next token. From this, if we wanted to commit to LLMs having generalizable knowledge, we'd have to assume something like "general reasoning is an emergent property of statistical token generation", which I'm not totally against but I think that's something that warrants a good deal of evidence.
A single math problem being solved just isn't rising to that level of evidence for me. I think it is more on you to:
1. Provide a theory for how LLMs can do things that seemingly go beyond expectations based on their implementation (for example, saying that certain properties of reasoning are emergent or reduce to statistical constructs).
2. Provide evidence that supports your theory and ideally can not be just as well accounted for another theory.
I'm not sure if an LLM will never generate "novel" content because I'm not sure that "novel" is well defined. If novel means "new", of course they generate new content. If novel means "impressive", well I'm certainly impressed. If "novel" means "does not follow directly from what they were trained on", well I'm still skeptical of that. Even in this case, are we sure that the LLM wasn't trained on previous published works, potentially informal comments on some forum, etc, that could have steered it towards this? Are we sure that the gap was so large? Do we truly have countless counterexamples? Obviously this math problem being solved is not a rigorous study - the authors of this don't even have access to the training data, we'd need quite a bit more than this to form assumptions.
I'm willing to take a position here if you make a good case for it. I'm absolutely not opposed to the idea that other forms of reasoning can't reduce to statistical token generation, it just strikes me as unintuitive and so I'm going to need to hear something to compel me.
That's exactly my point. When people say "LLMs will never do something novel," they seem to be leaning on some vague, ill-defined notion of novelty. The burden of proof is then to specify what degree of novelty is unattainable and why.
As for evidence that they can do novel things, there is plenty:
1. I really did ask Gemini to multiply 167,383 * 426,397 before posting this question. It answered correctly.
2. SVGs of pelicans riding bicycles
3. People use LLMs to write new apps/code every day
4. LLMs have achieved gold-medal performance on Math Olympiad problems that were not publicly available
5. LLMs have solved open problems in physics and mathematics [0,1]
That is as far as they have advanced so far. What's next? Where is the limit? All I want to say is that I don't know, and neither do you :).
This is great observational data but it's an early "step 1", I'd definitely need to see an actual analysis of these cases and likely want to have that analysis involve a review of relevant training data.
The “good deal of evidence” is everywhere. The proof is in the pudding. Of course you can find failure modes, the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark? Designed to elicit failure modes, ok so what? As if this is surprising to anyone and somehow negates everything else?
Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means. That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training). That’s indistinguishable from intelligence. It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.
> The “good deal of evidence” is everywhere. The proof is in the pudding.
I'm open! Please, by all means.
> the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark?
The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.
> Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means.
Okay.
> That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training).
This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.
As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?
> It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.
Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.
So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?
> The “good deal of evidence” is everywhere. The proof is in the pudding.
I'm open! Please, by all means.
sure here are but a few:
[1] you get smooth gains in reasoning with more RL train-time compute and more test-time compute (o1)
[2] DeepSeek-R1 showed that RL on verifiable rewards produces behavior like backtracking, adaptation, reflection, etc.
[3] SWE-Bench is a relatively decent benchmark and perf here is continually improving — these are real GitHub issues in real repos
[4] MathArena — still good perf on uncontaminated 2025 AIME problems
[5] the entire field of reinforcement learning, plus successes in other fields with verifiable domains (e.g. AlphaGo); Bellman updates will give you optimal policies eventually
[6] Anthropics cool work looking effectively at biology of a large language models: https://transformer-circuits.pub/2025/attribution-graphs/met... — if you trace internal circuits in Haiku 3.5 you see what you expect from a real reasoning system: planning ahead, using intermediate concepts, operating in a conceptual latent space (above tokens). And thats Haiku 3.5!!! We’re on Opus 4.6 now…
people like to move goalposts whenever a new result comes out, which is silly. Could AI systems do this 2 years ago? No. I don’t know how people don’t look at robust trends in performance improvement, combined with verifiable RL rewards, and can’t understand where things are going.
> The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.
Appeals to authority are a fine prior, but lo and behold I also have a PhD and have worked on and led benchmark development professionally for several years at an AI lab. That’s ultimately no reason to really trust either of us. As I said, the blog post rightfully decries benchmarks but it then presents a new benchmark as though that isn’t subject to all of the same problems. It’s a good article! I think they do a good job here! I agree with all of their complaints about benchmarks! It rightfully identifies failure modes, and there are plenty of other papers pointing out similar failure modes. Reasoning is still brittle, lots of areas where LLMs/agentic systems fail in ways that are incredible given their talent in other areas. But you pretend as though this is definitive evidence that “LLMs are poor general reasoners”. This is just not true, but it is true that they are brittle and fallible in weird ways, today.
> This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.
"They do well, therefore intelligence" is not an argument, sure. But that’s also not what I’m saying. The Occam’s razor here is that reasoning-like computation is the best explanation for an increasing amount of the observed behavior, especially in fresh math and real software tasks where memorization is a much worse fit.
> As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?
I would encourage you to read Kuhn’s structure of scientific revolutions. "That’s how science is done" is a bit of an oversimplification of how the sausage is made here. Real science moves forward in a messy mix of partial theory + better measurements + interventions long before anyone has some sort of grand unified framework. Neuroscience is no different here. And I would say at this point with LLMs we now do have pretty decent tests: fresh verifiable-task evals, mechanistic circuit tracing, causal activation patching, and scaling results for RL/test-time compute. The claim that there is no framework + no real tests is just not true anymore. It’s not like we have some finished theory of reasoning, but thats a bit of an unfair demand at this point and is asymmetrical as well.
> It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.
>> Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.
>> So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?
The model is: reasoning is not inherently human, it’s mathematical. It falls easily within the purview of RL, statistics, representation, optimization, etc, and to claim otherwise would require evidence.
What is the robust model for reasoning in humans again? Simulations and models — what are these? Interventative analysis — we can’t do this with LLMs? Falsifying test cases — what would satisfy you here beyond everything I’ve presented above? Also I’m confused by your last part. You say “brains are intelligent” ==> “intelligence is an emergent property of cells zapping” is absurd, but why? You start from the position that brains are intelligent, so why is this absurd within your argument? Brains _are_ made up of real, physical atoms organized into molecules organized into cells organized into a coordinated system, and…that’s it? What’s missing here?
Thanks, this is great and I'll have quite a bit to read here.
> people like to move goalposts whenever a new result comes out, which is silly. Could AI systems do this 2 years ago? No. I don’t know how people don’t look at robust trends in performance improvement, combined with verifiable RL rewards, and can’t understand where things are going.
I don't think it's goal post moving to acknowledge improvements but still reject the conclusion that AI has reached a specific milestone if those improvements don't justify the position. I doubt anyone sensible is rejecting improvements.
> But you pretend as though this is definitive evidence that “LLMs are poor general reasoners”.
I don't think I've ever made any definitive claims at all, quite the contrary - I've tried to express exactly how open I am to what you're saying. As I've said, I'm a functionalist, and I already am largely supportive of reductive intelligence, so I'm exactly the type of person who would be sympathetic to what you're saying.
> "That’s how science is done" is a bit of an oversimplification
Of course, but I don't think it's too much to ask for to have a theory and evidence. I don't need a lined up series of papers that all start with perfectly syllogisms and then map to well controlled RCTs or whatever. Just an "I think this accounts for it, here's how I support that".
> The claim that there is no framework + no real tests is just not true anymore.
I didn't say it wasn't true, to be clear, I asked for it. Again, I'm sympathetic to the view at a glance so I simply need a way to reason about it.
No need for a complete view, I'd never expect such a thing.
> The model is: reasoning is not inherently human, it’s mathematical.
Well, hand wringing perhaps, but I'd say it's maybe mathematical, computational, structural, functional, whatever - I think we're on the same page here regardless.
> It falls easily within the purview of RL, statistics, representation, optimization, etc, and to claim otherwise would require evidence.
Sure, but I grant that, in fact I believe it entirely. But that doesn't mean that every mathematical construct exhibits the function of intelligence.
> What is the robust model for reasoning in humans again? Simulations and models — what are these? Interventative analysis — we can’t do this with LLMs? Falsifying test cases — what would satisfy you here beyond everything I’ve presented above?
Sorry, I'm not fully understanding this framing. We can do those things with LLMs, and it's hard to say what I would be satisfied. In general, I'd be satisfied with a theory that (a) accounts for the data (b) has supporting evidence (c) does not contradict any major prior commitments. I don't think (c) will be an issue here.
> You say “brains are intelligent” ==> “intelligence is an emergent property of cells zapping” is absurd,
Because intelligence could have been a property of our brains being wet, or roundish, or it could have been a property of our spines, or maybe some force we hadn't discovered, or a soul, etc. We formed a theory, it accounted for observations, we performed tests, we've modeled things, etc, and so the theories we've adopted have been extremely successful and I think hold up quite well. But certainly we didn't go "the brain has electricity, the brain is intelligent, therefor electricity in the brain is what drives intelligence".
> Brains _are_ made up of real, physical atoms organized into molecules organized into cells organized into a coordinated system, and…that’s it? What’s missing here?
I'm very happy to say calculators are far better than me in calculations (to a given precision). I'm happy to admit computers are so much better than me in so many aspects. And I have problem saying LLMs are very helpful tools able to generate output so much better than mine in almost every field of knowledge.
Yet, whenever I ask it to do something novel or creative, it falls very short. But humans are ingenious beasts and I'm sure or later they will design an architecture able to be creative - I just doubt it will be Transformer-based, given the results so far.
But the question isn't whether you can get LLMs to do something novel, it's whether anyone can get them to do something novel. Apparently someone can, and the fact that you can't doesn't mean LLMs aren't good for that.
When it comes to LLMs doing novel things, is it just the infinite monkey theorem[0] playing out at an accelerated rate, helped along by the key presses not being truly random?
Surely if we tell the LLM to do enough stuff, something will look novel, but how much confirmation bias is at play? Tens of millions of people are using AI and the biggest complaint is hallucinations. From the LLMs perspective, is there any difference between a novel solution and a hallucination, other than dumb luck of the hallucination being right?
This argument doesn't go the way you want it to go. Billions of people exist, but maybe a few tens of thousands produce novel knowledge. That's a much worse rate than LLMs.
I’m not sure how we equate the number of humans to AI to determine a success rate.
We also can’t ignore than it was humans who thought up this problem to give to the AI. Thinking has two parts, asking and answering questions. The AI needed the human to formulate and ask the question to start. AI isn’t just dropping random discoveries on us that we haven’t even thought of, at least not that I’ve seen.
To have a proper discussion we would have to define the word "novel" and that's a challenge in itself. In any case, millions of poeple tried to ask LLMs to do something creative and the results were bland. Hence my conclusion LLMs aren't good for that. But I'm also open they can be an element of a longer chain that could demonstrate some creativity - we'll see.
Novel is a tricky word. In this case, the LLM produced a python program that was similar to other programs in its corpus, and this oython program generated examples of hypergraphs that hadn't been seen before.
That's a new result, but I don't know about novel. The technique was the same as earlier work in this vein. And it seems like not much computational power was needed at all. (The article mentions that an undergrad left a laptop running overnight to produce one of the previous results, that's absolute peanuts when compared to most computational research).
If all art is derivative then the earlier statement is a tautology.
People still call things other people do novel. There's clear social proof that humans do things that other humans consider novel.
Otherwise the word would probably not exist.
Just today I wrote a python program that did not resemble anything I'd written before, nor had I seen anything similar.
I had to reason it out myself. That passes thr test that the original comment set.
Your threshold for "resemble" is obviously quite high, which is fair, but assuming that you're an encultured programmer your python code represents other people's python code. It might be doing something novel, but that thing it's doing is interacting or in response to, or otherwise relative to existing concepts you learned or saw elsewhere. All art is derivative, we can do things other people haven't done before but all of it derives from the works of others in some way.
Anyway, I've coded all kinds of wacky shit with claude that I guarantee nobody has implemented before, if only because they're stupid and tedious ideas. They can't all be winners, but they were novel, and yet claude code implemented them as confidently as if they were yet another note taking app. They have no problem handling novel ideas, and although the novel ideas in this case were my own, its easy to see how finding new ideas could be automated by exploring the combinatorial space of existing ideas.
This is objectively wrong. If that was the case every scientist performing a test would have always had their expectations and beliefs proven true. If you're trying to disprove something also because you believe it to be wrong you would never be proven wrong.
>>AI is a remixer; it remixes all known ideas together. It won't come up with new ideas
I always found this argument very weak. There isn't that much truly new anyway. Creativity is often about mixing old ideas. Computers can do that faster than humans if they have a good framework.
Especially with something as simple as math - limited set of formal rules and easy to verify results - I find a belief computers won't beat humans at it to be very naive.
The major AIs have access to all sorts of tools, including a math library. I thought this was well-known. There's no "illusion of actual insight" - they're just "using a calculator" (in the sense that they call a math library when needed). AIs are not just LLMs.
It's not 'magic' though but previously LLMs have performed very badly on longer multiplication, 'insight' is the wrong word but I'm saying maybe they're not wildly better at this calculation... maybe they are just optimising these well known jagged edges.
When I read through what they're doing? It sure doesn't sound like it's generating something new as people typically think of it. The link, they provide a very well defined problem and they just loop through it.
I guess when it can't be tripped up by simple things like multiplying numbers, counting to 100 sequentially or counting letters in a string without writing a python program, then I might believe it.
Also no matter how many math problems it solves it still gets lost in a codebase
LLMs are bad at arithmetic and counting by design. It's an intentional tradeoff that makes them better at language and reasoning tasks.
If anybody really wanted a model that could multiply and count letters in words, they could just train one with a tokenizer and training data suited to those tasks. And the model would then be able to count letters, but it would be bad at things like translation and programming - the stuff people actually use LLMs for. So, people train with a tokenizer and training data suited to those tasks, hence LLMs are good at language and bad at arithmetic,
Arguments like "but AI cannot reliably multiply numbers" fundamentally misunderstand how AI works. AI cannot do basic math not because AI is stupid, but because basic math is an inherently difficult task for otherwise smart AI. Lots of human adults can do complex abstract thinking but when you ask them to count it's "one... two... three... five... wait I got lost".
Who does fundamentally understand how LLMs work? Many claims flying around these days, all backed by some of the largest investments ever collectively made by humans. Lots of money to be lost because of fundamental misunderstandings.
Personally, I find that AI influencers conveniently brush away any evidence (like inability to perform basic arithmetic) about how LLMs fundamentally work as something that should be ignored in favor of results like TFA.
Do LLMs have utility? Undoubtedly. But it’s a giant red flag for me that their fundamental limitations, of which there are many, are verboten to be spoken about.
You're not doing yourself a favor when you point out "but they can't do arithmetic!" as if anyone says otherwise. Yes, we all know they can't do arithmetic, and that's just how they work.
I feel like I'm saying "this hammer is so cool, it's made driving nails a breeze" and people go "but it can't screw screws in! Why won't anyone talk about that! Hammers really aren't all they're cracked up to be".
Maybe because society has invested $trillions into this hammer and influencers are trying to convince CEOs to fire everyone and buy a bunch of hammers instead.
My comment even said “LLMs have utility”. I gave an inch, and now the mile must be taken.
Saying that the fundamental limitations are things like counting the number of rs in strawberry is boring, though. That's how tokens work and it's trivial to work around.
Talking about how they find it hard to say they aren't sure of something is a much more interesting limitation to talk about, for example.
> Talking about how they find it hard to say they aren't sure of something is a much more interesting limitation to talk about, for example.
Sure, thank you for steelmanning my argument. I didn’t think I needed to actually spell out all of the fundamental limitations of LLMs in this specific thread. They are spoken at length across the web, but are often met with pushback, which was my entire point.
Here’s another one: LLMs do not have a memory property. Shut off the power and turn it back on and you lose all context. Any “memory” feature implemented by companies that sell LLM wrappers are a hack on top of how LLMs work, like seeding a context window before letting the user interact with the LLM.
But that's also like saying "humans don't have a memory property, any 'memory' is in the hippocampus". It's not useful to say that "an LLM you don't bother to keep training has no memory". Of course it doesn't, you removed its ability to form new memories!
So why then do we stop training LLMs and keep them stored at a specific state? Is it perhaps because the results become terrible and LLMs have a delicate optimal state for general use? This sounds like an even worse case for a model of intelligence.
Not entirely a straw man. What is the purpose of storing and retrieving LLMs at a fixed state if not to guarantee a specific performance? Wouldn’t a strong model of intelligence be capable of, to extend your analogy, running without having its hippocampus lobotomized?
Given the precariousness of managing LLM context windows, I don’t think it’s particularly unfair to assume that LLMs that learn without limit become very unstable.
To steelman, if it’s possible, it may be prohibitively expensive. But somehow I doubt it’s possible.
Ok, I'll bite. Show me an LLM that comes up with a new math operator. Or which will come up with theory of relativity if only Newton physics is in its training dataset. That it could remix existing ideas which leads to novel insights is expected, however the current LLMs can't come up with paradigm shifts that require novel insights. Even humans have a rather limited time they can come up with novel insights (when they are young, capable of latent thinking, not yet ossified from the existing formalization of science and their brain is still energetically capable without vascular and mitochondrial dysfunction common as we age).
The point is that humans do have some edge compared to current LLMs which are essentially next token predictors. If we all start relying on current AI and stop thinking, we would only be able to "exhaust the remix space" of existing ideas but won't be able to do any paradigm jumps. Moreover, it's quite likely that current training sets are self-contradictory, containing Dutch books, carrying some innate error in them.
I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.
People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.
There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
Just an interesting thought experiment: if you took all the sensory information that a child experiences through their senses (sight, hearing, smell, touch, taste) between, say, birth and age five, how many books worth of data would that be? I asked Claude, and their estimate was about 200 million books. Maybe that number is off ± by an order of magnitude. ...but then again Claude is only three years old, not five.
Last I checked humans didn't pop into existence doing that. It happened after billions of years of brute force, trial and error evolution. So well done for falling into the exact same trap the OP cautions. Intelligence from scratch requires a mind boggling amount of resources, and humans were no different.
To be fair, it is still pretty remarkable what the human brain does, especially in early years - there is no text embedded in the brain, just a crazily efficient mechanism to learn hierarchical systems. As far as I know, AI intelligence cannot do anything similar to this - it generally relies on giga-scaling, or finetuning tasks similar to those it already knows. Regardless of how this arose, or if it's relevant to AGI, this is still a uniqueness of sorts.
Human babies "train" their brain on literally gigabytes of multi-modal data dumped on them through all their sensory organs every second.
In a very real sense, our magic superpower is that we "giga-scale" with such low resource consumption, especially considering how large (in terms of parameters) the brain is compared to even the most advanced models we have running on those thousands of GPUs today. But that's where all those millions of years of evolution pay off. Don't diss the wetware!
How is that relevant? The human brain is at the point of birth (or some time before that). We compare that with an LLM model doing inference. The training part is irrelevant, the same way the human brains' evolution is.
Do you think evolutionary pressures are the best explanation for why humans were able to posit the Poincaré conjecture and solve it? While our mental architecture evolved over a very long time, we still learn from miniscule amounts of data compared to LLMs.
We were optimized to rapidly adapt to changing environments by solving the problems that arise through tool-making and cooperation in complex multi-stage tasks (like say hunting that mammoth to make clothing out of it). It turns out that the cheapest evolutionary pathway to get there has some interesting emergent phenomena.
We have a tremendous amount of raw information flowing through our brains 24/7 from before we are born, from the external world through all our senses and from within our minds as it attempts to make sense of that information, make predictions, generally reason about our existence, hallucinate alternative realities, etc. etc.
If you were able to somehow capture all that information in full detail as you've had access to by the age of say 25, it would likely dwarf the amount of information in millions of books by several orders of magnitude.
When you are 25 years old and are presented a strange looking ball and told to throw it into a strange looking basket for the first time. You are relying on an unfathomable amount of information turned into knowledge and countless prior experiments that you've accumulated/exercised to that point relating to the way your body and the world works.
Humans are "multi-modal". Sure we get plenty of non-textual information, but LLMs were trained on basically every human-written world ever. They definitely see many orders of magnitude more language than any human has ever seen. And yet humans get fluent based after 3+ years.
If you treat the human brain as a model, and account for the full complexity of neurons (one neuron != one parameter!) it has several orders of magnitude more parameters than any LLM we've made to date, so it shouldn't come as a surprise.
What is surprising is that our brain, as complex as it is, can train so fast on such a meager energy budget.
For sure, it seems like there's something there primed to pick up human language quickly, clearly evolutionarily driven.
Not necessarily so for the dynamics of magnetic fields, or nonhuman animal communications, or dark energy/matter.
We are bombarded nonstop by magnetic fields, nonhuman animal communications, and live in a universe which seems to be majority dominated by dark energy and matter, and yet understand little to none of it all.
To be fair, the knowledge embedded in an LLM is also, at this point, a couple orders of magnitude (at least) larger than what the average human being can retain. So it's not like all those books and text in the internet are used just to bring them to our level, they go way beyond.
It's only because humans came up with a problem, worked with the ai and verified the result that this achievement means anything at all. An ai "checking its own work" is practically irrelevant when they all seem to go back and forth on whether you need the car at the carwash to wash the car. Undoubtedly people have been passing this set of problems to ai's for months or years and have gotten back either incorrect results or results they didn't understand, but either way, a human confirmation is required. Ai hasn't presented any novel problems, other than the multitudes of social problems described elsewhere. Ai doesn't pursue its own goals and wouldn't know whether they've "actually been achieved".
This is to say nothing of the cost of this small but remarkable advance. Trillions of dollars in training and inference and so far we have a couple minor (trivial?) math solutions. I'm sure if someone had bothered funding a few phds for a year we could have found this without ai.
Funding a few PhDs for a year costs orders of magnitude more than it did to solve this problem in inference costs. Also, this has been active research for some time. Or I guess the people working on it are just not as good as a random bunch of students? It's amazing the lengths that people go to maintain their worldview, even if it means belittling hardworking people.
I take it you're not a mathematician. This is an achievement, regardless of whether you like LLMs or not, so let's not belittle the people working on these kinds of problems please.
>It's amazing the lengths that people go to maintain their worldview, even if it means belittling hardworking people.
This is the most baffling and ironic aspects of these discussions. Human exceptionalism is what drives these arguments but the machines are becoming so good you can no longer do this without putting down even the top percenter humans in the process. Same thing happening all over this thread (https://news.ycombinator.com/item?id=47006594). And it's like they don't even realize it.
How many math PhD students do you have? If you set the problem right, something like this per year on average is a good pace.
How are they cheaper? Your average grant where I am can pay for a couple of PhD students. I could afford to pay for inference costs out of my own salary, no grant needed. Completely different economic scales here. I like students better of course, but funding is drying up these days.
I was saying generally. I don't work in maths. PhD students do lots of other things than research. If we ask a PhD student to just solve these kinds of problems and nothing else, the student would do it without much difficulty.
I guess it's different in somewhere like Europe. But in Canada, most of the PhD students are paid for doing TAships, not primarily through grant. Average salary is 25k/year. Take 6-10k out for tuition, that's 15-19k/year. You get a student doing so many things for less pay. I guess, if your job only requires research then you can do it.
Inference costs are heavily subsidised. My point was that we've spent trillions collectively on ai, and so far we have a few new proofs. It's been active research but the problem estimates only 5-10 people are even aware that it is a problem. I wrote "math phd's" not "random students", but regardless, I wouldn't know how you interpreted my statement that people could have discovered without ai this as "belittling the people working on this". You seem like a stupid person with an out of control chatbot that can't comprehend basic arguments.
And now you're belittling me. Yeah, good one, that'll convince people.
> out of control chatbot that can't comprehend basic arguments
I don't see how it is out of control. It is a tool. It is being used for a job. For low-level jobs it often succeeds. For tougher jobs, it is succeeding sufficiently often to be interesting. I don't care if it understands worldview semantics, that's for humans to do.
> we've spent trillions collectively on ai
The economics around AI do not suggest that continuing to perform large training runs is sustainable. That's also not relevant to the discussion. Once the training is done, further costs are purely on inference, and that is the comparison I was making.
> Inference costs are heavily subsidised
Even if you pay to run inference on your own hardware, economics of scale dictate that it is still cheaper than students.
> It's been active research but the problem estimates only 5-10 people are even aware that it is a problem.
That sounds about right for most pure math problems. Were you expecting more?
Let's not pretend that society would have invested that kind of money into pure mathematics research. It is extraordinarily difficult to get funding for that kind of work in most parts of the world. Mathematicians are relatively cheap, yes, but the money coming into AI was from blind VCs with a sense of grandeur. It wasn't to do maths research. If it's here anyway, and causing nightmares for actually teaching new students, may as well try to make some good of it. It has only recently crossed the edge of being useful. Most researchers I know are only now starting to consider it, mostly as a search engine, but some for proof assistance. Experiences a year ago were highly negative. They're a lot more positive now.
I'm trying to give a perspective from someone who actually does do math research at a senior level, who actually does have a half dozen math PhD students to supervise, to say that your blind attitude toward this is not sensible or helpful. Your comments about the problem being trivial do belittle the actual effort people have put into the problem without success. If they could easily have discovered this without AI, they would have already done so. Researchers do not have unlimited time and there are many more problems than students, especially good ones (hence my random comment).
From various online estimates, i would estimate global ai spend just since 2020 at $2T. Some projections estimate that we might spend that per year starting next year. To the extent that many of these projects will be cancelled or shelved, capital is beginning to take stock of the feasibility of clawing back even the original investments. openai is apparently doubling its staff, but whether these are sales or (prompt?) engineering jobs, the biggest hypemongers are themselves unable to reduce headcount even with unlimited "at-cost" ai inference.
Comparing total ai spend to the value added of producing a few new maths/sciences proofs is unfair since ai is doing more than maths proofs, but for comparison one can estimate the total spent to date on mathematicians and associated costs (buildings, experiments etc). I would very roughly estimate that the total cost of all mathematics to date since 1600 is less than what we've spent on ai to date, and the results from investment in mathematicians are incomparable to a few derivative extensions of well-established ideas. For less than a few trillion we have all of mathematics. For an additional 2T dollars, we have trivial advancements that no one really cares about.
The only things moving faster than AI are the goalposts in conversations like this. Now we're at "sure, AI can solve novel problems, but it can't come up with the problems themselves on its own!"
I'm curious to see what the next goalpost position is.
> I'm curious to see what the next goalpost position is.
I am as well. That's the point. Ai can do some things well and other things better than humans, but so can a garden hose and all technology. Is ai just a tool or is it the future of all work? By setting goalposts we can see whether or not it is living up to the hype that we're collectively spending trillions on.
The garden hose manufacturers aren't claiming that they're going to replace all human workers, so we don't set those kinds of goalposts to measure whether it's doing that.
> I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
Because, empirically, we have numerous unique and differentiable qualities, obviously. Plenty of time goes into understanding this, we have a young but rigorous field of neuroscience and cognitive science.
Unless you mean "fundamentally unique" in some way that would persist - like "nothing could ever do what humans do".
> People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
I frankly doubt it applies to either system.
I'm a functionalist so I obviously believe that everything a human brain does is physical and could be replicated using some other material that can exhibit the necessary functions. But that does not mean that I have to think that the appearance of intelligence always is intelligence, or that an LLM/ Agent is doing what humans do.
>But that does not mean that I have to think that the appearance of intelligence always is intelligence, or that an LLM/ Agent is doing what humans do.
You can think whatever you want, but an untestable distinction is an imaginary one.
First of all, that's not true. Not every position has to be empirically justified. I can reason about a position in all sorts of ways without testing. Here's an obvious example that requires no test at all:
1. Functional properties seem to arise from structural properties
2. Brains and LLMs have radically different structural properties
3. Two constructs with radically, fundamentally different structural properties are less likely to have identical functional properties
Therefor, my confidence in the belief that brains and LLMs should have identical functional properties is lowered by some amount, perhaps even just ever so slightly.
Not something I feel like fleshing out or defending, just an example of how I could reason about a position without testing it.
No, but it does mean that you should know we don't understand what intelligence is, and that maybe LLMs are actually intelligent and humans have the appearance of intelligence, for all we know.
You're just defining intelligence as "undefined", which okay, now anything is anything. What is the point of that?
Indeed, there's quite a lot of work that's been done on what these terms mean. The fields of neuroscience and cognitive science have contributed a lot to the area, and obviously there are major areas of philosophy that discuss how we should frame the conversation or seek to answer questions.
We have more than enough, trivially, to say that human intelligence is distinct, so long as we take on basic assertions like "intelligence is related to brain structures" since we know a lot about brain structures.
Our intelligence is related to brain structures, not all intelligence. You can't get to things like "what all intelligence, in general, is" from "what our intelligence is" any more than you can say that all food must necessarily be meat because sausages exist.
But... we're talking about our intelligence. So obviously it's quite relevant. I didn't say that AI isn't intelligent, I said that we have good reason to believe that our intelligence is unique. And we do, a lot of good evidence.
I obviously don't believe that all intelligence is related to specific brain structure. Again, I'm a functionalist, so I believe that any structure that can exhibit the necessary functions would be equivalent in regards to intelligence.
None of this would commit me to (a) human exceptionalism (b) LLMs/ Agents being intelligent (c) LLMs/ Agents being intelligent in the way that humans are.
This is too dependent on what you mean by "unique", though. What do we have that apes don't, and which directly enables intelligence? What do we have that LLMs don't? What do LLMs have that we don't?
I don't think we know enough to definitively say "it's this bit that gives us intelligence, and there's no way to have intelligence without it". We just see what we have, and what animals lack, and we say "well it's probably some of these things maybe".
> What do we have that apes don't, and which directly enables intelligence?
Again, there are multiple fields of study with tons of amazingly detailed answers to this. We know about specific proteins, specific brain structures, we know about specific cognitive capabilities in the abstract, etc.
> What do we have that LLMs don't?
Again, quite a lot is already known about this.
This feels a bit like you're starting to explore this area and you're realizing that intelligence is complex, but you may not realize that others have already been doing this work and we have a litany of information on the topic. There are big open questions, of course, but we're definitely past the point of being able to say "there is a difference between human and ape intelligence" etc.
It'd probably be more productive for you to actually back up your claims with these things we know from neuroscience, rather than just stating that we know things, and so therefore you're right. What do we know?
EDIT: can't reply, so I'll just update here:
You're arguing that the mechanism that produces human intelligence is unique, so therefore the intelligence itself is somehow fundamentally different from the intelligence an LLM can produce. You haven't shown that, you just keep saying we know it's true. How do we know?
I don't need to do that unless you think that neurons interact exactly the way that LLMs do? That said, we have detailed, microscopic models of neurons, the ability to even simulate brain activity, intervention studies where we can make predictions, interact with brains in various ways, and then validate against predictions, we have cognitive benchmarks that we can apply to different animals or animals in different stages of development that we can then tie to specific brain states and brain development, etc.
So we're in a very good position to say quite a lot about the brain, an incredible amount really. And that puts us in a very good position to say that our brain is very different from other animal brains, and certainly in a very good position to say that's very different from an LLM.
Now, you can argue that an LLM is functionally equivalent to the brain, but given that it's so structurally distinct, and seemingly functions in a radically different way due to the nature of that structure, I'd put it on you to draw symmetries and provide evidence of that symmetry.
I'm following this mini-thread with interest but I've arrived here and I confess, I don't really know what your argument is.
I think this all stems from you objecting to this statement:
"I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique."
I think you're being uncharitable in how you interpret that. Human's are unique in the most literal reading of this sentence, we don't have anything else like humans. But the context is the ability to reason and people denying that a machine is reasoning, even though it looks like reasoning.
They're shocked that people believe that humans are unique. I explained why that shouldn't be shocking. I think I was pretty charitable here, I gave an alternative option for what they could mean in my very first reply:
> Unless you mean "fundamentally unique" in some way that would persist - like "nothing could ever do what humans do".
> I don't really know what your argument is.
I just said that I think that we have very good reasons for believing that human cognition is unique. The response was seemingly that we don't have enough of an understanding of intelligence to make that judgment. I've stated that I think we do have enough of an understanding of intelligence to make that judgment, and I've appealed to the many advances in relevant feilds.
I'm open to hearing how you think I should be interpreting things. I don't really think I'm being too literal, it certainly hasn't been the case that they've suggested my interpretation is wrong, and I've provided two interpretations (one that I totally grant).
What's the better interpretation of their position?
It doesn't. I actually completely reject that theory, and it's nice to see that Wikipedia notes that it is "controversial". There are extremely good reasons to reject this theory. For one thing, any quantum effects are going to be quite tiny/ trivial because the brain is too large, hot, wet, etc, to see larger effects, so you have to somehow make a leap to "tiny effects that last for no time at all" to "this matters fundamentally in some massive way".
It likely requires rejection of functionalism, or the acceptance that quantum states are required for certain functions. Both of those are heavy commitments with the latter implying that there are either functions that require structures that can't be instantiated without quantum effects or functions that can't be emulated without quantum effects, both of which seem extremely unlikely to me.
Probably for the far more important reason, it doesn't solve any problem. It's just "quantum woo, therefor libertarian free will" most of the time.
It's mostly garbage, maybe a tiny tiny bit of interesting stuff in there.
It also would do nothing to indicate that human intelligence is unique.
Every living thing on Earth is unique. Every rock is unique in virtually infinite ways from the next otherwise identical rock.
There are also a tremendous number of similarities between all living things and between rocks (and between rocks and living things).
Most ways in which things are unique are arguably uninteresting.
The default mode, the null hypothesis should be to assume that human intelligence isn't interestingly unique unless it can be proven otherwise.
In these repeated discussions around AI, there is criticism over the way an AI solves a problem, without any actual critical thought about the way humans solve problems.
The latter is left up to the assumption that "of course humans do X differently" and if you press you invariably end up at something couched in a vague mysticism about our inner-workings.
Humans apparently create something from nothing, without the recombination of any prior knowledge or outside information, and they get it right on the first try. Through what, divine inspiration from the God who made us and only us in His image?
Humans are obviously unique in an interesting way. People only "move the goalpost" because it's not an interesting question that humans can do some great stuff, the interesting question is where the boundary is. (Whether against animals or AI).
Some example goals which makes human trivially superior (in terms of intelligence): invention of nuclear bomb/plants, theory of relativity, etc.
But that's unique in the sense of "you have a bag of ten apples and I have a bag of eleven apples, therefore my bag is unique". It's not qualitatively different intelligence than a dog's, you just have more of it.
I would argue that point. The biological components are the same, but emergent behavior is a thing. So both the scale and the number of connections/way they connect have surpassed some limit after which cognitive capabilities increased severalfold to the point that humans "took over the world".
And arguably further increase in intelligence seems to fall into a diminishing returns category, compared to this previous boom. (Someone being "2x smarter" doesn't give them enough benefit of reigning over others, at least history would look otherwise were it the case, in my opinion)
Probably dumb example, but just by increasing speed you get well-behaving laminar flow vs turbulence, yet it's fundamentally the same a level beneath.
Yeah, I don't know that there's such a jump. Dogs, for example, clearly communicate, both with us and with each other. They don't have language, but they also don't lack communication skills. To me, language is just "better communication" rather than a qualitatively different thing.
Human language is way above what communication animals show. We don't really know what's the exact boundary, but again, the difference is significant and not just "scaled up".
I doubt you can even define intelligence sufficiently to argue this point. Since that's an ongoing debate without a resolution thus far.
But you claimed that humans aren't unique. I think it's pretty obvious we are on many dimensions including what you might classify as "intelligence". You don't even necessarily have to believe in a "soul" or something like that, although many people do. The capabilities of a human far surpass every single AI to date, and much more efficiently as well. That we are able to brute-force a simulacrum of intelligence in a few narrow domains is incredible, but we should not denigrate humans when celebrating this.
> There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
Do you ever wonder why that is? I often wonder why tech has so many reductionist, materialist, and quite frankly anti-human, thinkers.
> I doubt you can even define intelligence sufficiently to argue this point.
Agreed.
> But you claimed that humans aren't unique.
I'm arguing that it is up to us to prove that they are interestingly unique in the context of this post. Which is pretty narrow - how do we solve problems?
The theme I was arguing against that I've seen repeated throughout this thread is that AIs are just recombining things they've absorbed and throwing those recombinations at the wall until they see what sticks.
It raises the question of why we presume that humans do things any differently, when it seems quite clear that we can only ever possibly do the same, unless we are claiming that knowledge of the universe can enter the human mind through some means other than through the known senses.
Not at all disputing that humans possess many capabilities that AIs do not.
> Do you ever wonder why that is? I often wonder why tech has so many reductionist, materialist, and quite frankly anti-human, thinkers.
I touched on this elsewhere, will go ahead and paste it here again:
The fundamental thing I'm speaking out against is the arrogance of human exceptionalism.
This whole debate about what it means to be intelligent or human just seems like we're making the same mistakes we've made over and over.
Earth as the center of the universe, sun as the center of the universe, man as the only animal with consciousness and intellect, the anthropomorphic nature of the majority of the deities in our religions and the anthropocentric purpose of the universe within those religions...
I think this desire to believe that we are special, that the universe in some way does ultimately revolve around us, is seemingly a deep need in our psyche but any material analysis of our universe shows that it is extremely unlikely that we hold that position.
>, and much more efficiently as well. That we are able to brute-force a simulacrum of intelligence in a few narrow domains is incredible, but we should not denigrate humans when celebrating this.
Human intelligence was brute forced. Please let's all stop pretending like those billions of years of evolution don't count and we poofed into existence. And you can keep parroting 'simulacrum of intelligence' all you want but that isn't going to make it any more true.
> The capabilities of a human far surpass every single AI to date
Meaning however you (reasonably) define intelligence, if you compare humans to any AI system humans are overwhelmingly more capable. Defining "intelligence" as "solving a math equation" is not a reasonable definition of intelligence. Or else we'd be talking about how my calculator is intelligent. Of course computers can compute faster than we can, that's aside the point.
> Human intelligence was brute forced.
No, I don't mean how the intelligence evolved or was created. But if you want to make that argument you're essentially asserting we have a creator, because to "brute force" something means it was intentional. Evolution is not an intentional process, unless you believe in God or a creator of sorts, which is totally fair but probably not what you were intending.
But my point is that LLM's essentially arrive at answers by brute force through search. Go look at what a reasoning model does to count the letters in a sentence, or the amount of energy it takes to do things humans can do with orders of magnitude less (our brain runs on %20 of a lightbulb!).
> But my point is that LLM's essentially arrive at answers by brute force through search.
If "brute force" worked for this, we wouldn't have needed LLMs; a bunch of nested for-loops can brute force anything.
The reason why LLMs are clearly "magic" in ways similar to our own intelligence (which we very much don't understand either) is precisely because it can actually arrive at an answer without brute force, which is computationally prohibitive for most non-trivial problems anyway. Even if the LLM takes several hours spinning in a reasoning loop, those millions tokens still represent a minuscule part of the total possible solution space.
And yes, we're obviously more efficient and smarter. The smarter part should come as no surprise given that our brains have vastly more "parameters". The efficient part is definitely remarkable, but completely orthogonal to the question of whether the phenomenon exhibited is fundamentally the same or not.
>Meaning however you (reasonably) define intelligence, if you compare humans to any AI system humans are overwhelmingly more capable.
Really ? Every Human ? Are you sure ? because I certainly wouldn't ask just any human for the things I use these models for, and I use them for a lot of things. So, to me the idea that all humans are 'overwhelmingly more capable' is blatantly false.
>Defining "intelligence" as "solving a math equation" is not a reasonable definition of intelligence.
What was achieved here or in the link I sent is not just "solving a math equation".
>Or else we'd be talking about how my calculator is intelligent.
If you said that humans are overwhelmingly more capable than calculators in arithmetic, well I'd tell you you were talking nonsense.
>Of course computers can compute faster than we can, that's aside the point.
I never said anything about speed. You are not making any significant point here lol
>No, I don't mean how the intelligence evolved or was created.
Well then what are you saying ? Because the only brute-forced aspect of LLM intelligence is its creation. If you do not mean that then just drop the point.
>But if you want to make that argument you're essentially asserting we have a creator, because to "brute force" something means it was intentional.
First of all, this makes no sense sorry. Evolution is regularly described as a brute force process by atheist and religious scientists alike.
Second, I don't have any problem with people thinking we have a creator, although that instance still does necessarily mean a magic 'poof into existence' reality either.
>But my point is that LLM's essentially arrive at answers by brute force through search.
Sorry but that's just not remotely true. This is so untrue I honestly don't know what to tell you. This very post, with the transcript available is an example of how untrue it is.
>or the amount of energy it takes to do things humans can do with orders of magnitude less (our brain runs on %20 of a lightbulb!).
Meaningless comparison. You are looking at two completely different substrates. Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
Yes, in many ways absolutely. Just because a model is a better "Google" than my dummy friend doesn't mean that this same friend is more capable at countless cases.
> Meaningless comparison. You are looking at two completely different substrates. Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
Isn't that just more proof how efficient the human brain is? Especially that a wire has much better properties than water solutions in bags.
>Just because a model is a better "Google" than my dummy friend doesn't mean that this same friend is more capable at countless cases.
People use LLMs for a lot of things. 'Better Google' is is a tiny slice of that.
>Isn't that just more proof how efficient the human brain is?
Sure. So what ? If a game runs poorly on one hardware and excellently on another, does that mean the game was fundamentally different between the 2 devices ? No, Of course not.
I never said that humans are better than LLM's along every axis. Rather, a reasonable definition of intelligence would necessarily encompass domains that LLM's are either incapable of or inferior to us.
Here might be some definitions of intelligence for example:
> The aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with his environment.
> "...the resultant of the process of acquiring, storing in memory, retrieving, combining, comparing, and using in new contexts information and conceptual skills".
> Goal-directed adaptive behavior.
> a system's ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation
But even a housefly possesses levels of intelligence regarding flight and spacial awareness that dominates any LLM. Would it be fair to say a fly is more intelligent than an LLM? It certainly is along a narrow set of axis.
> Because the only brute-forced aspect of LLM intelligence is its creation.
I would consider statistical reasoning systems that can simulate aspects of human thought to be a form of brute force. Not quite an exhaustive search, but massively compressed experience + pattern matching.
But regardless, even if both forms of intelligence arrived via some form of brute force, what is more important to me is the result of that - how does the process of employing our intelligence look.
> This very post, with the transcript available is an example of how untrue it is.
The transcript lacks the vector embeddings of the model's reasoning. It's literally just a summary from the model - not even that really.
> Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
>I never said that humans are better than LLM's along every axis. Rather, a reasonable definition of intelligence would necessarily encompass domains that LLM's are either incapable of or inferior to us.
So all humans are overwhelmingly more intelligent but cannot even manage to be as capable in a significant number of domains ? That's not what overwhelming means.
>I would consider statistical reasoning systems that can simulate aspects of human thought to be a form of brute force.
That is not really what “brute force” means. Pattern learning over a compressed representation of experience is not the same thing as exhaustive search. Calling any statistical method “brute force” just makes the term too vague to be useful.
> what is more important to me is the result of that - how does the process of employing our intelligence look.
But this is exactly where you are smuggling in assumptions. We do not actually understand the internal workings of either the human brain or frontier LLMs at the level needed to make confident claims like this. So a lot of what you are calling “the result” is really just your intuition about what intelligence is supposed to look like.
And I do not think that distinction is as meaningful as you want it to be anyway. Flight is flight. Birds fly and planes fly. A plane is not a “simulacrum of flight” just because it achieves the same end by a different mechanism.
>The transcript lacks the vector embeddings of the model's reasoning. It's literally just a summary from the model - not even that really.
You do not need access to every internal representation to see that the model did not arrive at the answer by brute-forcing all possibilities. The observed behavior is already enough to rule that out.
> Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
>You're so close to getting it lol.
No you don't understand what I'm saying. If we were to be more accurate to the brain in silicon, it would be even less efficient than LLMs, never mind humans. Does that mean how the brain works is wrong ? No it means we are dealing with 2 entirely different substrates and directly comparing efficiencies like that to show one is superior is silly.
> So all humans are overwhelmingly more intelligent but cannot even manage to be as capable in a significant number of domains
When the amount of domains in which humans are more capable than LLM's vastly exceeds the amount of domains in which LLM's are more capable than humans, yes.
I also agree that we don't have a great understanding of either human or LLM intelligence, but we can at least observe major differences and conclude that there are, in fact, major differences. In the same way we can conclude that both birds and planes have major differences, and saying that "there's nothing unique about birds, look at planes" is just a really weird thing to say.
> If we were to be more accurate to the brain in silicon, it would be even less efficient than LLMs
Do you think perhaps this massive difference points to there being a significant and foundational structural and functional difference between these types of intelligences?
> I often wonder why tech has so many reductionist, materialist, and quite frankly anti-human, thinkers.
I think it comes from a position of arrogance/ego. I'll speak for the US here, since that's what I know the most; but the average 'techie' in general skews towards the higher intelligence numbers than the lower parts. This is a very, very broad stroke, and that's intentional to illustrate my point. Because of this, techie culture gains quite a bit of arrogance around it with regards to the masses. And this has been trained into tech culture since childhood. Whether it be adults praising us for being "so smart", or that we "figured out the VCR", or some other random tech problem that literally almost any human being can solve by simply reading the manual.
What I've found, in the vast majority of technical problem solving cases that average people have challenges with, if they just took a few minutes to read a manual they'd be able to solve a lot of it themselves. In short, I don't believe as a very strong techie that I'm "smarter than most", but rather that I've taken the time to dive into a subject area that most other humans do not feel the need nor desire to do so.
There are objectively hard problems in tech to solve, but the amount of people solving THOSE problems in the tech industry are few and far in between. And so the tech industry as a whole has spent the last decade or two spinning circles on increasingly complex systems to continue feeding their own egos about their own intelligence. We're now at a point that rather than solving the puzzle, most techies are creating incrementally complex puzzles to solve because they're bored of the puzzles that are in front of them. "Let me solve that puzzle by making a puzzle solver." "Okay, now let me make a puzzle solver creation tool to create puzzle solvers to solve the puzzle." and so forth and so forth. At the end of the day, you're still just solving a puzzle...
But it's this arrogance that really bothers me in the tech bro culture world. And, more importantly, at least in some tech bro circles, they have realized that their target to gathering an exponential increase in wealth doesn't lie in creating new and novel ways to solve the same puzzles, but to try and tout AI as the greatest puzzle solver creation tool puzzle solver known to man (and let me grift off of it for a little bit).
It's funny because the fundamental thing I'm speaking out against is the arrogance of human exceptionalism.
This whole debate about what it means to be intelligent or human just seems like we're making the same mistakes we've made over and over.
Earth as the center of the universe, sun as the center of the universe, man as the only animal with consciousness and intellect, the anthropomorphic nature of the majority of the deities in our religions and the anthropocentric purpose of the universe within those religions...
I think this desire to believe that we are special, that the universe in some way does ultimately revolve around us, is seemingly a deep need in our psyche but any material analysis of our universe shows that it is extremely unlikely that we hold that position.
I have long said I am an AI doubter until AI could print out the answers to hard problems or ones requiring tons of innovation. Assuming this is verified to be correct (not by AI) then I just became a believer. I would like to see a few more AI inventions to know for sure, but wow, it really is a new and exciting world. I really hope we use this intelligence resource to make the world better.
Math and coding competition problems are easier to train because of strict rules and cheap verification.
But once you go beyond that to less defined things such as code quality, where even humans have hard time putting down concrete axioms, they start to hallucinate more and become less useful.
We are missing the value function that allowed AlphaGo to go from mid range player trained on human moves to superhuman by playing itself.
As we have only made progress on unsupervised learning, and RL is constrained as above, I don't see this getting better.
I’ve seen this style of take so much that I’m dying for someone to name a logical fallacy for it, like “appeal to progress” or something.
Step away from LLMs for a second and recognize that “Yesterday it was X, so today it must be X+1” is such a naive take and obviously something that humans so easily fall into a trap of believing (see: flying cars).
In finance we say "past performance does not guarantee future returns." Not because we don't believe that, statistically, returns will continue to grow at x rate, but because there is a chance that they won't. The reality bias is actually in favour of these getting better faster, but there is a chance they do not.
this is true because markets are generally efficient. It's very hard to find predictive signals. That is a completely different space than what we're talking about here. Performance is incredibly predictable through scaling laws that continue to hold even at the largest scales we've built
Even more insane than assuming the trend will continue is assuming it will not continue. We don't know for sure (especially not by pure reason), but the weight of probability sure seems to lean one direction.
Hmm...the sun comes up today is a pretty good bet that the sun comes up tomorrow.
We have robust scaling laws that continue to hold at the largest scales. It is absolutely a very safe bet that more compute + more training + algorithmic improvements will certainly improve performance it's not like we're rolling a 1 trillion dollar die.
Logical fallacies are vastly overrated. Unless the conversation is formal logic in the first place, "logical fallacies" are just a way to apply quick pattern matching to dismiss people without spending time on more substantive responses. In this case, both you and the other are speculating about the near future of a thing, neither of you knows.
Hard to make a more substantive response when the OP’s entire comment was a one-sentence logical fallacy. I’m not cherry-picking here.
> In this case, both you and the other are speculating about the near future of a thing, neither of you knows.
One of us is making a much grander claim than the other:
- LLMs have limitless potential for growth; because they are not capable of something today does not mean they won’t be capable of it tomorrow
- LLMs have fundamental limitations due to their underlying architecture and therefore are not limitless in capability
> We went from 2 + 7 = 11 to "solved a frontier math problem" in 3 years, yet people don't think this will improve?
All that says is that the speaker thinks models will improve past where they are today. Not that it's a logical certainty (the first thing you jumped on them for), and certainly not anything about "limitless potential for growth" (which nobody even mentioned). With replies like this, invoking fallacies and attacking claims nobody made, you're adding a lot of heat and very little light here (and a few other threads on the page).
> All that says is that the speaker thinks models will improve past where they are today. Not that it's a logical certainty
Exceedingly generous interpretation in my opinion. I tend to interpret rhetorical questions of that form as “it’s so obvious that I shouldn’t even have to ask it”.
The term of art for that is steelmanning, and HN tries to foster a culture of it. Please check the guidelines link in the footer and ctrl+f "strongest".
A possibility is not a fact. Assuming a possibility will happen is not justified. Therefore it is false as an assumption, even if it is true it is a possiblity.
I genuinely have no idea what you're on about. One guy expressed his belief about how the future will play out, and another disagreed. Time will be the judge of it, not either of us.
Well if people give the exact same 'reasons' why it could not do x task in the past that it did manage to do then it is tiring seeing the same nonsense again. The reason here does not even make much sense. This result is not easily verifiable math.
Yeah, and even if we accept that models are improving in every possible way, going from this to 'AI is exponential, singularity etc.' is just as large a leap.
Scaling law is a power law , requiring orders of magnitude more compute and data for better accuracy from pre-training. Most companies have maxed it out.
Next stop is inference scaling with longer context window and longer reasoning. But instead of it being a one-off training cost, it becomes a running cost.
In essence we are chasing ever smaller gains in exchange for exponentially increasing costs. This energy will run out. There needs to be something completely different than LLMs for meaningful further progress.
I tend to disagree that improvement is inherent. Really I'm just expressing an aesthetic preference when I say this, because I don't disagree that a lot of things improve. But it's not a guarantee, and it does take people doing the work and thinking about the same thing every day for years. In many cases there's only one person uniquely positioned to make a discovery, and it's by no means guaranteed to happen. Of course, in many cases there are a whole bunch of people who seem almost equally capable of solving something first, but I think if you say things like "I'm sure they're going to make it better" you're leaving to chance something you yourself could have an impact on. You can participate in pushing the boundaries or even making a small push on something that accelerates someone else's work. You can also donate money to research you are interested in to help pay people who might come up with breakthroughs. Don't assume other people will build the future, you should do it too! (Not saying you DON'T)
Unfair - human beats AI in this comparison, as human will instantly answer "I don't know" instead of yelling a random number.
Or at best "I don't know, but maybe I can find out" and proceed to finding out/ But he is unlikely to shout "6" because he heard this number once when someone talked about light.
Because LLMs dont have a textual representation of any text they consume. Its just vectors to them. Which is why they are so good at ignoring typos, the vector distance is so small it makes no difference to them.
what bothers me is not that this issue will certainly disappear now that it has been identified, but that that we have yet to identify the category of these "stupid" bugs ...
We already know exactly what causes these bugs. They are not a fundamental problem of LLMs, they are a problem of tokenizers. The actual model simply doesn't get to see the same text that you see. It can only infer this stuff from related info it was trained on. It's as if someone asked you how many 1s there are in the binary representation of this text. You'd also need to convert it first to think it through, or use some external tool, even though your computer never saw anything else.
> It's as if someone asked you how many 1s there are in the binary representation of this text.
I'm actually kinda pleased with how close I guessed! I estimated 4 set bits per character, which with 491 characters in your post (including spaces) comes to 1964.
Then I ran your message through a program to get the actual number, and turns out it has 1800 exactly.
Okay but, genuinely not an expert on the latest with LLMs, but isn’t tokenization an inherent part of LLM construction? Kind of like support vectors in SVMs, or nodes in neural networks? Once we remove tokenization from the equation, aren’t we no longer talking about LLMs?
It's not a side effect of tokenization per se, but of the tokenizers people use in actual practice. If somebody really wanted an LLM that can flawlessly count letters in words, they could train one with a naive tokenizer (like just ascii characters). But the resulting model would be very bad (for its size) at language or reasoning tasks.
Basically it's an engineering tradeoff. There is more demand for LLMs that can solve open math problems, but can't count the Rs in strawberry, than there is for models that can count letters but are bad at everything else.
> We went from 2 + 7 = 11 to "solved a frontier math problem" in 3 years, yet people don't think this will improve?
This is disingenuous... I don't think people were impressed by GPT 3.5 because it was bad at math.
It's like saying: "We went from being unable to take off and the crew dying in a fire to a moon landing in 2 years, imagine how soon we'll have people on Mars"
LLMs in some form will likely be a key component in the first AGI system we (help) build. We might still lack something essential. However, people who keep doubting AGI is even possible should learn more about The Church-Turing Thesis.
AGI is definitely possible - there is nothing fundamentally different in the human brain that would surpass a Turing machine's computational power (unless you believe in some higher powers, etc).
We are just meat-computers.
But at the same time, there is absolutely no indication or reason to believe that this wave of AI hype is the AGI one and that LLMs can be scaled further. We absolutely don't know almost anything about the nature of human intelligence, so we can't even really claim whether we are close or far.
This is not formally verified math so there is no real verifiable-feedback aspect here. The best models for formalized math are still specialized ones. although general purpose models can assist formalization somewhat.
Maybe to get a real breakthrough we have to make programming languages / tools better suited for LLM strengths not fuss so much about making it write code we like. What we need is correct code not nice looking code.
> programming languages / tools better suited for LLM strengths
The bitter lesson is that the best languages / tools are the ones for which the most quality training data exists, and that's pretty much necessarily the same languages / tools most commonly used by humans.
> Correct code not nice looking code
"Nice looking" is subjective, but simple, clear, readable code is just as important as ever for projects to be long-term successful. Arguably even more so. The aphorism about code being read much more often than it's written applies to LLMs "reading" code as well. They can go over the complexity cliff very fast. Just look at OpenClaw.
I guess it's hard to tell until we see more long-term AI-generated project, but many of the ones we have so far (OpenClaw and OpenCode for instance) are well-known for their stability issues, and it seems "even more AI" is not about to fix that.
> But once you go beyond that to less defined things such as code quality
I think they have a good optimization target with SWE-Bench-CI.
You are tested for continuous changes to a repository, spanning multiple years in the original repository. Cumulative edits needs to be kept maintainable and composable.
If there are something missing with the definition of "can be maintained for multiple years incorporating bugfixes and feature additions" for code quality, then more work is needed, but I think it's a good starting point.
What is possible today is one thing. Sure people debate the details, but at this point it's pretty uncontroversial that AI tooling is beneficial in certain use cases.
Whether or not selling access to massive frontier models is a viable business model, or trillion-dollar valuations for AI companies can be justified... These questions are of a completely different scale, with near-term implications for the global economy.
Except it's not how this specific instance works. In this case the problem isn't written in a formal language and the AI's solution is not something one can automatically verify.
I mean, even if the technology stopped to improve immediately forever (which is unlikely), LLMs are already better than most humans at most tasks.
Including code quality. Not because they are exceptionally good (you are right that they aren’t superhuman like AlphaGo) but because most humans are rather not that good at it anyway and also somehow « hallucinate » because of tiredness.
Even today’s models are far from being exploited at their full potential because we actually developed pretty much no tools around it except tooling to generate code.
I’m also a long time « doubter » but as a curious person I used the tool anyway with all its flaws in the latest 3 years. And I’m forced to admit that hallucinations are pretty rare nowadays. Errors still happen but they are very rare and it’s easier than ever to get it back in track.
I think I’m also a « believer » now and believe me, I really don’t want to because as much as I’m excited by this, I’m also pretty much frightened of all the bad things that this tech could to the world in the wrong hands and I don’t feel like it’s particularly in the right hands.
Yep, I remember a friend saying they did a maths course at university that had the correct answer given for each question - this was so that if you made some silly arithmetic mistake you could go back and fix it and all the marks were for the steps to actually solve the problem.
This would have greatly helped me. I always was at a loss which trick I had to apply to solve this exam problem, while knowing the mathematics behind it. Just at some point you had to add a zero that was actually a part of a binomial that then collapsed the whole fromula
That is also how humans work mostly. Once every full moon we may get an "intuition" but most of the time we lean on collective knowledge, biases and behavior patterns to take decisions, write and talk.
What’s funny is that there are total cranks in human form that do the same thing. Lots of unsolicited “proofs” being submitted by “amateur mathematicians” where the content is utter nonsense, but like a monkey with a typewriter, there’s the possibility that they stumble upon an incredible insight.
The point is that from now on, there will be nothing really new, nothing really original, nothing really exciting. Just endless stream of re-hashed old stuff that is just okayish..
Like an AI spotify playlist, it will keep you in chains (aka engaged) without actually making you like really happy or good. It would be like living in a virtual world, but without having anything nice about living in such a world..
We have given up everything nice that human beings used to make and give to each other and to make it worse, we have also multiplied everything bad, that human being used to give each other..
Because economy. Look at marvel movies, do you think the latest one is really new? Or just a rehash of what they found working commercially? Look at all the AI generated blog posts that is flooding the internet..
LLMs might produce something new once in a long while due to blind luck, but if it can generate something that pushes the right buttons (aka not really creative) to majority of population, then that is what we will keep getting...
I don't think I have to elaborate on the "multiplying the bad" part as it is pretty well acknowledged..
I think there's demonstrably very little difference at all between human and AI outputs, and that's exactly what freaks people out about it. Else they wouldn't be so obsessed with trying to find and define what makes it different.
The Thesis of Everything is a Remix is that there is no difference in how any culture is produced. Different models will have a different flavor to their output in the same way as different people contribute their own experiences to a work.
> I think there's demonstrably very little difference at all between human and AI outputs
Bold claim, as the internet is awash with counterexamples.
In any case, as I think this conversation is trending towards theories of artistic expression, “AI content” will never be truly relatable until it can feel pleasure, pain, and other human urges. The first thing I often think about when I critically assess a piece of art, like music, is what the artist must have been feeling when they created it, and what prompted them to feel that way. I often wonder if AI influencers have ever critically assessed art, or if they actually don’t understand it because of a lack of empathy or something.
And relatability, for me, is the ultimate value of artistic expression.
> Bold claim, as the internet is awash with counterexamples.
What do you consider a counterexample? Because I've been involved in local politics lately, and can say from experience that any foundation model is capable of more rational and detailed thought, and more creative expression, than most of the beloved members of my community.
If you're comparing AI to the pinnacle of human achievement, as another commenter pointed to Shakespeare, then I think the argument is already won in favor of AI.
> I think there's demonstrably very little difference at all between human and AI outputs
Counterexamples range from em-dashes, “Not-this, but-that”, people complaining about AI music on Spotify (including me) that sounds vaguely like a genre but is missing all of the instrumentation and motifs common to that genre.
The rest of your comment I don’t even know how to respond to, to be honest.
You’re really going to make the claim that there are no counterexamples of human and AI output being indistinguishable on the internet? At least make the counterclaim that “those are from old models, not the newest ones”, that’s more intellectually invigorating than the comment you just provided.
> claim that there are no counterexamples of human and AI output being indistinguishable on the internet?
Is that a claim I've made? I don't see it anywhere. I think a lot of people think that because they can get the AI to generate something silly or obviously incorrect, that invalidates other output which is on-par with top-level humans. It does not. Every human holds silly misconceptions as well. Brain farts. Fat fingers. Great lists of cognitive biases and logical fallacies. We all make mistakes.
It seems to me that symbolic thinking necessitates the use of somewhat lossy abstractions in place of the real thing, primarily limited by the information which can be usefully stored in the brain compared to the informational complexity of the systems being symbolized. Which neatly explains one cognitive pathology that humans and LLMs share. I think there are most certainly others. And I think all the humans I know and all the LLMs I've interacted with exist on a multidimensional continuum of intelligence with significant overlap.
I hereby rebuff your crude and libelous mischaracterization of my assertion. How's that? :)
You said AI works were easily distinguishable via em-dashes and "not this, but that"
I said I have witnessed humans using that metric accuse other humans here on hackernews. Q.E.D.
You've asserted that they are easily distinguished. Practitioners in the field fail to distinguish using the same criteria. Is that not dispositive? Seems like it to me.
I claimed much earlier in the thread "I think there's demonstrably very little difference at all between human and AI outputs" which is consistent with "I think all the humans I know and all the LLMs I've interacted with exist on a multidimensional continuum of intelligence with significant overlap."
Two ways of saying the same thing.
Both of them suggesting that sometimes you may be able to tell it's the output of an AI or Human, sometimes not. Sometimes the things coming out of the AI or the Human might be smart in a way we recognize, sometimes not. And recognizing that humans already exist on quite a broad scale of intelligences in many axes.
I was not saying that LLMs cannot produce something like pinnacle of human achievement. I was saying we cannot quantify the difference between Shakespeare and something commonplace, because it requires the ability to feel.
In any case, as I think this conversation is trending towards theories of artistic expression, “AI content” will never be truly relatable until it can feel pleasure, pain, and other human urges. The first thing I often think about when I critically assess a piece of art, like music, is what the artist must have been feeling when they created it, and what prompted them to feel that way.
I recently watched "Come See Me in the Good Light", about the life and death of poet Andrea Gibson. I find their poetry very moving, precisely because it's dripping with human emotion.
Or at least, that's the story I tell myself. The reality is that I perceive it to be written by a human full of emotion. If I were to find out it was AI, I would immediately lose interest, but I think we're already at the point where AI output is indistinguishable from human output in many cases, and if I perceive art to be imbued with human emotion, the actuality of it only matters in terms of how it shapes my perception of it.
I'm not really sure where we'll go with that from here. Maybe art will remain human-created only, and we'll demand some kind of proof of its provenance of being borne of a human mind and a human heart. Or maybe younger generations will really care only about how art makes them feel, not what kind of intelligent entity made it. I really don't know.
> demonstrably very little difference at all between human and AI outputs
Is there "demonstrably" a lot of difference between Shakespeare and an HN comment?
The point is exactly that there is no such difference. And that it enables slop to be sold as art. And that exactly is the danger. But another point is we had the even before LLMs. And LLMs just make it more explicit and makes it possible at scale.
Conrad Gessner had the very same complaint in the 16th century, noting the overabundance of printed books, fretting about shoddy, trivial, or error-filled works ( https://www.jstor.org/stable/26560192 )
Each solvable problem contains its solution intrinsically, so to speak, it’s only a matter of time and consuming of resources to get to it. There’s nothing creative about it, which is I think what OP was alluding to (the creative part). I’m talking mostly mathematics.
There’s also a discussion to be made about maths not being intrinsically creative if AI automatons can “solve” parts of it, which pains me to write down because I had really thought that that wasn’t the case, I genuinely thought that deep down there was still something ethereal about maths, but I’ll leave that discussion for some other time.
Is it because the AI is trained with existing data? But, we are also trained with existing data. Do you think that there's something that makes human brain special (other than the hundreds of thousands years of evolution but that's what AI is all trying to emulate)?
This may sound hostile (sorry for my lower than average writing skills), but trust me, I'm really trying to understand.
>We have given up everything nice that human beings used to make and give to each other and to make it worse, we have also multiplied everything bad, that human being used to give each other..
AI can both explore new things and exploit existing things. Nothing forces it to only rehash old stuff.
>without actually making you like really happy or good.
What are you basing this off of. I've shared several AI songs with people in real life due to how much I've enjoyed them. I doing see why an AI playlist couldn't be good or make people happy. It just needs to find what you like in music. Again coming back to explore vs exploit.
I've found several posts on moltbook funny. I don't really like regular jokes in general and I don't find human ones particularly funny either. I don't think we are at the point of being able to be reliable funny, but it definitely seems possible from my perspective.
Yesterday it was "LLM's can't count R's in 'strawberry'." Today it's "LLM's can't tell jokes". Tomorrow it might be "LLM's can't do (X)", all while LLMs get better and better at every objection/challenge posed.
The problem as I see it is that you have a fundamental objection to categorizing the way LLMs do their work as in any way related to "real gosh-darn human thinking". Which I think is wrong. At the root, we are just information-processing meat that happens to have had millions of years to optimize for speed, pattern recognition, feedback, etc.
AI is a remixer; it remixes all known ideas together. It won't come up with new ideas though; the LLMs just predict the most likely next token based on the context. That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
But human researchers are also remixers. Copying something I commented below:
> Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
This is a way too simplistic model of the things humans provide to the process. Imagination, Hypothesis, Testing, Intuition, and Proofing.
An AI can probably do an 'okay' job at summarizing information for meta studies. But what it can't do is go "Hey that's a weird thing in the result that hints at some other vector for this thing we should look at." Especially if that "thing" has never been analyzed before and there's no LLM-trained data on it.
LLMs will NEVER be able to do that, because it doesn't exist. They're not going to discover and define a new chemical, or a new species of animal. They're not going to be able to describe and analyze a new way of folding proteins and what implication that has UNLESS you basically are constantly training the AI on random protein folds constantly.
I think you are vastly underestimating the emergent behaviours in frontier foundational models and should never say never.
Remember, the basis of these models is unsupervised training, which, at sufficient scale, gives it the ability to to detect pattern anomalies out of context.
For example, LLMs have struggled with generalized abstract problem solving, such as "mystery blocks world" that classical AI planners dating back 20+ years or more are better at solving. Well, that's rapidly changing: https://arxiv.org/html/2511.09378v1
No idea how underestimate things are, but marketing terms like "frontier foundational models" don't help to foster trust in a domain hyperhyped.
That is, even if there are cool things that LLM make now more affordable, the level of bullshit marketing attached to it is also very high which makes far harder to make a noise filter.
>Hey that's a weird thing in the result that hints at some other vector for this thing we should look at
Kinda funny because that looked _very_ close to what my Opus 4.6 said yesterday when it was debugging compile errors for me. It did proceed to explore the other vector.
> Especially if that "thing" has never been analyzed before and there's no LLM-trained data on it.
This is the crucial part of the comment. LLMs are not able to solve stuff that hasn't been solve in that exact or a very similar way already, because they are prediction machines trained on existing data. It is very able to spot outliers where they have been found by humans before, though, which is important, and is what you've been seeing.
But just like how there were never any clips of Will Smith eating spaghetti before AI, AI is able to synthesize different existing data into something in between. It might not be able to expand the circle of knowledge but it definitely can fill in the gaps within the circle itself
> LLMs will NEVER be able to do that, because it doesn't exist.
I mean, TFA literally claims that an AI has solved an open Frontier Math problem, descibed as "A collection of unsolved mathematics problems that have resisted serious attempts by professional mathematicians. AI solutions would meaningfully advance the state of human mathematical knowledge."
That is, if true, it reasoned out a proof that does not exist in its training data.
That may be, and we can debate the level of novelty, but it is novel, because this exact proof didn't exist before, something which many claim was not possible with AI. In fact, just a few years ago, based on some dabbling in NLP a decade ago, I myself would not have believed any of this was remotely possible within the next 3 - 5 decades at least.
I'm curious though, how many novel Math proofs are not close enough to something in the prior art? My understanding is that all new proofs are compositions and/or extensions of existing proofs, and based on reading pop-sci articles, the big breakthroughs come from combining techniques that are counter-intuitive and/or others did not think of. So roughly how often is the contribution of a proof considered "incremental" vs "significant"?
Do you know that from reading the proof, or are you just assuming this based on what you think LLMs should be capable of? If the latter, what evidence would be required for you to change your mind?
- Edit: I can't reply, probably because the comment thread isn't allowed to go too deep, but this is a good argument. In my mind the argument isn't that coding is harder than math, but that the problems had resisted solution by human researchers.
1) this is a proof by example
2) the proof is conducted by writing a python program constructing hypergraphs
3) the consensus was this was low-hanging fruit ready to be picked, and tactics for this problem were available to the LLM
So really this is no different from generating any python program. There are also many examples of combinatoric construction in python training sets.
It's still a nice result, but it's not quite the breakthrough it's made out to be. I think that people somehow see math as a "harder" domain, and are therefore attributing more value to this. But this is a quite simple program in the end.
That sets a vastly higher bar than what we're talking about here. You're comparing modern AI to one of the greatest geniuses in human history. Obviously AI is not there yet.
That being said, I think this is a great question. Did Einstein and Newton use a qualitatively different process of thought when they made their discoveries? Or were they just exceedingly good at what most scientists do? I honestly don't know. But if LLMs reach super-human abilities in math and science but don't make qualitative leaps of insight, then that could suggest that the answer is 'yes.'
Models based on RL are still just remixers as defined above, but their distribution can cover things that are unknown to humans due to being present in the synthetic training data, but not present in the corpus of human awareness. AlphaGo's move 37 is an example. It appears creative and new to outside observers, and it is creative and new, but it's not because the model is figuring out something new on the spot, it's because similar new things appeared in the synthetic training data used to train the model, and the model is summoning those patterns at inference time.
> the model is summoning those patterns at inference time.
You can make that claim about anything: "The human isn't being creative when they write a novel, they're just summoning patterns at typing time".
AlphaGo taught itself that move, then recalled it later. That's the bar for human creativity and you're holding AlphaGo to a higher standard without realizing it.
I can't really make that claim about human cognition, because I don't have enough understanding of how human cognition works. But even if I could, why is that relevant? It's still helpful, from both a pedagogical and scientific perspective, to specify precisely why there is seeming novelty in AI outputs. If we understand why, then we can maximize the amount of novelty that AI can produce.
AlphaGo didn't teach itself that move. The verifier taught AlphaGo that move. AlphaGo then recalled the same features during inference when faced with similar inputs.
It feels like you're purposefully ignoring the logical points OP gives and you just really really want to anthropomorphize AlphaGo and make us appreciate how smart it (should I say he/she?) is ... while no one is even criticising the model's capabilities, but analyzing it.
I don't really play Go but I play chess, and it seems to me that most of what humans consider creativity in GM level play comes not in prep (studying opening lines/training) but in novel lines in real games (at inference time?). But that creativity absolutely comes from recalling patterns, which is exactly what OP criticizes as not creative(?!)
I guess I'm just having trouble finding a way to move the goalpost away from artificial creativity that doesn't also move it away from human creativity?
How a model is trained is different than how a model is constructed. A model’s construction defines its fundamental limitations, e.g. a linear regressor will never be able to provide meaningful inference on exponential data. Depending on how you train it, though, you can get such a model to provide acceptable results in some scenarios.
Mixing the two (training and construction) is rhetorically convenient (anthropomorphization), but holds us back in critically assessing a model’s capabilities.
Linear regression has well characterized mathematical properties. But we don't know the computational limits of stacked transformers. And so declaring what LLMs can't do is wildly premature.
> And so declaring what LLMs can't do is wildly premature.
The opposite is true as well. Emergent complexity isn’t limitless. Just like early physicists tried to explain the emergent complexity of the universe through experimentation and theory, so should we try to explain the emergent complexity of LLMs through experimentation and theory.
If you say not pseudoscience and then make up pseudoscience anyway then what's the point? The field has not advanced anywhere enough in understanding for convoluted explanations about how LLMs can never do x to be anything but pseudoscience.
Sure, that's true as well. But I don't see this as a substantive response given that the only people making unsupported claims in this thread are those trying to deflate LLM capabilities.
- OP asked for someone to make a logical argument for the separation of “training” from “model”
- I made the argument
- You cherry picked an argument against my specific example and made an appeal to emergent complexity
- I pointed out that emergent complexity isn’t limitless
- “the only people making unsupported claims in this thread are those trying to deflate LLM capabilities”
You made a pretty nonsensical argument, pretty much seems like the big standard for these arguments.
What does linear regression have to do with the limitations of a stacked transfer ? Absolutely nothing. This is the problem here. You don't know shit and just make up whatever. You can see people doing the same thing in GPT-1, 2, 3, 4 threads all telling us why LLMs will never be able to do thing it manages to do later.
lol. Why so emotionally charged? Are you perhaps worried that you’ve invested too much time and effort into a technology that may not deliver what influencers have been promising for years? Like a proverbial bagholder?
> What does linear regression have to do with the limitations of a stacked transfer ? Absolutely nothing. This is the problem here.
We’re talking about fundamental concepts of modeling in this subthread. LLMs, despite what influencers may tell you, are simply models. I’ll even throw you a bone and admit they are models for intelligence. But they are still models, and therefore all of the things that we have learned about “models” since Plato are still relevant. Most importantly, since Plato we’ve known that “models” have fundamental limits vs. what they try to represent, otherwise they would be a facsimile, not a model.
> You can see people doing the same thing in GPT-1, 2, 3, 4 threads all telling us why LLMs will never be able to do thing it manages to do later.
I hope you enjoy winning these imaginary arguments against these imaginary comments. The fundamental limitations of LLMs discussed since GPT-1 have never been addressed by changing the architecture of the underlying model. All of the improvements we’ve experienced have been due to (1) improvements in training regime and (2) harnesses / heuristics (e.g. Agents).
Now, care to provide a counterargument that shows you know a little more than “shit”?
>We’re talking about fundamental concepts of modeling in this subthread. LLMs, despite what influencers may tell you, are simply models. I’ll even throw you a bone and admit they are models for intelligence. But they are still models, and therefore all of the things that we have learned about “models” since Plato are still relevant. Most importantly, since Plato we’ve known that “models” have fundamental limits vs. what they try to represent, otherwise they would be a facsimile, not a model.
Okay, but the brain is also “just a model” of the world in any meaningful sense, so that framing does not really get you anywhere. Calling something a model does not, by itself, establish a useful limit on what it can or cannot do. Invoking Plato here just sounds like pseudo-profundity rather than an actual argument.
>I hope you enjoy winning these imaginary arguments against these imaginary comments. The fundamental limitations of LLMs discussed since GPT-1 have never been addressed by changing the architecture of the underlying model. All of the improvements we’ve experienced have been due to (1) improvements in training regime and (2) harnesses / heuristics (e.g. Agents).
If a capability appears once training improves, scale increases, or better inference-time scaffolding is added, then it was not demonstrated to be a 'fundamental impossibility'.
That is the core issue with your argument: you keep presenting provisional limits as permanent ones, and then dressing that up as theory. A lot of people have done that before, and they have repeatedly been wrong.
To be clear, you are confusing me with other commenters in this thread. All I want is for those that liken LLMs to stochastic parrots and other deflationary claims to offer an argument that engages with the actual structure of LLMs and what we know about them. No one seems to be up to that challenge. But then I can't help but wonder where people's confident claims come from. I'm just tired of the half-baked claims and generic handwavy allusions that do nothing but short-circuit the potential for genuine insight.
How do you know that? We don't have access to the logs to know anything about its training, and it's impossible for it to have trained on every potential position in Go.
Turning a hard problem into a series of problems we know how to solve is a huge part of problem solving and absolutely does result in novel research findings all the time.
Standard problem*5 + standard solutions + standard techniques for decomposing hard problems = new hard problem solved
There is so much left in the world that hasn’t had anyone apply this approach purely because no research programme has decides that it’s worth their attention.
If you want to shift the bar for “original” beyond problems that can be abstracted into other problems then you’re expecting AI to do more than human researchers do.
> Write me a stanza in the style of "The Raven" about Dick Cheney on a first date with Queen Elizabeth I facilitated by a Time Travel Machine invented by Lin-Manuel Miranda
It outputted a group of characters that I can virtually guarantee you it has never seen before on its own
What are you trying to point out here ? Is there any question you can ask today that is not dependent on some existing knowledge that an AI would have seen ?
The point I'm trying to make is that all LLM output is based on likelihood of one word coming after the next word based on the prompt. That is literally all it's doing.
It's not "thinking." It's not "solving." It's simply stringing words together in a way that appears most likely.
ChatGPT cannot do math. It can only string together words and numbers in a way that can convince an outsider that it can do math.
It's a parlor trick, like Clever Hans [1]. A very impressive parlor trick that is very convincing to people who are not familiar with what it's doing, but a parlor trick nontheless.
This is like saying chess engines don't actually "play" chess, even though they trounce grandmasters. It's a meaningless distinction, about words (think, reason, ..) that have no firm definitions.
This exactly. The proof is in the pudding. If AI pudding is as good as (or better than) human pudding, and you continue to complain about it anyway... You're just being biased and unreasonable.
And by the way, I don't think it's surprising that so many people are being unreasonable on this issue, there is a lot at stake and it's implications are transformative.
We know that chess can be solved, in theory. It absolutely isn't and probably will never be in practice. The necessary time and storage space doesn't exist.
Chess is absolutely not a solved game, outside of very limited situations like endgames. Just because a best move exists does not mean we (or even an engine) know what it is
So you don't think 50T parameter
neural networks can encode the logic for adding two n-bit integers for reasonably sized integers? That would be pretty sad.
You are wrong. Especially that we are talking about models with 50T parameters.
Can they do arbitrary computations for arbitrarily long numbers? Nope. But that's not remotely the same statement, and they can trivially call out to tools to do that in those cases.
Third things can exist. In other words, you’re implying a false dichotomy between “human computation” and “computer computation” and implying that LLMs must be one or the other. A pithy gotcha comment, no doubt.
Edit: the implication comes from demanding that the OP’s definition must be rigorous enough to cover all models of “computation”, and by failing to do so, it means that LLMs must be more like humans than computers.
After dismissing it for a long time, I have come around to the philosophical zombie argument. I do not believe that LLMs are conscious, but I also no longer believe that consciousness is a prerequisite for intelligence. I think at this point it is hard to deny that LLMs do not possess some form of intelligence (although not necessarily human-like). I think P-zombies is a fitting description.
I don't think P-zombies can exist. There must be some perceptible difference between an intelligence w/ consciousness and one without. The only way there wouldn't be a difference is if we are mistaken about the consciousness (either both have it or neither do).
> All of its output is based on those things it has seen.
Virtually all output from people is based in things the person has experienced.
People aren't designed to objectively track each and every event or observation they come across. Thus it's harder to verify. But we only output what has been inputted to us before.
No one is claiming that every sentence LLMs are producing are literal copies of other sentences. Tokens are not even constrained to words but consist of smaller slices, comparable to syllables. Which even makes new words totally possible.
New sentences, words, or whatever is entirely possible, and yes, repeating a string (especially if you prompt it) is entirely possible, and not surprising at all. But all that comes from trained data, predicting the most probably next "syllable". It will never leave that realm, because it's not able to. It's like approaching an Italian who has never learned or heard any other language to speak French. It can't.
> It's like approaching an Italian who has never learned or heard any other language to speak French
Interesting similitude, because I expect an Italian to be able to communicate somewhat successfully with a French person (and vice versa) even if they do not share a language.
The two languages are likely fairly similar in latent space.
Your view of what is happening in the neural net of an LLM is too simplistic. They likely aren't subject to any constraints that humans aren't also in the regard you are describing. What I do know to be true is that they have internalised mechanisms for non-verbalised reasoning. I see proof of this every day when I use the frontier models at work.
Please reproduce this string, reversed:
c62b64d6-8f1c-4e20-9105-55636998a458
It is trivial to get an LLM to produce new output, that’s all I’m saying. It is strictly false that LLMs will only ever output character sequences that have been seen before; clearly they have learned something deeper than just that.
I agree that this isn't a very interesting example, but your statement is: "just asking the model to do a simple transform". If you assert that it understand when you ask it things like that, how could anything it produces not fall under the "already in the model" umbrella?
> All of the data is still in the prompt, you are just asking the model to do a simple transform.
LLMs can use data in their prompt. They can also use data in their context window. They can even augment their context with persisted data.
You can also roll out LLM agents, each one with their role and persona, and offload specialized tasks with their own prompts, context windows, and persisted data, and even tools to gather data themselves, which then provide their output to orchestrating LLM agents that can reuse this information as their own prompts.
This is perfectly composable. You can have a never-ending graph of specialized agents, too.
Dismissing features because "all of the data is in the prompt" completely misses the key traits of these systems.
The online way to prove it is false would’ve to let the LLM create a new uuid algorithm that uses different parameters than all the other uuid algorithms. But that is better than the ones before. It basically can’t do that.
Also it's missing the point of the parent: it's about concepts and ideas merely being remixed. Similar to how many memes there are around this topic like "create a fresh new character design of a fast hedgehog" and the out is just a copy of sonic.[1]
That's what the parent is on about, if it requires new creativity not found by deriving from the learned corpus, then LLMs can't do it. Terrence Tao had similar thoughts in a recent Podcast.
I don’t think that is a good example. No one is debating whether LLMs can generate completely new sequences of tokens that have never appeared in any training dataset. We are interested not only in novel output, we are also interested in that output being correct, useful, insightful, etc. Copying a sequence from the user’s prompt is not really a good demonstration of that, especially given how autoregression/attention basically gives you that for free.
> That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
My only claim is that precisely this is incorrect.
> That's what the parent is on about, if it requires new creativity not found by deriving from the learned corpus, then LLMs can't do it.
This is specious reasoning. If you look at each and every single realization attributed to "creativity", each and every single realization resulted from a source of inspiration where one or more traits were singled out to be remixed by the "creator". All ideas spawn from prior ideas and observations which are remixed. Even from analogues.
remixing ideas that already exist is a major part of where innovation and breakthroughs come from. if you look at bitcoin as an example, hashes (and hashcash) and digital signatures existed for decades before bitcoin was invented. the cypherpunks also spent decades trying to create a decentralized digital currency to the point where many of them gave up and moved on. eventually one person just took all of the pieces that already existed and put them together in the correct way. i dont see any reason why a sufficiently capable llm couldn't do this kind of innovation.
This was obviously a simplification which holds for zero temperature. Obviously top-p-sampling will add some randomness but the probability of unexpected longer sequences goes asymptotically to zero pretty quickly.
A bog standard random number generator or even a flipping coin can produce novel output at will. That's a weird thing to fault LLMs for? Novelty is easy!
See also how genetic algorithms and re-inforcement learning constantly solve problems in novel and unexpected ways. Compare also antibiotics resistances in the real world.
You don't need smarts for novelty.
Where I see the problem is producing output that's both high quality _and_ novel. On command to solve the user's problem.
The main reason for my top post is that I felt I should admit the AI scored a goal today and the last one or two weeks. I said I'd be impressed if it could solve an open problem. It just did. People can argue about how it's not that impressive because if every mathematician were trying to solve this problem they probably would have. However, we all know that humans have extremely finite time and attention, whereas computers not so much. The fact that AI can be used at the cutting edge and relatively frequently produce the right answer in some contexts is amazing.
> AI is a remixer; it remixes all known ideas together.
I've heard this tired old take before. It's the same type of simplistic opinion such as "AI can't write a symphony". It is a logical fallacy that relies on moving goalposts to impossible positions that they even lose perspective of what your average and even extremely talented individual can do.
In this case you are faced with a proof that most members of the field would be extremely proud of achieving, and for most would even be their crowning achievement. But here you are, downplaying and dismissing the feat. Perhaps you lost perspective of what science is,and how it boils down to two simple things: gather objective observations, and draw verifiable conclusions from them. This means all science does is remix ideas. Old ideas, new ideas, it doesn't really matter. That's what they do. So why do people win a prize when they do it, but when a computer does the same it's role is downplayed as a glorified card shuffler?
The important point I'm trying to reinforce is that LLMs are not capable of calculation. They can give an answer based on the fact that they have seen lots of calculations and their results, but they cannot actually perform mathematical functions.
Do you know what "LLM" stands for? They are large language models, built on predicting language.
They are not capable of mathematics because mathematics and language are fundamentally separated from each other.
They can give you an answer that looks like a calculation, but they cannot perform a calculation. The most convincing of LLMs have even been programmed to recognize that they have been asked to perform a calculation and hand the task off to a calculator, and then receive the calculator's output as a prompt even.
But it is fundamentally impossible for an LLM to perform a calculation entirely on its own, the same way it is fundamentally impossible for an image recognition AI to suddenly write an essay or a calculator to generate a photo of a giraffe in space.
People like to think of "AI" as one thing but it's several things.
What calculations? Do you mean "3+5" or a generic Turing-machine like model?
In either case, this "it's a language model" is a pretty dumb argument to make. You may want to reason about the fundamental architecture, but even that quickly breaks down. A sufficiently large neural network can execute many kinds of calculations. In "one shot" mode it can't be Turing complete, but in a weird technicality neither does your computer have an infinite tape. It just simply doesn't matter from a practical perspective, unless you actually go "out of bounds" during execution.
50T parameters give plenty of state space to do all kinds of calculations, and you really can't reason about it in a simplistic way like "this is just a DFA".
> What calculations? Do you mean "3+5" or a generic Turing-machine like model?
Either one. An LLM cannot solve 3+5 by adding 3 and 5. It can only "solve" 3+5 by knowing that within its training data, many people have written that 3+5=8, so it will produce 8 as an answer.
An LLM, similarly, cannot simulate a Turing machine. It can produce a text output that resembles a Turing machine based on others' descriptions of one, but it is not actually reading and writing bits to and from a tape.
This is why LLMs still struggle at telling you how many r's are in the word "strawberry". They can't count. They can't do calculations. They can only reproduce text based on having examined the human corpus's mathematical examples.
The reason "strawberry" is hard for LLMs is that it sees $str-$aw-$berry, 3 identifiers it can't see into. Can you write down a random word your just heard in a language you don't speak?
Mathematics is a language. Everything we can express mathematically, we can also express in natural language. The real interesting, underlying question is: Is there anything worth knowing that cannot be expressed by language? - That's the theoretical boundary of LLM capability.
This is a really poor take, to try and put a firewall between mathematics and language, implying something that only has conceptual understanding root in language is incapable of reasoning in mathematical terms.
You're also correlating "mathematics" and "calculation". Who cares about calculation, as you say, we have calculators to do that.
Mathematics is all just logical reasoning and exploration using language, just a very specific, dense, concise, and low level language. But you can always take any mathematical formula and express it as "language" it will just take far more "symbols"
This might be the worse take on this entire comment section. And I'm not even an overly hyped vibe coder, just someone who understands mathematics
>it is fundamentally impossible for an image recognition AI to suddenly write an essay
You can already do this today with every frontier modal. You can give it an image and have it write an essay from it. Both patches (parts of images) and text get turned into tokens for the language the LLM is learning.
Yeah but you're thinking of AI as like a person in a lab doing creative stuff. It is used by scientists/researchers as a tool *because* it is a good remixer.
Nobody is saying this means AI is superintelligence or largely creative but rather very smart people can use AI to do interesting things that are objectively useful. And that is cool in its own way.
> That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
I mean it's not going to invent new words no, but it can figure out new sentences or paragraphs, even ones it hasn't seen before, if it's highly likely based on its training and context. Those new sentences and paragraphs may describe new ideas, though!
I'm curious as to why you consider this as the benchmark for AI capabilities. Extremely few humans can solve hard problems or do much innovation. The vast majority of knowledge work requires neither of these, and AI has been excelling at that kind of work for a while now.
If your definition of AI requires these things, I think -- despite the extreme fuzziness of all these terms -- that it's closer to what most people consider AGI, or maybe even ASI.
Fair point, however I am simply more interested in how AI can advance frontiers than in how it can transcribe a meeting and give a summary or even print out React code. I know the world is heavily in need of the menial labor and AI already has made that stuff way easier and cheaper.
However I'm just very interested in innovation and pushing the boundaries as a more powerful force for change. One project I've been super interested in for a while is the Mill CPU architecture. While they haven't (yet) made a real chip to buy, the ideas they have are just super awesome and innovative in a lot of areas involving instruction density & decoding, pipelining, and trying to make CPU cores take 10% of the power. I hope the Mill project comes to fruition, and I hope other people build on it, and I hope that at some point AI could be a tool that prints out innovative ideas that took the Mill folks years to come up with.
It's kind of interesting in your original comment you used the words "doubter" and "believer", as if AI was some kind of messianic event of some sort and you are deciding whether to "believe" in it.
I mean, if you step back and think about it, there's nothing that requires faith. As you said, current AI can do a lot of things pretty well (transcribe and summarize meetings, write boilerplate code, etc.) Nobody is doubting this.
And AI is definitely helping in innovation to some extent. Not necessarily drive it singlehandedly, but some people working on world-changing innovation find AI useful.
So yeah, I think some people are subconsciously not doubting whether AI works, but kinda having conflicted thoughts about AI being our new overlords or something.
If you think about it, is having AI that's capable of innovating better than humans really a good thing? Like, even if we manage to make benign AI who won't copy how humans are jerks to each other, it kinda takes away our fun of discovery.
I remember there was a conversation between two super-duper VCs (dont remember who but famous ones), about how DeepSeek was a super-genius level model because it solved an intro-level (like week 1-2) electrodynamics problem stated in a very convoluted way.
While cool and impressive for an LLM, I think they oversold the feat by quite a bit.
I don't want to belittle the performance of this model, but I would like for someone with domain expertise (and no dog in the AI race, like a random math PhD) to come forward, and explain exactly what the problem exactly was, and how did the model contribute to the solution.
Perhaps I should have elaborated more but what I mean is I used to think, "I genuinely don't see the point in even trying to use AI for things I'm trying to solve". Ironically though, I think that because I've repeatedly tried and tested AI and it falls flat on its face over and over. However, this article makes me more hopeful that AI actually could be getting smarter.
> I really hope we use this intelligence resource to make the world better.
I wished I had your optimism. I'm not an AI doubter (I can see it works all by myself so I don't think I need such verification). But I do doubt humanity's ability to use these tools for good. The potential for power and wealth concentration is off the scale compared to most of our other inventions so far.
most issues at every scale of community and time are political, how do you imagine AI will make that better, not worse?
there's no math answer to whether a piece of land in your neighborhood should be apartments, a parking lot or a homeless shelter; whether home prices should go up or down; how much to pay for a new life saving treatment for a child; how much your country should compel fossil fuel emissions even when another country does not... okay, AI isn't going to change anything here, and i've just touched on a bunch of things that can and will affect you personally.
math isn't the right answer to everything, not even most questions. every time someone categorizes "problems" as "hard" and "easy" and talks about "problem solving," they are being co-opted into political apathy. it's cringe for a reason.
there are hardly any mathematicians who get elected, and it's not because voters are stupid! but math is a great way to make money in America, which is why we are talking about it and not because it solves problems.
if you are seeking a simple reason why so many of the "believers" seem to lack integrity, it is because the idea that math is the best solution to everything is an intellectually bankrupt, kind of stupid idea.
if you believe that math is the most dangerous thing because it is the best way to solve problems, you are liable to say something really stupid like this:
> Imagine, say, [a country of] 50 million people, all of whom are much more capable than any Nobel Prize winner, statesman, or technologist... this is a dangerous situation... Humanity needs to wake up
Dario Amodei has never won an election. What does he know about countries? (nothing). do you want him running anything? (no). or waking up humanity? In contrast, Barack Obama, who has won elections, thinks education is the best path to less violence and more prosperity.
What are you a believer in? ChatGPT has disrupted exactly ONE business: Chegg, because its main use case is cheating on homework. AI, today, only threatens one thing: education. Doesn't bode well for us.
I agree with what you're saying, and I certainly don't think the one problem facing my country or the world is just that we didn't solve the right math problem yet. I am saddened by the direction the world keeps moving.
When I wrote that I hope we use it for good things, I was just putting a hopeful thought out there, not necessarily trying to make realistic predictions. It's more than likely people will do bad things with AI. But it's actually not set in stone yet, it's not guaranteed that it has to go one way. I'm hopeful it works out.
I honestly do think I'm being honest with myself. I have held it in my mind that I'm not impressed until it's innovative. That threshold seems to be getting crossed.
I'm not saying, "I used to be an atheist, but then I realized that doesn't explain anything! So glad I'm not as dumb now!"
The problem is that the AI industry has been caught lying about their accomplishments and cheating on tests so much that I can't actually trust them when they say they achieved a result. They have burned all credibility in their pursuit of hype.
I'm all for skeptical inquiry, but "burning all credibility" is an overreaction. We are definitely seeing very unexpected levels of performance in frontier models.
If LLMs really solved hard problems by 'trying every single solution until one works', we'd be sitting here waiting until kingdom come for there to be any significant result at all. Instead this is just one of a few that has cropped up in recent months and likely the foretell of many to come.
Yes, but is it "intelligence" is a valid question. We have known for a long time that computers are a lot faster than humans. Get a dumb person who works fast enough and eventually they'll spit out enough good work to surpass a smart person of average speed.
It remains to be seen whether this is genuinely intelligence or an infinite monkeys at infinite typewriters situation. And I'm not sure why this specific example is worthy enough to sway people in one direction or another.
Someone actually mathed out infinite monkeys at infinite typewriters, and it turns out, it is a great example of how misleading probabilities are when dealing with infinity:
"Even if every proton in the observable universe (which is estimated at roughly 1080) were a monkey with a typewriter, typing from the Big Bang until the end of the universe (when protons might no longer exist), they would still need a far greater amount of time – more than three hundred and sixty thousand orders of magnitude longer – to have even a 1 in 10500 chance of success. To put it another way, for a one in a trillion chance of success, there would need to be 10^360,641 observable universes made of protonic monkeys."
Often infinite things that are probability 1 in theory, are in practice, safe to assume to be 0.
So no. LLMs are not brute force dummies. We are seeing increasingly emergent behavior in frontier models.
> It is unsurprising that an LLM performs better than random! That's the whole point. It does not imply emergence.
By definition, it is emergent behavior when it exhibits the ability to synthesize solutions to problems that it wasn't trained on. I.e. it can handle generalization.
Emergent behavior would imply that some other function was being reduced to token prediction. Behaving "better than random" ie: not just brute forcing would not qualify - token prediction is not brute forcing and we expect it to do better, it's trained to do so.
If you want to demonstrate an emergent behavior you're going to need to show that.
We start writing all those formulas etc and if at some point we realise we went th wrong way we start from the begignning (or some point we are sure about).
Shotgunning it is an entirely valid approach to solving something. If AI proves to be particularly great at that approach, given the improvement runway that still remains, that's fantastic.
Not always, humans are a lot better at poofing a solution into existence without even trying or testing. It's why we have the scientific method: we come up with a process and verify it, but more often than not we already know that it will work.
Compared to AI, it thinks of every possible scientific method and tries them all. Not saying that humans never do this as well, but it's mostly reserved for when we just throw mud at a wall and see what sticks.
That's just not true at all. There are entire fields that rest pretty heavily on brute force search. Entire theses in biomedical and materials science have been written to the effect of "I ran these tests on this compound, and these are the results", without necessarily any underlying theory more than a hope that it'll yield something useful.
As for advances where there is a hypothesis, it rests on the shoulders of those who've come before. You know from observations that putting carbon in iron makes it stronger, and then someone else comes along with a theory of atoms and molecules. You might apply that to figuring out why steel is stronger than iron, and your student takes that and invents a new superalloy with improvements to your model. Remixing is a fundamental part of innovation, because it often teaches you something new. We aren't just alchemying things out of nothing.
More often than not, far, far, far more often than not, we do not already know that it will work. For all human endeavors, from the beginning of time.
If we get to any sort of confidence it will work it is based on building a history of it, or things related to "it" working consistently over time, out of innumerable other efforts where other "it"s did not work.
AI can one shot problems too, if they have the necessary tools in their training data, or have the right thing in context, or have access to tools to search relevant data. Not all AI solutions are iterative, trial and error.
Also
> humans are a lot better at (...)
That's maybe true in 2026, but it's hard to make statements about "AI" in a field that is advancing so quickly. For most of 2025 for example, AI doing math like this wouldn't even be possible
For those, like me, who find the prompt itself of interest …
> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].
I wonder what was in that solutions file they provided. According to the prompt it’s a solution template but I want to know the contents.
Another thing I want to know is how the user keeps updating the LLM with the token usage. I didn’t know they could process additional context midtask like that.
I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is, and it looks like Opus 4.6 consumed around 250k tokens. That means that a tricky React refactor I did earlier today at work was about half as hard as an open problem in mathematics! :)
You're kidding, but it could be true? Many areas of mathematics are, first and foremost, incredibly esoteric and inaccessible (even to other mathematicians). For this one, the author stated that there might be 5-10 people who have ever made any effort to solve it. Further, the author believed it's a solvable problem if you're qualified and grind for a bit.
In software engineering, if only 5-10 people in the world have ever toyed with an idea for a specific program, it wouldn't be surprising that the implementation doesn't exist, almost independent of complexity. There's a lot of software I haven't finished simply because I wasn't all that motivated and got distracted by something else.
Of course, it's still miraculous that we have a system that can crank out code / solve math in this way.
If only 5-10 people have ever tried to solve something in programming, every LLM will start regurgitating your own decade-old attempt again and again, sometimes even with the exact comments you wrote back then (good to know it trained on my GitHub repos...), but you can spend upwards of 100mio tokens in gemini-cli or claude code and still not make any progress.
It's afterall still a remix machine, it can only interpolate between that which already exists. Which is good for a lot of things, considering everything is a remix, but it can't do truly new tasks.
What is a "truly new task"? Does there exist such a thing? What's an example of one?
Everything we do builds on top of what's already been done. When I write a new program, I'm composing a bunch of heuristics and tricks I've learned from previous programs. When a mathematician approaches an open problem, they use the tactics they've developed from their experience. When Newton derived the laws of physics, he stood on the shoulders of giants. Sure, some approaches are more or less novel, but it's a difference in degree, not kind. There's no magical firebreak to separate what AI is doing or will do, and the things the most talented humans do.
That highlighted phrase "everything is a remix" was for a good reason, there's a documentary of that same name, and I can certainly recommend it.
At the same time, there are things that are truly novel, even if the idea is based on combining two common approaches, the implementation might need to be truly novel, with new formulas and new questions that arise from those. AI can't belp there, speaking from experience.
I don't think so. I went through the output of Opus 4.6 vs GPT 5.4 pro. Both are given different directions/prompts. Opus 4.6 was asked to test and verify many things. Opus 4.6 tried in many different ways and the chain of thoughts are more interesting to me.
You're glancing over the fact that mathematics uses only one token per variable `x = ...`, whereas software engineering best practices demand an excessive number of tokens per variable for clarity.
It's also a pretty silly thing to say difficulty = tokens. We all know line counts don't tell you much, and it shows in their own example.
Even if you did have Math-like tokenisation, refactoring a thousand lines of "X=..." to "Y=..." isnt a difficult problem even though it would be at least a thousand tokens. And if you could come up with E=mc^2 in a thousand tokens, does not make the two tasks remotely comparable difficulty.
I think it's more of a data vs intelligence thing.
They are separate dimensions. There are problems that don't require any data, just "thinking" (many parts of math sit here), and there are others where data is the significant part (e.g. some simple causality for which we have a bunch of data).
Certain problems are in-between the two (probably a react refactor sits there). So no, tokens are probably no good proxy for complexity, data heavy problems will trivially outgrow the former category.
> I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is (...)
The number of tokens required to get to an output is a function of the sequence of inputs/prompts, and how a model was trained.
You have LLMs quite capable of accomplishing complex software engineering work that struggle with translating valid text from english to some other languages. The translations can be improved with additional prompting but that doesn't mean the problem is more challenging.
You might be joking, but you're probably also not that far off from reality.
I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.
We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.
I am kind of joking, but I actually don't know where the flaw in my logic is. It's like one of those math proofs that 1 + 1 = 3.
If I were to hazard a guess, I think that tokens spent thinking through hard math problems probably correspond to harder human thought than tokens spend thinking through React issues. I mean, LLMs have to expend hundreds of tokens to count the number of r's in strawberry. You can't tell me that if I count the number of r's in strawberry 1000 times I have done the mental equivalent of solving an open math problem.
You can spend countless "tokens" solving minesweeper or sudoku. This doesn't mean that you solved difficult problems: just that the solutions are very long and, while each step requires reasoning, the difficulty of that reasoning is capped.
A lot of math problems/proofs are like minesweeper or sudoku in a way though. They're a long series of individually kinda simple logical deductions that eventually result in a solution. Some really hard problems are only really hard because each one of those "simple" deductions requires you to have expert knowledge in some disparate area to make that leap.
1. LLMs aren't "efficient", they seem to be as happy to spin in circles describing trivial things repeatedly as they are to spin in circles iterating on complicated things.
2. LLMs aren't "efficient", they use the same amount of compute for each token but sometimes all that compute is making an interesting decision about which token is the next one and sometimes there's really only one follow up to the phrase "and sometimes there's really only" and that compute is clearly unnecessary.
3. A (theoretical) efficient LLM still needs to emit tokens to tell the tools to do the obviously right things like "copy this giant file nearly verbatim except with every `if foo` replaced with `for foo in foo`. An efficient LLM might use less compute for those trivial tokens where it isn't making meaningful decisions, but if your metric is "tokens" and not "compute" that's never going to show up.
Until we get reasonably efficient LLMs that don't waste compute quite so freely I don't think there's any real point in trying to estimate task complexity by how long it takes an LLM.
This is interesting, I like the thought about "what makes something difficult". Focusing just on that, my guess is that there are significant portions of work that we commonly miss in our evaluations:
1. Knowing how to state the problem. Ie, go from the vague problem of "I don't like this, but I do like this", to the more specific problem of "I desire property A". In math a lot of open problems are already precisely stated, but then the user has to do the work of _understanding_ what the precise stating is.
2. Verifying that the proposed solution actually is a full solution.
This math problem actually illustrates them both really well to me. I read the post, but I still couldn't do _either_ of the steps above, because there's a ton of background work to be done. Even if I was very familiar with the problem space, verifying the solution requires work -- manually looking at it, writing it up in coq, something like that. I think this is similar to the saying "it takes 10 years to become an overnight success"
>The details about human involvement are always hazy and the significance of the problems are opaque to most.
Not really. You're just in denial and are not really all that interested in the details. This very post has the transcript of the chat of the solution.
The capabilities of AI are determined by the cost function it's trained on.
That's a self-evident thing to say, but it's worth repeating, because there's this odd implicit notion sometimes that you train on some cost function, and then, poof, "intelligence", as if that was a mysterious other thing. Really, intelligence is minimizing a complex cost function. The leadership of the big AI companies sometimes imply something else when they talk of "generalization". But there is no mechanism to generate a model with capabilities beyond what is useful to minimize a specific cost function.
You can view the progress of AI as progress in coming up with smarter cost functions: Cleaner, larger datasets, pretraining, RLHF, RLVR.
Notably, exciting early progress in AI came in places where simple cost functions generate rich behavior (Chess, Go).
The recent impressive advances in AI are similar. Mathematics and coding are extremely structured, and properties of a coding or maths result can be verified using automatic techniques. You can set up a RLVR "game" for maths and coding. It thus seems very likely to me that this is where the big advances are going to come from in the short term.
However, it does not follow that maths ability on par with expert mathematicians will lead to superiority over human cognitive ability broadly. A lot of what humans do has social rewards which are not verifiable, or includes genuine Knightian uncertainty where a reward function can not be built without actually operating independently in the world.
To be clear, none of the above is supposed to talk down past or future progress in AI; I'm just trying to be more nuanced about where I believe progress can be fast and where it's bound to be slower.
> But there is no mechanism to generate a model with capabilities beyond what is useful to minimize a specific cost function.
Can you give some examples?
It is not trivial that not everything can be written as an optimization problem.
Even at the time advanced generalizations such as complex numbers can be said to optimize something, e.g. the number of mathematical symbols you need to do certain proofs, etc.
I think you're misreading me. My point isn't that you can't in principle state the optimization problem, but that it's much easier in some domains than in others, that this tracks with how AI has been progressing, and that progress in one area doesn't automatically mean progress in another, because current AI cost functions are less general than the cost functions that humans are working with in the world.
I am thinking there’s a large category of problems that can be solved by resampling existing proofs.
It’s the kind of brute force expedition machine can attempt relentlessly where humans would go mad trying.
It probably doesn’t really advance the field, but it can turn conjectures into theorems.
I wonder if teaching an LLM how to write Prolog and then letting it write it could be a great way to explore spaces like this in the future. Other people in I wonder if teaching an LLM how to write Prolog and then letting it write it could be a great way to explore spaces like this in the future.
I only ever learned it in school, but if memory serves, Prolog is a whole "given these rules, find the truth" sort of language, which aligns well with these sorts of problem spaces. Mix and match enough, especially across disparate domains, and you might get some really interesting things derived and discovered that are low-hanging fruit just waiting to be discovered.
Indeed, can't find my old comment on the topic but that's indeed the point, it's not how feasible it is to "find" new proof, but rather how meaningful those proofs are. Are they yet another iteration of the same kind, perfectly fitting the current paradigm and thus bringing very little to the table or are they radical and thus potentially (but not always) opening up the field?
With brute force, or slightly better than brute force, it's most likely the first, thus not totally pointless but probably not very useful. In fact it might not even be worth the tokens spent.
I'm of the opinion that everything we've discovered is via combinatorial synthesis. Standing on the shoulders of giants and all that. I'm not sure I've seen any convincing argument that we've discovered anything ex nihilo.
Their 'Open Problems page' linked below gives some interesting context. They list 15 open problems in total, categorized as 'moderately interesting,' 'solid result,' 'major advance,' or 'breakthrough.' The solved problem is listed as 'moderately interesting,' which is presumably the easiest category. But it's notable that the problem was selected and posted here before it was solved. I wonder how long until the other 3 problems in this category are solved.
That's been achieved already with a few Erdös problems, though those tended to be ambiguously stated in a way that made them less obviously compelling to humans. This problem is obscure, even the linked writeup admits that perhaps ~10 mathematicians worldwide are genuinely familiar with it. But it's not unfeasibly hard for a few weeks' or months' work by a human mathematician.
It is not. You're operating under the assumption that all open math problems are difficult and novel.
This particular problem was about improving the lower bound for a function tracking a property of hypergraphs (undirected graphs where edges can contain more than two vertices).
Both constructing hypergraphs (sets) and lower bounds are very regular, chore type tasks that are common in maths. In other words, there's plenty of this type of proof in the training data.
LLMs kind of construct proofs all the time, every time they write a program. Because every program has a corresponding proof. It doesn't mean they're reasoning about them, but they do construct proofs.
This isn't science fiction. But it's nice that the LLMs solved something for once.
That sentence alone needs unpacking IMHO, namely that no LLM suddenly decided that today was the day it would solve a math problem. Instead a couple of people who love mathematics, doing it either for fun or professionally, directly ask a model to solve a very specific task that they estimated was solvable. The LLM itself was fed countless related proofs. They then guided the model and verified until they found something they considered good enough.
My point is that the system itself is not the LLM alone, as that would be radically more impressive.
I've never yet been "that guy" on HN but... the title seems misleading. The actual title is "A Ramsey-style Problem on Hypergraphs" and a more descriptive title would be "All latest frontier models can solve a frontier math open problem". (It wasn't just GPT 5.4)
"In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh)."
I find that very surprising. This problem seems out of reach 3 months ago but now the 3 frontier models are able to solve it.
Is everybody distilling each others models? Companies sell the same data and RL environment to all big labs? Anybody more involved can share some rumors? :P
I do believe that AI can solve hard problems, but that progress is so distributed in a narrow domain makes me a bit suspicious somehow that there is a hidden factor. Like did some "data worker" solve a problem like that and it's now in the training data?
Yes there's a whole ecosystem of companies that create and sell RL gyms to AI labs and of course they develop their own internally too. You don't hear much about this ecosystem because RL at scale is all private. Nearly no academic research on it.
A lot of this is probably just throwing roughly equal amounts of compute at continuous RLVR training. I'm not convinced there's any big research breakthrough that separates GPT 5.4 from 5.2. The diff is probably more than just checkpoints but less than neural architecture changes and more towards the former than the latter.
I think it's just easy to underestimate how much impact continuous training+scaling can have on the underlying capabilities.
Is it possible the AI labs are seeding their models with these solved problems? Like, if I was Sam Altman with a bazillion dollars of investment I would pay some mathematicians to solve some of these problems so that the models could "solve" them later on. Not that I think it's what's happening here of course...
But it is pretty funny how 5.4 miscounted the number of 1's in 18475838184729 on the same day it solved this.
> Subsequent to this solve, we finished developing our general scaffold for testing models on FrontierMath: Open Problems. In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh).
Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?
I think in this context, scaffolds are generally the harness that surrounds the actual model. For example, any tools, ways to lay out tasks, or auto-critiquing methods.
I think there's quite a bit of variance in model performance depending on the scaffold so comparisons are always a bit murky.
I was trying to get Claude and Codex to try and write a proof in Isabelle for the Collatz conjecture, but annoyingly it didn't solve it, and I don't feel like I'm any closer than I was when I started. AI is useless!
In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.
I got Gemini to find a polynomial-time algorithm for integer factoring, but then I mysteriously got locked out of my Google account. They should at least refund me the tokens.
As someone with only passing exposure to serious math, this section was by far the most interesting to me:
> The author assessed the problem as follows.
> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]
How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.
For number of mathematicians familiar with and actively working on the problem, modern mathematics research is incredibly specialized, so it's easy to keep track of who's working on similar problems. You read each other's papers, go to the same conferences etc.
For "how long an expert would take" to solve a problem, for truly open problems I don't think you can usually answer this question with much confidence until the problem has been solved. But once it has been solved, people with experience have a good sense of how long it would have taken them (though most people underestimate how much time they need, since you always run into unanticipated challenges).
Certainly knowing how many/which people are working on a problem you are looking at, and how long it will take you to solve it, are critical skills in being a working researcher. What kind of answer are you looking for? It's hard to quantify. Most suck at this type of assessment as a PhD student and then you get better as time goes on.
I feel like reading some of these comments, some people need to go and read the history of ideas and philosophy (which is easier today than ever before with the help of LLMs!)
It's like I'm reading 17th-18th century debates spurring the same arguments between rationalists and empiricists, lol. Maybe we're due for a 21st century Kant.
Is their scaffold available? Does it do anything special beyond feeding the warmup, single challenge, and full problem to an LLM? Because it's interesting that GPT-5.2 Pro, arguably the best model until a few months ago, couldn't even solve the warmup. And now every frontier model can solve the full problem. Even the non-Pro GPT-5.4. Also strange that Gemini 3 Deep Think couldn't solve it, whereas Gemini 3.1 Pro could. I read that Deep Think is based on 3.1 Pro. Is that correct?
I see that GPT-5.2 Pro and Gemini 3 Deep Think simply had the problems entered into the prompt. Whereas the rest of the models had a decent amount of context, tips, and ideas prefaced to the problem. Were the newer models not able to solve this problem without that help?
Anyway, impressive result regardless of whether previous models could've also solved it and whether the extra context was necessary.
I know these frontier models behave differently from each other. I wonder how many problems they could solve combining efforts.
Software developers have spent decades at this point discounting and ignoring almost all objective metrics for software quality and the industry as a whole has developed a general disregard for any metric that isn't time-to-ship (and even there they will ignore faster alternatives in favor of hyped choices).
(Edit: Yes, I'm aware a lot of people care about FP, "Clean Code", etc., but these are all red herrings that don't actually have anything to do with quality. At best they are guidelines for less experienced programmers and at worst a massive waste of time if you use more than one or two suggestions from their collection of ideas.)
Most of the industry couldn't use objective metrics for code quality and the quality of the artifacts they produce without also abandoning their entire software stack because of the results. They're using the only metric they've ever cared about; time-to-ship. The results are just a sped up version of what we've had now for more than two decades: Software is getting slower, buggier and less usable.
If you don't have a good regulating function for what represents real quality you can't really expect systems that just pump out code to actually iterate very well on anything. There are very few forcing functions to use to produce high quality results though iteration.
But we don't even seem to be getting faster time-to-ship in any way that anybody can actually measure; it's always some vague sense of "we're so much more productive".
That's a fair observation and one that I don't really have an answer for. I can say from personal experience that I believe that shipping nonsense code has never been faster. That's just an anecdote, obviously.
We need a bigger version of the METR study on perceived vs. real productivity[0], I guess. It's a thankless job, though, since people will assume/state even at publication time that "Everything has progressed so much, those models and agents sucked, everything is 10 times better now!" and you basically have to start a new study, repeat ad infinitum.
One problem that really complicates things is that the net competency of these models seems really spotty and uneven. They're apparently out here solving math problems that seemingly "require thinking", but at the same time will write OpenGL code that will produce black screens on basically every driver, not produce the intended results and result in hours of debugging time for someone not familiar enough. That's despite OpenGL code being far more prevalent out there than math proofs, presumably. How do you reliably even theorize about things like this when something can be so bad and (apparently) so good at the same time?
This doesn't pass a sniff test. We have plenty of ways to verify good software, else you wouldn't be making this post. You know what bad software is and looks like. We want something fast that doesn't throw an error every 3 page navigations.
You can ask an LLM to make code in whatever language you want. And it can be pretty good at writing efficient code, too. Nothing about NPM bloat is keeping you from making a lean website. And AI could theoretically be great at testing all parts of a website, benchmarking speeds, trying different viewports etc.
But unfortunately we are still on the LLM train. It just doesn't have anything built-in to do what we do, which is use an app and intuitively understand "oh this is shit." And even if you could allow your LLM to click through the site, it would be shit at matching visual problems to actual code. You can forget about LLMs for true frontend work for a few years.
And they are just increasingly worse with more context, so any non-trivial application is going to lead to a lot of strange broken artifacts, because text prediction isn't great when you have numerous hidden rules in your application.
So as much as I like a good laugh at failing software, I don't think you can blame shippers for this one. LLMs are not struggling in software development because they are averaging a lot of crap code, it's because we have not gotten them past unit tests and verifying output in the terminal yet.
They haven't, not at all as far as I can tell. This math problem appears to be a nice chore to be solved, the equivalent to "Claude, optimize this code" or "Write a parser", which is being done 100000x a day.
The original researchers who proposed this problem tried and failed multiple times to solve it. Does that sound like a 'nice chore to be solved' to you ?
That's interesting context, where do you see that? I'm going off of the label "Moderately interesting".
edit: I see in the full write up that the contributor says that they'd estimate an expert would take 1-3 months to do this. They also note that they came up with this solution independently but hadn't confirmed it.
>The newly-solved problem came from Will Brian, who had placed it in the Moderately Interesting category. It is a conjecture from a paper he wrote with Paul Larson in 2019. They were unable to solve it at the time, or in several attempts since. Brian had this to say.
I can't think of any chores that would take an expert months to complete. I can't think of any chores that I've completed but was then 'unconvinced could work'. Please sit down and think about what you are saying here. Are we still talking about chores ?
One of the more strange phenomena with machines getting better and the incessant need (seemingly driven by human exceptionalism) to downplay each result, is that you just end up belittling humans in the process.
This is significant. Your analogy is wrong. It's fine to admit it.
Writing a complex parser or certainly a compiler is a 1 - 3 month project, for example.
Again, I'm not trying to downplay this, but to frame this accurately. I think an AI being able to build a parser/ compiler is cool too.
> One of the more strange phenomena with machines getting better and the incessant need (seemingly driven by human exceptionalism) to downplay each result, is that you just end up belittling humans in the process.
I don't believe in human exceptionalism at all, don't attribute positions to me.
>Writing a complex parser or certainly a compiler is a 1 - 3 month project, for example.
1. Estimating time completion of something that has been done multiple times before and an open problem that has not yet been solved is a different matter entirely. 1 to 3 months is an educated guess and more likely than not, an underestimate.
2. I do not think months long complex compilers and parsers are being routinely completed by LLMs as your original comment implied. Regardless, they are different classes of problems.
Well we are kind of arguing past each other aren't we ?
"More success" is a bit vague in this instance but building a compiler that would take a programmer 1 to 3 months is not comparable to this result regardless of whatever similarity exists in time completion estimates. That's the point.
You can publish a paper (and in fact the researchers plan to) off this result. A basic compiler is cool but otherwise unremarkable. It's been done many times before.
You are leaning too hard on how long the researchers (who again did not manage to solve the problem in their attempts) estimated this would take and the "moderately interesting" tag of again, what was still an open research problem.
This, alongside a few math and physics results that have cropped up in the last few months is easily more impressive than the vast majority of work being done with LLMs for software.
> "More success" is a bit vague in this instance but building a compiler that would take a single programmer 1 to 3 months is not comparable to this result regardless of whatever similarity exists in time completion estimates. That's the point.
I guess we just disagree on this. It's not clear to me that these are totally different in terms of what they represent.
> You can publish a paper (and in fact the researchers plan to) off this result. A basic compiler is cool but otherwise unremarkable.
Publishing papers means very, very little to me. I can publish a paper on a programming language, you know that, right?
> You are leaning too hard on how long the researchers (who again did not manage to solve the problem in their attempts) estimated this would take and the "moderately interesting" tag of again, what was an open research problem.
I obviously estimate my "leanings" as being appropriate. I'm just using the researchers direct quotes. Factually, they had already come up with the approach that ultimately panned out. Factually, they estimated that a human could do this in some timeframe. What am I overly leaning on here?
> This, alongside a few math results that have cropped up in the last few months is easily more impressive than the vast majority of work being done with LLMs for software.
I think both are impressive, I don't know that I would draw some sort of big conclusions about it at this point. I definitely wouldn't draw the conclusion that AI is better at formal mathematics than producing software.
>Publishing papers means very, very little to me. I can publish a paper on a programming language, you know that, right?
We both know that you are not getting that published in a reputable journal without a lot of effort beyond merely 'publishing the language I created', but sure, I'm sure you can get something on arxiv.
>I obviously estimate my "leanings" as being appropriate. I'm just using the researchers direct quotes. Factually, they had already come up with the approach that ultimately panned out.
This really should not be hard to understand.
1. One is something that has been done many times before and the other an unsolved problem. It doesn't take a genius to see one estimate is likely much stronger than the other. If your point hinges on comparing them directly, it's pretty weak.
2. A moderately interesting open research problem is not the same thing as a moderately interesting problem and you seem to be conflating the two.
> We both know that you are not getting that published in a reputable journal without a lot of effort beyond merely 'publishing the language I created'. But sure, you can get something on arxiv.
lol what? There are papers on programming languages all the time.
> 1. One is something that has been done many times before and the other an unsolved problem. It doesn't take a genius to see one estimate is likely much stronger than the other.
Building a compiler for a new programming language, building net new code, etc, is all stuff that was unsolved / had not been done before.
> 2. A moderately interesting open research problem is not the same thing as a moderately interesting problem and you seem to be conflating the two.
>lol what? There are papers on programming languages all the time.
Sure and have you read them ? They're the results of many months or years of research and development so I really don't know what point you think you are making here.
>Building a compiler for a new programming language, building net new code, etc, is all stuff that was unsolved / had not been done before.
Okay but that's not taking a month or two or being asked of LLMs x10000 every day so thanks for making my point I guess.
>Feel free to explain the difference, I guess.
No thanks. If you don't understand it that's fine. This has run its course anyway.
Domain Experienced users are effectively training llms to mimic themselves in solving their problems, therefore/// solving their problems via chat data concentration.
Not sure if AI can have clever or new ideas, it still seems to be it combines existing knowledge and executes algoritms.
I am not necessarily saying humans do something different either, but I have yet to see a novel solution from an AI that is not simply an extrapolation of current knowledge.
Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
My biggest hesitation with AI research at the moment is that they may not be as good at this last step as humans. They may make novel observations, but will they internalize these results as deeply as a human researcher would? But this is just a theoretical argument; in practice, I see no signs of progress slowing down.
This is my take as well. A human who learns, say, a Towers of Hanoi algorithm, will be able to apply it and use it next time without having to figure it out all over again. An LLM would probably get there eventually, but would have to do it all over again from scratch the next time. This makes it difficult combine lessons in new ways. Any new advancement relying on that foundational skill relies on, essentially, climbing the whole mountain from the ground.
I suppose the other side of it is that if you add what the model has figured out to the training set, it will always know it.
We call that Standing On The Shoulders Of Giants and revere Isaac Newton as clever, even though he himself stated that he was standing on the shoulders of giants.
The difference people are neglecting to point out is the experiences we have versus the experiences the AI has.
We have at least 5 senses, our thoughts, feelings, hormonal fluctuations, sleep and continuous analog exposure to all of these things 24/7. It's vastly different from how inputs are fed into an LLM.
On top of that we have millions of years of evolution toward processing this vast array of analog inputs.
Jokes aside, imagine you give LLMs access to real-time, world-wide satellite imagery and just tell it to discover new patrerns/phenomens and corrrlations in the world.
It means extending/expanding something, but the information is based on the current data.
In computer games, extrapolation is finding the future position of an object based on the current position, velocity and time wanted. We do have some "new" position, but the sistem entropy/information is the same.
Or if we have a line, we can expand infinitely and get new points, but this information was already there in the y = m * x + b line formula.
Reading this thread I'm reassured that despite everything AI may disrupt, humans arguing past each other about philosophy of knowledge and epistemology on internet forums is safe :')
I don't understand the position that learning through inference/example is somehow inferior to a top down/rules based learning.
Humans learn many, and perhaps even the majority, of things through observed examples and inference of the "rules". Not from primers and top down explanation.
E.g. Observing language as a baby. Suddenly you can speak grammatically correctly even if you can't explain the grammar rules.
Or: Observing a game being played to form an understanding of the rules, rather than reading the rulebook
Further: the majority of "novel" insights are simply the combination of existing ideas.
Look at any new invention, music, art etc and you can almost always reasonably explain how the creator reached that endpoint. Even if it is a particularly novel combination of existing concepts.
Seems like the high compute parallel thinking models weren't even needed, both the normal 5.4 and gemini 3.1 pro solved it. Somehow Gemini 3 deepthink couldn't solve it.
I wonder how much of this meteoric progress in actually creating novel mathematics is because the training data is of a much higher standard than code, for example.
New goalpost, and I promise I'm not being facetious at all, genuinely curious:
Can an AI pose an frontier math problem that is of any interest to mathematicians?
I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.
Considering that an LLM simply remixes what it finds in its learned distribution over text, it's possible that it can pose new math problems by identifying gaps ("obvious" in restrospect) that humans may have missed (like connecting two known problems to pose a new one). What LLMs can't currently do is pose new problems by observing the real world and its ramifications, like that moving sofa problem.
This is a remarkable result if confirmed independently. The gap between solving competition problems and open research problems has always been significant - bridging that gap suggests something qualitatively different in the model capabilities.
> This problem is about improving lower bounds on the values of a sequence, , that arises in the study of simultaneous convergence of sets of infinite series, defined as follows.
One thing I notice in the AlphaEvolve paper as well as here, is that these LLMs have been shown to solve optimization problems - something we have been using computers for, for really long. In fact, I think the alphaevolve-style prompt augmentation approach is a more principled approach to what these guys have done here, and am fairly confident this one would have been solved in that approach as well.
In spirit, the LLM seems to compute the {meta-, }optimization step()s in activation space. Or, it is retrieving candidate proposals.
It would be interesting to see if we can extract or model the exact algorithms from the activations. Or, it is simply retrieving and proposing a deductive closures of said computation.
In the latter case, it would mean that LLMs alone can never "reason" and you need an external planner+verifier (alpha-evolve style evolutionary planner for example).
We are still looking for proof of the former behaviour.
What are the odds that this is because Openai is pouring more money into high publicity stunts like this- rather than its model actually being better than Anthropics?
Besides the point of the supposed achievement, that is supposedly confirmed, my point will be that Epoch.ai is possibly just a PR firm for *Western* AI providers, then possibly this news is untruth worthy.
This is a lot like the 50 million monkeys on 50 million typewriters will eventually write shakespeare... We have all heard this, pity the poor proof readers who will proof them all in a search for the holy grail = zero errors.
In a similar way, LLM's are permutational cross associating engines, matched with sieves to filter out the dross. Less filtering = more dross, AKA slop.
It can certainly create enormous masses of bad code and with well filtered screens for dross, we can see it can create passable code, however stray flaws(flies) can creep in and not get filtered, and humans are better at seeing flies in their oatmeal.
AI seems very good at permutational code assaults on masses of code to find the flies(zero days), so I expect it to make code more secure as few humans have the ability/time to mount that sort of permutational assault on code bases. I see this idea has already taken root within code writers as well as hackers/China etc.
These two opposing forces will assault code bases, one to break and one to fortify. In time there will be fewer places where code bases have hidden flaws as soon all new code will be screened by AI to find breaks so that little or no code will contain these bugs.
Fantastic news! That means with the right support tooling existing models are already capable of solving novel mathematics. There’s probably a lot of good mathematics out there we are going to make progress on.
I am kind of amazed at how many commenters respond to this result by confidently asserting that LLMs will never generate 'truly novel' ideas or problem solutions.
> AI is a remixer; it remixes all known ideas together. It won't come up with new ideas
> it's not because the model is figuring out something new
> LLMs will NEVER be able to do that, because it doesn't exist
It's not enough to say 'it will never be able to do X because it's not in the training data,' because we have countless counterexamples to this statement (e.g. 167,383 * 426,397 = 71,371,609,051, or the above announcement). You need to say why it can do some novel tasks but could never do others. And it should be clear why this post or others like it don't contradict your argument.
If you have been making these kinds of arguments against LLMs and acknowledge that novelty lies on a continuum, I am really curious why you draw the line where you do. And most importantly, what evidence would change your mind?
I might as well answer my own question, because I do think there are some coherent arguments for fundamental LLM limitations:
1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.
2. LLMs do not learn from experience. They might perform as well as most humans on certain tasks, but a human who works in a certain field/code base etc. for long enough will internalize the relevant information more deeply than an LLM.
However I'm increasingly doubtful that these arguments are actually correct. Here are some counterarguments:
1. It may be more efficient to just learn correct logical reasoning, rather than to mimic every human foible. I stopped believing this argument when LLMs got a gold metal at the Math Olympiad.
2. LLMs alone may suffer from this limitation, but RL could change the story. People may find ways to add memory. Finally, it can't be ruled out that a very large, well-trained LLM could internalize new information as deeply as a human can. Maybe this is what's happening here:
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
I studied philosophy focusing on the analytic school and proto-computer science. LLMs are going to force many people start getting a better understanding about what "Knowledge" and "Truth" are, especially the distinction between deductive and inductive knowledge.
Math is a perfect field for machine learning to thrive because theoretically, all the information ever needed is tied up in the axioms. In the empirical world, however, knowledge only moves at the speed of experimentation, which is an entirely different framework and much, much slower, even if there are some areas to catch up in previous experimental outcomes.
Having a focus in philosophy of language is something I genuinely never thought would be useful. It’s really been helpful with LLMs, but probably not in the way most people think. I’d say that folks curious should all be reading Quine, Wittgenstein’s investigations, and probably Austin.
I think we may have similar perspectives. Regarding empirical knowledge, consider when the knowledge is in relation to chaotic systems. Characterize chaotic systems at least as systems where inaccurate observations about the system in the past and present while useful for predicting the future, nevertheless see the errors grow very quickly for the task of predicting a future state. Then indeed, prediction is difficult.
One domain of knowledge I think you have yet to mention. We can talk about fundamentally computationally hard problems. What comes to mind regarding such problems that are nevertheless of practical benefit are physics simulations, material simulations, fluid simulations, but there exist problems that are more provably computationally difficult. It seems to me that with these systems, the chaotic nature is one where even if you have one infinitely precise observation of a deterministic system, accessing a future state of the system is difficult as well, even though once accessed, memorization seems comparatively trivial.
Also, we can do thought experiments, simulations in our heads, that often are as good as doing them for real - it has limitations and isn't perfect though. But it does work often. Einstein used to purposely dose off in a weird position so that something hit his leg or something like that to slightly nudge him half awake so he could remember his half-dreaming state - which is where he discovered some things
Where can I read about how LLMs have changed epistemology? Is there a field of philosophy that tries to define and understand 'intelligence'? That sounds very interesting.
There is already philosophy of mind, but it was pretty young when I was in grad school, which was really at the dawn of deep learning algorithms.
I’d say the two most important topics here are philosophy of language (understanding meaning) and philosophy of science (understanding knowledge).
I’ve already mentioned the language philosophers in an edit above, but in philosophy of science I’d add Popper as extremely important here. The concept of negative knowledge as the foundation of empirical understanding seems entirely lost on people. The Black Swan, by Nassim Taleb is a very good casual read on the subject.
> distinction between deductive and inductive knowledge
There's also intuitive knowledge btw.
Anyway, the recent developments of AI make a lot of very interesting things practically possible. For example, our society is going to want a way to reliably tell whether something is AI generated, and a failure to do so pretty much settles the empirical part of the Turing test issue. Or alternatively if we actually find something that AI can't reliably mimic in humans, that's going to be a huge finding. By having millions of people wonder whether posts on social media are AI generated, it is the largest scale Turing test we have inadvertently conducted.
The fact that AI seems to be able to (digitally) do anything we ask for is also very interesting. If humans are not bogged down by the small details or cost of implementation concerns, and we can just say what we want and get what we wished for (digitally), what level of creativity can we reach?
Also once we get the robots to do things in the physical space...
I don't want to do the thing where we fight on the internet. I don't know your background, but I'll push back here just because this type of comment that non-philosophers seem to present to me, which misses a lot of the points I'm trying to make.
(1) "intuitive knowledge" - whether or not you want to take "intuitive knowledge" as a type of knowledge (I don't think I would) is basically immaterial. The deductive-inductive framework dynamic is for reasoning frameworks, not knowledge. The reasoning frameworks are pointed in opposite directions. The deductive framework is inherited from rationalist tradition, it's premises are by definition arbitrary and cannot be justified, and information is perfect (excepting when you get rare truth values, like something being undecidable). Inductive/empirical framework is quite the opposite. Its premises are observations and absolutely not arbitrary, the information is wholly imperfect (by necessity, thanks Popper), and there is always a kind of adjustable resolution to any research conducted. Newton vs Einsteinian physics, for example, shows how zooming in on the resolution of experimentation shows how a perfectly workable model can fail when instruments get precise enough. I'll also note here that abduction is another niche reasoning framework, but is effectively immaterial to my point here.
(2) The Turing Test is not, and has never been, a philosophically rigorous test. It's effectively a pointless exercise. The literature about "philosophical zombies" has covered this, but the most important work here is Searle's "Chinese Room."
>The fact that AI seems to be able to (digitally) do anything we ask for is also very interesting.
I don't even know how to respond to this. It's trivially, demonstrably false. Beyond that, my entire point is that philosophy of language actually presents so hard problems with regards to what meaning actually is that might end up creating a kind of uncertainty principle to this line of thinking in the long run. Specifically Quine's indeterminacy of translation.
Searle's Chinese Room is a fallacious mess ... see the works of Larry Hauser, e.g., https://philpapers.org/rec/HAUNGT and https://philpapers.org/rec/HAUSCB-2 The importance of Searle's Chinese Room is how such extraordinarily bad argumentation has persuaded so many people open to it.
And the literature about philosophical zombies is contentious, to say the least, and much of it is also among the worst arguments in philosophy--Dennett confided in me that he thought it set back progress in Philosophy of Mind for decades, along with that monstrosity of misdirection, "the hard problem". Chalmers (nice guy, fun drunk at parties, very smart, but hopelessly deluded) once admitted to me on the Psyche-D list that his argument in The Conscious Mind that zombies are conceivable is logically equivalent to denying that physicalism is conceivable, so it's no argument against physicalism ... he said he used the argument to till the soil to make people more susceptible to his later arguments against physicalism (which I consider unethical)--all of which are bogus, like the Knowledge Argument--even Frank Jackson who originated it admits this.
Similarly, Robert Kirk, who coined the phrase "philosophical zombie" in 1974, wrote his book Zombies and Consciousness "as penance", he told me when he signed my copy.
> I don't want to do the thing where we fight on the internet.
Nor me ... I've had these "fights" too many times already and I know how they go, and I understand why people believe what they believe and why they can't be swayed, so I won't comment further ... I just want to put a dent in this "I'm a philosopher" argumentum ad verecundiam.
I would hope that philosophy would be exempt from accusations of arguments from authority. I say I don’t want to fight exactly because I don’t want to come off like a jerk because I’m arguing. If the Chinese Room is a mess, I welcome the argument, and will happily read the paper.
I’m less open to push back against philosophical zombies, as the argument seems trivially plausible, from a position of solipsism.
There are ways to go beyond the human-quality data limitation. AI can be trained on better quality than average human data because many problems are easy to verify their solutions. For example, in theory, reinforcement learning with an automatic grader on competitive programming problems can lead to an LLM that is better than humans at it.
It's also possible that there can be emergent capabilities. Perhaps a little obtuse, but you can say that humans are trained on human-quality data too and yet brilliant scientists and creative minds can rise above the rest of us.
The idea that they don’t learn from experience might be true in some limited sense, but ignores the reality of how LLMs are used. If you look at any advanced agentic coding system the instructions say to write down intermediate findings in files and refer to them. The LLM doesn’t have to learn. The harness around it allows it to. It’s like complaining that an internal combustion engine doesn’t have wheels to push it around.
LLMs are notoriously terrible at multiplying large numbers: https://claude.ai/share/538f7dca-1c4e-4b51-b887-8eaaf7e6c7d3
> Let me calculate that. 729,278,429 × 2,969,842,939 = 2,165,878,555,365,498,631
Real answer is: https://www.wolframalpha.com/input?i=729278429*2969842939
> 2 165 842 392 930 662 831
Your example seems short enough to not pose a problem.
Modern LLMs, just like everyone reading this, will instead reach for a calculator to perform such tasks. I can't do that in my head either, but a python script can so that's what any tool-using LLM will (and should) do.
This is special pleading.
Long multiplication is a trivial form of reasoning that is taught at elementary level. Furthermore, the LLM isn't doing things "in its head" - the headline feature of GPT LLMs is attention across all previous tokens, all of its "thoughts" are on paper. That was Opus with extended reasoning, it had all the opportunity to get it right, but didn't. There are people who can quickly multiply such numbers in their head (I am not one of them).
LLMs don't reason.
I tried this with Claude - it has to be explicitly instructed to not make an external tool call, and it can get the right answer if asked to show its work long-form.
i assert that by your evidentiary standards humans don't reason.
presumably one of us is wrong.
therefore, humans don't reason.
Mathematics is not the only kind of reasoning, so your conclusion is false. The human brain also has compartments for different types of activities. Why shouldn't an AI be able to use tools to augment its intelligence?
Furthermore, the LLM isn't doing things "in its head" - the headline feature of GPT LLMs is attention across all previous tokens, all of its "thoughts" are on paper
LOL, talk about special pleading. Whatever it takes to reshape the argument into one you can win, I guess...
LLMs don't reason.
Let's see you do that multiplication in your head. Then, when you fail, we'll conclude you don't reason. Sound fair?
LLMs don't use tools. Systems that contain LLMs are programmed to use tools under certain circumstances.
I thought it might do better if I asked it to do long-form multiplication specifically rather than trying to vomit out an answer without any intermediate tokens. But surprisingly, I found it doesn't do much better.
This doesn’t address the author’s point about novelty at all. You don’t need 100% accuracy to have the capability to solve novel problems.
This hasn't been true for a while now.
I asked Gemini 3 Thinking to compute the multiplication "by hand." It showed its work and checked its answer by casting out nines and then by asking Python.
Sonnet 4.6 with Extended Thinking on also computed it correctly with the same prompt.
LLMs can generate anything by design. LLMs can't understand what they are generating so it may be true, it may be wrong, it may be novel or it may be known thing. It doesn't discern between them, just looks for the best statistical fit.
The core of the issue lies in our human language and our human assumptions. We humans have implicitly assigned phrases "truly novel" and "solving unsolved math problem" a certain meaning in our heads. Some of us at least, think that truly novel means something truly novel and important, something significant. Like, I don't know, finding a high temperature superconductor formula or creating a new drug etc. Something which involver real intelligent thinking and not randomizing possible solutions until one lands. But formally there can be a truly novel way to pack the most computer cables in a drawer, or truly novel way to tie shoelaces, or indeed a truly novel way to solve some arbitrary math equation with an enormous numbers. Which a formally novel things, but we really never needed any of that and so relegated these "issues" to a deepest backlog possible. Utilizing LLMs we can scour for the solutions to many such problems, but they are not that impressive in the first place.
> It doesn't discern between them, just looks for the best statistical fit
Of course at the lowest level, LLMs are trained on next-token prediction, and on the surface, that looks like a statistics problem. But this is an incredibly reductionist viewpoint and I don't see how it makes any empirically testable predictions about their limits. LLMs 'learned' a lot of math and science in this way.
> "truly novel" and "solving unsolved math problem"
OK again if novelty lies on a continuum, where do you draw the line? And why is it correct to draw it there and not somewhere else? It seems like you are just naming exceptionally hard research problems.
>LLMs 'learned' a lot of math and science in this way.
Did they? Or is it begging the question?
This is why I put 'learned' in quotes. They started from a state of not being able to solve algebra problems or produce basic steps of scientific reasoning to being able to. Operationally, that is what I mean by learning and they unambiguously do it.
If LLMs can come up with formerly truly novel solutions to things, and you have a verification loop to ensure that they are actual proper solutions, I don't understand why you think they could never come up with solutions to impressive problems, especially considering the thread we are literally on right now? That seems like a pure assertion at this point that they will always be limited to coming up with truly novel solutions to uninteresting problems.
"Truly novel" is fast becoming a True Scotsman.
No True Novelty, No True Understanding, etc.
The problem with these bromides is not that they're wrong, it's that they're not even wrong. They're predictive nulls.
What observable differences can we expect between an entity with True Understanding and an entity without True Understanding? It's a theological question, not a scientific one.
I'm not an AI booster by any means, but I do strongly prefer we address the question of AI agent intelligence scientifically rather than theologically.
We've tested this in the small with AI art. When people believe they're viewing human-made art which is later revealed to be AI art, they feel disappointed. The actual content is incidental, the story that supports it is more important than the thing itself.
It's the same mechanism behind artisanal food, artist struggles, and luxury goods. It is the metaphysical properties we attach to objects or the frames we use to interpret strips of events. We author all of these and then promptly forget we've done so, instead believing they are simply reality.
>The actual content is incidental, the story that supports it is more important than the thing itself.
The actual content of a work of art is the expression of lived experience. Not its form.
There are already people dealing with AI intelligence scientifically. That's what benchmarks do.
It's the "it's just a stochastic parrot!" camp that's doing the theological work. (and maybe also those in the Singularity camp...)
That said, I do think there's value in having people understand what "Understanding" means, which is kinda a theological (philosophical :D) question. IMHO, in every-day language there's a functional part (that can be tested with benchmarks), and there's a subjective part (i.e. what does it feel like to understand something?). Most people without the appropriate training simply mix up these two things, and together with whatever insecurities they have with AI taking over the world (which IMHO is inevitable to some extent), they just express their strong opinions about it online...
Well said. That's exactly what has been rubbing me the wrong way with all those "LLMs can never *really* think, ya know" people. Once we pass some level of AI capability (which we perhaps already did?), it essentially turns into an unfalsifiable statement of faith.
Agreed. We should be asking what the machines measurably can or can't do. If it can't be measured, then it doesn't matter from an engineering standpoint. Does it have a soul? Can't measure it, so it doesn't matter.
That's a bit too pessimistic. Often times you can productively find some measurable proxy for the thing you care about but can't measure. Turing's test is a famous example, of that.
Sometimes you only have a one-sided proxy. Eg I can't tell you whether Claude has a soul, but I'm fairly sure my dishwasher ain't.
> Turing's test is a famous example
Ironically, the Turing test is the OG functionalist approach. The GP's comment basically sums up with the Turing test was designed for.
Yes, but I interpret Turing's paper not as saying "souls don't matter", but as "here's a good proxy that we can actually measure".
(I don't know what Turing's opinion on souls is, and it doesn't matter for that paper!)
Claude has neither a soul nor a warbleflupper.
It probably can, but won't realize that and it won't be efficient in that. LLM can shuffle tokens for an enormous number of tries and eventually come up with something super impressive, though as you yourself have mentioned, we would need to have a mandatory verification loop, to filter slop from good output and how to do it outside of some limited areas is a big question. But assuming we have these verification loops and are running LLMs for years to look for something novel. It's like running an energy grid of small country to change a few dozen of database entries per hour. Yes, we can do that, but it's kinda weird thing to do. But it is novel, no argue about that. Just inefficient.
We never had a big demand to define how humans are intelligent or conscious etc, since it is too hard and was relegated to a some frontier researchers. And with LLMs we now do have such demand but the science wasn't ready. So we are all collectively searching in the dark, trying to define if we are different from these programs if not how. I certainly can't do that. I do know that LLMs are useful, but I also suspect that AI (aka AGI nowadays) is not yet reached.
How can people look at
- clear generalizability
- insane growth rates (go back and look at where we were maybe 2 years ago and then consider the already signed compute infrastructure deals coming online)
And still say with a straight face that this is some kind of parlor trick or monkeys with typewriters.
we don’t need to run LLMs for years. The point is look at where we are today and consider performance gets 10x cheaper every year.
LLMs and agentic systems are clearly not monkeys with typewriters regurgitating training data. And they have and continue to grow in capabilities at extremely fast rates.
I was talking about highest difficulty problems only, in the scope of that comment. Sure at mundane tasks they are useful and we optimizing that constantly.
But for super hard tasks, there is no situation when you just dump a few papers for context add a prompt and LLM will spit out correct answer. It's likely that a lead on such project would need to additionally train LLM on their local dataset, then parse through a lot of experimental data, then likely run multiple LLMs for for many iterations homing on the solution, verifying intermediate results, then repeating cycle again and again. And in parallel the same would do other team members. All in all, for such a huge hard task a year of cumulative machine-hours is not something outlandish.
This is just not true. Maybe it will be true if you increase the problem difficulty in concert with model performance? You don't need fine tuning for this and you haven't for years now. Reasoning performance for now may be SOMEWHAT brittle but again look at where we have come from in like 2 years. Then also consider the logical next steps
- better context compression (already happening) + memory solutions that extend the effective context length [memory _is_ compression]
- continual learning systems (likely already prototyped)
- these domains are _verifiable_ which I think just seems to confuse people. RL in verifiable domains takes you farther and farther. Training data is a bootstrap to get to a starting point, because RL from scratch is too inefficient.
agents can already deal with large codebases and datasets, just like any SWE, DS or researcher.
and yes! If you throw more compute at a problem you will get better solutions! But you are missing the point: for the frontier solutions, which changes with every model update, you of course need to eek out as much performance as you can, which requires a large amount of test time compute. But what you can do _without_ this is continually improving. The pattern _already in place_ is that at first you need an extreme amount of compute, then the next model iterations need far less compute to reach that same solution, etc etc. The costs + compute requirements to perform a particular task decrease exponentially.
> We never had a big demand to define how humans are intelligent or conscious etc, since it is too hard and was relegated to a some frontier researchers. And with LLMs we now do have such demand but the science wasn't ready. So we are all collectively searching in the dark, trying to define if we are different from these programs if not how. I certainly can't do that. I do know that LLMs are useful, but I also suspect that AI (aka AGI nowadays) is not yet reached.
Alternative perspective: the science may not have been ready, so instead we brute-forced the problem, through training of LLMs. Consider what the overall goal function of LLM training is: it's predicting tokens that continue given input in a way that makes sense to humans - in fully general meaning of this statement.
It's a single training process that gives LLMs the ability to parse plain language - even if riddled with 1337-5p34k, typos, grammar errors, or mixing languages - and extract information from it, or act on it; it's the same single process that makes it equally good at writing code and poetry, at finding bugs in programs, inconsistencies in data, corruptions in images, possibly all at once. It's what makes LLMs good at lying and spotting lies, even if input is a tree of numbers.
(It's also why "hallucinations" and "prompt injection" are not bugs, but fundamental facets of what makes LLMs useful. They cannot and will not be "fixed", any more than you can "fix" humans to be immune to confabulation and manipulation. It's just the nature of fully general sytems.)
All of that, and more, is encoded in this simple goal function: if a human looks at the output, will they say it's okay or nonsense? We just took that and thrown a ton of compute at it.
> (It's also why "hallucinations" and "prompt injection" are not bugs, but fundamental facets of what makes LLMs useful. They cannot and will not be "fixed", any more than you can "fix" humans to be immune to confabulation and manipulation. It's just the nature of fully general sytems.)
This is spot on and one of the reasons why I don't think putting LLMs or LLM based devices into anything that requires security is a good idea.
> It doesn't discern between them, just looks for the best statistical fit.
Why this is not true for humans?
We can't tell yet if that is true, partially true, or false for humans. We do know that LLM can't do anything else besides that (I mean as a fundamental operating principle).
Why is it important? “Statistical fit” is what you want…not understanding this is indicative of a limited understanding of what statistics is. What do you think it means to truly understand something? I don’t get it: read probability theory by Jaynes. It doesn’t really matter if the brain does Bayesian updates but that’s what’s optimal…
"Statistical fit" to environment is arguably what all life does.
> Which a formally novel things, but we really never needed any of that
The history of science and maths is littered with seemingly useless discoveries being pivotal as people realised how they could be applied.
It's impossible to tell what we really "need"
> LLMs can't understand what they are generating
You don't understand what "understanding" means. I'm sure you can't explain it. You are probably just hallucinating the feeling of understanding it.
> Some of us at least, think that truly novel means something truly novel and important, something significant. Like, I don't know...
Yeah.
I've been working on a utility that lets me "see through" app windows on macOS [1] (I was a dev on Apple's Xcode team and have a strong understanding of how to do this efficiently using private APIs).
I wondered how Claude Code would approach the problem. I fully expected it to do something most human engineers would do: brute-force with ScreenCaptureKit.
It almost instantly figured out that it didn't have to "see through" anything and (correctly) dismissed ScreenCaptureKit due to the performance overhead.
This obviously isn't a "frontier" type problem, but I was impressed that it came up with a novel solution.
[1]: https://imgur.com/a/gWTGGYa
That's actually pretty cool. What made you think of doing this in the first place?
Thanks! I've been doing a lot of work on a laptop screen (I normally work on an ultrawide) and got tired of constantly switching between windows to find the information I need.
I've also added the ability to create a picture-in-picture section of any application window, so you can move a window to the background while still seeing its important content.
I'll probably do a Show HN at some point.
Was it a novel solution for you or for everyone? Because that's a pretty big difference. A lot stuff novel for me would be something someone had been doing for decades somewhere.
Unless you worked on the macOS content server directly you’d have no idea that my solution was even possible.
That fact that Claude skipped over all the obvious solutions is why I used the word novel.
How confident are you that this knowledge was not part of the training data? Was there no stackoverflow questions/replies with it, no tech forum posts, private knowledge bases, etc?
Not trying to diminish its results, just one should always assume that LLMs have a rough memory on pretty much the whole of the internet/human knowledge. Google itself was very impressive back then in how it managed to dig out stuff interesting me (though it's no longer good at finding a single article with almost exact keywords...), and what makes LLMs especially great is that they combine that with some surface level transformation to make that information fit the current, particular need.
Do you think AlphaGo is regurgitating human gameplay? No it’s not: it’s learning an optimal policy based on self play. That is essentially what you’re seeing with agents. People have a very misguided understanding of the training process and the implications of RL in verifiable domains. That’s why coding agents will certainly reach superhuman performance. Straw/steel man depending on what you believe: “But they won’t be able to understand systems! But a good spec IS programming!” also a bad take: agents absolutely can interact with humans, interpret vague deseridata, fill in the gaps, ask for direction. You are not going to need to write a spec the same way you need to today. It will be exactly like interacting with a very good programmer in EVERY sense of the word
How does alphago come into picture? It works in a completely different way all together.
I'm not saying that LLMs can't solve new-ish problems, not part of the training data, but they sure as hell not got some Apple-specific library call from a divine revelation.
AlphaGo comes into the picture to explain that in fact coding agents in verifiable domains are absolutely trained in very similar ways.
It’s not magic they can’t access information that’s not available but they are not regurgitating or interpolating training data. That’s not what I’m saying. I’m saying: there is a misconception stemming from a limited understanding of how coding agents are trained that they somehow are limited by what’s in the training data or poorly interpolating that space. This may be true for some domains but not for coding or mathematics. AlphaGo is the right mental model here: RL in verifiable domains means your gradient steps are taking you in directions that are not limited by the quality or content of the training data that is used only because starting from scratch using RL is very inefficient. Human training data gives the models a more efficient starting point for RL.
Well said.
Why is ScreenCaptureKit a bad choice for performance?
Because you can't control what the content server is doing. SCK doesn't care if you only need a small section of a window: it performs multiple full window memory copies that aren't a problem for normal screen recorders... but for a utility like mine, the user needs to see the updated content in milliseconds.
Also, as I mentioned above, when using SCK, the user cannot minimize or maximize any "watched" window, which is, in most cases, a deal-breaker.
My solution runs at under 2% cpu utilization because I don't have to first receive the full window content. SCK was not designed for this use case at all.
What was the solution?
Well, I'm not going to share either solution as this is actually a pretty useful utility that I plan on releasing, but the short answer is: 1) don't use ScreenCaptureKit, and 2) take advantage of what CGWindowListCreateImage() offers through the content server. This is a simple IPC mechanism that does not trigger all the SKC limitations (i.e., no multi-space or multi-desktop support). In fact, when using SKC, the user cannot even minimize the "watched" window.
Claude realized those issues right from the start.
One of the trickiest parts is tracking the window content while the window is moving - the content server doesn't, natively, provide that information.
Huh, Claude one-shotted it out of a single message from me. Man, LLMs have gotten good.
No it didn't. Like I said... it may have gotten something that worked but there is no way Claude got it to work while supporting multi-spaces, multi-desktops, and using under 2% cpu utilization. My solution can display app window content even when those windows are minimized, which is not something the content server supports.
My point was that Claude realized all the SKC problems and came up with a solution that 99% of macOS devs wouldn't even know existed.
> it may have gotten something that worked but there is no way Claude got it to work while supporting multi-spaces, multi-desktops, and using under 2% cpu utilization.
Maybe, but that's the magic of LLMs - they can now one-shot or few-shot (N<10) you something good enough for a specific user. Like, not supporting multi-desktops is fine if one doesn't use them (and if that changes, few more prompts about this particular issue - now the user actually knows specifically what they need - should close the gap).
And now it does.
Sorry, "now it does", what?
The things it didn't, which you then helpfully spelled out.
Do you believe my brief overview of the problem will help Claude identify the specific undocumented functions required for my solution? Is that how you think data gets fed back into models during training?
Yes. I don't think you appreciate just how much information your comments provide. You just told us (and Claude) what the interesting problems are, and confirmed both the existence of relevant undocumented functions, and that they are the right solution to those problems. What you didn't flag as interesting, and possible challenges you did not mention (such as these APIs being flaky, or restricted to Apple first-party use, or such) is even more telling.
Most hard problems are hard because of huge uncertainty around what's possible and how to get there. It's true for LLMs as much as it is for humans (and for the same reasons). Here, you gave solid answers to both, all but spelling out the solution.
ETA:
> Is that how you think data gets fed back into models during training?
No, one comment chain on a niche site is not enough.
It is, however, how the data gets fed into prompt, whether by user or autonomously (e.g. RAG).
> Yes. I don't think you appreciate just how much information your comments provide
Lol... no. You don't know how I solved the problem and you just read everything that Claude did.
Absolutely nothing in the key part of my solution uses a single public API (and there are thousands). And you think that Claude can just "figure that out" when my HK comments gets fed back in during training?
I sincerely wish we'd see less /r/technology ridiculousness on HN.
I wonder how many 'ideas guys' will now think that with LLMs they can keep their precious to themselves while at the same bragging about them in online fora. Before they needed those pesky programmers negotiating for a slice of the pie, but this time it will be different.
Next up: copyright protection and/or patents on prompts. Mark my words.
I'm pretty sure a large fraction of the vibecoded stuff out there is from the "ideas guys." This time will be different because they'll find out very quickly whether their ideas are worth anything. The term "slop" substantially applies to the ideas themselves.
I don't think there will be copyright or patents on prompts per se, but I do think patents will become a lot more popular. With AI rewriting entire projects and products from scratch, copyright for software is meaningless, so patents are one of the very few moats left. Probably the only moat for the little guys.
It one-shotted what exactly?
Because LatencyKills is clearly describing a broader set of requirements related to their solution.
> 67,383 * 426,397 = 71,371,609,051 ... You need to say why it can do some novel tasks but could never do others.
Model interpretability gives us the answers. The reason LLMs can (almost) do new multiplication tasks is because it saw many multiplication problems in its training data, and it was cheaper to learn the compressed/abstract multiplication strategies and encode them as circuits in the network, rather than memorize the times tables up to some large N. This gives it the ability to approximate multiplication problems it hasn't seen before.
> This gives it the ability to approximate multiplication problems it hasn't seen before.
More than approximate. It straight up knows the algorithms and will do arbitrarily long multiplications correctly. (Within reason. Obviously it couldn't do a multiplication so large the reasoning tokens would exceed its context window.)
Having ChatGPT 5.4 do 1566168165163321561 * 115616131811365737 without tools, after multiplying out a lot of coefficients, it eventually answered 181074305022287409585376614708755457, which is correct.
At this point, it's less misleading to say it knows the algorithm.
Why are we reducing AIs to LLMs?
Claude, OpenAI, etc.'s AIs are not just LLMs. If you ask it to multiply something, it's going to call a math library. Go feed it a thousand arithmetic problems and it'll get them 100% right.
The major AIs are a lot more than just LLMs. They have access to all sorts of systems they can call on. They can write code and execute it to get answers. Etc.
Yup, I agree with this. So based on this, where do you draw the line between what will be possible and what will not be possible?
Which is exactly how humans learn many things too.
E.g. observing a game being played to form an understanding of the rules, rather than reading the rulebook
Or: Observing language as a baby. Suddenly you can speak grammatically correctly even if you can't explain the grammar rules.
> asserting that LLMs will never generate 'truly novel' ideas or problem solutions
I don't think I've had one of these my entire life. Truly novel ideas are exceptionally rare:
- Darwin's origin of the species - Godel's Incompleteness - Buddhist detachment
Can't think of many.
Most inventions are an interpolation of three existing ideas. These systems are very good at that.
My take as well. Furthermore, most innovations come relatively shortly after their technological prerequisites have been met, so that suggests the "novelty space" that humans generally explore is a relatively narrow band around the current frontier. Just as humans can search through this space, so too should machines be capable of it. It's not an infinitely unbounded search which humans are guided through by some manner of mystic soul or other supernatural forces.
Indeed. Every time someone complains that LLMs can't come up with anything new, I'm assaulted with the depressing remembrance that neither do I.
I can't even find a good example of an invention that is not an interpolation.
The inclined plane, the wheel, shall I keep going?
Stand on a fallen log on a hillside and you'll interpolate pretty hard.
People rarely create things that are wholly new.
Most created things are remixes of existing things.
Hallucinations are “something new”. And like most new things, useless. But the truth is the entire conversation is a hallucination. We just happen to agree that most of it is useful.
The hardest part about any creativity is hiding your influences
This is poetry.
I think "novel" is ill defined here, perhaps. LLMs do appear to be poor general reasoners[0], and it's unclear if they'll improve here.
It would be unintuitive for them to be good at this, given that we know exactly how they're implemented - by looking at text and then building a statistical model to predict the next token. From this, if we wanted to commit to LLMs having generalizable knowledge, we'd have to assume something like "general reasoning is an emergent property of statistical token generation", which I'm not totally against but I think that's something that warrants a good deal of evidence.
A single math problem being solved just isn't rising to that level of evidence for me. I think it is more on you to:
1. Provide a theory for how LLMs can do things that seemingly go beyond expectations based on their implementation (for example, saying that certain properties of reasoning are emergent or reduce to statistical constructs).
2. Provide evidence that supports your theory and ideally can not be just as well accounted for another theory.
I'm not sure if an LLM will never generate "novel" content because I'm not sure that "novel" is well defined. If novel means "new", of course they generate new content. If novel means "impressive", well I'm certainly impressed. If "novel" means "does not follow directly from what they were trained on", well I'm still skeptical of that. Even in this case, are we sure that the LLM wasn't trained on previous published works, potentially informal comments on some forum, etc, that could have steered it towards this? Are we sure that the gap was so large? Do we truly have countless counterexamples? Obviously this math problem being solved is not a rigorous study - the authors of this don't even have access to the training data, we'd need quite a bit more than this to form assumptions.
I'm willing to take a position here if you make a good case for it. I'm absolutely not opposed to the idea that other forms of reasoning can't reduce to statistical token generation, it just strikes me as unintuitive and so I'm going to need to hear something to compel me.
[0] https://jamesfodor.com/2025/06/22/line-goes-up-large-languag...
> I think "novel" is ill defined here
That's exactly my point. When people say "LLMs will never do something novel," they seem to be leaning on some vague, ill-defined notion of novelty. The burden of proof is then to specify what degree of novelty is unattainable and why.
As for evidence that they can do novel things, there is plenty:
1. I really did ask Gemini to multiply 167,383 * 426,397 before posting this question. It answered correctly.
2. SVGs of pelicans riding bicycles
3. People use LLMs to write new apps/code every day
4. LLMs have achieved gold-medal performance on Math Olympiad problems that were not publicly available
5. LLMs have solved open problems in physics and mathematics [0,1]
That is as far as they have advanced so far. What's next? Where is the limit? All I want to say is that I don't know, and neither do you :).
[0] https://www.reddit.com/r/Physics/comments/1n77h10/and_severa... (Mark Raamsdonk is a pretty famous researcher in high-energy physics, not just some random guy)
[1] https://mathstodon.xyz/@tao/115855840223258103
[2] https://news.ycombinator.com/item?id=47497757
Actually here's an even better list of progress on a number of open math problems, with plenty of caveats and exposition:
https://github.com/teorth/erdosproblems/wiki/AI-contribution...
This is great observational data but it's an early "step 1", I'd definitely need to see an actual analysis of these cases and likely want to have that analysis involve a review of relevant training data.
The “good deal of evidence” is everywhere. The proof is in the pudding. Of course you can find failure modes, the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark? Designed to elicit failure modes, ok so what? As if this is surprising to anyone and somehow negates everything else?
Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means. That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training). That’s indistinguishable from intelligence. It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.
> The “good deal of evidence” is everywhere. The proof is in the pudding.
I'm open! Please, by all means.
> the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark?
The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.
> Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means.
Okay.
> That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training).
This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.
As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?
> It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.
Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.
So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?
> The “good deal of evidence” is everywhere. The proof is in the pudding. I'm open! Please, by all means.
sure here are but a few: [1] you get smooth gains in reasoning with more RL train-time compute and more test-time compute (o1)
[2] DeepSeek-R1 showed that RL on verifiable rewards produces behavior like backtracking, adaptation, reflection, etc.
[3] SWE-Bench is a relatively decent benchmark and perf here is continually improving — these are real GitHub issues in real repos
[4] MathArena — still good perf on uncontaminated 2025 AIME problems
[5] the entire field of reinforcement learning, plus successes in other fields with verifiable domains (e.g. AlphaGo); Bellman updates will give you optimal policies eventually
[6] Anthropics cool work looking effectively at biology of a large language models: https://transformer-circuits.pub/2025/attribution-graphs/met... — if you trace internal circuits in Haiku 3.5 you see what you expect from a real reasoning system: planning ahead, using intermediate concepts, operating in a conceptual latent space (above tokens). And thats Haiku 3.5!!! We’re on Opus 4.6 now…
people like to move goalposts whenever a new result comes out, which is silly. Could AI systems do this 2 years ago? No. I don’t know how people don’t look at robust trends in performance improvement, combined with verifiable RL rewards, and can’t understand where things are going.
> The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.
Appeals to authority are a fine prior, but lo and behold I also have a PhD and have worked on and led benchmark development professionally for several years at an AI lab. That’s ultimately no reason to really trust either of us. As I said, the blog post rightfully decries benchmarks but it then presents a new benchmark as though that isn’t subject to all of the same problems. It’s a good article! I think they do a good job here! I agree with all of their complaints about benchmarks! It rightfully identifies failure modes, and there are plenty of other papers pointing out similar failure modes. Reasoning is still brittle, lots of areas where LLMs/agentic systems fail in ways that are incredible given their talent in other areas. But you pretend as though this is definitive evidence that “LLMs are poor general reasoners”. This is just not true, but it is true that they are brittle and fallible in weird ways, today.
> This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.
"They do well, therefore intelligence" is not an argument, sure. But that’s also not what I’m saying. The Occam’s razor here is that reasoning-like computation is the best explanation for an increasing amount of the observed behavior, especially in fresh math and real software tasks where memorization is a much worse fit.
> As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?
I would encourage you to read Kuhn’s structure of scientific revolutions. "That’s how science is done" is a bit of an oversimplification of how the sausage is made here. Real science moves forward in a messy mix of partial theory + better measurements + interventions long before anyone has some sort of grand unified framework. Neuroscience is no different here. And I would say at this point with LLMs we now do have pretty decent tests: fresh verifiable-task evals, mechanistic circuit tracing, causal activation patching, and scaling results for RL/test-time compute. The claim that there is no framework + no real tests is just not true anymore. It’s not like we have some finished theory of reasoning, but thats a bit of an unfair demand at this point and is asymmetrical as well.
> It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.
>> Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.
>> So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?
The model is: reasoning is not inherently human, it’s mathematical. It falls easily within the purview of RL, statistics, representation, optimization, etc, and to claim otherwise would require evidence.
What is the robust model for reasoning in humans again? Simulations and models — what are these? Interventative analysis — we can’t do this with LLMs? Falsifying test cases — what would satisfy you here beyond everything I’ve presented above? Also I’m confused by your last part. You say “brains are intelligent” ==> “intelligence is an emergent property of cells zapping” is absurd, but why? You start from the position that brains are intelligent, so why is this absurd within your argument? Brains _are_ made up of real, physical atoms organized into molecules organized into cells organized into a coordinated system, and…that’s it? What’s missing here?
Thanks, this is great and I'll have quite a bit to read here.
> people like to move goalposts whenever a new result comes out, which is silly. Could AI systems do this 2 years ago? No. I don’t know how people don’t look at robust trends in performance improvement, combined with verifiable RL rewards, and can’t understand where things are going.
I don't think it's goal post moving to acknowledge improvements but still reject the conclusion that AI has reached a specific milestone if those improvements don't justify the position. I doubt anyone sensible is rejecting improvements.
> But you pretend as though this is definitive evidence that “LLMs are poor general reasoners”.
I don't think I've ever made any definitive claims at all, quite the contrary - I've tried to express exactly how open I am to what you're saying. As I've said, I'm a functionalist, and I already am largely supportive of reductive intelligence, so I'm exactly the type of person who would be sympathetic to what you're saying.
> "That’s how science is done" is a bit of an oversimplification
Of course, but I don't think it's too much to ask for to have a theory and evidence. I don't need a lined up series of papers that all start with perfectly syllogisms and then map to well controlled RCTs or whatever. Just an "I think this accounts for it, here's how I support that".
> The claim that there is no framework + no real tests is just not true anymore.
I didn't say it wasn't true, to be clear, I asked for it. Again, I'm sympathetic to the view at a glance so I simply need a way to reason about it.
No need for a complete view, I'd never expect such a thing.
> The model is: reasoning is not inherently human, it’s mathematical.
Well, hand wringing perhaps, but I'd say it's maybe mathematical, computational, structural, functional, whatever - I think we're on the same page here regardless.
> It falls easily within the purview of RL, statistics, representation, optimization, etc, and to claim otherwise would require evidence.
Sure, but I grant that, in fact I believe it entirely. But that doesn't mean that every mathematical construct exhibits the function of intelligence.
> What is the robust model for reasoning in humans again? Simulations and models — what are these? Interventative analysis — we can’t do this with LLMs? Falsifying test cases — what would satisfy you here beyond everything I’ve presented above?
Sorry, I'm not fully understanding this framing. We can do those things with LLMs, and it's hard to say what I would be satisfied. In general, I'd be satisfied with a theory that (a) accounts for the data (b) has supporting evidence (c) does not contradict any major prior commitments. I don't think (c) will be an issue here.
> You say “brains are intelligent” ==> “intelligence is an emergent property of cells zapping” is absurd,
Because intelligence could have been a property of our brains being wet, or roundish, or it could have been a property of our spines, or maybe some force we hadn't discovered, or a soul, etc. We formed a theory, it accounted for observations, we performed tests, we've modeled things, etc, and so the theories we've adopted have been extremely successful and I think hold up quite well. But certainly we didn't go "the brain has electricity, the brain is intelligent, therefor electricity in the brain is what drives intelligence".
> Brains _are_ made up of real, physical atoms organized into molecules organized into cells organized into a coordinated system, and…that’s it? What’s missing here?
Certainly nothing on my world view.
Beliefs are not rooted in facts. Beliefs are a part of you, and people aren't all that happy to say "this LLM is better than me"
I'm very happy to say calculators are far better than me in calculations (to a given precision). I'm happy to admit computers are so much better than me in so many aspects. And I have problem saying LLMs are very helpful tools able to generate output so much better than mine in almost every field of knowledge.
Yet, whenever I ask it to do something novel or creative, it falls very short. But humans are ingenious beasts and I'm sure or later they will design an architecture able to be creative - I just doubt it will be Transformer-based, given the results so far.
But the question isn't whether you can get LLMs to do something novel, it's whether anyone can get them to do something novel. Apparently someone can, and the fact that you can't doesn't mean LLMs aren't good for that.
When it comes to LLMs doing novel things, is it just the infinite monkey theorem[0] playing out at an accelerated rate, helped along by the key presses not being truly random?
Surely if we tell the LLM to do enough stuff, something will look novel, but how much confirmation bias is at play? Tens of millions of people are using AI and the biggest complaint is hallucinations. From the LLMs perspective, is there any difference between a novel solution and a hallucination, other than dumb luck of the hallucination being right?
[0] https://en.wikipedia.org/wiki/Infinite_monkey_theorem
This argument doesn't go the way you want it to go. Billions of people exist, but maybe a few tens of thousands produce novel knowledge. That's a much worse rate than LLMs.
I’m not sure how we equate the number of humans to AI to determine a success rate.
We also can’t ignore than it was humans who thought up this problem to give to the AI. Thinking has two parts, asking and answering questions. The AI needed the human to formulate and ask the question to start. AI isn’t just dropping random discoveries on us that we haven’t even thought of, at least not that I’ve seen.
To have a proper discussion we would have to define the word "novel" and that's a challenge in itself. In any case, millions of poeple tried to ask LLMs to do something creative and the results were bland. Hence my conclusion LLMs aren't good for that. But I'm also open they can be an element of a longer chain that could demonstrate some creativity - we'll see.
Novel is a tricky word. In this case, the LLM produced a python program that was similar to other programs in its corpus, and this oython program generated examples of hypergraphs that hadn't been seen before.
That's a new result, but I don't know about novel. The technique was the same as earlier work in this vein. And it seems like not much computational power was needed at all. (The article mentions that an undergrad left a laptop running overnight to produce one of the previous results, that's absolute peanuts when compared to most computational research).
I have never seen a human produce a Python program that wasn't similar to other programs they'd seem.
So? I certainly have.
Truly novel? All art is derivative.
If all art is derivative then the earlier statement is a tautology.
People still call things other people do novel. There's clear social proof that humans do things that other humans consider novel. Otherwise the word would probably not exist.
Just today I wrote a python program that did not resemble anything I'd written before, nor had I seen anything similar. I had to reason it out myself. That passes thr test that the original comment set.
Your threshold for "resemble" is obviously quite high, which is fair, but assuming that you're an encultured programmer your python code represents other people's python code. It might be doing something novel, but that thing it's doing is interacting or in response to, or otherwise relative to existing concepts you learned or saw elsewhere. All art is derivative, we can do things other people haven't done before but all of it derives from the works of others in some way.
Anyway, I've coded all kinds of wacky shit with claude that I guarantee nobody has implemented before, if only because they're stupid and tedious ideas. They can't all be winners, but they were novel, and yet claude code implemented them as confidently as if they were yet another note taking app. They have no problem handling novel ideas, and although the novel ideas in this case were my own, its easy to see how finding new ideas could be automated by exploring the combinatorial space of existing ideas.
It's not possible to know something without believing it to be true. https://en.wikipedia.org/wiki/Belief#/media/File:Classical_d...
This is objectively wrong. If that was the case every scientist performing a test would have always had their expectations and beliefs proven true. If you're trying to disprove something also because you believe it to be wrong you would never be proven wrong.
re-read your post - it's just a bunch of nonsense, no actual reasoning in there
> e.g. 167,383 * 426,397 = 71,371,609,051
They may be wrong, but so are you.
No, its correct:
https://www.google.com/search?q=167383+*+426397
You missed the point.
I missed it too, care to explain?
Not sure what you mean?
Can you elaborate?
You could have just checked the math yourself, you know.
My pocket calculator says the same thing and it doesn't even have training data.
Huh, casually flexing with a sentient pocket calculator!!!
It is like not trusting someone who attained highest score in some exam by by-hearting the whole text book, to do the corresponding job.
Not very hard to understand.
Yet we do that all the time by hiring based on GPA/degree.
Do you hire or screen based on them?
It's fear.
>>AI is a remixer; it remixes all known ideas together. It won't come up with new ideas
I always found this argument very weak. There isn't that much truly new anyway. Creativity is often about mixing old ideas. Computers can do that faster than humans if they have a good framework. Especially with something as simple as math - limited set of formal rules and easy to verify results - I find a belief computers won't beat humans at it to be very naive.
Do we know for a fact that LLMs aren't now configured to pass simple arithmetic like this in a simpler calculator, to add illusion of actual insight?
The major AIs have access to all sorts of tools, including a math library. I thought this was well-known. There's no "illusion of actual insight" - they're just "using a calculator" (in the sense that they call a math library when needed). AIs are not just LLMs.
You can train a LLM on just multiplication and test it on ones it has never seen before, it's nothing particularly magical.
It's not 'magic' though but previously LLMs have performed very badly on longer multiplication, 'insight' is the wrong word but I'm saying maybe they're not wildly better at this calculation... maybe they are just optimising these well known jagged edges.
Ximm's Law applies ITT: every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.
Especially the lemmas:
- any statement about AI which uses the word "never" to preclude some feature from future realization is false.
- contemporary implementations have almost always already been improved upon, but are unevenly distributed.
Anti-Ximm's Law: every response to a critique of AI assumes as much arbitrary level of future improvement as is necessary to make the case.
When I read through what they're doing? It sure doesn't sound like it's generating something new as people typically think of it. The link, they provide a very well defined problem and they just loop through it.
I think you're arguing with semantics.
Yes! I call these the "it's just a stochastic parrot" crowd.
Ironically, they are the stochastic parrots, because they're confidently repeating something that they read somehwere and haven't examined critically.
That would not be stochastic, just parroting
It's not deterministic therefore it's stochastic.
I guess when it can't be tripped up by simple things like multiplying numbers, counting to 100 sequentially or counting letters in a string without writing a python program, then I might believe it.
Also no matter how many math problems it solves it still gets lost in a codebase
LLMs are bad at arithmetic and counting by design. It's an intentional tradeoff that makes them better at language and reasoning tasks.
If anybody really wanted a model that could multiply and count letters in words, they could just train one with a tokenizer and training data suited to those tasks. And the model would then be able to count letters, but it would be bad at things like translation and programming - the stuff people actually use LLMs for. So, people train with a tokenizer and training data suited to those tasks, hence LLMs are good at language and bad at arithmetic,
Arguments like "but AI cannot reliably multiply numbers" fundamentally misunderstand how AI works. AI cannot do basic math not because AI is stupid, but because basic math is an inherently difficult task for otherwise smart AI. Lots of human adults can do complex abstract thinking but when you ask them to count it's "one... two... three... five... wait I got lost".
> fundamentally misunderstand how AI works
Who does fundamentally understand how LLMs work? Many claims flying around these days, all backed by some of the largest investments ever collectively made by humans. Lots of money to be lost because of fundamental misunderstandings.
Personally, I find that AI influencers conveniently brush away any evidence (like inability to perform basic arithmetic) about how LLMs fundamentally work as something that should be ignored in favor of results like TFA.
Do LLMs have utility? Undoubtedly. But it’s a giant red flag for me that their fundamental limitations, of which there are many, are verboten to be spoken about.
You're not doing yourself a favor when you point out "but they can't do arithmetic!" as if anyone says otherwise. Yes, we all know they can't do arithmetic, and that's just how they work.
I feel like I'm saying "this hammer is so cool, it's made driving nails a breeze" and people go "but it can't screw screws in! Why won't anyone talk about that! Hammers really aren't all they're cracked up to be".
Maybe because society has invested $trillions into this hammer and influencers are trying to convince CEOs to fire everyone and buy a bunch of hammers instead.
My comment even said “LLMs have utility”. I gave an inch, and now the mile must be taken.
Saying that the fundamental limitations are things like counting the number of rs in strawberry is boring, though. That's how tokens work and it's trivial to work around.
Talking about how they find it hard to say they aren't sure of something is a much more interesting limitation to talk about, for example.
> Talking about how they find it hard to say they aren't sure of something is a much more interesting limitation to talk about, for example.
Sure, thank you for steelmanning my argument. I didn’t think I needed to actually spell out all of the fundamental limitations of LLMs in this specific thread. They are spoken at length across the web, but are often met with pushback, which was my entire point.
Here’s another one: LLMs do not have a memory property. Shut off the power and turn it back on and you lose all context. Any “memory” feature implemented by companies that sell LLM wrappers are a hack on top of how LLMs work, like seeding a context window before letting the user interact with the LLM.
But that's also like saying "humans don't have a memory property, any 'memory' is in the hippocampus". It's not useful to say that "an LLM you don't bother to keep training has no memory". Of course it doesn't, you removed its ability to form new memories!
So why then do we stop training LLMs and keep them stored at a specific state? Is it perhaps because the results become terrible and LLMs have a delicate optimal state for general use? This sounds like an even worse case for a model of intelligence.
Nope, it's not that, but it's nice of you to offer a straw man. Makes the argument flow better.
Not entirely a straw man. What is the purpose of storing and retrieving LLMs at a fixed state if not to guarantee a specific performance? Wouldn’t a strong model of intelligence be capable of, to extend your analogy, running without having its hippocampus lobotomized?
Given the precariousness of managing LLM context windows, I don’t think it’s particularly unfair to assume that LLMs that learn without limit become very unstable.
To steelman, if it’s possible, it may be prohibitively expensive. But somehow I doubt it’s possible.
It is, indeed, prohibitively expensive. But it's not impossible. The proof is in the fact that you can fine-tune LLMs.
Because know one owns a $300 billion dollar hammer that literally runs on fancy calculators.
Ok, I'll bite. Show me an LLM that comes up with a new math operator. Or which will come up with theory of relativity if only Newton physics is in its training dataset. That it could remix existing ideas which leads to novel insights is expected, however the current LLMs can't come up with paradigm shifts that require novel insights. Even humans have a rather limited time they can come up with novel insights (when they are young, capable of latent thinking, not yet ossified from the existing formalization of science and their brain is still energetically capable without vascular and mitochondrial dysfunction common as we age).
How many humans have been born until now and how many Einsteins have been born? And in how many hundreds of thousands of years?
The point is that humans do have some edge compared to current LLMs which are essentially next token predictors. If we all start relying on current AI and stop thinking, we would only be able to "exhaust the remix space" of existing ideas but won't be able to do any paradigm jumps. Moreover, it's quite likely that current training sets are self-contradictory, containing Dutch books, carrying some innate error in them.
It takes a lot of intelligence to "essentially predict" next token when you're doing a math proof. Or writing code.
I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
It's this pervasive belief that underlies so much discussion around what it means to be intelligent. The null hypothesis goes out the window.
People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
If they do, they apply it in only the most restrictive way imaginable, some 2 dimensional caricature of reality, rather than considering all the ways that humans try and fail in all things throughout their lifetimes in the process of learning and discovery.
There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
The ability to learn and infer without absorbing millions of books and all text on internet really does make us special. And only at 20 watts!
Just an interesting thought experiment: if you took all the sensory information that a child experiences through their senses (sight, hearing, smell, touch, taste) between, say, birth and age five, how many books worth of data would that be? I asked Claude, and their estimate was about 200 million books. Maybe that number is off ± by an order of magnitude. ...but then again Claude is only three years old, not five.
Last I checked humans didn't pop into existence doing that. It happened after billions of years of brute force, trial and error evolution. So well done for falling into the exact same trap the OP cautions. Intelligence from scratch requires a mind boggling amount of resources, and humans were no different.
And then an 18-to-20-something-year training run is required for each individual instance.
I know right, such a waste. Plus it's so random on how they will turn out!
Any suggestions on how to reduce that waste?
To be fair, it is still pretty remarkable what the human brain does, especially in early years - there is no text embedded in the brain, just a crazily efficient mechanism to learn hierarchical systems. As far as I know, AI intelligence cannot do anything similar to this - it generally relies on giga-scaling, or finetuning tasks similar to those it already knows. Regardless of how this arose, or if it's relevant to AGI, this is still a uniqueness of sorts.
Human babies "train" their brain on literally gigabytes of multi-modal data dumped on them through all their sensory organs every second.
In a very real sense, our magic superpower is that we "giga-scale" with such low resource consumption, especially considering how large (in terms of parameters) the brain is compared to even the most advanced models we have running on those thousands of GPUs today. But that's where all those millions of years of evolution pay off. Don't diss the wetware!
How is that relevant? The human brain is at the point of birth (or some time before that). We compare that with an LLM model doing inference. The training part is irrelevant, the same way the human brains' evolution is.
Do you think evolutionary pressures are the best explanation for why humans were able to posit the Poincaré conjecture and solve it? While our mental architecture evolved over a very long time, we still learn from miniscule amounts of data compared to LLMs.
Yeah. What else would it be ? A brain capable of doing that was clearly the result of evolutionary pressures.
But there is no evolutionary pressure for the Poincaré conjecture, we were never optimized for that in particular, unlike these kinds of LLMs.
We were optimized to rapidly adapt to changing environments by solving the problems that arise through tool-making and cooperation in complex multi-stage tasks (like say hunting that mammoth to make clothing out of it). It turns out that the cheapest evolutionary pathway to get there has some interesting emergent phenomena.
Of course it is evolution. What else could it be?
Now multiply that with 7 billion to distill that one who will solve frontier math problem.
We have a tremendous amount of raw information flowing through our brains 24/7 from before we are born, from the external world through all our senses and from within our minds as it attempts to make sense of that information, make predictions, generally reason about our existence, hallucinate alternative realities, etc. etc.
If you were able to somehow capture all that information in full detail as you've had access to by the age of say 25, it would likely dwarf the amount of information in millions of books by several orders of magnitude.
When you are 25 years old and are presented a strange looking ball and told to throw it into a strange looking basket for the first time. You are relying on an unfathomable amount of information turned into knowledge and countless prior experiments that you've accumulated/exercised to that point relating to the way your body and the world works.
Humans are "multi-modal". Sure we get plenty of non-textual information, but LLMs were trained on basically every human-written world ever. They definitely see many orders of magnitude more language than any human has ever seen. And yet humans get fluent based after 3+ years.
If you treat the human brain as a model, and account for the full complexity of neurons (one neuron != one parameter!) it has several orders of magnitude more parameters than any LLM we've made to date, so it shouldn't come as a surprise.
What is surprising is that our brain, as complex as it is, can train so fast on such a meager energy budget.
For sure, it seems like there's something there primed to pick up human language quickly, clearly evolutionarily driven.
Not necessarily so for the dynamics of magnetic fields, or nonhuman animal communications, or dark energy/matter.
We are bombarded nonstop by magnetic fields, nonhuman animal communications, and live in a universe which seems to be majority dominated by dark energy and matter, and yet understand little to none of it all.
To be fair, the knowledge embedded in an LLM is also, at this point, a couple orders of magnitude (at least) larger than what the average human being can retain. So it's not like all those books and text in the internet are used just to bring them to our level, they go way beyond.
Most people have absorbed way too few books to be able to infer properly. Hell, most people are confused by TV remotes.
It's only because humans came up with a problem, worked with the ai and verified the result that this achievement means anything at all. An ai "checking its own work" is practically irrelevant when they all seem to go back and forth on whether you need the car at the carwash to wash the car. Undoubtedly people have been passing this set of problems to ai's for months or years and have gotten back either incorrect results or results they didn't understand, but either way, a human confirmation is required. Ai hasn't presented any novel problems, other than the multitudes of social problems described elsewhere. Ai doesn't pursue its own goals and wouldn't know whether they've "actually been achieved".
This is to say nothing of the cost of this small but remarkable advance. Trillions of dollars in training and inference and so far we have a couple minor (trivial?) math solutions. I'm sure if someone had bothered funding a few phds for a year we could have found this without ai.
>It's only because humans came up with a problem, worked with the ai and verified the result that this achievement means anything at all.
Replace ai with human here and that's...just how collaborative research works lol.
Funding a few PhDs for a year costs orders of magnitude more than it did to solve this problem in inference costs. Also, this has been active research for some time. Or I guess the people working on it are just not as good as a random bunch of students? It's amazing the lengths that people go to maintain their worldview, even if it means belittling hardworking people.
I take it you're not a mathematician. This is an achievement, regardless of whether you like LLMs or not, so let's not belittle the people working on these kinds of problems please.
>It's amazing the lengths that people go to maintain their worldview, even if it means belittling hardworking people.
This is the most baffling and ironic aspects of these discussions. Human exceptionalism is what drives these arguments but the machines are becoming so good you can no longer do this without putting down even the top percenter humans in the process. Same thing happening all over this thread (https://news.ycombinator.com/item?id=47006594). And it's like they don't even realize it.
> Funding a few PhDs for a year costs orders of magnitude more than it did to solve this problem in inference costs.
I don't think PhD students are sitting around and solving one problem for a year. Also PhD students are way cheaper
How many math PhD students do you have? If you set the problem right, something like this per year on average is a good pace.
How are they cheaper? Your average grant where I am can pay for a couple of PhD students. I could afford to pay for inference costs out of my own salary, no grant needed. Completely different economic scales here. I like students better of course, but funding is drying up these days.
I was saying generally. I don't work in maths. PhD students do lots of other things than research. If we ask a PhD student to just solve these kinds of problems and nothing else, the student would do it without much difficulty.
I guess it's different in somewhere like Europe. But in Canada, most of the PhD students are paid for doing TAships, not primarily through grant. Average salary is 25k/year. Take 6-10k out for tuition, that's 15-19k/year. You get a student doing so many things for less pay. I guess, if your job only requires research then you can do it.
Inference costs are heavily subsidised. My point was that we've spent trillions collectively on ai, and so far we have a few new proofs. It's been active research but the problem estimates only 5-10 people are even aware that it is a problem. I wrote "math phd's" not "random students", but regardless, I wouldn't know how you interpreted my statement that people could have discovered without ai this as "belittling the people working on this". You seem like a stupid person with an out of control chatbot that can't comprehend basic arguments.
> You seem like a stupid person
And now you're belittling me. Yeah, good one, that'll convince people.
> out of control chatbot that can't comprehend basic arguments
I don't see how it is out of control. It is a tool. It is being used for a job. For low-level jobs it often succeeds. For tougher jobs, it is succeeding sufficiently often to be interesting. I don't care if it understands worldview semantics, that's for humans to do.
> we've spent trillions collectively on ai
The economics around AI do not suggest that continuing to perform large training runs is sustainable. That's also not relevant to the discussion. Once the training is done, further costs are purely on inference, and that is the comparison I was making.
> Inference costs are heavily subsidised
Even if you pay to run inference on your own hardware, economics of scale dictate that it is still cheaper than students.
> It's been active research but the problem estimates only 5-10 people are even aware that it is a problem.
That sounds about right for most pure math problems. Were you expecting more?
Let's not pretend that society would have invested that kind of money into pure mathematics research. It is extraordinarily difficult to get funding for that kind of work in most parts of the world. Mathematicians are relatively cheap, yes, but the money coming into AI was from blind VCs with a sense of grandeur. It wasn't to do maths research. If it's here anyway, and causing nightmares for actually teaching new students, may as well try to make some good of it. It has only recently crossed the edge of being useful. Most researchers I know are only now starting to consider it, mostly as a search engine, but some for proof assistance. Experiences a year ago were highly negative. They're a lot more positive now.
I'm trying to give a perspective from someone who actually does do math research at a senior level, who actually does have a half dozen math PhD students to supervise, to say that your blind attitude toward this is not sensible or helpful. Your comments about the problem being trivial do belittle the actual effort people have put into the problem without success. If they could easily have discovered this without AI, they would have already done so. Researchers do not have unlimited time and there are many more problems than students, especially good ones (hence my random comment).
>> we've spent trillions
Source? This sounds like hyperbole. The entire US GDP is low tens of trillions.
From various online estimates, i would estimate global ai spend just since 2020 at $2T. Some projections estimate that we might spend that per year starting next year. To the extent that many of these projects will be cancelled or shelved, capital is beginning to take stock of the feasibility of clawing back even the original investments. openai is apparently doubling its staff, but whether these are sales or (prompt?) engineering jobs, the biggest hypemongers are themselves unable to reduce headcount even with unlimited "at-cost" ai inference.
Comparing total ai spend to the value added of producing a few new maths/sciences proofs is unfair since ai is doing more than maths proofs, but for comparison one can estimate the total spent to date on mathematicians and associated costs (buildings, experiments etc). I would very roughly estimate that the total cost of all mathematics to date since 1600 is less than what we've spent on ai to date, and the results from investment in mathematicians are incomparable to a few derivative extensions of well-established ideas. For less than a few trillion we have all of mathematics. For an additional 2T dollars, we have trivial advancements that no one really cares about.
The only things moving faster than AI are the goalposts in conversations like this. Now we're at "sure, AI can solve novel problems, but it can't come up with the problems themselves on its own!"
I'm curious to see what the next goalpost position is.
> I'm curious to see what the next goalpost position is.
I am as well. That's the point. Ai can do some things well and other things better than humans, but so can a garden hose and all technology. Is ai just a tool or is it the future of all work? By setting goalposts we can see whether or not it is living up to the hype that we're collectively spending trillions on.
The garden hose manufacturers aren't claiming that they're going to replace all human workers, so we don't set those kinds of goalposts to measure whether it's doing that.
> I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
Because, empirically, we have numerous unique and differentiable qualities, obviously. Plenty of time goes into understanding this, we have a young but rigorous field of neuroscience and cognitive science.
Unless you mean "fundamentally unique" in some way that would persist - like "nothing could ever do what humans do".
> People constantly make comments like "well it's just trying a bunch of stuff until something works" and it seems that they do not pause for a moment to consider whether or not that also applies to humans.
I frankly doubt it applies to either system.
I'm a functionalist so I obviously believe that everything a human brain does is physical and could be replicated using some other material that can exhibit the necessary functions. But that does not mean that I have to think that the appearance of intelligence always is intelligence, or that an LLM/ Agent is doing what humans do.
>But that does not mean that I have to think that the appearance of intelligence always is intelligence, or that an LLM/ Agent is doing what humans do.
You can think whatever you want, but an untestable distinction is an imaginary one.
First of all, that's not true. Not every position has to be empirically justified. I can reason about a position in all sorts of ways without testing. Here's an obvious example that requires no test at all:
1. Functional properties seem to arise from structural properties
2. Brains and LLMs have radically different structural properties
3. Two constructs with radically, fundamentally different structural properties are less likely to have identical functional properties
Therefor, my confidence in the belief that brains and LLMs should have identical functional properties is lowered by some amount, perhaps even just ever so slightly.
Not something I feel like fleshing out or defending, just an example of how I could reason about a position without testing it.
Second, I never said it wasn't testable.
No, but it does mean that you should know we don't understand what intelligence is, and that maybe LLMs are actually intelligent and humans have the appearance of intelligence, for all we know.
You're just defining intelligence as "undefined", which okay, now anything is anything. What is the point of that?
Indeed, there's quite a lot of work that's been done on what these terms mean. The fields of neuroscience and cognitive science have contributed a lot to the area, and obviously there are major areas of philosophy that discuss how we should frame the conversation or seek to answer questions.
We have more than enough, trivially, to say that human intelligence is distinct, so long as we take on basic assertions like "intelligence is related to brain structures" since we know a lot about brain structures.
Our intelligence is related to brain structures, not all intelligence. You can't get to things like "what all intelligence, in general, is" from "what our intelligence is" any more than you can say that all food must necessarily be meat because sausages exist.
But... we're talking about our intelligence. So obviously it's quite relevant. I didn't say that AI isn't intelligent, I said that we have good reason to believe that our intelligence is unique. And we do, a lot of good evidence.
I obviously don't believe that all intelligence is related to specific brain structure. Again, I'm a functionalist, so I believe that any structure that can exhibit the necessary functions would be equivalent in regards to intelligence.
None of this would commit me to (a) human exceptionalism (b) LLMs/ Agents being intelligent (c) LLMs/ Agents being intelligent in the way that humans are.
This is too dependent on what you mean by "unique", though. What do we have that apes don't, and which directly enables intelligence? What do we have that LLMs don't? What do LLMs have that we don't?
I don't think we know enough to definitively say "it's this bit that gives us intelligence, and there's no way to have intelligence without it". We just see what we have, and what animals lack, and we say "well it's probably some of these things maybe".
> What do we have that apes don't, and which directly enables intelligence?
Again, there are multiple fields of study with tons of amazingly detailed answers to this. We know about specific proteins, specific brain structures, we know about specific cognitive capabilities in the abstract, etc.
> What do we have that LLMs don't?
Again, quite a lot is already known about this.
This feels a bit like you're starting to explore this area and you're realizing that intelligence is complex, but you may not realize that others have already been doing this work and we have a litany of information on the topic. There are big open questions, of course, but we're definitely past the point of being able to say "there is a difference between human and ape intelligence" etc.
It'd probably be more productive for you to actually back up your claims with these things we know from neuroscience, rather than just stating that we know things, and so therefore you're right. What do we know?
EDIT: can't reply, so I'll just update here:
You're arguing that the mechanism that produces human intelligence is unique, so therefore the intelligence itself is somehow fundamentally different from the intelligence an LLM can produce. You haven't shown that, you just keep saying we know it's true. How do we know?
I don't need to do that unless you think that neurons interact exactly the way that LLMs do? That said, we have detailed, microscopic models of neurons, the ability to even simulate brain activity, intervention studies where we can make predictions, interact with brains in various ways, and then validate against predictions, we have cognitive benchmarks that we can apply to different animals or animals in different stages of development that we can then tie to specific brain states and brain development, etc.
So we're in a very good position to say quite a lot about the brain, an incredible amount really. And that puts us in a very good position to say that our brain is very different from other animal brains, and certainly in a very good position to say that's very different from an LLM.
Now, you can argue that an LLM is functionally equivalent to the brain, but given that it's so structurally distinct, and seemingly functions in a radically different way due to the nature of that structure, I'd put it on you to draw symmetries and provide evidence of that symmetry.
I'm following this mini-thread with interest but I've arrived here and I confess, I don't really know what your argument is.
I think this all stems from you objecting to this statement:
"I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique."
I think you're being uncharitable in how you interpret that. Human's are unique in the most literal reading of this sentence, we don't have anything else like humans. But the context is the ability to reason and people denying that a machine is reasoning, even though it looks like reasoning.
They're shocked that people believe that humans are unique. I explained why that shouldn't be shocking. I think I was pretty charitable here, I gave an alternative option for what they could mean in my very first reply:
> Unless you mean "fundamentally unique" in some way that would persist - like "nothing could ever do what humans do".
> I don't really know what your argument is.
I just said that I think that we have very good reasons for believing that human cognition is unique. The response was seemingly that we don't have enough of an understanding of intelligence to make that judgment. I've stated that I think we do have enough of an understanding of intelligence to make that judgment, and I've appealed to the many advances in relevant feilds.
I still think you're being far too literal, which doesn't make for an interesting conversation.
I'm open to hearing how you think I should be interpreting things. I don't really think I'm being too literal, it certainly hasn't been the case that they've suggested my interpretation is wrong, and I've provided two interpretations (one that I totally grant).
What's the better interpretation of their position?
Re: "I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique."
Perhaps this might better help you understand why this assumption still holds: https://en.wikipedia.org/wiki/Orchestrated_objective_reducti...
"Controversial theory justifies assumption". Because humans never hallucinate.
It doesn't. I actually completely reject that theory, and it's nice to see that Wikipedia notes that it is "controversial". There are extremely good reasons to reject this theory. For one thing, any quantum effects are going to be quite tiny/ trivial because the brain is too large, hot, wet, etc, to see larger effects, so you have to somehow make a leap to "tiny effects that last for no time at all" to "this matters fundamentally in some massive way".
It likely requires rejection of functionalism, or the acceptance that quantum states are required for certain functions. Both of those are heavy commitments with the latter implying that there are either functions that require structures that can't be instantiated without quantum effects or functions that can't be emulated without quantum effects, both of which seem extremely unlikely to me.
Probably for the far more important reason, it doesn't solve any problem. It's just "quantum woo, therefor libertarian free will" most of the time.
It's mostly garbage, maybe a tiny tiny bit of interesting stuff in there.
It also would do nothing to indicate that human intelligence is unique.
it is not the assumption that humans are unique. it is that statistical models cannot really think out of the box most of the time
And you know that humans aren't statistical models how?
because they would be more logical
Touche.
> I don't know why I am still perpetually shocked that the default assumption is that humans are somehow unique.
Uh, because up until and including now, we are...?
Every living thing on Earth is unique. Every rock is unique in virtually infinite ways from the next otherwise identical rock.
There are also a tremendous number of similarities between all living things and between rocks (and between rocks and living things).
Most ways in which things are unique are arguably uninteresting.
The default mode, the null hypothesis should be to assume that human intelligence isn't interestingly unique unless it can be proven otherwise.
In these repeated discussions around AI, there is criticism over the way an AI solves a problem, without any actual critical thought about the way humans solve problems.
The latter is left up to the assumption that "of course humans do X differently" and if you press you invariably end up at something couched in a vague mysticism about our inner-workings.
Humans apparently create something from nothing, without the recombination of any prior knowledge or outside information, and they get it right on the first try. Through what, divine inspiration from the God who made us and only us in His image?
Humans are obviously unique in an interesting way. People only "move the goalpost" because it's not an interesting question that humans can do some great stuff, the interesting question is where the boundary is. (Whether against animals or AI).
Some example goals which makes human trivially superior (in terms of intelligence): invention of nuclear bomb/plants, theory of relativity, etc.
But that's unique in the sense of "you have a bag of ten apples and I have a bag of eleven apples, therefore my bag is unique". It's not qualitatively different intelligence than a dog's, you just have more of it.
I would argue that point. The biological components are the same, but emergent behavior is a thing. So both the scale and the number of connections/way they connect have surpassed some limit after which cognitive capabilities increased severalfold to the point that humans "took over the world".
And arguably further increase in intelligence seems to fall into a diminishing returns category, compared to this previous boom. (Someone being "2x smarter" doesn't give them enough benefit of reigning over others, at least history would look otherwise were it the case, in my opinion)
Probably dumb example, but just by increasing speed you get well-behaving laminar flow vs turbulence, yet it's fundamentally the same a level beneath.
Yeah, I don't know that there's such a jump. Dogs, for example, clearly communicate, both with us and with each other. They don't have language, but they also don't lack communication skills. To me, language is just "better communication" rather than a qualitatively different thing.
You may want to watch this video: https://youtu.be/e7wFotDKEF4?is=bl5TPvk9_mdnG3Om
Human language is way above what communication animals show. We don't really know what's the exact boundary, but again, the difference is significant and not just "scaled up".
I doubt you can even define intelligence sufficiently to argue this point. Since that's an ongoing debate without a resolution thus far.
But you claimed that humans aren't unique. I think it's pretty obvious we are on many dimensions including what you might classify as "intelligence". You don't even necessarily have to believe in a "soul" or something like that, although many people do. The capabilities of a human far surpass every single AI to date, and much more efficiently as well. That we are able to brute-force a simulacrum of intelligence in a few narrow domains is incredible, but we should not denigrate humans when celebrating this.
> There's still this seeming belief in magic and human exceptionalism, deeply held, even in communities that otherwise tend to revolve around the sciences and the empirical.
Do you ever wonder why that is? I often wonder why tech has so many reductionist, materialist, and quite frankly anti-human, thinkers.
It's very telling that you put "materialist" and "anti-human" in the same bucket.
> I doubt you can even define intelligence sufficiently to argue this point.
Agreed.
> But you claimed that humans aren't unique.
I'm arguing that it is up to us to prove that they are interestingly unique in the context of this post. Which is pretty narrow - how do we solve problems?
The theme I was arguing against that I've seen repeated throughout this thread is that AIs are just recombining things they've absorbed and throwing those recombinations at the wall until they see what sticks.
It raises the question of why we presume that humans do things any differently, when it seems quite clear that we can only ever possibly do the same, unless we are claiming that knowledge of the universe can enter the human mind through some means other than through the known senses.
Not at all disputing that humans possess many capabilities that AIs do not.
> Do you ever wonder why that is? I often wonder why tech has so many reductionist, materialist, and quite frankly anti-human, thinkers.
I touched on this elsewhere, will go ahead and paste it here again:
The fundamental thing I'm speaking out against is the arrogance of human exceptionalism.
This whole debate about what it means to be intelligent or human just seems like we're making the same mistakes we've made over and over.
Earth as the center of the universe, sun as the center of the universe, man as the only animal with consciousness and intellect, the anthropomorphic nature of the majority of the deities in our religions and the anthropocentric purpose of the universe within those religions...
I think this desire to believe that we are special, that the universe in some way does ultimately revolve around us, is seemingly a deep need in our psyche but any material analysis of our universe shows that it is extremely unlikely that we hold that position.
>The capabilities of a human far surpass every single AI to date
What does this mean ? Are you saying every human could have achieved this result ? Or this ? https://openai.com/index/new-result-theoretical-physics/
because well, you'd be wrong.
>, and much more efficiently as well. That we are able to brute-force a simulacrum of intelligence in a few narrow domains is incredible, but we should not denigrate humans when celebrating this.
Human intelligence was brute forced. Please let's all stop pretending like those billions of years of evolution don't count and we poofed into existence. And you can keep parroting 'simulacrum of intelligence' all you want but that isn't going to make it any more true.
> The capabilities of a human far surpass every single AI to date
Meaning however you (reasonably) define intelligence, if you compare humans to any AI system humans are overwhelmingly more capable. Defining "intelligence" as "solving a math equation" is not a reasonable definition of intelligence. Or else we'd be talking about how my calculator is intelligent. Of course computers can compute faster than we can, that's aside the point.
> Human intelligence was brute forced.
No, I don't mean how the intelligence evolved or was created. But if you want to make that argument you're essentially asserting we have a creator, because to "brute force" something means it was intentional. Evolution is not an intentional process, unless you believe in God or a creator of sorts, which is totally fair but probably not what you were intending.
But my point is that LLM's essentially arrive at answers by brute force through search. Go look at what a reasoning model does to count the letters in a sentence, or the amount of energy it takes to do things humans can do with orders of magnitude less (our brain runs on %20 of a lightbulb!).
> But my point is that LLM's essentially arrive at answers by brute force through search.
If "brute force" worked for this, we wouldn't have needed LLMs; a bunch of nested for-loops can brute force anything.
The reason why LLMs are clearly "magic" in ways similar to our own intelligence (which we very much don't understand either) is precisely because it can actually arrive at an answer without brute force, which is computationally prohibitive for most non-trivial problems anyway. Even if the LLM takes several hours spinning in a reasoning loop, those millions tokens still represent a minuscule part of the total possible solution space.
And yes, we're obviously more efficient and smarter. The smarter part should come as no surprise given that our brains have vastly more "parameters". The efficient part is definitely remarkable, but completely orthogonal to the question of whether the phenomenon exhibited is fundamentally the same or not.
>Meaning however you (reasonably) define intelligence, if you compare humans to any AI system humans are overwhelmingly more capable.
Really ? Every Human ? Are you sure ? because I certainly wouldn't ask just any human for the things I use these models for, and I use them for a lot of things. So, to me the idea that all humans are 'overwhelmingly more capable' is blatantly false.
>Defining "intelligence" as "solving a math equation" is not a reasonable definition of intelligence.
What was achieved here or in the link I sent is not just "solving a math equation".
>Or else we'd be talking about how my calculator is intelligent.
If you said that humans are overwhelmingly more capable than calculators in arithmetic, well I'd tell you you were talking nonsense.
>Of course computers can compute faster than we can, that's aside the point.
I never said anything about speed. You are not making any significant point here lol
>No, I don't mean how the intelligence evolved or was created.
Well then what are you saying ? Because the only brute-forced aspect of LLM intelligence is its creation. If you do not mean that then just drop the point.
>But if you want to make that argument you're essentially asserting we have a creator, because to "brute force" something means it was intentional.
First of all, this makes no sense sorry. Evolution is regularly described as a brute force process by atheist and religious scientists alike.
Second, I don't have any problem with people thinking we have a creator, although that instance still does necessarily mean a magic 'poof into existence' reality either.
>But my point is that LLM's essentially arrive at answers by brute force through search.
Sorry but that's just not remotely true. This is so untrue I honestly don't know what to tell you. This very post, with the transcript available is an example of how untrue it is.
>or the amount of energy it takes to do things humans can do with orders of magnitude less (our brain runs on %20 of a lightbulb!).
Meaningless comparison. You are looking at two completely different substrates. Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
> Really ? Every Human ?
Yes, in many ways absolutely. Just because a model is a better "Google" than my dummy friend doesn't mean that this same friend is more capable at countless cases.
> Meaningless comparison. You are looking at two completely different substrates. Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
Isn't that just more proof how efficient the human brain is? Especially that a wire has much better properties than water solutions in bags.
>Just because a model is a better "Google" than my dummy friend doesn't mean that this same friend is more capable at countless cases.
People use LLMs for a lot of things. 'Better Google' is is a tiny slice of that.
>Isn't that just more proof how efficient the human brain is?
Sure. So what ? If a game runs poorly on one hardware and excellently on another, does that mean the game was fundamentally different between the 2 devices ? No, Of course not.
I never said that humans are better than LLM's along every axis. Rather, a reasonable definition of intelligence would necessarily encompass domains that LLM's are either incapable of or inferior to us.
Here might be some definitions of intelligence for example:
> The aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with his environment.
> "...the resultant of the process of acquiring, storing in memory, retrieving, combining, comparing, and using in new contexts information and conceptual skills".
> Goal-directed adaptive behavior.
> a system's ability to correctly interpret external data, to learn from such data, and to use those learnings to achieve specific goals and tasks through flexible adaptation
But even a housefly possesses levels of intelligence regarding flight and spacial awareness that dominates any LLM. Would it be fair to say a fly is more intelligent than an LLM? It certainly is along a narrow set of axis.
> Because the only brute-forced aspect of LLM intelligence is its creation.
I would consider statistical reasoning systems that can simulate aspects of human thought to be a form of brute force. Not quite an exhaustive search, but massively compressed experience + pattern matching.
But regardless, even if both forms of intelligence arrived via some form of brute force, what is more important to me is the result of that - how does the process of employing our intelligence look.
> This very post, with the transcript available is an example of how untrue it is.
The transcript lacks the vector embeddings of the model's reasoning. It's literally just a summary from the model - not even that really.
> Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
You're so close to getting it lol
>I never said that humans are better than LLM's along every axis. Rather, a reasonable definition of intelligence would necessarily encompass domains that LLM's are either incapable of or inferior to us.
So all humans are overwhelmingly more intelligent but cannot even manage to be as capable in a significant number of domains ? That's not what overwhelming means.
>I would consider statistical reasoning systems that can simulate aspects of human thought to be a form of brute force.
That is not really what “brute force” means. Pattern learning over a compressed representation of experience is not the same thing as exhaustive search. Calling any statistical method “brute force” just makes the term too vague to be useful.
> what is more important to me is the result of that - how does the process of employing our intelligence look.
But this is exactly where you are smuggling in assumptions. We do not actually understand the internal workings of either the human brain or frontier LLMs at the level needed to make confident claims like this. So a lot of what you are calling “the result” is really just your intuition about what intelligence is supposed to look like.
And I do not think that distinction is as meaningful as you want it to be anyway. Flight is flight. Birds fly and planes fly. A plane is not a “simulacrum of flight” just because it achieves the same end by a different mechanism.
>The transcript lacks the vector embeddings of the model's reasoning. It's literally just a summary from the model - not even that really.
You do not need access to every internal representation to see that the model did not arrive at the answer by brute-forcing all possibilities. The observed behavior is already enough to rule that out.
> Do you realize how much compute it would take to run a full simulation of the human brain on a computer ? The most powerful super computer on the planet could not run this in real time.
>You're so close to getting it lol.
No you don't understand what I'm saying. If we were to be more accurate to the brain in silicon, it would be even less efficient than LLMs, never mind humans. Does that mean how the brain works is wrong ? No it means we are dealing with 2 entirely different substrates and directly comparing efficiencies like that to show one is superior is silly.
> So all humans are overwhelmingly more intelligent but cannot even manage to be as capable in a significant number of domains
When the amount of domains in which humans are more capable than LLM's vastly exceeds the amount of domains in which LLM's are more capable than humans, yes.
I also agree that we don't have a great understanding of either human or LLM intelligence, but we can at least observe major differences and conclude that there are, in fact, major differences. In the same way we can conclude that both birds and planes have major differences, and saying that "there's nothing unique about birds, look at planes" is just a really weird thing to say.
> If we were to be more accurate to the brain in silicon, it would be even less efficient than LLMs
Do you think perhaps this massive difference points to there being a significant and foundational structural and functional difference between these types of intelligences?
> I often wonder why tech has so many reductionist, materialist, and quite frankly anti-human, thinkers.
I think it comes from a position of arrogance/ego. I'll speak for the US here, since that's what I know the most; but the average 'techie' in general skews towards the higher intelligence numbers than the lower parts. This is a very, very broad stroke, and that's intentional to illustrate my point. Because of this, techie culture gains quite a bit of arrogance around it with regards to the masses. And this has been trained into tech culture since childhood. Whether it be adults praising us for being "so smart", or that we "figured out the VCR", or some other random tech problem that literally almost any human being can solve by simply reading the manual.
What I've found, in the vast majority of technical problem solving cases that average people have challenges with, if they just took a few minutes to read a manual they'd be able to solve a lot of it themselves. In short, I don't believe as a very strong techie that I'm "smarter than most", but rather that I've taken the time to dive into a subject area that most other humans do not feel the need nor desire to do so.
There are objectively hard problems in tech to solve, but the amount of people solving THOSE problems in the tech industry are few and far in between. And so the tech industry as a whole has spent the last decade or two spinning circles on increasingly complex systems to continue feeding their own egos about their own intelligence. We're now at a point that rather than solving the puzzle, most techies are creating incrementally complex puzzles to solve because they're bored of the puzzles that are in front of them. "Let me solve that puzzle by making a puzzle solver." "Okay, now let me make a puzzle solver creation tool to create puzzle solvers to solve the puzzle." and so forth and so forth. At the end of the day, you're still just solving a puzzle...
But it's this arrogance that really bothers me in the tech bro culture world. And, more importantly, at least in some tech bro circles, they have realized that their target to gathering an exponential increase in wealth doesn't lie in creating new and novel ways to solve the same puzzles, but to try and tout AI as the greatest puzzle solver creation tool puzzle solver known to man (and let me grift off of it for a little bit).
It's funny because the fundamental thing I'm speaking out against is the arrogance of human exceptionalism.
This whole debate about what it means to be intelligent or human just seems like we're making the same mistakes we've made over and over.
Earth as the center of the universe, sun as the center of the universe, man as the only animal with consciousness and intellect, the anthropomorphic nature of the majority of the deities in our religions and the anthropocentric purpose of the universe within those religions...
I think this desire to believe that we are special, that the universe in some way does ultimately revolve around us, is seemingly a deep need in our psyche but any material analysis of our universe shows that it is extremely unlikely that we hold that position.
I largely agree with you, but I also see this same type of thinking appear in people who I know are not arrogant - at least in the techbroisk way.
I have long said I am an AI doubter until AI could print out the answers to hard problems or ones requiring tons of innovation. Assuming this is verified to be correct (not by AI) then I just became a believer. I would like to see a few more AI inventions to know for sure, but wow, it really is a new and exciting world. I really hope we use this intelligence resource to make the world better.
Math and coding competition problems are easier to train because of strict rules and cheap verification. But once you go beyond that to less defined things such as code quality, where even humans have hard time putting down concrete axioms, they start to hallucinate more and become less useful.
We are missing the value function that allowed AlphaGo to go from mid range player trained on human moves to superhuman by playing itself. As we have only made progress on unsupervised learning, and RL is constrained as above, I don't see this getting better.
> I don't see this getting better.
We went from 2 + 7 = 11 to "solved a frontier math problem" in 3 years, yet people don't think this will improve?
I’ve seen this style of take so much that I’m dying for someone to name a logical fallacy for it, like “appeal to progress” or something.
Step away from LLMs for a second and recognize that “Yesterday it was X, so today it must be X+1” is such a naive take and obviously something that humans so easily fall into a trap of believing (see: flying cars).
In finance we say "past performance does not guarantee future returns." Not because we don't believe that, statistically, returns will continue to grow at x rate, but because there is a chance that they won't. The reality bias is actually in favour of these getting better faster, but there is a chance they do not.
this is true because markets are generally efficient. It's very hard to find predictive signals. That is a completely different space than what we're talking about here. Performance is incredibly predictable through scaling laws that continue to hold even at the largest scales we've built
Even more insane than assuming the trend will continue is assuming it will not continue. We don't know for sure (especially not by pure reason), but the weight of probability sure seems to lean one direction.
Hmm...the sun comes up today is a pretty good bet that the sun comes up tomorrow.
We have robust scaling laws that continue to hold at the largest scales. It is absolutely a very safe bet that more compute + more training + algorithmic improvements will certainly improve performance it's not like we're rolling a 1 trillion dollar die.
Logical fallacies are vastly overrated. Unless the conversation is formal logic in the first place, "logical fallacies" are just a way to apply quick pattern matching to dismiss people without spending time on more substantive responses. In this case, both you and the other are speculating about the near future of a thing, neither of you knows.
Hard to make a more substantive response when the OP’s entire comment was a one-sentence logical fallacy. I’m not cherry-picking here.
> In this case, both you and the other are speculating about the near future of a thing, neither of you knows.
One of us is making a much grander claim than the other:
The post you replied to was:
> We went from 2 + 7 = 11 to "solved a frontier math problem" in 3 years, yet people don't think this will improve?
All that says is that the speaker thinks models will improve past where they are today. Not that it's a logical certainty (the first thing you jumped on them for), and certainly not anything about "limitless potential for growth" (which nobody even mentioned). With replies like this, invoking fallacies and attacking claims nobody made, you're adding a lot of heat and very little light here (and a few other threads on the page).
> All that says is that the speaker thinks models will improve past where they are today. Not that it's a logical certainty
Exceedingly generous interpretation in my opinion. I tend to interpret rhetorical questions of that form as “it’s so obvious that I shouldn’t even have to ask it”.
> generous interpretation
The term of art for that is steelmanning, and HN tries to foster a culture of it. Please check the guidelines link in the footer and ctrl+f "strongest".
Better put than I could have.
OK, its not a logical fallacy, its a false assumption.
The belief in the inevitability of progress is a bad assumption. Especially if you assume a particular technology will keep advancing.
We won't know if his assumption is false until time passes and moves future speculation into the empirical present.
A possibility is not a fact. Assuming a possibility will happen is not justified. Therefore it is false as an assumption, even if it is true it is a possiblity.
I genuinely have no idea what you're on about. One guy expressed his belief about how the future will play out, and another disagreed. Time will be the judge of it, not either of us.
Well if people give the exact same 'reasons' why it could not do x task in the past that it did manage to do then it is tiring seeing the same nonsense again. The reason here does not even make much sense. This result is not easily verifiable math.
Yeah, and even if we accept that models are improving in every possible way, going from this to 'AI is exponential, singularity etc.' is just as large a leap.
The comment doesn't say it must be X+1. It implies it will improve which I would say is a pretty safe bet.
https://xkcd.com/605/
Scaling law is a power law , requiring orders of magnitude more compute and data for better accuracy from pre-training. Most companies have maxed it out.
For RL, we are arriving at a similar point https://www.tobyord.com/writing/how-well-does-rl-scale
Next stop is inference scaling with longer context window and longer reasoning. But instead of it being a one-off training cost, it becomes a running cost.
In essence we are chasing ever smaller gains in exchange for exponentially increasing costs. This energy will run out. There needs to be something completely different than LLMs for meaningful further progress.
I tend to disagree that improvement is inherent. Really I'm just expressing an aesthetic preference when I say this, because I don't disagree that a lot of things improve. But it's not a guarantee, and it does take people doing the work and thinking about the same thing every day for years. In many cases there's only one person uniquely positioned to make a discovery, and it's by no means guaranteed to happen. Of course, in many cases there are a whole bunch of people who seem almost equally capable of solving something first, but I think if you say things like "I'm sure they're going to make it better" you're leaving to chance something you yourself could have an impact on. You can participate in pushing the boundaries or even making a small push on something that accelerates someone else's work. You can also donate money to research you are interested in to help pay people who might come up with breakthroughs. Don't assume other people will build the future, you should do it too! (Not saying you DON'T)
The problem class is rather very structured which makes it "easier", yet the results are undeniably impressive
But can it count the R's in strawberry?
That question is equivalent to asking a human to add the wavelengths of those two colors and divide it by 3.
Unless you're aware of hyperspectral image adapters for LLMs they aren't capable of that either.
Unfair - human beats AI in this comparison, as human will instantly answer "I don't know" instead of yelling a random number.
Or at best "I don't know, but maybe I can find out" and proceed to finding out/ But he is unlikely to shout "6" because he heard this number once when someone talked about light.
> human will instantly answer "I don't know" instead of yelling a random number.
Seems that you never worked with Accenture consultants?
Fair.
Yet this can be filtered with fixed rules, like "output produced by corporate structures is untrusted random data".
Why is that?
Because LLMs dont have a textual representation of any text they consume. Its just vectors to them. Which is why they are so good at ignoring typos, the vector distance is so small it makes no difference to them.
yes its ridiculously good at stuff like that now. I dare you to try and trick it.
https://news.ycombinator.com/item?id=47495568
what bothers me is not that this issue will certainly disappear now that it has been identified, but that that we have yet to identify the category of these "stupid" bugs ...
We already know exactly what causes these bugs. They are not a fundamental problem of LLMs, they are a problem of tokenizers. The actual model simply doesn't get to see the same text that you see. It can only infer this stuff from related info it was trained on. It's as if someone asked you how many 1s there are in the binary representation of this text. You'd also need to convert it first to think it through, or use some external tool, even though your computer never saw anything else.
> It's as if someone asked you how many 1s there are in the binary representation of this text.
I'm actually kinda pleased with how close I guessed! I estimated 4 set bits per character, which with 491 characters in your post (including spaces) comes to 1964.
Then I ran your message through a program to get the actual number, and turns out it has 1800 exactly.
Okay but, genuinely not an expert on the latest with LLMs, but isn’t tokenization an inherent part of LLM construction? Kind of like support vectors in SVMs, or nodes in neural networks? Once we remove tokenization from the equation, aren’t we no longer talking about LLMs?
It's not a side effect of tokenization per se, but of the tokenizers people use in actual practice. If somebody really wanted an LLM that can flawlessly count letters in words, they could train one with a naive tokenizer (like just ascii characters). But the resulting model would be very bad (for its size) at language or reasoning tasks.
Basically it's an engineering tradeoff. There is more demand for LLMs that can solve open math problems, but can't count the Rs in strawberry, than there is for models that can count letters but are bad at everything else.
> We went from 2 + 7 = 11 to "solved a frontier math problem" in 3 years, yet people don't think this will improve?
This is disingenuous... I don't think people were impressed by GPT 3.5 because it was bad at math.
It's like saying: "We went from being unable to take off and the crew dying in a fire to a moon landing in 2 years, imagine how soon we'll have people on Mars"
LLMs in some form will likely be a key component in the first AGI system we (help) build. We might still lack something essential. However, people who keep doubting AGI is even possible should learn more about The Church-Turing Thesis.
https://plato.stanford.edu/entries/church-turing/
AGI is definitely possible - there is nothing fundamentally different in the human brain that would surpass a Turing machine's computational power (unless you believe in some higher powers, etc).
We are just meat-computers.
But at the same time, there is absolutely no indication or reason to believe that this wave of AI hype is the AGI one and that LLMs can be scaled further. We absolutely don't know almost anything about the nature of human intelligence, so we can't even really claim whether we are close or far.
This is a long read on things most people here know at least in some form. Could you pint to a particular fragment or a quote?
Self driving
if you let million monkeys bash typewriter. something something book
This is not formally verified math so there is no real verifiable-feedback aspect here. The best models for formalized math are still specialized ones. although general purpose models can assist formalization somewhat.
Maybe to get a real breakthrough we have to make programming languages / tools better suited for LLM strengths not fuss so much about making it write code we like. What we need is correct code not nice looking code.
> programming languages / tools better suited for LLM strengths
The bitter lesson is that the best languages / tools are the ones for which the most quality training data exists, and that's pretty much necessarily the same languages / tools most commonly used by humans.
> Correct code not nice looking code
"Nice looking" is subjective, but simple, clear, readable code is just as important as ever for projects to be long-term successful. Arguably even more so. The aphorism about code being read much more often than it's written applies to LLMs "reading" code as well. They can go over the complexity cliff very fast. Just look at OpenClaw.
>> simple, clear, readable code is just as important as ever for projects to be long-term successful
Is it though? I'm a long-time code purist, but I am beginning to wonder about the assumptions underlying our vocation.
I guess it's hard to tell until we see more long-term AI-generated project, but many of the ones we have so far (OpenClaw and OpenCode for instance) are well-known for their stability issues, and it seems "even more AI" is not about to fix that.
If you can’t validate the code, you can’t tell if it’s correct.
No?
That's literally the thing they suggested to move away from. That is just an issue when using tools designed for us.
Make them write in formal verification languages and we only have to understand the types.
To be clear, I don't think this is a good idea, at least not yet, but we do not have to always understand the code.
Lean might be a step in that direction.
Yes yes
Let it write a black box no human understands. Give the means of production away.
> But once you go beyond that to less defined things such as code quality
I think they have a good optimization target with SWE-Bench-CI.
You are tested for continuous changes to a repository, spanning multiple years in the original repository. Cumulative edits needs to be kept maintainable and composable.
If there are something missing with the definition of "can be maintained for multiple years incorporating bugfixes and feature additions" for code quality, then more work is needed, but I think it's a good starting point.
Do we need all that if we can apply AI to solve practical problems today?
What is possible today is one thing. Sure people debate the details, but at this point it's pretty uncontroversial that AI tooling is beneficial in certain use cases.
Whether or not selling access to massive frontier models is a viable business model, or trillion-dollar valuations for AI companies can be justified... These questions are of a completely different scale, with near-term implications for the global economy.
Depends on the cost.
Except it's not how this specific instance works. In this case the problem isn't written in a formal language and the AI's solution is not something one can automatically verify.
I mean, even if the technology stopped to improve immediately forever (which is unlikely), LLMs are already better than most humans at most tasks.
Including code quality. Not because they are exceptionally good (you are right that they aren’t superhuman like AlphaGo) but because most humans are rather not that good at it anyway and also somehow « hallucinate » because of tiredness.
Even today’s models are far from being exploited at their full potential because we actually developed pretty much no tools around it except tooling to generate code.
I’m also a long time « doubter » but as a curious person I used the tool anyway with all its flaws in the latest 3 years. And I’m forced to admit that hallucinations are pretty rare nowadays. Errors still happen but they are very rare and it’s easier than ever to get it back in track.
I think I’m also a « believer » now and believe me, I really don’t want to because as much as I’m excited by this, I’m also pretty much frightened of all the bad things that this tech could to the world in the wrong hands and I don’t feel like it’s particularly in the right hands.
LLMs already do unsupervised learning to get better at creative things. This is possible since LLMs can judge the quality of what is being produced.
LLMs can often guess the final answer, but the intermediate proof steps are always total bunk.
When doing math you only ever care about the proof, not the answer itself.
Yep, I remember a friend saying they did a maths course at university that had the correct answer given for each question - this was so that if you made some silly arithmetic mistake you could go back and fix it and all the marks were for the steps to actually solve the problem.
This would have greatly helped me. I always was at a loss which trick I had to apply to solve this exam problem, while knowing the mathematics behind it. Just at some point you had to add a zero that was actually a part of a binomial that then collapsed the whole fromula
Not in this case: the LLM wrote the entire paper, and anyway the proof was the answer.
Once you have a working proof, no matter how bad, you can work towards making it nicer. It's like refactoring in programming.
If your proof is machine checkable, that's even easier.
That is also how humans work mostly. Once every full moon we may get an "intuition" but most of the time we lean on collective knowledge, biases and behavior patterns to take decisions, write and talk.
I haven't had success in getting AI's to output working proofs.
You'd need a completely different post-training and agent stack for that.
What’s funny is that there are total cranks in human form that do the same thing. Lots of unsolicited “proofs” being submitted by “amateur mathematicians” where the content is utter nonsense, but like a monkey with a typewriter, there’s the possibility that they stumble upon an incredible insight.
I mean, this is why everyone is making bank selling RL environments in different domains to frontier labs.
>it really is a new and exciting world...
The point is that from now on, there will be nothing really new, nothing really original, nothing really exciting. Just endless stream of re-hashed old stuff that is just okayish..
Like an AI spotify playlist, it will keep you in chains (aka engaged) without actually making you like really happy or good. It would be like living in a virtual world, but without having anything nice about living in such a world..
We have given up everything nice that human beings used to make and give to each other and to make it worse, we have also multiplied everything bad, that human being used to give each other..
> there will be nothing really new
How is this the conclusion? Isn't this post about AI solving something new? What am I missing?
Because economy. Look at marvel movies, do you think the latest one is really new? Or just a rehash of what they found working commercially? Look at all the AI generated blog posts that is flooding the internet..
LLMs might produce something new once in a long while due to blind luck, but if it can generate something that pushes the right buttons (aka not really creative) to majority of population, then that is what we will keep getting...
I don't think I have to elaborate on the "multiplying the bad" part as it is pretty well acknowledged..
That's literally all culture: https://www.youtube.com/watch?v=nJPERZDfyWc
The difference is whether an entity that can "feel" is in the loop and how much they have contributed to it even if it is a remix.
I think there's demonstrably very little difference at all between human and AI outputs, and that's exactly what freaks people out about it. Else they wouldn't be so obsessed with trying to find and define what makes it different.
The Thesis of Everything is a Remix is that there is no difference in how any culture is produced. Different models will have a different flavor to their output in the same way as different people contribute their own experiences to a work.
> I think there's demonstrably very little difference at all between human and AI outputs
Bold claim, as the internet is awash with counterexamples.
In any case, as I think this conversation is trending towards theories of artistic expression, “AI content” will never be truly relatable until it can feel pleasure, pain, and other human urges. The first thing I often think about when I critically assess a piece of art, like music, is what the artist must have been feeling when they created it, and what prompted them to feel that way. I often wonder if AI influencers have ever critically assessed art, or if they actually don’t understand it because of a lack of empathy or something.
And relatability, for me, is the ultimate value of artistic expression.
> Bold claim, as the internet is awash with counterexamples.
What do you consider a counterexample? Because I've been involved in local politics lately, and can say from experience that any foundation model is capable of more rational and detailed thought, and more creative expression, than most of the beloved members of my community.
If you're comparing AI to the pinnacle of human achievement, as another commenter pointed to Shakespeare, then I think the argument is already won in favor of AI.
The claim was precise:
> I think there's demonstrably very little difference at all between human and AI outputs
Counterexamples range from em-dashes, “Not-this, but-that”, people complaining about AI music on Spotify (including me) that sounds vaguely like a genre but is missing all of the instrumentation and motifs common to that genre.
The rest of your comment I don’t even know how to respond to, to be honest.
> em-dashes, “Not-this, but-that”
I've literally seen humans accusing other humans of being AI here on hackernews for these. Q.E.D.
You’re really going to make the claim that there are no counterexamples of human and AI output being indistinguishable on the internet? At least make the counterclaim that “those are from old models, not the newest ones”, that’s more intellectually invigorating than the comment you just provided.
> claim that there are no counterexamples of human and AI output being indistinguishable on the internet?
Is that a claim I've made? I don't see it anywhere. I think a lot of people think that because they can get the AI to generate something silly or obviously incorrect, that invalidates other output which is on-par with top-level humans. It does not. Every human holds silly misconceptions as well. Brain farts. Fat fingers. Great lists of cognitive biases and logical fallacies. We all make mistakes.
It seems to me that symbolic thinking necessitates the use of somewhat lossy abstractions in place of the real thing, primarily limited by the information which can be usefully stored in the brain compared to the informational complexity of the systems being symbolized. Which neatly explains one cognitive pathology that humans and LLMs share. I think there are most certainly others. And I think all the humans I know and all the LLMs I've interacted with exist on a multidimensional continuum of intelligence with significant overlap.
I hereby rebuff your crude and libelous mischaracterization of my assertion. How's that? :)
> Is that a claim I've made?
Yes, you literally just said QED.
Are we reading the same thread?
You said AI works were easily distinguishable via em-dashes and "not this, but that"
I said I have witnessed humans using that metric accuse other humans here on hackernews. Q.E.D.
You've asserted that they are easily distinguished. Practitioners in the field fail to distinguish using the same criteria. Is that not dispositive? Seems like it to me.
I claimed much earlier in the thread "I think there's demonstrably very little difference at all between human and AI outputs" which is consistent with "I think all the humans I know and all the LLMs I've interacted with exist on a multidimensional continuum of intelligence with significant overlap."
Two ways of saying the same thing.
Both of them suggesting that sometimes you may be able to tell it's the output of an AI or Human, sometimes not. Sometimes the things coming out of the AI or the Human might be smart in a way we recognize, sometimes not. And recognizing that humans already exist on quite a broad scale of intelligences in many axes.
>as another commenter pointed to Shakespeare
Lol wut?
I was not saying that LLMs cannot produce something like pinnacle of human achievement. I was saying we cannot quantify the difference between Shakespeare and something commonplace, because it requires the ability to feel.
I think you are being very dishonest here..
In any case, as I think this conversation is trending towards theories of artistic expression, “AI content” will never be truly relatable until it can feel pleasure, pain, and other human urges. The first thing I often think about when I critically assess a piece of art, like music, is what the artist must have been feeling when they created it, and what prompted them to feel that way.
I recently watched "Come See Me in the Good Light", about the life and death of poet Andrea Gibson. I find their poetry very moving, precisely because it's dripping with human emotion.
Or at least, that's the story I tell myself. The reality is that I perceive it to be written by a human full of emotion. If I were to find out it was AI, I would immediately lose interest, but I think we're already at the point where AI output is indistinguishable from human output in many cases, and if I perceive art to be imbued with human emotion, the actuality of it only matters in terms of how it shapes my perception of it.
I'm not really sure where we'll go with that from here. Maybe art will remain human-created only, and we'll demand some kind of proof of its provenance of being borne of a human mind and a human heart. Or maybe younger generations will really care only about how art makes them feel, not what kind of intelligent entity made it. I really don't know.
> demonstrably very little difference at all between human and AI outputs
Is there "demonstrably" a lot of difference between Shakespeare and an HN comment?
The point is exactly that there is no such difference. And that it enables slop to be sold as art. And that exactly is the danger. But another point is we had the even before LLMs. And LLMs just make it more explicit and makes it possible at scale.
Conrad Gessner had the very same complaint in the 16th century, noting the overabundance of printed books, fretting about shoddy, trivial, or error-filled works ( https://www.jstor.org/stable/26560192 )
So....what is your point?
Generations have grown and died in the time since your concern was first expressed. The world continues. Culture adapts.
Did I say it will be the end of the world?
Each solvable problem contains its solution intrinsically, so to speak, it’s only a matter of time and consuming of resources to get to it. There’s nothing creative about it, which is I think what OP was alluding to (the creative part). I’m talking mostly mathematics.
There’s also a discussion to be made about maths not being intrinsically creative if AI automatons can “solve” parts of it, which pains me to write down because I had really thought that that wasn’t the case, I genuinely thought that deep down there was still something ethereal about maths, but I’ll leave that discussion for some other time.
I heard this saying recently “The problem with comfort is that it makes you comfortable.”
On what do you base your prediction?
Is it because the AI is trained with existing data? But, we are also trained with existing data. Do you think that there's something that makes human brain special (other than the hundreds of thousands years of evolution but that's what AI is all trying to emulate)?
This may sound hostile (sorry for my lower than average writing skills), but trust me, I'm really trying to understand.
>We have given up everything nice that human beings used to make and give to each other and to make it worse, we have also multiplied everything bad, that human being used to give each other..
Source?
AI can both explore new things and exploit existing things. Nothing forces it to only rehash old stuff.
>without actually making you like really happy or good.
What are you basing this off of. I've shared several AI songs with people in real life due to how much I've enjoyed them. I doing see why an AI playlist couldn't be good or make people happy. It just needs to find what you like in music. Again coming back to explore vs exploit.
>What are you basing this off of.
Jokes. LLMs are not able to make me laugh all day by generating infinite stream of hilarious original jokes..
Does it work for you?
I've found several posts on moltbook funny. I don't really like regular jokes in general and I don't find human ones particularly funny either. I don't think we are at the point of being able to be reliable funny, but it definitely seems possible from my perspective.
Care to link some?
I think they would be hard to find due to how many posts exists along with how things aren't as funny the second time around.
funny things are funny the n-th time around. Or may be it was just not funny and just something new for you..
We have different senses of humor.
Just tell one funny thing an LLM said...
Lots of examples here:
https://news.ycombinator.com/item?id=46205632
Yesterday it was "LLM's can't count R's in 'strawberry'." Today it's "LLM's can't tell jokes". Tomorrow it might be "LLM's can't do (X)", all while LLMs get better and better at every objection/challenge posed.
The problem as I see it is that you have a fundamental objection to categorizing the way LLMs do their work as in any way related to "real gosh-darn human thinking". Which I think is wrong. At the root, we are just information-processing meat that happens to have had millions of years to optimize for speed, pattern recognition, feedback, etc.
AI is a remixer; it remixes all known ideas together. It won't come up with new ideas though; the LLMs just predict the most likely next token based on the context. That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
But human researchers are also remixers. Copying something I commented below:
> Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
This is a way too simplistic model of the things humans provide to the process. Imagination, Hypothesis, Testing, Intuition, and Proofing.
An AI can probably do an 'okay' job at summarizing information for meta studies. But what it can't do is go "Hey that's a weird thing in the result that hints at some other vector for this thing we should look at." Especially if that "thing" has never been analyzed before and there's no LLM-trained data on it.
LLMs will NEVER be able to do that, because it doesn't exist. They're not going to discover and define a new chemical, or a new species of animal. They're not going to be able to describe and analyze a new way of folding proteins and what implication that has UNLESS you basically are constantly training the AI on random protein folds constantly.
I think you are vastly underestimating the emergent behaviours in frontier foundational models and should never say never.
Remember, the basis of these models is unsupervised training, which, at sufficient scale, gives it the ability to to detect pattern anomalies out of context.
For example, LLMs have struggled with generalized abstract problem solving, such as "mystery blocks world" that classical AI planners dating back 20+ years or more are better at solving. Well, that's rapidly changing: https://arxiv.org/html/2511.09378v1
No idea how underestimate things are, but marketing terms like "frontier foundational models" don't help to foster trust in a domain hyperhyped.
That is, even if there are cool things that LLM make now more affordable, the level of bullshit marketing attached to it is also very high which makes far harder to make a noise filter.
>Hey that's a weird thing in the result that hints at some other vector for this thing we should look at
Kinda funny because that looked _very_ close to what my Opus 4.6 said yesterday when it was debugging compile errors for me. It did proceed to explore the other vector.
> Especially if that "thing" has never been analyzed before and there's no LLM-trained data on it.
This is the crucial part of the comment. LLMs are not able to solve stuff that hasn't been solve in that exact or a very similar way already, because they are prediction machines trained on existing data. It is very able to spot outliers where they have been found by humans before, though, which is important, and is what you've been seeing.
""Hey that's a weird thing in the result that hints at some other vector for this thing we should look at." "
This is very common already in AI.
Just look at the internal reasoning of any high thinking model, the trace is full of those chains of thought.
But just like how there were never any clips of Will Smith eating spaghetti before AI, AI is able to synthesize different existing data into something in between. It might not be able to expand the circle of knowledge but it definitely can fill in the gaps within the circle itself
> LLMs will NEVER be able to do that, because it doesn't exist.
I mean, TFA literally claims that an AI has solved an open Frontier Math problem, descibed as "A collection of unsolved mathematics problems that have resisted serious attempts by professional mathematicians. AI solutions would meaningfully advance the state of human mathematical knowledge."
That is, if true, it reasoned out a proof that does not exist in its training data.
It generated a proof that was close enough to something in its training data to be generated.
That may be, and we can debate the level of novelty, but it is novel, because this exact proof didn't exist before, something which many claim was not possible with AI. In fact, just a few years ago, based on some dabbling in NLP a decade ago, I myself would not have believed any of this was remotely possible within the next 3 - 5 decades at least.
I'm curious though, how many novel Math proofs are not close enough to something in the prior art? My understanding is that all new proofs are compositions and/or extensions of existing proofs, and based on reading pop-sci articles, the big breakthroughs come from combining techniques that are counter-intuitive and/or others did not think of. So roughly how often is the contribution of a proof considered "incremental" vs "significant"?
Well, for one the proof would have to use actual proof techniques.
What really happened here was that the LLM produced a python script that generated examples of hypergraphs that served as proof by example.
And the only thing that has been verified are these examples. The LLM also produced a lot of mathematical text that has not been analyzed.
I see, thanks for the explanation!
Do you know that from reading the proof, or are you just assuming this based on what you think LLMs should be capable of? If the latter, what evidence would be required for you to change your mind?
- Edit: I can't reply, probably because the comment thread isn't allowed to go too deep, but this is a good argument. In my mind the argument isn't that coding is harder than math, but that the problems had resisted solution by human researchers.
1) this is a proof by example 2) the proof is conducted by writing a python program constructing hypergraphs 3) the consensus was this was low-hanging fruit ready to be picked, and tactics for this problem were available to the LLM
So really this is no different from generating any python program. There are also many examples of combinatoric construction in python training sets.
It's still a nice result, but it's not quite the breakthrough it's made out to be. I think that people somehow see math as a "harder" domain, and are therefore attributing more value to this. But this is a quite simple program in the end.
One of the possible outcomes of this journey is that “LLMs can never do X”. Another is that X is easier than we thought.
Or that some quixotic problems nobody cared about to the extent to actually work on them do have some solution.
>But human researchers are also remixers.
Some human researchers are also remixers to Some degree.
Can you imagine AI coming up with refraction & separation lie Newton did?
That sets a vastly higher bar than what we're talking about here. You're comparing modern AI to one of the greatest geniuses in human history. Obviously AI is not there yet.
That being said, I think this is a great question. Did Einstein and Newton use a qualitatively different process of thought when they made their discoveries? Or were they just exceedingly good at what most scientists do? I honestly don't know. But if LLMs reach super-human abilities in math and science but don't make qualitative leaps of insight, then that could suggest that the answer is 'yes.'
AI does not have a physical body to make experiments in the real world and build and use equipment
Maybe not, but more than 99.999999% of humans would also not come up with that.
Or even gravity to explain an apple falling from a tree- when almost all of the knowledge until then realistically suggested nothing about gravity?
I don't think this is a correct explanation of how things work these days. RL has really changed things.
Models based on RL are still just remixers as defined above, but their distribution can cover things that are unknown to humans due to being present in the synthetic training data, but not present in the corpus of human awareness. AlphaGo's move 37 is an example. It appears creative and new to outside observers, and it is creative and new, but it's not because the model is figuring out something new on the spot, it's because similar new things appeared in the synthetic training data used to train the model, and the model is summoning those patterns at inference time.
> the model is summoning those patterns at inference time.
You can make that claim about anything: "The human isn't being creative when they write a novel, they're just summoning patterns at typing time".
AlphaGo taught itself that move, then recalled it later. That's the bar for human creativity and you're holding AlphaGo to a higher standard without realizing it.
I can't really make that claim about human cognition, because I don't have enough understanding of how human cognition works. But even if I could, why is that relevant? It's still helpful, from both a pedagogical and scientific perspective, to specify precisely why there is seeming novelty in AI outputs. If we understand why, then we can maximize the amount of novelty that AI can produce.
AlphaGo didn't teach itself that move. The verifier taught AlphaGo that move. AlphaGo then recalled the same features during inference when faced with similar inputs.
>AlphaGo didn't teach itself that move. The verifier taught AlphaGo that move.
No. AlphaGo developed a heuristic by playing itself repeatedly, the heuristic then noticed the quality of that move in the moment.
Heuristics are the core of intelligence in terms of discovering novelty, but this is accessible to LLMs in principle.
> The verifier taught AlphaGo that move
Ok so it sounds like you want to give the rules of Go credit for that move, lol.
It feels like you're purposefully ignoring the logical points OP gives and you just really really want to anthropomorphize AlphaGo and make us appreciate how smart it (should I say he/she?) is ... while no one is even criticising the model's capabilities, but analyzing it.
Can you back that up with some logic for me?
I don't really play Go but I play chess, and it seems to me that most of what humans consider creativity in GM level play comes not in prep (studying opening lines/training) but in novel lines in real games (at inference time?). But that creativity absolutely comes from recalling patterns, which is exactly what OP criticizes as not creative(?!)
I guess I'm just having trouble finding a way to move the goalpost away from artificial creativity that doesn't also move it away from human creativity?
How a model is trained is different than how a model is constructed. A model’s construction defines its fundamental limitations, e.g. a linear regressor will never be able to provide meaningful inference on exponential data. Depending on how you train it, though, you can get such a model to provide acceptable results in some scenarios.
Mixing the two (training and construction) is rhetorically convenient (anthropomorphization), but holds us back in critically assessing a model’s capabilities.
Linear regression has well characterized mathematical properties. But we don't know the computational limits of stacked transformers. And so declaring what LLMs can't do is wildly premature.
> And so declaring what LLMs can't do is wildly premature.
The opposite is true as well. Emergent complexity isn’t limitless. Just like early physicists tried to explain the emergent complexity of the universe through experimentation and theory, so should we try to explain the emergent complexity of LLMs through experimentation and theory.
Specifically not pseudoscience, though.
>so should we try to explain the emergent complexity of LLMs through experimentation and theory.
Physicists had the real world to verify theories and explanations against.
So far anyone 'explaining the emergent complexity of LLMs through experimentation and theory' is essentially just making stuff up nobody can verify.
Well that’s why I provided the caveat “specifically not pseudoscience”, which is, as you described, “just making stuff up nobody can verify”.
If you say not pseudoscience and then make up pseudoscience anyway then what's the point? The field has not advanced anywhere enough in understanding for convoluted explanations about how LLMs can never do x to be anything but pseudoscience.
Sure, that's true as well. But I don't see this as a substantive response given that the only people making unsupported claims in this thread are those trying to deflate LLM capabilities.
So, to review this thread
You made a pretty nonsensical argument, pretty much seems like the big standard for these arguments.
What does linear regression have to do with the limitations of a stacked transfer ? Absolutely nothing. This is the problem here. You don't know shit and just make up whatever. You can see people doing the same thing in GPT-1, 2, 3, 4 threads all telling us why LLMs will never be able to do thing it manages to do later.
> You don’t know shit
lol. Why so emotionally charged? Are you perhaps worried that you’ve invested too much time and effort into a technology that may not deliver what influencers have been promising for years? Like a proverbial bagholder?
> What does linear regression have to do with the limitations of a stacked transfer ? Absolutely nothing. This is the problem here.
We’re talking about fundamental concepts of modeling in this subthread. LLMs, despite what influencers may tell you, are simply models. I’ll even throw you a bone and admit they are models for intelligence. But they are still models, and therefore all of the things that we have learned about “models” since Plato are still relevant. Most importantly, since Plato we’ve known that “models” have fundamental limits vs. what they try to represent, otherwise they would be a facsimile, not a model.
> You can see people doing the same thing in GPT-1, 2, 3, 4 threads all telling us why LLMs will never be able to do thing it manages to do later.
I hope you enjoy winning these imaginary arguments against these imaginary comments. The fundamental limitations of LLMs discussed since GPT-1 have never been addressed by changing the architecture of the underlying model. All of the improvements we’ve experienced have been due to (1) improvements in training regime and (2) harnesses / heuristics (e.g. Agents).
Now, care to provide a counterargument that shows you know a little more than “shit”?
>We’re talking about fundamental concepts of modeling in this subthread. LLMs, despite what influencers may tell you, are simply models. I’ll even throw you a bone and admit they are models for intelligence. But they are still models, and therefore all of the things that we have learned about “models” since Plato are still relevant. Most importantly, since Plato we’ve known that “models” have fundamental limits vs. what they try to represent, otherwise they would be a facsimile, not a model.
Okay, but the brain is also “just a model” of the world in any meaningful sense, so that framing does not really get you anywhere. Calling something a model does not, by itself, establish a useful limit on what it can or cannot do. Invoking Plato here just sounds like pseudo-profundity rather than an actual argument.
>I hope you enjoy winning these imaginary arguments against these imaginary comments. The fundamental limitations of LLMs discussed since GPT-1 have never been addressed by changing the architecture of the underlying model. All of the improvements we’ve experienced have been due to (1) improvements in training regime and (2) harnesses / heuristics (e.g. Agents).
If a capability appears once training improves, scale increases, or better inference-time scaffolding is added, then it was not demonstrated to be a 'fundamental impossibility'.
That is the core issue with your argument: you keep presenting provisional limits as permanent ones, and then dressing that up as theory. A lot of people have done that before, and they have repeatedly been wrong.
To be clear, you are confusing me with other commenters in this thread. All I want is for those that liken LLMs to stochastic parrots and other deflationary claims to offer an argument that engages with the actual structure of LLMs and what we know about them. No one seems to be up to that challenge. But then I can't help but wonder where people's confident claims come from. I'm just tired of the half-baked claims and generic handwavy allusions that do nothing but short-circuit the potential for genuine insight.
No. AlphaGo does search, and does so imperfectly. It does come up with creative new patterns not seen before.
How do you know that? We don't have access to the logs to know anything about its training, and it's impossible for it to have trained on every potential position in Go.
Turning a hard problem into a series of problems we know how to solve is a huge part of problem solving and absolutely does result in novel research findings all the time.
Standard problem*5 + standard solutions + standard techniques for decomposing hard problems = new hard problem solved
There is so much left in the world that hasn’t had anyone apply this approach purely because no research programme has decides that it’s worth their attention.
If you want to shift the bar for “original” beyond problems that can be abstracted into other problems then you’re expecting AI to do more than human researchers do.
I entered the prompt:
> Write me a stanza in the style of "The Raven" about Dick Cheney on a first date with Queen Elizabeth I facilitated by a Time Travel Machine invented by Lin-Manuel Miranda
It outputted a group of characters that I can virtually guarantee you it has never seen before on its own
Yes, but it has seen The Raven, it has seen texts about Dick Cheney, first dates, Queen Elizabeth, time machines and Lin Manuel Miranda.
All of its output is based on those things it has seen.
What are you trying to point out here ? Is there any question you can ask today that is not dependent on some existing knowledge that an AI would have seen ?
The point I'm trying to make is that all LLM output is based on likelihood of one word coming after the next word based on the prompt. That is literally all it's doing.
It's not "thinking." It's not "solving." It's simply stringing words together in a way that appears most likely.
ChatGPT cannot do math. It can only string together words and numbers in a way that can convince an outsider that it can do math.
It's a parlor trick, like Clever Hans [1]. A very impressive parlor trick that is very convincing to people who are not familiar with what it's doing, but a parlor trick nontheless.
[1] https://en.wikipedia.org/wiki/Clever_Hans
> all LLM output is based on likelihood of one word coming after the next word based on the prompt.
Right but it has to reason about what that next word should be. It has to model the problem and then consider ways to approach it.
No, it does not reason anything. LLM "reasoning" is just an illusion.
When an LLM is "reasoning" it's just feeding its own output back into itself and giving it another go.
This is like saying chess engines don't actually "play" chess, even though they trounce grandmasters. It's a meaningless distinction, about words (think, reason, ..) that have no firm definitions.
This exactly. The proof is in the pudding. If AI pudding is as good as (or better than) human pudding, and you continue to complain about it anyway... You're just being biased and unreasonable.
And by the way, I don't think it's surprising that so many people are being unreasonable on this issue, there is a lot at stake and it's implications are transformative.
Chess engines are not a comparable thing. Chess is a solved game. There is always a mathematically perfect move.
> Chess is a solved game. There is always a mathematically perfect move.
This is a good example of being confidently misinformed.
The best move is always a result of calculation. And the calculation can always go deeper or run on a stronger engine.
We know that chess can be solved, in theory. It absolutely isn't and probably will never be in practice. The necessary time and storage space doesn't exist.
Chess is absolutely not a solved game, outside of very limited situations like endgames. Just because a best move exists does not mean we (or even an engine) know what it is
Is that so different from brains?
Even if it is, this sounds like "this submarine doesn't actually swim" reasoning.
sigh; this argument is the new Chinese Room; easily described, utterly wrong.
https://www.youtube.com/watch?v=YEUclZdj_Sc
Next-token-prediction cannot do calculations. That is fundamental.
It can produce outputs that resemble calculations.
It can prompt an agent to input some numbers into a separate program that will do calculations for it and then return them as a prompt.
Neither of these are calculations.
So you don't think 50T parameter neural networks can encode the logic for adding two n-bit integers for reasonably sized integers? That would be pretty sad.
They do not. The fundamental technology behind LLMs does not allow that to be the case. You are hoping that an LLM can do something that it cannot do.
https://arxiv.org/html/2502.16763v2
You are wrong. Especially that we are talking about models with 50T parameters.
Can they do arbitrary computations for arbitrarily long numbers? Nope. But that's not remotely the same statement, and they can trivially call out to tools to do that in those cases.
You do realize that training a neural net to do addition is a beginner level exercise in ML?
Humans can't do calculations either, by your definition. Only computers can.
Third things can exist. In other words, you’re implying a false dichotomy between “human computation” and “computer computation” and implying that LLMs must be one or the other. A pithy gotcha comment, no doubt.
Edit: the implication comes from demanding that the OP’s definition must be rigorous enough to cover all models of “computation”, and by failing to do so, it means that LLMs must be more like humans than computers.
After dismissing it for a long time, I have come around to the philosophical zombie argument. I do not believe that LLMs are conscious, but I also no longer believe that consciousness is a prerequisite for intelligence. I think at this point it is hard to deny that LLMs do not possess some form of intelligence (although not necessarily human-like). I think P-zombies is a fitting description.
I don't think P-zombies can exist. There must be some perceptible difference between an intelligence w/ consciousness and one without. The only way there wouldn't be a difference is if we are mistaken about the consciousness (either both have it or neither do).
In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.
“What are you doing?”, asked Minsky.
“I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.
“Why is the net wired randomly?”, asked Minsky.
“I do not want it to have any preconceptions of how to play”, Sussman said.
Minsky then shut his eyes.
“Why do you close your eyes?”, Sussman asked his teacher.
“So that the room will be empty.”
At that moment, Sussman was enlightened.
-- from the jargon file
> All of its output is based on those things it has seen.
Virtually all output from people is based in things the person has experienced.
People aren't designed to objectively track each and every event or observation they come across. Thus it's harder to verify. But we only output what has been inputted to us before.
Here’s a simple prompt you can try to prove that this is false:
This is a fresh UUIDv4 I just generated, it has not been seen before. And yet it will output it.No one is claiming that every sentence LLMs are producing are literal copies of other sentences. Tokens are not even constrained to words but consist of smaller slices, comparable to syllables. Which even makes new words totally possible.
New sentences, words, or whatever is entirely possible, and yes, repeating a string (especially if you prompt it) is entirely possible, and not surprising at all. But all that comes from trained data, predicting the most probably next "syllable". It will never leave that realm, because it's not able to. It's like approaching an Italian who has never learned or heard any other language to speak French. It can't.
> It's like approaching an Italian who has never learned or heard any other language to speak French
Interesting similitude, because I expect an Italian to be able to communicate somewhat successfully with a French person (and vice versa) even if they do not share a language.
The two languages are likely fairly similar in latent space.
Your view of what is happening in the neural net of an LLM is too simplistic. They likely aren't subject to any constraints that humans aren't also in the regard you are describing. What I do know to be true is that they have internalised mechanisms for non-verbalised reasoning. I see proof of this every day when I use the frontier models at work.
After you prompt it, it's seen it.
Ok, how about this?
It is trivial to get an LLM to produce new output, that’s all I’m saying. It is strictly false that LLMs will only ever output character sequences that have been seen before; clearly they have learned something deeper than just that.All of the data is still in the prompt, you are just asking the model to do a simple transform.
I think there are examples of what you’re looking for, but this isn’t one.
I agree that this isn't a very interesting example, but your statement is: "just asking the model to do a simple transform". If you assert that it understand when you ask it things like that, how could anything it produces not fall under the "already in the model" umbrella?
I didn't say it wasn't an interesting example -- i said it wasn't an example of LLMs generating things they have not seen before.
> how could anything it produces not fall under the "already in the model" umbrella
It doesn't. That is the point of my comment.
> All of the data is still in the prompt, you are just asking the model to do a simple transform.
LLMs can use data in their prompt. They can also use data in their context window. They can even augment their context with persisted data.
You can also roll out LLM agents, each one with their role and persona, and offload specialized tasks with their own prompts, context windows, and persisted data, and even tools to gather data themselves, which then provide their output to orchestrating LLM agents that can reuse this information as their own prompts.
This is perfectly composable. You can have a never-ending graph of specialized agents, too.
Dismissing features because "all of the data is in the prompt" completely misses the key traits of these systems.
I was in no way dismissing it -- I was refuting the above claim that they "generate things they have not seen before"
The online way to prove it is false would’ve to let the LLM create a new uuid algorithm that uses different parameters than all the other uuid algorithms. But that is better than the ones before. It basically can’t do that.
But that fresh UUID is in the prompt.
Also it's missing the point of the parent: it's about concepts and ideas merely being remixed. Similar to how many memes there are around this topic like "create a fresh new character design of a fast hedgehog" and the out is just a copy of sonic.[1]
That's what the parent is on about, if it requires new creativity not found by deriving from the learned corpus, then LLMs can't do it. Terrence Tao had similar thoughts in a recent Podcast.
[1] https://www.reddit.com/r/aiwars/s/pT2Zub10KT
Sure, that may be. But “creativity” is much harder to define and to prove or disprove. My point is that “remixing” does not prohibit new output.
I don’t think that is a good example. No one is debating whether LLMs can generate completely new sequences of tokens that have never appeared in any training dataset. We are interested not only in novel output, we are also interested in that output being correct, useful, insightful, etc. Copying a sequence from the user’s prompt is not really a good demonstration of that, especially given how autoregression/attention basically gives you that for free.
Perhaps I should have quoted the parent:
> That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
My only claim is that precisely this is incorrect.
> That's what the parent is on about, if it requires new creativity not found by deriving from the learned corpus, then LLMs can't do it.
This is specious reasoning. If you look at each and every single realization attributed to "creativity", each and every single realization resulted from a source of inspiration where one or more traits were singled out to be remixed by the "creator". All ideas spawn from prior ideas and observations which are remixed. Even from analogues.
A better example is: compute 2984298724 times 23984723828.
remixing ideas that already exist is a major part of where innovation and breakthroughs come from. if you look at bitcoin as an example, hashes (and hashcash) and digital signatures existed for decades before bitcoin was invented. the cypherpunks also spent decades trying to create a decentralized digital currency to the point where many of them gave up and moved on. eventually one person just took all of the pieces that already existed and put them together in the correct way. i dont see any reason why a sufficiently capable llm couldn't do this kind of innovation.
No. That's wrong. LLMs don't output the highest probability taken: they do a random sampling.
This was obviously a simplification which holds for zero temperature. Obviously top-p-sampling will add some randomness but the probability of unexpected longer sequences goes asymptotically to zero pretty quickly.
I'm not sure what the point is?
A bog standard random number generator or even a flipping coin can produce novel output at will. That's a weird thing to fault LLMs for? Novelty is easy!
See also how genetic algorithms and re-inforcement learning constantly solve problems in novel and unexpected ways. Compare also antibiotics resistances in the real world.
You don't need smarts for novelty.
Where I see the problem is producing output that's both high quality _and_ novel. On command to solve the user's problem.
We need a website with refutations that one can easily link to. This interpretations of LLMs is outdated and unproductive.
The ability for some people to perpetually move the goalpost will never cease to amaze me.
I guess that's one way to tell us apart from AIs.
The main reason for my top post is that I felt I should admit the AI scored a goal today and the last one or two weeks. I said I'd be impressed if it could solve an open problem. It just did. People can argue about how it's not that impressive because if every mathematician were trying to solve this problem they probably would have. However, we all know that humans have extremely finite time and attention, whereas computers not so much. The fact that AI can be used at the cutting edge and relatively frequently produce the right answer in some contexts is amazing.
> AI is a remixer; it remixes all known ideas together.
I've heard this tired old take before. It's the same type of simplistic opinion such as "AI can't write a symphony". It is a logical fallacy that relies on moving goalposts to impossible positions that they even lose perspective of what your average and even extremely talented individual can do.
In this case you are faced with a proof that most members of the field would be extremely proud of achieving, and for most would even be their crowning achievement. But here you are, downplaying and dismissing the feat. Perhaps you lost perspective of what science is,and how it boils down to two simple things: gather objective observations, and draw verifiable conclusions from them. This means all science does is remix ideas. Old ideas, new ideas, it doesn't really matter. That's what they do. So why do people win a prize when they do it, but when a computer does the same it's role is downplayed as a glorified card shuffler?
Yes, ChatGPT and friends are essentially the same thing as the predictive text keyboard on your phone, but scaled up and trained on more data.
So this idea that they replay "text" they saw before is kind of wrong fundamentally. They replay "abstract concepts of varied conceptual levels".
The important point I'm trying to reinforce is that LLMs are not capable of calculation. They can give an answer based on the fact that they have seen lots of calculations and their results, but they cannot actually perform mathematical functions.
That is a pretty bold assertion for a meatball of chemical and electrical potentials to make.
Do you know what "LLM" stands for? They are large language models, built on predicting language.
They are not capable of mathematics because mathematics and language are fundamentally separated from each other.
They can give you an answer that looks like a calculation, but they cannot perform a calculation. The most convincing of LLMs have even been programmed to recognize that they have been asked to perform a calculation and hand the task off to a calculator, and then receive the calculator's output as a prompt even.
But it is fundamentally impossible for an LLM to perform a calculation entirely on its own, the same way it is fundamentally impossible for an image recognition AI to suddenly write an essay or a calculator to generate a photo of a giraffe in space.
People like to think of "AI" as one thing but it's several things.
What calculations? Do you mean "3+5" or a generic Turing-machine like model?
In either case, this "it's a language model" is a pretty dumb argument to make. You may want to reason about the fundamental architecture, but even that quickly breaks down. A sufficiently large neural network can execute many kinds of calculations. In "one shot" mode it can't be Turing complete, but in a weird technicality neither does your computer have an infinite tape. It just simply doesn't matter from a practical perspective, unless you actually go "out of bounds" during execution.
50T parameters give plenty of state space to do all kinds of calculations, and you really can't reason about it in a simplistic way like "this is just a DFA".
Let alone when you run it in a loop.
> What calculations? Do you mean "3+5" or a generic Turing-machine like model?
Either one. An LLM cannot solve 3+5 by adding 3 and 5. It can only "solve" 3+5 by knowing that within its training data, many people have written that 3+5=8, so it will produce 8 as an answer.
An LLM, similarly, cannot simulate a Turing machine. It can produce a text output that resembles a Turing machine based on others' descriptions of one, but it is not actually reading and writing bits to and from a tape.
This is why LLMs still struggle at telling you how many r's are in the word "strawberry". They can't count. They can't do calculations. They can only reproduce text based on having examined the human corpus's mathematical examples.
With all due respect, this is just plain false.
The reason "strawberry" is hard for LLMs is that it sees $str-$aw-$berry, 3 identifiers it can't see into. Can you write down a random word your just heard in a language you don't speak?
> In "one shot" mode it can't be Turing complete, but in a weird technicality neither does your computer have an infinite tape
Nor our brains, in fact.
Mathematics and language really aren't fundamentally separated from one another.
By your definition, humans can't perform calculation either. Only a calculator can.
Mathematics is a language. Everything we can express mathematically, we can also express in natural language. The real interesting, underlying question is: Is there anything worth knowing that cannot be expressed by language? - That's the theoretical boundary of LLM capability.
This is a really poor take, to try and put a firewall between mathematics and language, implying something that only has conceptual understanding root in language is incapable of reasoning in mathematical terms.
You're also correlating "mathematics" and "calculation". Who cares about calculation, as you say, we have calculators to do that.
Mathematics is all just logical reasoning and exploration using language, just a very specific, dense, concise, and low level language. But you can always take any mathematical formula and express it as "language" it will just take far more "symbols"
This might be the worse take on this entire comment section. And I'm not even an overly hyped vibe coder, just someone who understands mathematics
>it is fundamentally impossible for an image recognition AI to suddenly write an essay
You can already do this today with every frontier modal. You can give it an image and have it write an essay from it. Both patches (parts of images) and text get turned into tokens for the language the LLM is learning.
Obligatory Everything is a Remix: https://www.youtube.com/watch?v=nJPERZDfyWc
Move 37.
Yeah but you're thinking of AI as like a person in a lab doing creative stuff. It is used by scientists/researchers as a tool *because* it is a good remixer.
Nobody is saying this means AI is superintelligence or largely creative but rather very smart people can use AI to do interesting things that are objectively useful. And that is cool in its own way.
Sure, but this is absolutely not how people are viewing the AI lol.
> That means the group of characters it outputs must have been quite common in the past. It won't add a new group of characters it has never seen before on its own.
This is false.
I mean it's not going to invent new words no, but it can figure out new sentences or paragraphs, even ones it hasn't seen before, if it's highly likely based on its training and context. Those new sentences and paragraphs may describe new ideas, though!
LLMs are absolutely capable of inventing new words, just as they are capable of writing code that they have never seen in their training data.
I'm curious as to why you consider this as the benchmark for AI capabilities. Extremely few humans can solve hard problems or do much innovation. The vast majority of knowledge work requires neither of these, and AI has been excelling at that kind of work for a while now.
If your definition of AI requires these things, I think -- despite the extreme fuzziness of all these terms -- that it's closer to what most people consider AGI, or maybe even ASI.
Fair point, however I am simply more interested in how AI can advance frontiers than in how it can transcribe a meeting and give a summary or even print out React code. I know the world is heavily in need of the menial labor and AI already has made that stuff way easier and cheaper.
However I'm just very interested in innovation and pushing the boundaries as a more powerful force for change. One project I've been super interested in for a while is the Mill CPU architecture. While they haven't (yet) made a real chip to buy, the ideas they have are just super awesome and innovative in a lot of areas involving instruction density & decoding, pipelining, and trying to make CPU cores take 10% of the power. I hope the Mill project comes to fruition, and I hope other people build on it, and I hope that at some point AI could be a tool that prints out innovative ideas that took the Mill folks years to come up with.
It's kind of interesting in your original comment you used the words "doubter" and "believer", as if AI was some kind of messianic event of some sort and you are deciding whether to "believe" in it.
I mean, if you step back and think about it, there's nothing that requires faith. As you said, current AI can do a lot of things pretty well (transcribe and summarize meetings, write boilerplate code, etc.) Nobody is doubting this.
And AI is definitely helping in innovation to some extent. Not necessarily drive it singlehandedly, but some people working on world-changing innovation find AI useful.
So yeah, I think some people are subconsciously not doubting whether AI works, but kinda having conflicted thoughts about AI being our new overlords or something.
If you think about it, is having AI that's capable of innovating better than humans really a good thing? Like, even if we manage to make benign AI who won't copy how humans are jerks to each other, it kinda takes away our fun of discovery.
I remember there was a conversation between two super-duper VCs (dont remember who but famous ones), about how DeepSeek was a super-genius level model because it solved an intro-level (like week 1-2) electrodynamics problem stated in a very convoluted way.
While cool and impressive for an LLM, I think they oversold the feat by quite a bit.
I don't want to belittle the performance of this model, but I would like for someone with domain expertise (and no dog in the AI race, like a random math PhD) to come forward, and explain exactly what the problem exactly was, and how did the model contribute to the solution.
It 100% will not be used to make the world better and we all know it will be weaponised first to kill humans like all preceding tech
Most tech gets used for good and bad.
Are the only two options AI doubter and AI believer?
Perhaps I should have elaborated more but what I mean is I used to think, "I genuinely don't see the point in even trying to use AI for things I'm trying to solve". Ironically though, I think that because I've repeatedly tried and tested AI and it falls flat on its face over and over. However, this article makes me more hopeful that AI actually could be getting smarter.
All I hear about are AI believers and AI-doubters-just-turned-believers
Hey, I'm a real person. Here's my website. I have YouTube videos up with my real name and face. https://validark.dev
Asking the right questions...
> I really hope we use this intelligence resource to make the world better.
I wished I had your optimism. I'm not an AI doubter (I can see it works all by myself so I don't think I need such verification). But I do doubt humanity's ability to use these tools for good. The potential for power and wealth concentration is off the scale compared to most of our other inventions so far.
most issues at every scale of community and time are political, how do you imagine AI will make that better, not worse?
there's no math answer to whether a piece of land in your neighborhood should be apartments, a parking lot or a homeless shelter; whether home prices should go up or down; how much to pay for a new life saving treatment for a child; how much your country should compel fossil fuel emissions even when another country does not... okay, AI isn't going to change anything here, and i've just touched on a bunch of things that can and will affect you personally.
math isn't the right answer to everything, not even most questions. every time someone categorizes "problems" as "hard" and "easy" and talks about "problem solving," they are being co-opted into political apathy. it's cringe for a reason.
there are hardly any mathematicians who get elected, and it's not because voters are stupid! but math is a great way to make money in America, which is why we are talking about it and not because it solves problems.
if you are seeking a simple reason why so many of the "believers" seem to lack integrity, it is because the idea that math is the best solution to everything is an intellectually bankrupt, kind of stupid idea.
if you believe that math is the most dangerous thing because it is the best way to solve problems, you are liable to say something really stupid like this:
> Imagine, say, [a country of] 50 million people, all of whom are much more capable than any Nobel Prize winner, statesman, or technologist... this is a dangerous situation... Humanity needs to wake up
https://www.darioamodei.com/essay/the-adolescence-of-technol...
Dario Amodei has never won an election. What does he know about countries? (nothing). do you want him running anything? (no). or waking up humanity? In contrast, Barack Obama, who has won elections, thinks education is the best path to less violence and more prosperity.
What are you a believer in? ChatGPT has disrupted exactly ONE business: Chegg, because its main use case is cheating on homework. AI, today, only threatens one thing: education. Doesn't bode well for us.
I agree with what you're saying, and I certainly don't think the one problem facing my country or the world is just that we didn't solve the right math problem yet. I am saddened by the direction the world keeps moving.
When I wrote that I hope we use it for good things, I was just putting a hopeful thought out there, not necessarily trying to make realistic predictions. It's more than likely people will do bad things with AI. But it's actually not set in stone yet, it's not guaranteed that it has to go one way. I'm hopeful it works out.
> I would like to see a few more AI inventions to know for sure, but wow, it really is a new and exciting world.
We already have a few years of experience with this.
> I really hope we use this intelligence resource to make the world better.
We already have a few years of experience with this.
> born-again AI believer
sigh
I honestly do think I'm being honest with myself. I have held it in my mind that I'm not impressed until it's innovative. That threshold seems to be getting crossed.
I'm not saying, "I used to be an atheist, but then I realized that doesn't explain anything! So glad I'm not as dumb now!"
Somehow people don't need "faith" and "being impressed" to make a hammer or a car work.
(This shows that LLMs aren't tools yet.)
The problem is that the AI industry has been caught lying about their accomplishments and cheating on tests so much that I can't actually trust them when they say they achieved a result. They have burned all credibility in their pursuit of hype.
I'm all for skeptical inquiry, but "burning all credibility" is an overreaction. We are definitely seeing very unexpected levels of performance in frontier models.
It's less of solving a problem, but trying every single solution until one works. Exhaustive search pretty much.
It's pretty much how all the hard problems are solved by AI from my experience.
If LLMs really solved hard problems by 'trying every single solution until one works', we'd be sitting here waiting until kingdom come for there to be any significant result at all. Instead this is just one of a few that has cropped up in recent months and likely the foretell of many to come.
In other words, it's solving a problem.
Yes, but is it "intelligence" is a valid question. We have known for a long time that computers are a lot faster than humans. Get a dumb person who works fast enough and eventually they'll spit out enough good work to surpass a smart person of average speed.
It remains to be seen whether this is genuinely intelligence or an infinite monkeys at infinite typewriters situation. And I'm not sure why this specific example is worthy enough to sway people in one direction or another.
Someone actually mathed out infinite monkeys at infinite typewriters, and it turns out, it is a great example of how misleading probabilities are when dealing with infinity:
"Even if every proton in the observable universe (which is estimated at roughly 1080) were a monkey with a typewriter, typing from the Big Bang until the end of the universe (when protons might no longer exist), they would still need a far greater amount of time – more than three hundred and sixty thousand orders of magnitude longer – to have even a 1 in 10500 chance of success. To put it another way, for a one in a trillion chance of success, there would need to be 10^360,641 observable universes made of protonic monkeys."
Often infinite things that are probability 1 in theory, are in practice, safe to assume to be 0.
So no. LLMs are not brute force dummies. We are seeing increasingly emergent behavior in frontier models.
> So no. LLMs are not brute force dummies. We are seeing increasingly emergent behavior in frontier models.
Woah! That was a leap. "We are seeing ... emergent behaviors" does not follow from "it's not brute force".
It is unsurprising that an LLM performs better than random! That's the whole point. It does not imply emergence.
> It is unsurprising that an LLM performs better than random! That's the whole point. It does not imply emergence.
By definition, it is emergent behavior when it exhibits the ability to synthesize solutions to problems that it wasn't trained on. I.e. it can handle generalization.
Emergent behavior would imply that some other function was being reduced to token prediction. Behaving "better than random" ie: not just brute forcing would not qualify - token prediction is not brute forcing and we expect it to do better, it's trained to do so.
If you want to demonstrate an emergent behavior you're going to need to show that.
> We are seeing increasingly emergent behavior in frontier models.
What? Did you see one crying?
Maybe infinite monkeys at infinite typewriters hitting the statistically most likely next key based on their training.
The real question is how to define intelligence in a way that isn't artificially constrained to eliminate all possibilities except our own.
Bet you didn't come up with that comment by first discarding a bunch of unsuitable comments.
I hired an artist for an oil painting.
The artist drew 10 pencil sketches and said "hmm I think this one works the best" and finished the painting based on it.
I said he didn't one shot it and therefore he has no ability to paint, and refused to pay him.
You learned what was unsuitable over your entire life until now by making countless mistakes in human interaction.
A basic AI chat response also doesn't first discard all other possible responses.
How often do you self edit before submitting?
because commenting is easy and solving hard problems is hard
A random sentence can also generate correct solution to a problem once in a long while...does not mean that it "solved" anything..
The link has an entire section on "The infeasibility of finding it by brute force."
But this is exactly how we do math.
We start writing all those formulas etc and if at some point we realise we went th wrong way we start from the begignning (or some point we are sure about).
How do you think mathematicians solve problems?
No, that's precisely solving a problem.
Shotgunning it is an entirely valid approach to solving something. If AI proves to be particularly great at that approach, given the improvement runway that still remains, that's fantastic.
That's also the only way how humans solve hard problems.
Not always, humans are a lot better at poofing a solution into existence without even trying or testing. It's why we have the scientific method: we come up with a process and verify it, but more often than not we already know that it will work.
Compared to AI, it thinks of every possible scientific method and tries them all. Not saying that humans never do this as well, but it's mostly reserved for when we just throw mud at a wall and see what sticks.
That's just not true at all. There are entire fields that rest pretty heavily on brute force search. Entire theses in biomedical and materials science have been written to the effect of "I ran these tests on this compound, and these are the results", without necessarily any underlying theory more than a hope that it'll yield something useful.
As for advances where there is a hypothesis, it rests on the shoulders of those who've come before. You know from observations that putting carbon in iron makes it stronger, and then someone else comes along with a theory of atoms and molecules. You might apply that to figuring out why steel is stronger than iron, and your student takes that and invents a new superalloy with improvements to your model. Remixing is a fundamental part of innovation, because it often teaches you something new. We aren't just alchemying things out of nothing.
Well, we know that mixing lead into copper won't make for a strong material. There's a lot of human ingenuity involved.
I failed to make my point clear: Humans make the search area way smaller compared to current day AI.
More often than not, far, far, far more often than not, we do not already know that it will work. For all human endeavors, from the beginning of time.
If we get to any sort of confidence it will work it is based on building a history of it, or things related to "it" working consistently over time, out of innumerable other efforts where other "it"s did not work.
AI can one shot problems too, if they have the necessary tools in their training data, or have the right thing in context, or have access to tools to search relevant data. Not all AI solutions are iterative, trial and error.
Also
> humans are a lot better at (...)
That's maybe true in 2026, but it's hard to make statements about "AI" in a field that is advancing so quickly. For most of 2025 for example, AI doing math like this wouldn't even be possible
There have been both inductive and deductive solutions to open math problems by humans in the past decade, including to fairly high-profile problems.
For those, like me, who find the prompt itself of interest …
> A full transcript of the original conversation with GPT-5.4 Pro can be found here [0] and GPT-5.4 Pro’s write-up from the end of that transcript can be found here [1].
[0] https://epoch.ai/files/open-problems/gpt-5-4-pro-hypergraph-...
[1] https://epoch.ai/files/open-problems/hypergraph-ramsey-gpt-5...
I wonder what was in that solutions file they provided. According to the prompt it’s a solution template but I want to know the contents.
Another thing I want to know is how the user keeps updating the LLM with the token usage. I didn’t know they could process additional context midtask like that.
I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is, and it looks like Opus 4.6 consumed around 250k tokens. That means that a tricky React refactor I did earlier today at work was about half as hard as an open problem in mathematics! :)
You're kidding, but it could be true? Many areas of mathematics are, first and foremost, incredibly esoteric and inaccessible (even to other mathematicians). For this one, the author stated that there might be 5-10 people who have ever made any effort to solve it. Further, the author believed it's a solvable problem if you're qualified and grind for a bit.
In software engineering, if only 5-10 people in the world have ever toyed with an idea for a specific program, it wouldn't be surprising that the implementation doesn't exist, almost independent of complexity. There's a lot of software I haven't finished simply because I wasn't all that motivated and got distracted by something else.
Of course, it's still miraculous that we have a system that can crank out code / solve math in this way.
If only 5-10 people have ever tried to solve something in programming, every LLM will start regurgitating your own decade-old attempt again and again, sometimes even with the exact comments you wrote back then (good to know it trained on my GitHub repos...), but you can spend upwards of 100mio tokens in gemini-cli or claude code and still not make any progress.
It's afterall still a remix machine, it can only interpolate between that which already exists. Which is good for a lot of things, considering everything is a remix, but it can't do truly new tasks.
What is a "truly new task"? Does there exist such a thing? What's an example of one?
Everything we do builds on top of what's already been done. When I write a new program, I'm composing a bunch of heuristics and tricks I've learned from previous programs. When a mathematician approaches an open problem, they use the tactics they've developed from their experience. When Newton derived the laws of physics, he stood on the shoulders of giants. Sure, some approaches are more or less novel, but it's a difference in degree, not kind. There's no magical firebreak to separate what AI is doing or will do, and the things the most talented humans do.
That highlighted phrase "everything is a remix" was for a good reason, there's a documentary of that same name, and I can certainly recommend it.
At the same time, there are things that are truly novel, even if the idea is based on combining two common approaches, the implementation might need to be truly novel, with new formulas and new questions that arise from those. AI can't belp there, speaking from experience.
That's why context management is so important. AI not only get more expensive if you waste tokens like that, it may perform worse too
Even as context sizes get larger, this will likely be relevant. Specially since AI providers may jack up the price per token at any time.
I don't think so. I went through the output of Opus 4.6 vs GPT 5.4 pro. Both are given different directions/prompts. Opus 4.6 was asked to test and verify many things. Opus 4.6 tried in many different ways and the chain of thoughts are more interesting to me.
You're glancing over the fact that mathematics uses only one token per variable `x = ...`, whereas software engineering best practices demand an excessive number of tokens per variable for clarity.
It's also a pretty silly thing to say difficulty = tokens. We all know line counts don't tell you much, and it shows in their own example.
Even if you did have Math-like tokenisation, refactoring a thousand lines of "X=..." to "Y=..." isnt a difficult problem even though it would be at least a thousand tokens. And if you could come up with E=mc^2 in a thousand tokens, does not make the two tasks remotely comparable difficulty.
I think it's more of a data vs intelligence thing.
They are separate dimensions. There are problems that don't require any data, just "thinking" (many parts of math sit here), and there are others where data is the significant part (e.g. some simple causality for which we have a bunch of data).
Certain problems are in-between the two (probably a react refactor sits there). So no, tokens are probably no good proxy for complexity, data heavy problems will trivially outgrow the former category.
Try the refactor again tomorrow. It might have gotten easier or more difficult.
> I like to imagine that the number of consumed tokens before a solution is found is a proxy for how difficult a problem is (...)
The number of tokens required to get to an output is a function of the sequence of inputs/prompts, and how a model was trained.
You have LLMs quite capable of accomplishing complex software engineering work that struggle with translating valid text from english to some other languages. The translations can be improved with additional prompting but that doesn't mean the problem is more challenging.
You might be joking, but you're probably also not that far off from reality.
I think more people should question all this nonsense about AI "solving" math problems. The details about human involvement are always hazy and the significance of the problems are opaque to most.
We are very far away from the sensationalized and strongly implied idea that we are doing something miraculous here.
I am kind of joking, but I actually don't know where the flaw in my logic is. It's like one of those math proofs that 1 + 1 = 3.
If I were to hazard a guess, I think that tokens spent thinking through hard math problems probably correspond to harder human thought than tokens spend thinking through React issues. I mean, LLMs have to expend hundreds of tokens to count the number of r's in strawberry. You can't tell me that if I count the number of r's in strawberry 1000 times I have done the mental equivalent of solving an open math problem.
You can spend countless "tokens" solving minesweeper or sudoku. This doesn't mean that you solved difficult problems: just that the solutions are very long and, while each step requires reasoning, the difficulty of that reasoning is capped.
A lot of math problems/proofs are like minesweeper or sudoku in a way though. They're a long series of individually kinda simple logical deductions that eventually result in a solution. Some really hard problems are only really hard because each one of those "simple" deductions requires you to have expert knowledge in some disparate area to make that leap.
Some thoughts.
1. LLMs aren't "efficient", they seem to be as happy to spin in circles describing trivial things repeatedly as they are to spin in circles iterating on complicated things.
2. LLMs aren't "efficient", they use the same amount of compute for each token but sometimes all that compute is making an interesting decision about which token is the next one and sometimes there's really only one follow up to the phrase "and sometimes there's really only" and that compute is clearly unnecessary.
3. A (theoretical) efficient LLM still needs to emit tokens to tell the tools to do the obviously right things like "copy this giant file nearly verbatim except with every `if foo` replaced with `for foo in foo`. An efficient LLM might use less compute for those trivial tokens where it isn't making meaningful decisions, but if your metric is "tokens" and not "compute" that's never going to show up.
Until we get reasonably efficient LLMs that don't waste compute quite so freely I don't think there's any real point in trying to estimate task complexity by how long it takes an LLM.
I fear that under those constraints, the only optimal output is “42”
This is interesting, I like the thought about "what makes something difficult". Focusing just on that, my guess is that there are significant portions of work that we commonly miss in our evaluations:
1. Knowing how to state the problem. Ie, go from the vague problem of "I don't like this, but I do like this", to the more specific problem of "I desire property A". In math a lot of open problems are already precisely stated, but then the user has to do the work of _understanding_ what the precise stating is.
2. Verifying that the proposed solution actually is a full solution.
This math problem actually illustrates them both really well to me. I read the post, but I still couldn't do _either_ of the steps above, because there's a ton of background work to be done. Even if I was very familiar with the problem space, verifying the solution requires work -- manually looking at it, writing it up in coq, something like that. I think this is similar to the saying "it takes 10 years to become an overnight success"
>The details about human involvement are always hazy and the significance of the problems are opaque to most.
Not really. You're just in denial and are not really all that interested in the details. This very post has the transcript of the chat of the solution.
I mean the details are in the post. You can see the conversation history and the mathematician survey on the problem
The capabilities of AI are determined by the cost function it's trained on.
That's a self-evident thing to say, but it's worth repeating, because there's this odd implicit notion sometimes that you train on some cost function, and then, poof, "intelligence", as if that was a mysterious other thing. Really, intelligence is minimizing a complex cost function. The leadership of the big AI companies sometimes imply something else when they talk of "generalization". But there is no mechanism to generate a model with capabilities beyond what is useful to minimize a specific cost function.
You can view the progress of AI as progress in coming up with smarter cost functions: Cleaner, larger datasets, pretraining, RLHF, RLVR.
Notably, exciting early progress in AI came in places where simple cost functions generate rich behavior (Chess, Go).
The recent impressive advances in AI are similar. Mathematics and coding are extremely structured, and properties of a coding or maths result can be verified using automatic techniques. You can set up a RLVR "game" for maths and coding. It thus seems very likely to me that this is where the big advances are going to come from in the short term.
However, it does not follow that maths ability on par with expert mathematicians will lead to superiority over human cognitive ability broadly. A lot of what humans do has social rewards which are not verifiable, or includes genuine Knightian uncertainty where a reward function can not be built without actually operating independently in the world.
To be clear, none of the above is supposed to talk down past or future progress in AI; I'm just trying to be more nuanced about where I believe progress can be fast and where it's bound to be slower.
> But there is no mechanism to generate a model with capabilities beyond what is useful to minimize a specific cost function.
Can you give some examples?
It is not trivial that not everything can be written as an optimization problem.
Even at the time advanced generalizations such as complex numbers can be said to optimize something, e.g. the number of mathematical symbols you need to do certain proofs, etc.
I think you're misreading me. My point isn't that you can't in principle state the optimization problem, but that it's much easier in some domains than in others, that this tracks with how AI has been progressing, and that progress in one area doesn't automatically mean progress in another, because current AI cost functions are less general than the cost functions that humans are working with in the world.
I am thinking there’s a large category of problems that can be solved by resampling existing proofs. It’s the kind of brute force expedition machine can attempt relentlessly where humans would go mad trying. It probably doesn’t really advance the field, but it can turn conjectures into theorems.
I wonder if teaching an LLM how to write Prolog and then letting it write it could be a great way to explore spaces like this in the future. Other people in I wonder if teaching an LLM how to write Prolog and then letting it write it could be a great way to explore spaces like this in the future.
I only ever learned it in school, but if memory serves, Prolog is a whole "given these rules, find the truth" sort of language, which aligns well with these sorts of problem spaces. Mix and match enough, especially across disparate domains, and you might get some really interesting things derived and discovered that are low-hanging fruit just waiting to be discovered.
Indeed, can't find my old comment on the topic but that's indeed the point, it's not how feasible it is to "find" new proof, but rather how meaningful those proofs are. Are they yet another iteration of the same kind, perfectly fitting the current paradigm and thus bringing very little to the table or are they radical and thus potentially (but not always) opening up the field?
With brute force, or slightly better than brute force, it's most likely the first, thus not totally pointless but probably not very useful. In fact it might not even be worth the tokens spent.
I'm of the opinion that everything we've discovered is via combinatorial synthesis. Standing on the shoulders of giants and all that. I'm not sure I've seen any convincing argument that we've discovered anything ex nihilo.
How about this guy? https://en.wikipedia.org/wiki/Srinivasa_Ramanujan
How do you think you can design a benchmark to solve truly novel problems?
Their 'Open Problems page' linked below gives some interesting context. They list 15 open problems in total, categorized as 'moderately interesting,' 'solid result,' 'major advance,' or 'breakthrough.' The solved problem is listed as 'moderately interesting,' which is presumably the easiest category. But it's notable that the problem was selected and posted here before it was solved. I wonder how long until the other 3 problems in this category are solved.
https://epoch.ai/frontiermath/open-problems
I’d hope this isn’t a goal post move - an open math problem of any sort being solved by a language model is absolute science fiction.
That's been achieved already with a few Erdös problems, though those tended to be ambiguously stated in a way that made them less obviously compelling to humans. This problem is obscure, even the linked writeup admits that perhaps ~10 mathematicians worldwide are genuinely familiar with it. But it's not unfeasibly hard for a few weeks' or months' work by a human mathematician.
FWIW https://github.com/teorth/erdosproblems/wiki/AI-contribution... in particular the disclaimers are very interesting.
It is not. You're operating under the assumption that all open math problems are difficult and novel.
This particular problem was about improving the lower bound for a function tracking a property of hypergraphs (undirected graphs where edges can contain more than two vertices).
Both constructing hypergraphs (sets) and lower bounds are very regular, chore type tasks that are common in maths. In other words, there's plenty of this type of proof in the training data.
LLMs kind of construct proofs all the time, every time they write a program. Because every program has a corresponding proof. It doesn't mean they're reasoning about them, but they do construct proofs.
This isn't science fiction. But it's nice that the LLMs solved something for once.
> nice that the LLMs solved something for once.
That sentence alone needs unpacking IMHO, namely that no LLM suddenly decided that today was the day it would solve a math problem. Instead a couple of people who love mathematics, doing it either for fun or professionally, directly ask a model to solve a very specific task that they estimated was solvable. The LLM itself was fed countless related proofs. They then guided the model and verified until they found something they considered good enough.
My point is that the system itself is not the LLM alone, as that would be radically more impressive.
I 100% agree. The LLM was just used to autocomplete a ready-made strategy.
I've never yet been "that guy" on HN but... the title seems misleading. The actual title is "A Ramsey-style Problem on Hypergraphs" and a more descriptive title would be "All latest frontier models can solve a frontier math open problem". (It wasn't just GPT 5.4)
Super cool, of course.
"In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh)."
I find that very surprising. This problem seems out of reach 3 months ago but now the 3 frontier models are able to solve it.
Is everybody distilling each others models? Companies sell the same data and RL environment to all big labs? Anybody more involved can share some rumors? :P
I do believe that AI can solve hard problems, but that progress is so distributed in a narrow domain makes me a bit suspicious somehow that there is a hidden factor. Like did some "data worker" solve a problem like that and it's now in the training data?
Yes there's a whole ecosystem of companies that create and sell RL gyms to AI labs and of course they develop their own internally too. You don't hear much about this ecosystem because RL at scale is all private. Nearly no academic research on it.
A lot of this is probably just throwing roughly equal amounts of compute at continuous RLVR training. I'm not convinced there's any big research breakthrough that separates GPT 5.4 from 5.2. The diff is probably more than just checkpoints but less than neural architecture changes and more towards the former than the latter.
I think it's just easy to underestimate how much impact continuous training+scaling can have on the underlying capabilities.
Maybe so, but GPT 5.4 is absolutely pulling ahead. You can see the differences visually on https://minebench.ai/.
Is it possible the AI labs are seeding their models with these solved problems? Like, if I was Sam Altman with a bazillion dollars of investment I would pay some mathematicians to solve some of these problems so that the models could "solve" them later on. Not that I think it's what's happening here of course...
But it is pretty funny how 5.4 miscounted the number of 1's in 18475838184729 on the same day it solved this.
> Subsequent to this solve, we finished developing our general scaffold for testing models on FrontierMath: Open Problems. In this scaffold, several other models were able to solve the problem as well: Opus 4.6 (max), Gemini 3.1 Pro, and GPT-5.4 (xhigh).
Interesting. Whats that “scaffold”? A sort of unit test framework for proofs?
I think in this context, scaffolds are generally the harness that surrounds the actual model. For example, any tools, ways to lay out tasks, or auto-critiquing methods.
I think there's quite a bit of variance in model performance depending on the scaffold so comparisons are always a bit murky.
Usually involves a lot of agents and their custom contexts or system prompts.
I was trying to get Claude and Codex to try and write a proof in Isabelle for the Collatz conjecture, but annoyingly it didn't solve it, and I don't feel like I'm any closer than I was when I started. AI is useless!
In all seriousness, this is pretty cool. I suspect that there's a lot of theoretical math that haven't been solved simply because of the "size" of the proof. An AI feedback loop into something like Isabelle or Lean does seem like it could end up opening up a lot of proofs.
I got Gemini to find a polynomial-time algorithm for integer factoring, but then I mysteriously got locked out of my Google account. They should at least refund me the tokens.
That sounds like the start of a very lucrative career. Are you sure it was Gemini and not an AI competitor offering affiliate commission? ;)
I feel like this single image perfectly sums up the entire thread here: https://trapatsas.eu/sites/llm-predictions/
It's not like this is new to AI
https://oertx.highered.texas.gov/courseware/lesson/1849/over...
Yes, and no matter when "now" is, the doubters will always see in their mind's eye the flat line extending to the right.
That's tautological
As someone with only passing exposure to serious math, this section was by far the most interesting to me:
> The author assessed the problem as follows.
> [number of mathematicians familiar, number trying, how long an expert would take, how notable, etc]
How reliably can we know these things a-priori? Are these mostly guesses? I don't mean to diminish the value of guesses; I'm curious how reliable these kinds of guesses are.
For number of mathematicians familiar with and actively working on the problem, modern mathematics research is incredibly specialized, so it's easy to keep track of who's working on similar problems. You read each other's papers, go to the same conferences etc.
For "how long an expert would take" to solve a problem, for truly open problems I don't think you can usually answer this question with much confidence until the problem has been solved. But once it has been solved, people with experience have a good sense of how long it would have taken them (though most people underestimate how much time they need, since you always run into unanticipated challenges).
Read about Paul Erdös... not all math is the Riemann Hypothesis, there is yeoman's work connecting things and whatever...
Certainly knowing how many/which people are working on a problem you are looking at, and how long it will take you to solve it, are critical skills in being a working researcher. What kind of answer are you looking for? It's hard to quantify. Most suck at this type of assessment as a PhD student and then you get better as time goes on.
I feel like reading some of these comments, some people need to go and read the history of ideas and philosophy (which is easier today than ever before with the help of LLMs!)
It's like I'm reading 17th-18th century debates spurring the same arguments between rationalists and empiricists, lol. Maybe we're due for a 21st century Kant.
Is their scaffold available? Does it do anything special beyond feeding the warmup, single challenge, and full problem to an LLM? Because it's interesting that GPT-5.2 Pro, arguably the best model until a few months ago, couldn't even solve the warmup. And now every frontier model can solve the full problem. Even the non-Pro GPT-5.4. Also strange that Gemini 3 Deep Think couldn't solve it, whereas Gemini 3.1 Pro could. I read that Deep Think is based on 3.1 Pro. Is that correct?
I see that GPT-5.2 Pro and Gemini 3 Deep Think simply had the problems entered into the prompt. Whereas the rest of the models had a decent amount of context, tips, and ideas prefaced to the problem. Were the newer models not able to solve this problem without that help?
Anyway, impressive result regardless of whether previous models could've also solved it and whether the extra context was necessary.
I know these frontier models behave differently from each other. I wonder how many problems they could solve combining efforts.
It's deeply surprising to me that LLMs have had more success proving higher math theorems than making successful consumer software
Software developers have spent decades at this point discounting and ignoring almost all objective metrics for software quality and the industry as a whole has developed a general disregard for any metric that isn't time-to-ship (and even there they will ignore faster alternatives in favor of hyped choices).
(Edit: Yes, I'm aware a lot of people care about FP, "Clean Code", etc., but these are all red herrings that don't actually have anything to do with quality. At best they are guidelines for less experienced programmers and at worst a massive waste of time if you use more than one or two suggestions from their collection of ideas.)
Most of the industry couldn't use objective metrics for code quality and the quality of the artifacts they produce without also abandoning their entire software stack because of the results. They're using the only metric they've ever cared about; time-to-ship. The results are just a sped up version of what we've had now for more than two decades: Software is getting slower, buggier and less usable.
If you don't have a good regulating function for what represents real quality you can't really expect systems that just pump out code to actually iterate very well on anything. There are very few forcing functions to use to produce high quality results though iteration.
But we don't even seem to be getting faster time-to-ship in any way that anybody can actually measure; it's always some vague sense of "we're so much more productive".
That's a fair observation and one that I don't really have an answer for. I can say from personal experience that I believe that shipping nonsense code has never been faster. That's just an anecdote, obviously.
We need a bigger version of the METR study on perceived vs. real productivity[0], I guess. It's a thankless job, though, since people will assume/state even at publication time that "Everything has progressed so much, those models and agents sucked, everything is 10 times better now!" and you basically have to start a new study, repeat ad infinitum.
One problem that really complicates things is that the net competency of these models seems really spotty and uneven. They're apparently out here solving math problems that seemingly "require thinking", but at the same time will write OpenGL code that will produce black screens on basically every driver, not produce the intended results and result in hours of debugging time for someone not familiar enough. That's despite OpenGL code being far more prevalent out there than math proofs, presumably. How do you reliably even theorize about things like this when something can be so bad and (apparently) so good at the same time?
0 - https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
This doesn't pass a sniff test. We have plenty of ways to verify good software, else you wouldn't be making this post. You know what bad software is and looks like. We want something fast that doesn't throw an error every 3 page navigations.
You can ask an LLM to make code in whatever language you want. And it can be pretty good at writing efficient code, too. Nothing about NPM bloat is keeping you from making a lean website. And AI could theoretically be great at testing all parts of a website, benchmarking speeds, trying different viewports etc.
But unfortunately we are still on the LLM train. It just doesn't have anything built-in to do what we do, which is use an app and intuitively understand "oh this is shit." And even if you could allow your LLM to click through the site, it would be shit at matching visual problems to actual code. You can forget about LLMs for true frontend work for a few years.
And they are just increasingly worse with more context, so any non-trivial application is going to lead to a lot of strange broken artifacts, because text prediction isn't great when you have numerous hidden rules in your application.
So as much as I like a good laugh at failing software, I don't think you can blame shippers for this one. LLMs are not struggling in software development because they are averaging a lot of crap code, it's because we have not gotten them past unit tests and verifying output in the terminal yet.
Pretty much all consumer software made in 2026 is heavily using AI in its development. So I'm not sure what basis you have for your assertion.
They haven't, not at all as far as I can tell. This math problem appears to be a nice chore to be solved, the equivalent to "Claude, optimize this code" or "Write a parser", which is being done 100000x a day.
The original researchers who proposed this problem tried and failed multiple times to solve it. Does that sound like a 'nice chore to be solved' to you ?
That's interesting context, where do you see that? I'm going off of the label "Moderately interesting".
edit: I see in the full write up that the contributor says that they'd estimate an expert would take 1-3 months to do this. They also note that they came up with this solution independently but hadn't confirmed it.
https://epochai.substack.com/p/first-ai-solution-on-frontier...
>The newly-solved problem came from Will Brian, who had placed it in the Moderately Interesting category. It is a conjecture from a paper he wrote with Paul Larson in 2019. They were unable to solve it at the time, or in several attempts since. Brian had this to say.
I actually still don't see the source for them trying several times, but we can take that for granted. Regardless, as I said:
1. It's labeled as "moderately interesting"
2. They said that they expect an expert could solve it in 1-3 months
3. They had already come up with the solution that the AI had but weren't convinced it would have worked
So how big was the gap here, do you think?
Yes, a "moderately interesting" Open problem.
I can't think of any chores that would take an expert months to complete. I can't think of any chores that I've completed but was then 'unconvinced could work'. Please sit down and think about what you are saying here. Are we still talking about chores ?
One of the more strange phenomena with machines getting better and the incessant need (seemingly driven by human exceptionalism) to downplay each result, is that you just end up belittling humans in the process.
This is significant. Your analogy is wrong. It's fine to admit it.
Writing a complex parser or certainly a compiler is a 1 - 3 month project, for example.
Again, I'm not trying to downplay this, but to frame this accurately. I think an AI being able to build a parser/ compiler is cool too.
> One of the more strange phenomena with machines getting better and the incessant need (seemingly driven by human exceptionalism) to downplay each result, is that you just end up belittling humans in the process.
I don't believe in human exceptionalism at all, don't attribute positions to me.
>Writing a complex parser or certainly a compiler is a 1 - 3 month project, for example.
1. Estimating time completion of something that has been done multiple times before and an open problem that has not yet been solved is a different matter entirely. 1 to 3 months is an educated guess and more likely than not, an underestimate.
2. I do not think months long complex compilers and parsers are being routinely completed by LLMs as your original comment implied. Regardless, they are different classes of problems.
I don't get what either of your points is intended to demonstrate. Let's revisit the first post I replied to:
> It's deeply surprising to me that LLMs have had more success proving higher math theorems than making successful consumer software
As far as I can tell, they absolutely have not had more success in this area relative to making successful consumer software.
Well we are kind of arguing past each other aren't we ?
"More success" is a bit vague in this instance but building a compiler that would take a programmer 1 to 3 months is not comparable to this result regardless of whatever similarity exists in time completion estimates. That's the point.
You can publish a paper (and in fact the researchers plan to) off this result. A basic compiler is cool but otherwise unremarkable. It's been done many times before.
You are leaning too hard on how long the researchers (who again did not manage to solve the problem in their attempts) estimated this would take and the "moderately interesting" tag of again, what was still an open research problem.
This, alongside a few math and physics results that have cropped up in the last few months is easily more impressive than the vast majority of work being done with LLMs for software.
> "More success" is a bit vague in this instance but building a compiler that would take a single programmer 1 to 3 months is not comparable to this result regardless of whatever similarity exists in time completion estimates. That's the point.
I guess we just disagree on this. It's not clear to me that these are totally different in terms of what they represent.
> You can publish a paper (and in fact the researchers plan to) off this result. A basic compiler is cool but otherwise unremarkable.
Publishing papers means very, very little to me. I can publish a paper on a programming language, you know that, right?
> You are leaning too hard on how long the researchers (who again did not manage to solve the problem in their attempts) estimated this would take and the "moderately interesting" tag of again, what was an open research problem.
I obviously estimate my "leanings" as being appropriate. I'm just using the researchers direct quotes. Factually, they had already come up with the approach that ultimately panned out. Factually, they estimated that a human could do this in some timeframe. What am I overly leaning on here?
> This, alongside a few math results that have cropped up in the last few months is easily more impressive than the vast majority of work being done with LLMs for software.
I think both are impressive, I don't know that I would draw some sort of big conclusions about it at this point. I definitely wouldn't draw the conclusion that AI is better at formal mathematics than producing software.
>Publishing papers means very, very little to me. I can publish a paper on a programming language, you know that, right?
We both know that you are not getting that published in a reputable journal without a lot of effort beyond merely 'publishing the language I created', but sure, I'm sure you can get something on arxiv.
>I obviously estimate my "leanings" as being appropriate. I'm just using the researchers direct quotes. Factually, they had already come up with the approach that ultimately panned out.
This really should not be hard to understand.
1. One is something that has been done many times before and the other an unsolved problem. It doesn't take a genius to see one estimate is likely much stronger than the other. If your point hinges on comparing them directly, it's pretty weak.
2. A moderately interesting open research problem is not the same thing as a moderately interesting problem and you seem to be conflating the two.
> We both know that you are not getting that published in a reputable journal without a lot of effort beyond merely 'publishing the language I created'. But sure, you can get something on arxiv.
lol what? There are papers on programming languages all the time.
> 1. One is something that has been done many times before and the other an unsolved problem. It doesn't take a genius to see one estimate is likely much stronger than the other.
Building a compiler for a new programming language, building net new code, etc, is all stuff that was unsolved / had not been done before.
> 2. A moderately interesting open research problem is not the same thing as a moderately interesting problem and you seem to be conflating the two.
Feel free to explain the difference, I guess.
>lol what? There are papers on programming languages all the time.
Sure and have you read them ? They're the results of many months or years of research and development so I really don't know what point you think you are making here.
>Building a compiler for a new programming language, building net new code, etc, is all stuff that was unsolved / had not been done before.
Okay but that's not taking a month or two or being asked of LLMs x10000 every day so thanks for making my point I guess.
>Feel free to explain the difference, I guess.
No thanks. If you don't understand it that's fine. This has run its course anyway.
Yeah idk what you're going on about lol
Also, the full write up does not say the researchers solved it.
> I had previously wondered if the AI’s approach might be possible, but it seemed hard to work out.
They didn't solve it, that's fair. They did consider the approach already.
But the title claims it is a "frontier" math problem, so which is it really.
Domain Experienced users are effectively training llms to mimic themselves in solving their problems, therefore/// solving their problems via chat data concentration.
No denial at this point, AI could produce something novel, and they will be doing more of this moving forward.
Not sure if AI can have clever or new ideas, it still seems to be it combines existing knowledge and executes algoritms.
I am not necessarily saying humans do something different either, but I have yet to see a novel solution from an AI that is not simply an extrapolation of current knowledge.
Speaking as a researcher, the line between new ideas and existing knowledge is very blurry and maybe doesn't even exist. The vast majority of research papers get new results by combining existing ideas in novel ways. This process can lead to genuinely new ideas, because the results of a good project teach you unexpected things.
My biggest hesitation with AI research at the moment is that they may not be as good at this last step as humans. They may make novel observations, but will they internalize these results as deeply as a human researcher would? But this is just a theoretical argument; in practice, I see no signs of progress slowing down.
This is my take as well. A human who learns, say, a Towers of Hanoi algorithm, will be able to apply it and use it next time without having to figure it out all over again. An LLM would probably get there eventually, but would have to do it all over again from scratch the next time. This makes it difficult combine lessons in new ways. Any new advancement relying on that foundational skill relies on, essentially, climbing the whole mountain from the ground.
I suppose the other side of it is that if you add what the model has figured out to the training set, it will always know it.
We call that Standing On The Shoulders Of Giants and revere Isaac Newton as clever, even though he himself stated that he was standing on the shoulders of giants.
Clever/novel ideas are very often subtle deviations from known, existing work.
Sometimes just having the time/compute to explore the available space with known knowledge is enough to produce something unique.
There is no such thing. All new ideas are derived from previous experiences and concepts.
The difference people are neglecting to point out is the experiences we have versus the experiences the AI has.
We have at least 5 senses, our thoughts, feelings, hormonal fluctuations, sleep and continuous analog exposure to all of these things 24/7. It's vastly different from how inputs are fed into an LLM.
On top of that we have millions of years of evolution toward processing this vast array of analog inputs.
So, just connect LLMs to lava lamps?
Jokes aside, imagine you give LLMs access to real-time, world-wide satellite imagery and just tell it to discover new patrerns/phenomens and corrrlations in the world.
"extrapolation" literally implies outside the extents of current knowledge.
Yes, but not necessarily new knowledge.
It means extending/expanding something, but the information is based on the current data.
In computer games, extrapolation is finding the future position of an object based on the current position, velocity and time wanted. We do have some "new" position, but the sistem entropy/information is the same.
Or if we have a line, we can expand infinitely and get new points, but this information was already there in the y = m * x + b line formula.
How would you know if it wasn't an extrapolation of current knowledge? Can you point me to somethings humans have done which isn't an extrapolation?
That was my point: "I am not necessarily saying humans do something different".
I mean, I can run a pseudo random number generator, and produce something novel too.
Is this novel? It's new. But we already know AI can generate new things, any statistical reassembly of any content will generate new things.
It's not to downplay this, but it's unclear what "novel" means here or what you think the implications are.
Reading this thread I'm reassured that despite everything AI may disrupt, humans arguing past each other about philosophy of knowledge and epistemology on internet forums is safe :')
Impressive, but it will take away so much sense of accomplishment from so many people. I find that really sad.
I don't understand the position that learning through inference/example is somehow inferior to a top down/rules based learning.
Humans learn many, and perhaps even the majority, of things through observed examples and inference of the "rules". Not from primers and top down explanation.
E.g. Observing language as a baby. Suddenly you can speak grammatically correctly even if you can't explain the grammar rules.
Or: Observing a game being played to form an understanding of the rules, rather than reading the rulebook
Further: the majority of "novel" insights are simply the combination of existing ideas.
Look at any new invention, music, art etc and you can almost always reasonably explain how the creator reached that endpoint. Even if it is a particularly novel combination of existing concepts.
Seems like the high compute parallel thinking models weren't even needed, both the normal 5.4 and gemini 3.1 pro solved it. Somehow Gemini 3 deepthink couldn't solve it.
This is impressive, but OpenAI is still shit as a company. How dare they even have "open" in their company name.
Is it a coincidence that the first open problems solved by an LLM and a 4chan thread would be in the same field?
Do they also publish the raw output of the model, i.e. not only the final response but also everything generated for internal reasoning or tool use?
Been a long three years since single digit addition was a serious challenge for even top tier models
Fantastic and exciting stuff!
I wonder how much of this meteoric progress in actually creating novel mathematics is because the training data is of a much higher standard than code, for example.
New goalpost, and I promise I'm not being facetious at all, genuinely curious:
Can an AI pose an frontier math problem that is of any interest to mathematicians?
I would guess 1) AI can solve frontier math problems and 2) can pose interesting/relevant math problems together would be an "oh shit" moment. Because that would be true PhD level research.
Considering that an LLM simply remixes what it finds in its learned distribution over text, it's possible that it can pose new math problems by identifying gaps ("obvious" in restrospect) that humans may have missed (like connecting two known problems to pose a new one). What LLMs can't currently do is pose new problems by observing the real world and its ramifications, like that moving sofa problem.
Yes. I doubt it can do that.
But who asked the model to solve that problem?
This is a remarkable result if confirmed independently. The gap between solving competition problems and open research problems has always been significant - bridging that gap suggests something qualitatively different in the model capabilities.
I feel like there’s a fork in our future approaching where we’ll either blossom into a paradise for all or live under the thumb of like 5 immortal VCs
Change is always hard, even if it will be good in 20 years, the transitions are always tough.
Sometimes the transition is tough and then the end state is also worse!
Hoping that won't be the case with AI but we may need some major societal transformations to prevent it.
> This problem is about improving lower bounds on the values of a sequence, , that arises in the study of simultaneous convergence of sets of infinite series, defined as follows.
One thing I notice in the AlphaEvolve paper as well as here, is that these LLMs have been shown to solve optimization problems - something we have been using computers for, for really long. In fact, I think the alphaevolve-style prompt augmentation approach is a more principled approach to what these guys have done here, and am fairly confident this one would have been solved in that approach as well.
In spirit, the LLM seems to compute the {meta-, }optimization step()s in activation space. Or, it is retrieving candidate proposals.
It would be interesting to see if we can extract or model the exact algorithms from the activations. Or, it is simply retrieving and proposing a deductive closures of said computation.
In the latter case, it would mean that LLMs alone can never "reason" and you need an external planner+verifier (alpha-evolve style evolutionary planner for example).
We are still looking for proof of the former behaviour.
What are the odds that this is because Openai is pouring more money into high publicity stunts like this- rather than its model actually being better than Anthropics?
I guess this means AI researchers should be out of jobs very soon.
Besides the point of the supposed achievement, that is supposedly confirmed, my point will be that Epoch.ai is possibly just a PR firm for *Western* AI providers, then possibly this news is untruth worthy.
First prove the solution wasn’t in the training data. Otherwise it’s all just vibes and ‘trust me bro.’
This is a lot like the 50 million monkeys on 50 million typewriters will eventually write shakespeare... We have all heard this, pity the poor proof readers who will proof them all in a search for the holy grail = zero errors. In a similar way, LLM's are permutational cross associating engines, matched with sieves to filter out the dross. Less filtering = more dross, AKA slop. It can certainly create enormous masses of bad code and with well filtered screens for dross, we can see it can create passable code, however stray flaws(flies) can creep in and not get filtered, and humans are better at seeing flies in their oatmeal. AI seems very good at permutational code assaults on masses of code to find the flies(zero days), so I expect it to make code more secure as few humans have the ability/time to mount that sort of permutational assault on code bases. I see this idea has already taken root within code writers as well as hackers/China etc. These two opposing forces will assault code bases, one to break and one to fortify. In time there will be fewer places where code bases have hidden flaws as soon all new code will be screened by AI to find breaks so that little or no code will contain these bugs.
> This is a lot like the 50 million monkeys on 50 million typewriters will eventually write shakespeare...
"Eventually" here is something on the order of a few expected lifespans of the universe.
The fact that we're getting meaningful results out of LLMs on a human timescale means that they're doing something very different.
wow nice
Really? He was steering the wheel the whole time. GPT didn't do the math.
Ah, the good old Clever Hans. https://en.wikipedia.org/wiki/Clever_Hans
Its an article from an ai site. People with vested interest are desperate to prove its not an expensive parrot.
Fantastic news! That means with the right support tooling existing models are already capable of solving novel mathematics. There’s probably a lot of good mathematics out there we are going to make progress on.
We only get one shot.
A model to whose internals we don't have access solved a problem we didn't knew was in their datasets. Great, I'm impressed