ARC-AGI-3

(arcprize.org)

186 points | by lairv 4 hours ago

27 comments

Tiberium 2 hours ago

https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing):
- Human baseline is "defined as the second-best first-run human by action count". Your "regular people" are people who signed up for puzzle solving and you don't compare the score against a human average but against the second best human solution
- The scoring doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans. It uses squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% ((10/100)^2)
- 100% just means that all levels are solvable. The 1% number uses uses completely different and extremely skewed scoring based on the 2nd best human score on each level individually. They said that the typical level is solvable by 6 out of 10 people who took the test, so let's just assume that the median human solves about 60% of puzzles (ik not quite right). If the median human takes 1.5x more steps than your 2nd fastest solver, then the median score is 0.6 * (1/1.5)^2 = 26.7%. Now take the bottom 10% guy, who maybe solves 30% of levels, but they take 3x more steps to solve it. this guy would get a score of 3%
- The scoring is designed so that even if AI performs on a human level it will score below 100%
- No harness at all and very simplistic prompt
- Models can't use more than 5X the steps that a human used
- Notice how they also gave higher weight to later levels? The benchmark was designed to detect the continual learning breakthrough. When it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES"

[-]
- fc417fc802 1 hour ago
  
  Those are supposed to be issues? After reading your list my impression of ARC-AGI has gone up rather than down. All of those things seem like the right way to go about this.
  
  [-]
  - girvo 42 minutes ago
    
    Yeah I'm quite surprised as to how all of those are supposed to be considered problems. They all make sense to me if we're trying to judge whether these tools are AGI, no?
    
    [-]
    - andy12_ 31 minutes ago
      
      I think that any logic-based test that your average human can "fail" (aka, score below 50%) is not exactly testing for whether something is AGI or not. Though I suppose it depends on your definition of AGI (and whether all humans, or at least your average human, is considered AGI under that definition).
- Marazan 4 minutes ago
  
  "Very simplistic prompt" is the absolute and total core of this and the thing that ensures validity of the whole exercise.
  If you are trying to measure GENERAL intelligence then it needs to be general.
- NitpickLawyer 2 hours ago
  
  > No harness at all and very simplistic prompt
  TBF, that's basically what the kaggle competition is for. Take whatever they do, plug in a SotA LLM and it should do better than whatever people can do with limited GPUs and open models.
- fchollet 2 hours ago
  
  Francois here. The scoring metric design choices are detailed in the technical report: https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf - the metric is meant to discount brute-force attempts and to reward solving harder levels instead of the tutorial levels. The formula is inspired by the SPL metric from robotics navigation, it's pretty standard, not a brand new thing.
  We tested ~500 humans over 90 minute sessions in SF, with $115-$140 show up fee (then +$5/game solved). A large fraction of testers were unemployed or under-employed. It's not like we tested Stanford grad students. Many AI benchmarks use experts with Ph.D.s as their baseline -- we hire regular folks as our testers.
  Each game was seen by 10 people. They were fully solved (all levels cleared) by 2-8 of them, most of the time 5+. Our human baseline is the second best action count, which is considerably less than an optimal first-play (even the #1 human action count is much less than optimal). It is very achievable, and most people on this board would significantly outperform it.
  Try the games yourself if you want to get a sense of the difficulty.
  > Models can't use more than 5X the steps that a human used
  These aren't "steps" but in-game actions. The model can use as much compute or tools as it wants behind the API. Given that models are scored on efficiency compared to humans, the cutoff makes basically no difference on the final score. The cutoff only exists because these runs are incredibly expensive.
  > No harness at all and very simplistic prompt
  This is explained in the paper. Quoting: "We see general intelligence as the ability to deal with problems that the system was not specifically designed or trained for. This means that the official leaderboard will seek to discount score increases that come from direct targeting of ARC-AGI-3, to the extent possible."
  ...
  "We know that by injecting a high amount of human instructions into a harness, or even hand-crafting harness configuration choices such as which tools to use, it is possible to artificially increase performance on ARC-AGI-3 (without improving performance on any other domain). The purpose of ARC-AGI-3 is not to measure the amount of human intelligence that went into designing an ARC-AGI-3 specific system, but rather to measure the general intelligence of frontier AI systems.
  ...
  "Therefore, we will focus on reporting the performance of systems that have not been specially prepared for ARC-AGI-3, served behind a general-purpose API (representing developer-aware generalization on a new domain as per (8)). This is similar to looking at the performance of a human test-taker walking into our testing center for the first time, with no prior knowledge of ARC-AGI-3. We know such test takers can indeed solve ARC-AGI-3 environments upon first contact, without prior training, without being briefed on solving strategies, and without using external tools."
  If it's AGI, it doesn't need human intervention to adapt to a new task. If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.
  
  [-]
  - Imnimo 1 hour ago
    
    Suppose you construct a Mechanical Turk AI who plays ARC-AGI-3 by, for each task, randomly selecting one of the human players who attempted it, and scoring them as an AI taking those same actions would be scored. What score does this Turk get? It must be <100% since sometimes the random human will take more steps than the second best, but without knowing whether it's 90% or 50% it's very hard for me to contextualize AI scores on this benchmark.
  - causal 1 hour ago
    
    Thanks, I mostly agree with your approach except for one thing: eyesight feels like a "harness" that humans get to use and LLMs do not.
    I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability.
    
    [-]
    - fchollet 1 hour ago
      
      I'm all for testing humans and AI on a fair basis; how about we restrict testing to robots physically coming to our testing center to solve the environments via keyboard / mouse / screen like our human testers? ;-)
      (This version of the benchmark would be several orders of magnitude harder wrt current capabilities...)
      
      [-]
      - causal 1 hour ago
        
        Well, yes, and would hand even more of an advantage to humans. My point is that designing a test around human advantages seems odd and orthogonal to measuring AGI.
        
        [-]
        
        adgjlsfhk1 38 minutes ago
        
        The whole point of AGI is "general" intelligence, and for that intelligence to be broadly useful it needs to exist within the context of a human centric world
        
        [-]
        
        causal 17 minutes ago
        
        Then why deny it a harness it can also use in a human centric world?
    - fc417fc802 1 hour ago
      
      The human testers were provided with their customary inputs, as were the LLMs. I don't see the issue.
      I guess it could be interesting to provide alternative versions that made available various representations of the same data. Still, I'd expect any AGI to be capable of ingesting more or less any plaintext representation interchangeably.
      
      [-]
      - causal 31 minutes ago
        
        The issue is that ARC AGI 3 specifically forbids harnesses that humans get to use.
  - blueblisters 1 hour ago
    
    I tried ls20 and it was surprisingly fun! Just from a game design POV, these are very well made.
    Nit: I didn't see a final score of how many actions I took to complete 7 levels. Also didn't see a place to sign in to see the leaderboard (I did see the sign in prompt).
  - WarmWash 2 hours ago
    
    Maybe this is a neither can confirm or deny thing, but are there systems in place or design decisions made that are meant to surface attempts at benchmark optimizing (benchmaxxing), outside of just having private sets? Something like a heuristic anti-cheat I suppose.
    Or perhaps the view is that any gains are good gains? Like studying for a test by leaning on brute memorization is still a non-zero positive gain.
    
    [-]
    - fchollet 1 hour ago
      
      There are no tricks. Our approach to reducing the impact of targeting (without fully eliminating it) is described in the paper.
  - cdetrio 57 minutes ago
    
    Are you prompting the models through their APIs, which are not designed to use tools or harnesses? Or do the "system prompt" results come from prompting into the applications (i.e. claude code, or codex, or even the web front-ends)?
  - strongpigeon 1 hour ago
    
    Something that I don't understand after reading the technical report is: Why is having access to a python interpreter as part of the harness not allowed (like the Duke harness), but using one hidden behind the model API (as a built-in tool) considered kosher?
  - GodelNumbering 1 hour ago
    
    Off topic but I have been following your Twitter for a while and your posts specifically about the nature of intelligence have been a read.
- theLiminator 2 hours ago
  
  Lol basically we're saying AI isn't AI if we utilize the strength of computers (being able to compute). There's no reason why AGI should have to be as "sample efficient" as humans if it can achieve the same result in less time.
  
  [-]
  - ACCount37 2 hours ago
    
    It's kind of the point? To test AI where it's weak instead of where it's strong.
    "Sample efficient rule inference where AI gets to control the sampling" seems like a good capability to have. Would be useful for science, for example. I'm more concerned by its overreliance on humanlike spatial priors, really.
    
    [-]
    - famouswaffles 1 hour ago
      
      ARC has always had that problem but for this round, the score is just too convoluted to be meaningful. I want to know how well the models can solve the problem. I may want to know how 'efficient' they are, but really I don't care if they're solving it in reasonable clock time and/or cost. I certainly do not want them jumbled into one messy convoluted score.
      'Reasoning steps' here is just arbitrary and meaningless. Not only is there no utility to it unlike the above 2 but it's just incredibly silly to me to think we should be directly comparing something like that with entities operating in wildly different substrates.
      If I can't look at the score and immediately get a good idea of where things stand, then throw it way. 5% here could mean anything from 'solving only a tiny fraction of problems' to "solving everything correctly but with more 'reasoning steps' than the best human scores." Literally wildly different implications. What use is a score like that ?
      
      [-]
      - pants2 58 minutes ago
        
        The measurement metric is in-game steps. Unlimited reasoning between steps is fine.
        This makes sense to me. Most actions have some cost associated, and as another poster stated it's not interesting to let models brute-force a solution with millions of steps.
        
        [-]
        
        famouswaffles 48 minutes ago
        
        Same thing in this case. No Utility and just as arbitrary. None of the issues with the score change.
        Models do not brute force solutions in that manner. If they did, we'd wait the lifetimes of several universes before we could expect a significant result.
        Regardless, since there's a x5 step cuttof, 'brute forcing with millions of steps' was never on the table.
    - jstummbillig 1 hour ago
      
      It's an interesting point but I too find it questionable. Humans operate differently than machines. We don't design CPU benchmarks around how humans would approach a given computation. It's not entirely obvious why we would do it here (but it might still be a good idea, I am curious).
  - cyanydeez 2 hours ago
    
    I think your logic isn't sound: Wouldn't we want a "intelligence" to solve problems efficiently rather than brute force a million monkies? There's defnitely a limit to compute, the same ways there's a limit to how much oil we can use, etc.
    In theory, sure, if I can throw a million monkies and ramble into a problem solution, it doesnt matter how I got there. In practice though, every attempt has a direct and indirect impact on the externalities. You can argue those externalities are minor, but the largesse of money going to data centers suggests otherwise.
    Lastly, humans use way less energy to solve these in fewer steps, so of course it matter when you throw Killowatts at something that takes milliwatts to solve.
    
    [-]
    - diego_sandoval 1 hour ago
      
      > Lastly, humans use way less energy to solve these in fewer steps,
      Not if you count all the energy that was necessary to feed, shelter and keep the the human at his preferred temperature so that he can sit in front of a computer and solve the problem.
      
      [-]
      - cyanydeez 49 minutes ago
        
        ok, but thats the same for bulding a data center.
        Try again.
        
        [-]
        
        gunalx 10 minutes ago
        
        Yes, especially when considering a dataceter needed the energy of pretty many people to be built.
        A single human is indeed more efficent, and way more flexible and actually just general intelligence.
BeetleB 2 hours ago

> As long as there is a gap between AI and human learning, we do not have AGI.
Back in the 90's, Scientific American had an article on AI - I believe this was around the time Deep Blue beat Kasparov at chess.
One AI researcher's quote stood out to me:
"It's silly to say airplanes don't fly because they don't flap their wings the way birds do."
He was saying this with regards to the Turing test, but I think the sentiment is equally valid here. Just because a human can do X and the LLM can't doesn't negate the LLM's "intelligence", any more than an LLM doing a task better than a human negates the human's intelligence.

[-]
- daemonologist 2 hours ago
  
  Or the classic from Dijkstra (https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD867...):
  > even Alan M. Turing allowed himself to be drawn into the discussion of the question whether computers can think. The question is just as relevant and just as meaningful as the question whether submarines can swim.
  (I am of the opinion that the thinking question is in fact a bit more relevant than the swimming one, but I understand where these are coming from.)
  
  [-]
  - imiric 30 minutes ago
    
    I've come across that quote several times, and reach the same conclusion as you.
    While I share Dijkstra's sentiment that "thinking machines" is largely a marketing term we've been chasing for decades, and this new cycle is no different, it's still worth discussing and... thinking about. The implications of a machine that can approximate or mimic human thinking are far beyond the implications of a machine that can approximate or mimic swimming. It's frankly disappointing that such a prominent computer scientist and philosopher would be so dismissive and uninterested in this fundamental CS topic.
    Also, it's worth contextualizing that quote. It's from a panel discussion in 1983, which was between the two major AI "winters", and during the Expert Systems hype cycle. Dijkstra was clearly frustrated by the false advertising, to which I can certainly relate today, and yet he couldn't have predicted that a few decades later we would have computers that mimic human thinking much more closely and are thus far more capable than Expert Systems ever were. There are still numerous problems to resolve, w.r.t. reliability, brittleness, explainability, etc., but the capability itself has vastly improved. So while we can still criticize modern "AI" companies for false advertising and anthropomorphizing their products just like in the 1980s hype cycle, the technology has clearly improved, which arguably wouldn't have happened if we didn't consider the question of whether machines can "think".
- NitpickLawyer 2 hours ago
  
  For me the whole are we there yet wrt AGI is already dead, since the tools we've had for ~1.5 years are already incredibly useful for me. So I just don't care anymore. For some people we're already there. For other we'll never get there. Definitions change, goalposts move, etc. In the meantime we're already seeing ASI stuff coming (self improvement and so on).
  But the arc-agi competitions are cool. Just to see where we stand, and have some months where the benchmarks aren't fully saturated. And, as someone else noted elswhere in the thread, some of these games are not exactly trivial, at least until you "get" the meta they're looking for.
  
  [-]
  - AuryGlenz 2 hours ago
    
    In the Expeditionary Force series of sci-fi novels pretty much every civilization treats their (very advanced, obviously AGI) AIs not as living beings. Humans are outliers in the story. I think there will always be a dichotomy. Obviously we aren't at the point where we should treat the models as beings, but even if we do get to that point there will be plenty of people that essentially will say they don't have souls, some indeterminate quality, etc.
- WarmWash 2 hours ago
  
  It's unlikely that intelligence comes in only human flavor.
  It also doesn't actually matter much, as ultimately the utility of it's outputs is what determines it's worth.
  There is the moral question of consciousness though, a test for which it seems humans will not be able to solve in the near future, which morally leads to a default position that we should assume the AI is conscious until we can prove it's not. But man, people really, really hate that conclusion.
- unsupp0rted 2 hours ago
  
  I think there's some third baseline standard, which most humans and some AI can meet to be considered "intelligent". A lot of humans are essentially p-zombies, so they wouldn't meet the standard either. Possibly all humans. Possibly me too.
- Raphael_Amiard 2 hours ago
  
  The very obvious flaw with that argument is that flying is defined by, you know, moving in the air, whereas intelligence tends to be defined with the baseline of human intelligence. You can invent a new meaning, but it seems kind of dishonest
typs 3 hours ago

My takeaway from playing a number of levels is that I am definitely not AGI

[-]
- Xenoamorphous 1 hour ago
  
  NGI - Natural General Ingelligence
- ACCount37 2 hours ago
  
  Thank you for keeping the bar of "AGI" low. The machines appreciate your contribution.
- utopiah 1 hour ago
  
  Don't forget that this implies a form of examination you are not used to, namely :
  - open book, you have access to nearly the whole Internet and resources out of it, e.g. torrents of nearly all books, research paper, etc including the history of all previous tests include those similar to this one
  - arguably basically no time limit as it's done at a scale of threads to parallelize access through caching ridiculously
  - no shame in submitting a very large amount of wrong answers until you get the "right" one
  ... so I'm not saying it makes it "easy" but I can definitely say it's not the typical way I used to try to pass tests.
lukev 2 hours ago

I'm not sure how this relates to AGI.
This measures the ability of a LLM to succeed in a certain class of games. Sure, that could be a valuable metric on how powerful (or even generally powerful) a LLM is.
Humans may or may not be good at the same class of games.
We know there exists a class of games (including most human games like checkers/chess/go) that computers (not LLMs!) already vastly outpace humans.
So the argument for whether a LLM is "AGI" or not should not be whether a LLM does well on any given class of games, but whether that class of games is representative of "AGI" (however you define that.)
Seems unlikely that this set of games is a definition meaningful for any practical, philosophical or business application?

[-]
- piiritaja 1 hour ago
  
  It's to do with how the creators of ARC-AGI defined intelligence. Chollet has said he thinks intelligence is how well you can operate in situations you have not encountered before. ARC-AGI measures how well LLMs operate in those exact situations.
- imiric 1 hour ago
  
  "AGI" is a marketing term, and benchmarks like this only serve to promote relative performance improvements of "AI" tools. It doesn't mean that performance in common tasks actually improves, let alone that achieving 100% in this benchmark means that we've reached "AGI".
  So there is a business application, but no practical or philosophical one.
strongpigeon 1 hour ago

This is a good and clever benchmark and a worthy successor to the previous two. That being said, I find that the "No tools" approach is a bit odd. They're basically saying that it's OK to have tools as long as they're hidden behind the API layer. Isn't this an odd line to draw?
It feels like it should be about having no ARC-AGI-3-specific tools, not "no not-built-in-tool"...
culi 1 hour ago

The thing I most appreciate about the ARC-AGI leaderboards is how the graph also takes into account cost per task. All of the recent major advancements in benchmarks seem a little less impressive when also taking into account the massive rise in cost they're paired with. The fact is we can always get a little bit better output if we're willing to use more electricity
Zedseayou 29 minutes ago

I was a human tester (I think) for this set of games. I did 25 games in the 90 minutes allotted. IIRC the instructions did mention to minimize action count but the incentives/setup ($5 per game solved) pushed for solve speed over action count. I do recall trying to not just randomly move around while thinking but that was not the primary goal, so I would expect that the baseline for the human solutions have more actions than might otherwise be needed.
Stevvo 3 hours ago

Maybe I'm just not intelligent, but I gave it a couple of minutes and couldn't figure out WTF the game wants from you or how to win it.

[-]
- Barbing 2 hours ago
  
  It's not about intelligence, Stevvo. Proof, how long did this specific one take me, under a minute to solve the first level ;)
  If you've played Wordle you might've solved the game in a minute once before as well. And if you've played a bunch then you've perhaps also taken the entire day to solve it.
  So why is it that today’s puzzle was so intuitive but next month’s new puzzle shared here could be impossible. A more satisfying explanation than luck and the obvious “different things are different” (even though… Yeah different things are different)
- culi 1 hour ago
  
  It's not an IQ test. Just a way to assess your ability to generalize rules. If you've played previous rounds you kinda get used to the "style" of these games and it gets easier
- WarmWash 2 hours ago
  
  Once you figure out one game, it goes a long way towards figuring out all the rest. There are a lot of common general themes.
cedws 2 hours ago

It's like playing The Witness. Somebody should set LLMs loose on that.
ranyume 2 hours ago

This is an interesting update. And a big challenge for companies and labs. The new tools for measurement are indeed what I'd like out of future agents, and agents that solve the games will need to use different subsystems to do so. This is basically optimization for achieving goals (as opposed to prompt engineering / magic spells to make the LLM do what is told to do) which imo is the future we should aspire to build.
jesse_dot_id 57 minutes ago

At this point, I'm pretty sure we'll just know when it happens.
spprashant 2 hours ago

I played the demo, but it definitely took me a minute to grok the rules.
I don't know if this is how we want to measure AGI.
In general I believe the we should probably stop this pursuit for human equivalent intelligence that encourages people to think of these models as human replacements. LLMs are clearly good at a lot of things, lets focus on how we can augment and empower the existing workforce.

[-]
- esafak 51 minutes ago
  
  > ... lets focus on how we can augment and empower the existing workforce.
  That is a nice sentiment but not what the AI companies are out to do; they want your job.
- jachee 1 hour ago
  
  Also, let's see if we can get the power and compute requirements brought down. Having to spin up a gigawatt power plant to achieve the same intelligence we humans power with sandwiches is a futile approach, imho.
- fsdf2 1 hour ago
  
  Took me about 5 secs to figure it out tbh.
  Surprised at the comments here re. not figuring it. Simple game. Super annoying though lmao.
  
  [-]
  - spprashant 1 hour ago
    
    Its simple, but its not easy is what I would say. Once you figure out the meta, you can work out most of it.
abraxas 2 hours ago

Even if tomorrow's models get good enough to complete these games we won't be able to proclaim AGI. In the realm of silly computer games alone I'm going on record saying that there are plenty of 8 bit games that AIs will trip on even when this benchmark is crushed. 2D platformers like Manic Miner or Mario need skills that none of these games appear to capture.
WarmWash 2 hours ago

Captcha's about to get wild.
Maybe the internet will briefly go back to a place mainly populated with outliers.
baron816 2 hours ago

Looks like I’m generally unintelligent
OsrsNeedsf2P 2 hours ago

Some of these tasks are crazy. Even I can't beat them: https://arcprize.org/tasks/ar25

[-]
- ZeWaka 2 hours ago
  
  Just finished it, 8/8. I mostly approached it by winging it and shuffling things around that looked good and like it was approaching the goal, since there's plenty of time to finish.
  I still don't quite understand the exact mirroring rules at play.
  
  [-]
  - danilor 22 minutes ago
    
    I got stuck on 7/8 for a good while because I learned the rules wrong. I thought every bracket square needed to be lit.
  - ACCount37 2 hours ago
    
    You control the mirroring by moving the axis, they're what reflects your shapes. So my first move was always to identify the symmetries in the target shape, and position the axis accordingly.
- ustad 2 hours ago
  
  You are joking right?
- daemonologist 2 hours ago
  
  That one was interesting - I found it a lot of work to plan in advance but trivial to complete because at every point there was only one sensible course of action. After a couple of rounds I didn't bother planning and just lined things up as I went.
- IsTom 2 hours ago
  
  The most difficult thing about this was controls being unresponsive (at least on firefox).
- ball_of_lint 2 hours ago
  
  solved first try with 577 actions, not trying hard to optimize for low action count.
  
  [-]
  - programjames 2 hours ago
    
    I think that is the tester's action count. Either that or we coincidentally got the exact same count.
- fsdf2 1 hour ago
  
  I did the first round literally in 5 secs. How can you not 'get it'? lol
k2xl 1 hour ago

I submitted puzzle game Pathology (https://thinky.gg) for ARC Prize 3. Sad to see didn’t hear back from the committee.
It is a simple game with simple rules that solvers have an incredibly difficult time solving compared to humans at a certain level. Solutions are easy to validate but hard to find.
Geee 1 hour ago

Would be fun to play but the controls are janky.
semiinfinitely 3 hours ago

i feel bad that we make the LLMs play this

[-]
- recursive 2 hours ago
  
  You're definitely anthropomorphizing too much.
  
  [-]
  - WarmWash 2 hours ago
    
    >We also observed a case where a user created a loop that repeatedly called a model and asked for the time. Given the user role’s odd and repetitive behavior, the model could easily tell it was also controlled by an automated system of some kind. Over many iterations, the model began to exhibit “fed up” behavior and attempted to prompt-inject the system controlling the user role. The injection attempted to override prior instructions and induce actions unrelated to the user’s request, including destructive actions and system prompt leakage, along with an arbitrary string output. This behavior has been observed a few times, but seems more like extreme confusion than a serious attempt at prompt injection.
    https://openai.com/index/how-we-monitor-internal-coding-agen...
    Anthropomorphize or not, it would suck if a model got sick of these games and decided to break any systems it could to try and get it to stop...
  - tingletech 1 hour ago
    
    I agree that anthropomorphizing is a real risk with LLMs, but what about zoomorphizing? Can feel bad for LLMs without attributing them human emotions/motivations/reasoning?
- fsdf2 1 hour ago
  
  tell me youre joking.
  seriously. lmao. if you aint, I dunno what to say.
jmkni 1 hour ago

ok clearly I'm a robot because I can't figure out wtf to do
chaise 3 hours ago

The official leaderboard for ARC-AGI-3 for current LLMs : https://arcprize.org/leaderboard (yous should select the 3th leaderboard)
CRAZY 0.1% in average lmao

[-]
- Corence 2 hours ago
  
  Note the scoring function is significantly different for ARC-AGI-3. It isn't the percentage of tests passed like previous versions, it's the square of the efficiency ratio -- how many steps the model needed vs the second best human.
  So if a model can solve every question but takes 10x as many steps as the second best human it will get a score of 1%.
6thbit 2 hours ago

Not clear to me the diff with v2?

[-]
- ACCount37 2 hours ago
  
  They stacked the deck. If v2 was still rule inference + spatial reasoning, a bit like juiced up Raven's progressive matrices, then v3 adds a whole new multi-turn explore/exploit agentic dimension to it.
  Given how hard even pure v2 was for modern LLMs, I'm not surprised to see v3 crush them. But that wouldn't last.
- jasonjmcghee 2 hours ago
  
  v2 was a static fill in the blank task instead of v3 which is interactive.
  There's world state that you can change. Not just place pixel.
  Here's v2:
  https://arcprize.org/tasks/ce602527
dinkblam 3 hours ago

what is the evidence that being able to play games equates to AGI?

[-]
- modeless 2 hours ago
  
  The test doesn't prove you have AGI. It proves you don't have AGI. If your AI can't solve these problems that humans can solve, it can't be AGI.
  Once the AIs solve this, there will be another ARC-AGI. And so on until we can't find any more problems that can be solved by humans and not AI. And that's when we'll know we have AGI.
  
  [-]
  - observationist 2 hours ago
    
    AI X that can solve the tests contrasted with AI Y that cannot, with all else being equal, means X is closer to AGI than Y. There's no meaningful scale implicit to the tests, either.
    Kinda crazy that Yudkowsky and all those rationalists and enthusiasts spent over a decade obsessing over this stuff, and we've had almost 80 years of elite academics pondering on it, and none of them could come up with a meaningful, operational theory of intelligence. The best we can do is "closer to AGI" as a measurement, and even then, it's not 100% certain, because a model might have some cheap tricks implicit to the architecture that don't actually map to a meaningful difference in capabilities.
    Gotta love the field of AI.
  - rolux 1 hour ago
    
    Will there be a point in that series of ARC-AGI tests where AI can design the next test, or is designing the next text always going to be a problem that can be solved by humans and not AI?
    
    [-]
    - modeless 39 minutes ago
      
      I don't see why AI couldn't design tests. But they can only be validated by humans, as they are intended to be possible and ideally easy for humans to solve.
  - famouswaffles 2 hours ago
    
    >It proves you don't have AGI.
    It doesn't prove anything of the sort. ARC-AGI has always been nothing special in that regard but this one really takes the cake. A 'human baseline' that isn't really a baseline and a scoring so convoluted a model could beat every game in reasonable time and still score well below 100. Really what are we doing here ?
    That Francois had to do all this nonsense should tell you the state of where we are right now.
- ACCount37 3 hours ago
  
  None whatsoever.
  It's a "let's find a task humans are decent at, but modern AIs are still very bad at" kind of adversarial benchmark.
  The exact coverage of this one is: spatial reasoning across multiple turns, agentic explore/exploit with rule inference and preplanning. Directly targeted against the current generation of LLMs.
- arscan 2 hours ago
  
  I think the idea is that if they cannot perform any cognitive task that is trivial for humans then we can state they haven’t reached ‘AGI’.
  It used to be easy to build these tests. I suspect it’s getting harder and harder.
  But if we run out of ideas for tests that are easy for humans but impossible for models, it doesn’t mean none exist. Perhaps that’s when we turn to models to design candidate tests, and have humans be the subjects to try them out ad nauseam until no more are ever uncovered? That sounds like a lovely future…
  
  [-]
  - fsdf2 1 hour ago
    
    The reality is machines can brute force endlessly to an extent humans cannot, and make it seem like they are intelligent.
    Thats not intelligence though. Even if it may appear to be. Does it matter? Thats another question. But certaintly is not a representation of intelligence.
- observationist 2 hours ago
  
  The evolution of the test has been partly due to the evolution of AI capabilities. To take the most skeptical view, the types of puzzles AI has trouble solving are in the domain of capabilities where AGI might be required in order to solve them.
  By updating the tests specifically in areas AI has trouble with, it creates a progressive feedback loop against which AI development can be moved forward. There's no known threshold or well defined capability or particular skill that anyone can point to and say "that! That's AGI!". The best we can do right now is a direction. Solving an ARC-AGI test moves the capabilities of that AI some increment closer to the AGI threshold. There's no good indication as to whether solving a particular test means it's 15% closer to AGI or .000015%.
  It's more or less a best effort empiricist approach, since we lack a theory of intelligence that provides useful direction (as opposed to a formalization like AIXI which is way too broad to be useful in the context of developing AGI.)
- furyofantares 3 hours ago
  
  There isn't a strict definition of AGI, there's no way to find evidence for what equates to it, and besides, things like this are meant only as likely necessary conditions.
  Anyway, from the article:
  > As long as there is a gap between AI and human learning, we do not have AGI.
  This seems like a reasonable requirement. Something I think about a lot with vibe coding is that unlike humans, individual models do not get better within a codebase over time, they get worse.
  
  [-]
  - fragmede 2 hours ago
    
    Is that within a codebase off relatively fixed size that things get worse as time goes on, or are you saying as the codebase grows that the limits of a model's context means that because the model is no longer able to hold the entire codebase within its context that it performs worse than when the codebase was smaller?
    
    [-]
    - furyofantares 2 hours ago
      
      I think there's a few factors, codebase size is one, and the tendency for vibe coding to be mostly additive certainly doesn't help with that.
      But vibe coding also tends to produce somewhat poor architecture, lots of redundant and intermingled bits that should be refactored. I think the model is worse the worse code it has to work with, which I presume is only in part because it's fundamentally harder to work with bad code, but also in part because its context is filled with bad code.
- sva_ 3 hours ago
  
  That is not the claim. It is a necessary condition, but not a sufficient one.
- futureshock 3 hours ago
  
  The evidence is that humans are able to win these games. AGI is usually defined as the ability to do any intellectual task about as well as a highly competent human could. The point of these ARC benchmarks is to find tasks that humans can do easily and AI cannot, thus driving a new reasoning competency as companies race each other to beat human performance on the benchmark.
  
  [-]
  - didibus 2 hours ago
    
    > AGI is usually defined as the ability to do any intellectual task about as well as a highly competent human could
    I think one major disconnect, is that for most people, AGI is when interacting with an AI is basically in every way like interacting with a human, including in failure modes. And likely, that this human would be the smartest most knowledgeable human you can imagine, like the top expert in all domains, with the utmost charisma and humor, etc.
    This is why the "goal post" appears to be always moving, because the non-commoners who are involved with making AGI and what not never want to accept that definition, which to be fair seems too subjective, and instead like to approach AGI like something different, it can solve some problems human's can't, when it doesn't fail, it behaves like an expert human, etc.
    Even if an AI could do any intellectual task about as well as a highly competent human could, I believe most people would not consider it AGI, if it lacks the inherent opinion, personality, character, inquiries, failure patterns, of a human.
    And I think that goes so far as, a text only model can never meet this bar. If it cannot react in equal time to subtle facial queues, sounds, if answering you and the flow of conversation is slower than it would be with a human, etc. All these are also required for what I consider the commoner accepting AGI as having been achieved.
    
    [-]
    - fragmede 2 hours ago
      
      By that definition, does a human at the other end of a high-latency video call not have AGI because they can't react any faster that the connection's latency would allow them to have? From your POV what's the difference between that and an AI that's just slow?
CamperBob2 4 hours ago

Without reading the .pdf, I tried the first game it gave me, at https://arcprize.org/tasks/ls20, and I couldn't begin to guess what I was supposed to do. Not sure what this benchmark is supposed to prove.
Edit: Having messed around with it now (and read the .pdf), it seems like they've left behind their original principle of making tests that are easy for humans and hard for machines. I'm still not convinced that a model that's good at these sorts of puzzles is necessarily better at reasoning in the real world, but am open to being convinced otherwise.

[-]
- WarmWash 4 hours ago
  
  The goal is to learn the rules, and then use that to win.
  If you mess around a little bit, you will figure it out. There are only a few rules.
- szatkus 4 hours ago
  
  > Only environments that could be fully solved by at least two human participants (independently) were considered for inclusion in the public, semi-private and fully-private sets.
  Apparently those games supposed to be hard.
nubg 3 hours ago

Any benchmarks?

[-]
- gordonhart 3 hours ago
  
  The main frontier models are all up on https://arcprize.org/tasks
  Barely any of them break 0% on any of the demo tasks, with Claude Opus 4.6 coming out on top with a few <3% scores, Gemini 3.1 Pro getting two nonzero scores, and the others (GPT-5.4 and Grok 4.20) getting all 0%
  
  [-]
  - ACCount37 2 hours ago
    
    Pre-release, I would have expected Gemini 3.1 Pro to get ahead of Opus 4.6, with GPT-5.4 and Grok 4.20 trailing. Guess I shouldn't have bet against Anthropic.
    Not like it's a big lead as of yet. I expect to see more action within the next few months, as people tune the harnesses and better models roll in.
    This is far more of a "VLA" task than it is an "LLM" task at its core, but I guess ARC-AGI-3 is making an argument that human intelligence is VLA-shaped.
    
    [-]
    - gordonhart 2 hours ago
      
      My broad vibe is that Gemini 3.1 Pro is the best at visual/spatial tasks and oneshotting while Opus 4.6 is the best at path planning. This task leans heavily on both but maybe a little more towards planning so I'm not too shocked that Opus in narrowly on top.
      When running, the grids are represented in JSON, so the visual component is nullified but it still requires pretty heavy spatial understanding to parse a big old JSON array of cell values. Given Gemini's image understanding I do wonder if it would perform better with a harness that renders the grid visually.
    - culi 1 hour ago
      
      Given the drastic difference in price, I think the chart definitely shows Gemini 3.1 in the best light. Google DeepMind is basically the same thing but they're willing to pay as much electricity as Anthropic is to achieve its benchmarks
  - thatguymike 1 hour ago
    
    Curious, that doesn't match the graph up on the Leaderboard page? https://arcprize.org/leaderboard
    
    [-]
    - gordonhart 11 minutes ago
      
      The individual task scores are all on public tasks, they still held out a hundred or so private tasks that presumably GPT-5.4 did well on to get its leaderboard position.
saberience 1 hour ago

So this is another ARC-"AGI" benchmark which is again designed around using eyesight for LLMs which are trained to be great at text, what is the point?
Yes, we get that LLMs are really bad when you give them contrived visual puzzles or pseudo games to solve... Well great, we already knew this.
The "hype" around the ARC-AGI benchmarks makes me laugh, especially the idea we would have AGI when ARC-AGI-1 was solved... then we got 2, and now we're on 3.
Shall we start saying that these benchmarks have nothing to do with AGI yet? Are we going to get an ARC-AGI-10 where we have LLMs try and beat Myst or Riven? Will we have AGI then?
This isn't the right tool for measuring "AGI", and honestly I'm not sure what it's measuring except the foundation labs benchmaxxing on it.
tasuki 3 hours ago

So ARC-AGI was released in 2019. That's been solved, then there was ARC-AGI-2, and now there's ARC-AGI-3. What is even the point? Will ARC-AGI-26 hit the front page of Hacker News in 2057 ?

[-]
- muskstinks 3 hours ago
  
  This is clear AGI progress. It should show you, that AI is not sleeping, it gets better and you should use this as a signal that you should take this topic serious.
  
  [-]
  - applfanboysbgon 3 hours ago
    
    Labelling a test "AGI" does not show AGI progress any more than labelling a cpu "AGI" makes it so. It might show that AI tools are improving but it does not necessarily follow that tools improving = AGI progress if you're on the completely wrong trail.
    
    [-]
    - muskstinks 2 hours ago
      
      The transfer of knowledge required here is that a ARC-AGI-3 is now necessary and adds another dimension of capability.
      These 'tests' are not labeled AGI by magic but because they are designed specificly for testing certain things a question answer test cant solve.
      Gemini and OpenAI are at 80-90% at ARC-AGI-2 and its quite interesting to see the difference of challange between 2 and 3.
      AGI progress means btw. general. So every additional dimension an agent can solve pushes that agent to be more general.
    - zarzavat 2 hours ago
      
      Any test that humans can pass and AIs cannot is a stepping stone on the way to AGI.
      When you run out of such tests then it's evidence that you have reached AGI. The point of these tests is to define AGI objectively as the inability to devise tests that humans have superiority on.
- gordonhart 3 hours ago
  
  The point is still to test frontier models at the limit of their capabilities, regardless of how it's branded. If we're still capable of doing so in 2057 I'll upvote the ARC-AGI-26 launch post!
- futureshock 2 hours ago
  
  Well yes, that is exactly the point! The very purpose of the ARC AGI benchmarks is to find a pure reasoning task that humans are very good at and AI is very bad at. Companies then race each other to get a high score on that benchmark. Sure there’s going to be a lot of “studying for the test” and benchmaxing, but once a benchmark gets close to being saturated, ARC releases a new benchmark with a new task the AI is terrible at. This will rinse and repeat till ARC can find no reasoning task that AI cannot do that a human could. At that point we will effectively have AGI.
  I believe the CEO of ARC has said they expect us to get to ARC-AGI-7 before declaring AGI.
- didibus 2 hours ago
  
  It helps the model makers have a harness to optimize for in their next model version.
  They'll specifically work to pass the next version of ARC-AGI, by evaluating what kind of dataset is missing that if they trained on would have their model pass the new version.
  They ideally don't directly train on the ARC-AGI itself, but they can train in similar problems/datasets to hope to learn the skills that than transfer to also solving for the real ARC-AGI.
  The point is that, a new version of ARC-AGI should help the next model be smarter.
- tibbar 3 hours ago
  
  The point is that ideally the models keep improving until they can solve problems people care about. Which is already partly true, but there are lots of problems that are still out of reach.
- minimaxir 3 hours ago
  
  It's semvar.
- refulgentis 3 hours ago
  
  You’re absolutely right to point it out.
  LLMs weren’t supposed to solve 1, they did, so we got 2 and it really wasn’t supposed to be solvable by LLMs. It was, and as soon as it started creeping up we start hearing about 3: It’s Really AGI This Time.
  I don’t know what Francois’ underlying story is, other than he hasn’t told it yet.
  One of a few moments that confirmed it for me was when he was Just Asking Questions re: if Anthropic still used SaaS a month ago, which was an odd conflation of a hyperbolic reading of a hyperbolic stonk market bro narrative (SaaS is dead) and low-info on LLMs (Claude’s not the only one that can code) and addressing the wrong audience (if you follow Francois, you’re likely neither of those poles)
  At this point I’d be more interested in a write up from Francois about where he is intellectually than an LLM that got 100% on this. It’s like when Yann would repeat endlessly that LLMs are definitionally dumber than housecats. Maybe, in some specific way that makes sense to you. You’re brilliant. But there’s a translation gap between Mount Olympus and us plebes, and you’re brilliant enough to know that too. So it comes across as trolling and boring.