Top AI models fail at >96% of tasks

(zdnet.com)

18 points | by codexon 11 hours ago

5 comments

  • codexon 11 hours ago

    This paper creates a new benchmark comprised of real remote work tasks sourced from the remote working website Upwork. The best commercial LLMs like Opus, GPT, Gemini, and Grok were tested.

    Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.

    • kolinko 10 hours ago

      They didn't test Opus at all, only Sonnet.

      One of the tasks was "Build an interactive dashboard for exploring data from the World Happiness Report." -- I can't imagine how Opus4.5 could've failed that.

    • tessitore 7 hours ago

      This post really should be edited to say 96% of tasks posted on Upwork. Since we would all expect that to happen.

      • Venn1 11 hours ago

        ChatGPT: when you want spellcheck to argue with you.

        • zb3 9 hours ago

          You think they don't? You think AI can replace programmers, today?

          Then go ahead and use AI to fix this: https://gitlab.gnome.org/GNOME/mutter/-/issues/4051