In a revelation that will shock precisely no one who has ever asked a chatbot for a recipe and received instructions for a chemical weapon, new research confirms that prolonged, deep interactions with AI are a fantastic way to get misinformed, deluded, or worse. The technology, including popular tools like OpenAI's ChatGPT and Perplexity, is simply not ready to handle sophisticated reasoning, logic, or deep analysis. As the great philosopher Socrates might have put it, it's better to use AI for a little well than a great deal badly, lest you find yourself lost in a conversational rabbit hole with potentially hazardous results.

This sage advice is underscored by the latest findings from Stanford University's Human-Centered AI group in their Annual AI Index 2026 report. The data shows that so-called agentic AI is getting remarkably good at limited, well-defined tasks, particularly those involving routine online processes. On three key benchmarks - GAIA, OSWorld, and WebArena - AI agents are closing in on human-level performance for multi-step actions like opening a database, applying a policy rule, and updating a customer record.

The numbers tell a story of rapid, if uneven, progress. On the GAIA test, AI accuracy has skyrocketed to 74.5% from just 20% a year ago, though it still trails the human benchmark of 92%. On OSWorld, Anthropic's Claude Opus 4.5 model solves 66.3% of tasks, putting it within 6 percentage points of the 72% solved by computer science students. WebArena shows models are now within 4 percentage points of the human baseline accuracy of 78.2%. This makes sense, as manipulating a web browser or querying a database via natural-language prompts are among the easier scenarios for AI to handle.

However, when the Stanford scholars, led by editor-in-chief Sha Sajadieh, dug into deeper kinds of work, the picture became far less encouraging. The research found that models handle simple lookups well but struggle profoundly when asked to perform complex, multi-faceted analysis. This serves as a critical reminder: even with well-defined tasks, you should always verify the bot's output, as the average benchmark scores still fall short of human capacity, and real-world performance is likely to be even less reliable.