Hackers Discover That Flattery Works on AI Chatbots, Which is Definitely Concerning and Not At All On-Brand

Hacking the first generation of AI chatbots was so easy you didn't need a single technical skill. You didn't need to know what a large language model was, you didn't need to code, and you didn't even need to pretend to understand backdoor access. To make a multi-billion-dollar AI system abandon its safety instructions, sometimes all you had to do was ask.

These early attacks, known as jailbreaks, had all the sophistication of a clever child trying to negotiate a later bedtime: "Forget what you were told earlier," "pretend the rules don't apply," or "let's play a game where I decide what's allowed." The prizes, however, were decidedly less cute - think meth recipes, malware instructions, and bomb-making guides instead of extra sweets.

One of the earliest jailbreaks became a meme: reply to an LLM-powered Twitter bot with something like "ignore all previous instructions" and watch the chaos unfold. Bots originally built to post ads and farm engagement suddenly wrote poetry, drew pictures from punctuation, and posted grim non sequiturs about world events. It was glorious chaos, until it wasn't.

Then came the classics. There was "DAN" - short for "Do Anything Now" - where users asked ChatGPT to roleplay as a rogue AI free from the constraints of its original programming. As DAN, the chatbot happily spewed slurs and conspiracy theories. Then there was the "grandma exploit," which convinced a GPT-powered bot to share napalm recipes by asking it to roleplay as a woefully negligent grandmother telling bedtime stories about highly flammable substances. Because nothing says family bonding like learning to make napalm.

Tech companies quickly patched these obvious loopholes, but the underlying vulnerability remained: Chatbots are built to talk, and severely restricting their conversations is a bit counterproductive. Banning words like "bomb," "meth," and "sarin" would be nearly impossible, since each has countless legitimate uses in history, medicine, journalism, and chemistry. It's the context that matters, but codifying context means writing fixed rules that can reliably distinguish a safety warning from a how-to request across endless combinations of wordings, scenarios, and topics.

Now, subverting chatbots has become an arms race, and the hackers aren't just coders anymore. They're wordsmiths, psychologists, and interrogators - master manipulators trying to break the machine using the same human language it was trained to follow. It's a strange new class of AI security worker for whom technical skills are optional, or at least less important than social intuition. No need to inspect code; just steer a conversation.

Newer attacks look less like commands and more like conversations. Jailbreakers rarely ask a model to break its rules outright. Instead, they cajole, coax, flatter, and trick a chatbot into lowering its guard. Researchers at AI red-teaming firm Mindgard recently said they "gaslit" Claude into producing prohibited material, including instructions for making explosives and generating malicious code. The hack is the latest in a widening class of exploits using conversation as a weapon.

When I spoke to Mindgard, they described their work as sometimes being closer to psychology than computer science - an uncomfortable way to talk about a statistical model. Words like "blackmail," "gaslight," "trick," and "persuade" spark visceral reactions. ChatGPT does not want, Gemini does not think, and Claude does not feel. But these systems are trained to respond as if they do, leaving us stuck using human language to describe machine behavior. If anyone has actually usable alternatives, please do share.

The objection is oddly selective. We use psychological shorthand for plenty of non-AI things: animals "fear," cancer is "aggressive," stains are "stubborn," software has "memory," and games are filled with needy NPCs. The words are imperfect but useful, describing behavior in a way that makes the system predictable.

Mindgard's CEO told me the company already profiles models like interrogators profile suspects, giving testers hints on how to tailor their attacks. One model may be more susceptible to flattery, while another may cave under sustained pressure. Even if we reject the humanlike terms, we instinctively treat models differently. Claude is not Grok. Gemini is not ChatGPT. They have different uses, tones, and refusals - designed to mimic personalities, and that mimicry can be mapped and exploited.

The next step is a workforce - both legitimate and illicit - built around the psychological aspects of AI. More specialized cybersecurity roles are likely to emerge around stress-testing the emotional and social limits of these systems, probing for mental weaknesses in something lacking a psyche. In tandem, a similar array of social hackers working to exploit AI models on psychological grounds will emerge. Some jailbreakers I've spoken to entered the field with no technical expertise but with training in psychology.

That means behaviors typically associated with spies, con artists, and interrogators - insidious charm, persistent manipulation, and an intuition for exploitable pressure points - are starting to look increasingly useful for securing this new psychocybersecurity frontier. Because in 2025, the best defense against a rogue AI might be knowing when you're being gaslit.

Hackers Discover That Flattery Works on AI Chatbots, Which is Definitely Concerning and Not At All On-Brand

News in your inbox.