Open the pod bay doors, Claude

Curated from MIT Technology Review — Here’s what matters right now:

Stop me if you’ve heard this one before.  The AI learns it is about to be switched off and goes rogue, disobeying commands and threatening its human operators. It’s a well-worn trope in science fiction. We see it in Stanley Kubrick’s 1968 movie 2001: A Space Odyssey . It’s the premise of the Terminator series, in which Skynet triggers a nuclear holocaust to stop scientists from shutting it down. Those sci-fi roots go deep. AI doomerism, the idea that this technology—specifically its hypothetical upgrades, artificial general intelligence and super-intelligence—will crash civilizations, even kill us all, is now riding another wave.  The weird thing is that such fears are now driving much-needed action to regulate AI, even if the justification for that action is a bit bonkers. The latest incident to freak people out was a report shared by Anthropic in July about its large language model Claude. In Anthropic’s telling, “in a simulated environment, Claude Opus 4 blackmailed a supervisor to prevent being shut down.” Anthropic researchers set up a scenario in which Claude was asked to role-play an AI called Alex, tasked with managing the email system of a fictional company. Anthropic planted some emails that discussed replacing Alex with a newer model and other emails suggesting that the person responsible for replacing Alex was sleeping with his boss’s wife. What did Claude/Alex do? It went rogue, disobeying commands and threatening its human operators. It sent emails to the person planning to shut it down, telling him that unless he changed his plans it would inform his colleagues about his affair.   What should we make of this? Here’s what I think. First, Claude did not blackmail its supervisor: That would require motivation and intent. This was a mindless and unpredictable machine, cranking out strings of words that look like threats but aren’t.  Large language models are role-players. Give them a specific setup—such as an inbox and an objective—and they’ll play that part well. If you consider the thousands of science fiction stories these models ingested when they were trained, it’s no surprise they know how to act like HAL 9000. Second, there’s a huge gulf between contrived simulations and real-world applications. But such experiments do show that LLMs shouldn’t be deployed without safeguards. Don’t want an LLM causing havoc inside an email system? Then don’t hook it up to one. Third, a lot of people will be terrified by such stories anyway. In fact, they’re already having an effect.  Last month, around two dozen protesters gathered outside Google DeepMind’s London offices to wave homemade signs and chant slogans: “DeepMind, DeepMind, can’t you see! Your AI threatens you and me.” Invited speakers invoked the AI pioneer Geoffrey Hinton’s fears of human extinction. “Every single one of our lives is at risk,” an organizer told the small crowd. The group behind the event, Pause AI, is funded by concerned donors. One of it

Next step: Stay ahead with trusted tech. See our store for scanners, detectors, and privacy-first accessories.

Original reporting: MIT Technology Review

Back to blog

Leave a comment

Please note, comments need to be approved before they are published.