Curated from MIT Technology Review — Here’s what matters right now:
MIT Technology Review Explains: Let our writers untangle the complex, messy world of technology to help you understand what’s coming next. You can read more from the series here . It’s been a big year for video generation. In the last nine months OpenAI made Sora public , Google DeepMind launched Veo 3 , the video startup Runway launched Gen-4. All can produce video clips that are (almost) impossible to distinguish from actual filmed footage or CGI animation. This year also saw Netflix debut an AI visual effect in its show The Eternaut , the first time video generation has been used to make mass-market TV. Sure, the clips you see in demo reels are cherry-picked to showcase a company’s models at the top of their game. But with the technology in the hands of more users than ever before—Sora and Veo 3 are available in the ChatGPT and Gemini apps for paying subscribers—even the most casual filmmaker can now knock out something remarkable. The downside is that creators are competing with AI slop, and social media feeds are filling up with faked news footage. Video generation also uses up a huge amount of energy , many times more than text or image generation. With AI-generated videos everywhere, let’s take a moment to talk about the tech that makes them work. How do you generate a video? Let’s assume you’re a casual user. There are now a range of high-end tools that allow pro video makers to insert video generation models into their workflows. But most people will use this technology in an app or via a website. You know the drill: “Hey, Gemini, make me a video of a unicorn eating spaghetti. Now make its horn take off like a rocket.” What you get back will be hit or miss, and you’ll typically need to ask the model to take another pass or 10 before you get more or less what you wanted. So what’s going on under the hood? Why is it hit or miss—and why does it take so much energy? The latest wave of video generation models are what’s known as latent diffusion transformers . Yes, that’s quite a mouthful. Let’s unpack each part in turn, starting with diffusion. What’s a diffusion model? Imagine taking an image and adding a random spattering of pixels to it. Take that pixel-spattered image and spatter it again and then again. Do that enough times and you will have turned the initial image into a random mess of pixels, like static on an old TV set. A diffusion model is a neural network trained to reverse that process, turning random static into images. During training, it gets shown millions of images in various stages of pixelation. It learns how those images change each time new pixels are thrown at them and, thus, how to undo those changes. The upshot is that when you ask a diffusion model to generate an image, it will start off with a random mess of pixels and step by step turn that mess into an image that is more or less similar to images in its training set. But you don’t want any image—you want the im
Next step: Stay ahead with trusted tech. See our store for scanners, detectors, and privacy-first accessories.
Original reporting: MIT Technology Review