Runway’s Gen-2 shows the limitations of today’s text-to-video tech
In a recent panel interview with Collider, Joe Russo, the director of tentpole Marvel films like “Avengers: Endgame,” predicted that, within two years, AI will be able to create a fully-fledged movie.
It’d say that’s a rather optimistic timeline. But we’re getting closer.
This week, Runway, a Google-backed AI startup that helped develop the AI image generator Stable Diffusion, released Gen-2, a model that generates videos from text prompts or an existing image. (Gen-2 was previously in limited, waitlisted access.) The follow-up to Runway’s Gen-1 model launched in February, Gen-2 is one of the first commercially available text-to-video models.
“Commercially available” is an important distinction. Text-to-video, being the logical next frontier in generative AI after images and text, is becoming a bigger area of focus particularly among tech giants, several of which have demoed text-to-video models over the past year. But those models remain firmly in the research stages, inaccessible to all but a select few data scientists and engineers.
Of course, first isn’t necessarily better.
Out of personal curiosity and service to you, dear readers, I ran a few prompts through Gen-2 to get a sense of what the model can — and can’t — accomplish. (Runway’s currently providing around 100 seconds of free video generation.) There wasn’t much of a method to my madness, but I tried to capture a range of angles, genres and styles that a director, professional or armchair, might like to see on the silver screen — or a laptop as the case might be.
One limitation of Gen-2 that became immediately apparent is the framerate of the four-second-long videos the model generates. It’s quite low and noticeably so, to the point where it’s nearly slideshow-like in places.
What’s unclear is whether that’s a problem with the tech or an attempt by Runway to save on compute costs. In any case, it makes Gen-2 a rather unattractive proposition off the bat for editors hoping to avoid post-production work.
Beyond the framerate issue, I’ve found that Gen-2-generated clips tend to share a certain graininess or fuzziness in common, as if they’ve had some sort of old-timey Instagram filter applied. Other artifacting occurs as well in places, like pixelation around objects when the “camera” (for lack of a better word) circles them or quickly zooms toward them.
As with many generative models, Gen-2 isn’t particularly consistent with respect to physics or anatomy, either. Like something conjured up by a surrealist, people’s arms and legs in Gen-2-produced videos meld together and come apart again while objects melt into the floor and disappear, their reflections warped and distorted. And — depending on the prompt — faces can appear doll-like, with glossy, emotionless eyes and pasty skin that evokes a cheap plastic.
To pile on higher, there’s the content issue. Gen-2 seems to have a tough time understanding nuance, clinging to particular descriptors in prompts while ignoring others, seemingly at random.
One of the prompts I tried, “A video of an underwater utopia, shot on an old camera, in the style of a ‘found footage’ film,’ brought about no such utopia — only what looked like a first-person scuba dive through an anonymous coral reef. Gen-2 struggled with my other prompts too, failing to generate a zoom-in shot for a prompt specifically calling for a “slow zoom” and not quite nailing the look of your average astronaut.
Could the issues lie with Gen-2’s training data set? Perhaps.
Gen-2, like Stable Diffusion, is a diffusion model, meaning it learns how to gradually subtract noise from a starting image made entirely of noise to move it closer, step by step, to the prompt. Diffusion models learn through training on millions to billions of examples; in an academic paper detailing Gen-2’s architecture, Runway says the model was trained on an internal data set of 240 million images and 6.4 million video clips.
Diversity in the examples is key. If the data set doesn’t contain much footage of, say, animation, the model — lacking points of reference — won’t be able to generate reasonable-quality animations. (Of course, animation being a broad field, even if the data set did have clips of anime or hand-drawn animation, the model wouldn’t necessarily generalize well to all types of animation.)
On the plus side, Gen-2 passes a surface-level bias test. While generative AI models like DALL-E 2 have been found to reinforce societal biases, generating images of positions of authority — like “CEO or “director” — that depict mostly white men, Gen-2 was the tiniest bit more diverse in the content it generated — at least in my testing.
Fed the prompt “A video of a CEO walking into a conference room,” Gen-2 generated a video of men and women (albeit more men than women) seated around something like a conference table. The output for the prompt “A video of a doctor working in an office,” meanwhile, depicts a woman doctor vaguely Asian in appearance behind a desk.
Results for any prompt containing the word “nurse” were less promising though, consistently showing young white women. Ditto for the phrase “a person waiting tables.” Evidently, there’s work to be done.
The takeaway from all this, for me, is that Gen-2 is more a novelty or toy than a genuinely useful tool in any video workflow. Could the outputs be edited into something more coherent? Perhaps. But depending on the video, it’d require potentially more work than shooting footage in the first place.
That’s not to be too dismissive of the tech. It’s impressive what Runway’s done, here, effectively beating tech giants to the text-to-video punch. And I’m sure some users will find uses for Gen-2 that don’t require photorealism — or a lot of customizability. (Runway CEO Cristóbal Valenzuela recently told Bloomberg that he sees Gen-2 as a way to offer artists and designers a tool that can help them with their creative processes.)
I did myself. Gen-2 can indeed understand a range of styles, like anime and claymation, which lend themselves to the lower framerate. With a little fiddling and editing work, it wouldn’t be impossible to string together a few clips from to create a narrative piece.
Lest the potential for deepfakes concern you, Runway says it’s using a combination of AI and human moderation to prevent users from generating videos that include pornography, violent content or that violate copyrights. I can confirm there’s a content filter — an overzealous one in point of fact. But course, those aren’t foolproof methods, so we’ll have to see how well they work in practice.
But at least for now, filmmakers, animators and CGI artists and ethicists can rest easy. It’ll be at least couple iterations down the line before Runway’s tech comes close to generating film-quality footage — assuming it ever gets there.