Meta Introduces In-House Text-to-Video Generation Model

Meta Platforms last Thursday published research with details on the company’s new in-house artificial intelligence model that generates video footage from text input.

According to research from Meta, the Make-A-Video model circumvents obstacles by using existing text-to-image model frameworks in conjunction with unsupervised learning on unpaired video data, Andrew Kim, research associate at ARK Invest, wrote in an October 3 newsletter. Combining the two approaches, Meta enabled its text-to-image model to learn realistic motion, allowing Make-A-Video to produce short videos without audio.

Meta’s website illustrates different ways for users to interact with Make-A-Video, not only generating variations on user-inputted videos but also specifying video styles –– “surreal/realistic/stylized” –– and inferring motion based on users’ static images or image pairs, Kim said.

“While the commercialization of synthetic video is at an early stage, the pace of AI research has advanced significantly this year,” Kim wrote. “Make-A-Video seems to have many use cases, from digital video art to the facilitation of digital ad synthesis. That said, Meta has yet to release Make-A-Video to the public who will battle test it.”

This year’s release of DALL·E 2, Midjourney, and Stable Diffusion represent meaningful progress in text-to-image modeling. However, research on text-to-video generation has lagged, perhaps because of the scarce supply of large-scale datasets pairing video with descriptive text, Kim said.

OpenAI’s DALL·E 2, the AI system that creates realistic images and art from a description in natural language, wooed the industry this summer with its ability to generate creative images from text prompts. In the last few months, DALL·E 2 has passed some important milestones: commercial availability, pricing tiers, and more restrictive content mediation.

William Summerlin, ARK Invest analyst, wrote in a September newsletter that, in ARK’s view, Stability.AI has debuted the most provocative model, Stable Diffusion. Its image-generation model seems superior to DALL·E 2 in certain domains, particularly face generation, Summerlin said.

Investors looking to gain exposure to the lucrative AI industry should consider the ARK Autonomous Technology & Robotics ETF (ARKQ). Companies within ARKQ are focused on and are expected to substantially benefit from the development of new products or services, technological improvements, and advancements in scientific research related to, among other things, energy, automation and manufacturing, materials, artificial intelligence, and transportation.

For more news, information, and strategy, visit the Disruptive Technology Channel.

Meta Introduces In-House Text-to-Video Generation Model

RELATED TOPICS