This year has seen AI image generators go from abstract to photorealistic. In fact, AI-generated content has gone from barely usable to one of the most common forms of content on social media. Individuals and big brands alike have quickly adopted the technology across entertainment and advertising. The same can’t be said of generative video – yet. With their latest generative AI video model, stability AI aims to change this. Having built a stable (pun intended) foundation in AI art with Stable Diffusion, the diffusion models research firm Stability AI now sets its sights on text-to-video and image-to-video models with Stable Video Diffusion.
How to use Stable Video Diffusion – Text-to-video models SVD & SVD-XT
On November 21st, Stability AI announced Stable Video Diffusion, its “first foundation model for generative video based on the image model Stable Diffusion.”
Already showing results that compete with rival AI video generators Runway and Pika Labs, “this state-of-the-art generative AI video model represents a significant step” for generative artificial intelligence. The AI model research firm proudly states that its diverse open-source portfolio, spanning “across modalities including image, language, audio, 3D, and code… is a testament to Stability AI’s dedication to amplifying human intelligence.”
Leading closed models such as text-to-video platforms Runway and Pika Labs have offered different modalities to the Stable Diffusion model for several months, but now the two new AI models from Stability AI are “capable of generating 14 and 25 frames” per rendered file, “at customizable frame rates between 3 and 30 frames per second. At the time of release in their foundational form, through external evaluation, we have found these models surpass the leading closed models in user preference studies.”
The main difference between the SVD and SVD-XT models is their respective lengths and framerates. SVD-XT will be capable of longer video generations but will be more computationally expensive as a result.
Essential AI Tools
Video – The final frontier
Each mode of digital media (text, audio, image, and video) comes with a unique set of challenges to achieve the level of fidelity required for real-world commercial applications. That said, video has predictably become the final frontier of the four, with the greatest number of challenges and, as a result, will be the last form to be perfected.
Researchers developing this model explored three different model training techniques for video LDM (Logical Data Model) architecture: “text-to-image pretraining, video pretraining, and high-quality video fine-tuning.”
Further technical details can be found in the official research paper.
Where to use Stable Video Diffusion
It is currently in research preview, meaning you can’t use it just yet. You can however sign up for the waitlist for the “new upcoming web experience”.
Will the Stability AI video generator be open source?
Yes! The new AI video generator will be open-source. In fact, the code is already available to copy from Github, and those who wish to run the text-to-video interface locally can find the model weights on Hugging Face.