
Imagen: Text-to-Image Diffusion Models
With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.
Imagen Video
We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models.
Imagen Editor & EditBench
We present Imagen Editor, a cascaded diffusion model built by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training.
cascade of video diffusion models. By extending the text-to-image diffusion models of Imagen (Saharia et al., 2022b) to the time domain, and training jointly on video and images, we obtained a model capable of gen-erating high fidelity videos with good temporal consistency while maintaining the strong features of the original image system, such ...