Meet MultiDiffusion: A Unified AI Framework That Enables Versatile And Controllable Image Generation Using A Pre-Trained Text-to-Image Diffusion Model
While diffusion models are now considered state-of-the-art, text-to-image generative models, they have emerged as a “disruptive technology” that exhibits previously unheard-of skills in creating high-quality, diversified pictures from text prompts. The ability to give users intuitive control over the created material remains a challenge for text-to-image models, even though this advancement holds significant potential for transforming how they may create digital content.
Presently, there are two techniques to regulate diffusion models: (i) training a model from scratch or (ii) fine-tuning an existing diffusion model for the job at hand. Even in a fine-tuning scenario, this strategy frequently necessitates considerable computation and a lengthy development period due to the ever-increasing volume of models and training data. (ii) Reuse a model that has already been trained and add some controlled generation abilities. Some techniques have previously focused on particular tasks and created a specialized methodology. This study aims to generate MultiDiffusion, a new, unified framework that vastly improves the adaptability of a pre-trained (reference) diffusion model to controlled picture production.
The fundamental goal of MultiDiffusion is to design a new generation process comprising several reference diffusion generation processes joined by a common set of characteristics or constraints. The resultant image’s various areas are subjected to the reference diffusion model, which more specifically predicts a denoising sampling step for each. The MultiDiffusion then performs a global denoising sampling step, using the least squares best solution, to reconcile all of these separate phases. Consider, for instance, the challenge of creating a picture with any aspect ratio using a reference diffusion model trained on square images (see Figure 2 below).
The MultiDiffusion merges the denoising directions from all the square crops that the reference model provides at each phase of the denoising process. It tries to follow them all as closely as possible, hampered by the neighboring crops sharing common pixels. Although each crop may tug in a distinct direction for denoising, it should be noted that their framework results in a single denoising phase, producing high-quality and seamless pictures. We should urge each crop to represent a true sample of the reference model.
Using MultiDiffusion, they may apply a pre-trained reference text-to-image model to a variety of tasks, such as generating pictures with a specific resolution or aspect ratio or generating images from illegible region-based text prompts, as shown in Fig. 1. Significantly, their architecture enables the concurrent resolution of both tasks by utilizing a shared developing process. They discovered that their methodology could achieve state-of-the-art controlled generation quality even when compared to approaches specially trained for these jobs by comparing them to relevant baselines. Also, their approach operates effectively without adding computational burden. The complete codebase will be soon released on their Github page. One can also see more demos on their project page.
Check out the Paper, Github, and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
The post Meet MultiDiffusion: A Unified AI Framework That Enables Versatile And Controllable Image Generation Using A Pre-Trained Text-to-Image Diffusion Model appeared first on MarkTechPost.