Researchers From Stanford Introduce Locally Conditioned Diffusion: A Method For Compositional Text-To-Image Generation Using Diffusion Models

3D scene modeling has traditionally been a time-consuming procedure reserved for people with domain expertise. Although a sizable collection of 3D materials is available in the public domain, it is uncommon to discover a 3D scene that matches the user’s requirements. Because of this, 3D designers sometimes devote hours or even days to modeling individual 3D objects and assembling them into a scene. Making 3D creation straightforward while preserving control over its components would help close the gap between experienced 3D designers and the general public (e.g., size and position of individual objects).

The accessibility of 3D scene modeling has recently improved because of working on 3D generative models. Promising results for 3D object synthesis have been obtained using 3Daware generative adversarial networks (GANs), indicating a first step towards combining created items into scenes. GANs, on the other hand, are specialized to a single item category, which restricts the variety of outcomes and makes scene-level text-to-3D conversion difficult. In contrast, text-to-3D generation utilizing diffusion models allows users to urge the creation of 3D objects from a wide range of categories.

Current research uses a single-word prompt to impose global conditioning on rendered views of a differentiable scene representation, using robust 2D image diffusion priors learned on internet-scale data. These techniques may produce excellent object-centric generations, but they need help to produce scenes with several unique features. Global conditioning further restricts controllability since user input is limited to a single text prompt, and there is no way to influence the design of the created scene. Researchers from Stanford provide a technique for compositional text-to-image production utilizing diffusion models called locally conditioned diffusion.

Their suggested technique builds cohesive 3D sets with control over the size and positioning of individual objects while using text prompts and 3D bounding boxes as input. Their approach applies conditional diffusion stages selectively to certain sections of the picture using an input segmentation mask and matching text prompts, producing outputs that follow the user-specified composition. By incorporating their technique into a text-to-3D generating pipeline based on score distillation sampling, they can also create compositional text-to-3D scenes.

They specifically provide the following contributions: 

• They present locally conditioned diffusion, a technique that gives 2D diffusion models more compositional flexibility. 

• They propose important camera pose sampling methodologies, crucial for a compositional 3D generation.

• They introduce a method for compositional 3D synthesis by adding locally conditioned diffusion to a score distillation sampling-based 3D generating pipeline.

Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

The post Researchers From Stanford Introduce Locally Conditioned Diffusion: A Method For Compositional Text-To-Image Generation Using Diffusion Models appeared first on MarkTechPost.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *