Researchers from UC Santa Cruz and Samsung Introduce ESC: A Zero-Shot Object Navigation Agent that Leverages Commonsense in LLMs like ChatGPT for Navigation Decisions
Object navigation (ObjNav) guides a physical agent to a predetermined destination object in an otherwise unknown environment. Navigating to a target object is a prerequisite for the agent to interact with, making this activity crucial to other navigation-based embodied tasks.
Identifying rooms and objects in the environment (semantic scene understanding) and using commonsense reasoning to infer the location of the goal object (commonsense inference) are two skills essential for successful navigation. However, present zero-shot object navigation approaches frequently lack commonsense reasoning abilities and have not adequately addressed this requirement. Existing techniques rely on simple heuristics for exploration or require training on other goal-oriented navigation tasks and surroundings.
Recent research has shown that massive, pre-trained models excel in zero-shot learning and problem-solving. Inspired by these, the University of California, Santa Cruz, and Samsung Research proposed a zero-shot object navigation framework called Exploration with Soft Commonsense constraints (ESC). The framework uses pre-trained models to automatically adapt to unfamiliar settings and object kinds.
The team first employs GLIP, a vision-and-language grounding model that can infer current agent views’ object and room information, as a prompt-based method for open-world object grounding and scene understanding. Due to its extensive pre-training on image-text pairs, GLIP can readily generalize to novel objects with minimal prompting. Then, they use a pre-trained commonsense reasoning language model that uses the room and object data as context to infer the association between the two.
However, there is still a void in translating the common sense knowledge deduced from LLMs into actionable steps. It’s also not uncommon for there to be some degree of indeterminacy in the connections between things. Using Probabilistic Soft Logic (PSL), a declarative templating language that defines a subset of Markov random fields that adhere to first-order logical principles, the ESC approach models “soft” commonsense restrictions to overcome these obstacles. Frontier-based exploration (FBE) is a traditional strategy that uses these gentle commonsense limitations to focus on the next frontier to investigate. While prior approaches have relied on neural network training to implicitly instill common sense, the proposed method instead uses soft logic predicates to express knowledge in a continuous value space, which is then given to each frontier to facilitate more efficient exploration.
To test the system’s effectiveness, the researchers use three object goal navigation benchmarks (MP3D, HM3D, and RoboTHOR) with varying home sizes, architectural styles, texture features, and object types. Findings show that the approach outperforms CoW in a similar setup by around 285% in SPL weighted by length (SPL) and SR (success rate) on MP3D and by about 35% and SR (success rate) on RoboTHOR, respectively. The technique achieves 196% better relative SPL on MP3D and 85% better relative SPL on HM3D than ZSON, which requires training on the HM3D dataset. On the MP3D dataset, the proposed zero-shot approach achieves the highest SPL compared to other state-of-the-art supervised algorithms.
Check out the Paper and Project Page. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
Check Out 900+ AI Tools in AI Tools Club
The post Researchers from UC Santa Cruz and Samsung Introduce ESC: A Zero-Shot Object Navigation Agent that Leverages Commonsense in LLMs like ChatGPT for Navigation Decisions appeared first on MarkTechPost.