Someday, you might want your home robot to carry the dirty laundry downstairs and put it in the washing machine in the far left corner of the basement. The robot must combine your instructions with its visual observations to determine the steps to take to complete this task.
For an AI agent, this is easier said than done. Current approaches use multiple hand-crafted machine learning models to tackle different parts of the task, which require human effort and expertise. These methods, which use visual representations to directly make navigation decisions, demand large amounts of visual data for training, which are often hard to come by.
To overcome these challenges, researchers at MIT and the MIT-IBM Watson AI Lab designed a navigation system that converts visual representations into chunks of language, which are then fed into a larger language model that reaches all parts of a multi-step navigation task.
Instead of encoding visual features from images of a robot’s surroundings, which is computationally intensive, their method generates textual captions that describe the robot’s view. A large language model uses topics to predict the actions the robot should take to fulfill the user’s language-based instructions.
Because their method uses purely language-based representations, they can use a large language model to efficiently generate a large amount of synthetic training data.
Although this approach does not perform better than techniques that use visual features, it works well in situations where there is not enough visual data for training. The researchers found that combining their language-based inputs with visual signals led to better navigation performance.
„By using language as a perceptual representation, ours is a very straightforward approach. Because we can encode all the input as language, we can create a human-understandable path,” said Bowen Pan, an Electrical Engineering and Computer Science (EECS) graduate student and lead author. paper on this approach.
Pan’s co-authors include his advisor Ed Oliva, director of strategic industry engagement at MIT Schwarzman Computing, MIT director of the MIT-IBM Watson AI Laboratory, and senior research scientist at the Computer Science and Artificial Intelligence Laboratory (CSAIL). ); Philip Isola, associate professor of EECS and member of CSAIL; Senior author Yoon Kim, assistant professor of EECS and member of CSAIL; Others are at the MIT-IBM Watson AI Lab and Dartmouth College. The research will be presented at the conference of the North American Chapter of the Society for Computational Linguistics.
Visual problem solving with language
Because large language models are very powerful machine learning models, researchers have tried to combine them in a complex task called vision and language navigation, Pan says.
But such models take text-based inputs and cannot process visual data from the robot’s camera. Therefore, the group must find a way to use language instead.
Their technique uses a simple topic model to derive textual descriptions of the robot’s visual observations. These topics are combined with language-based instructions and fed into a larger language model, which determines what navigation the robot should take next.
The large language model outputs the title of the scene that the robot should see after completing that step. It is used to update the path history so that the robot can track where it has been.
The model repeats these processes to create a path that guides the robot to its destination.
To streamline the process, the researchers designed templates so tracking information is fed to the model in a fixed format — a sequence of choices the robot can make based on its surroundings.
For example, a caption might say „30 degrees to your left is a door, next to it is a potted plant, behind you is a small office with a desk and computer,” etc. Selects whether the model should move towards the robot. door or office.
„One of the biggest challenges is how to encode this kind of information in language in a way that makes the agent understand what the task is and how they should respond,” says Pan.
Advantages of language
When they tested this approach, they found that while it could not outperform vision-based techniques, it did offer several advantages.
First, because text requires less computational resources to synthesize than complex image data, their method can be used to rapidly generate synthetic training data. In one experiment, they generated 10,000 synthetic trajectories based on 10 real-world, visual trajectories.
This technique can bridge the gap that can prevent an agent trained with a simulated environment from performing well in the real world. This gap often occurs because computer-generated images appear completely different from real-world scenes due to elements such as lighting or color. But Pan says that it can be very difficult to separate the language that describes an artificial and a real image.
Also, because the representations using their model are written in natural language, they are easy for a human to understand.
„If the agent fails to achieve its goal, we can easily determine where it failed and why. Perhaps the historical data was not clear enough or the observation overlooked some important detail,” says Pan.
In addition, their method can be more easily applied to a variety of tasks and contexts because it uses only one type of input. As long as the data is coded into language, they can use the same model without making any changes.
But a drawback is that their method naturally loses some information that is captured by vision-based models, such as depth information.
However, the researchers were surprised to find that combining language-based representations with vision-based methods improved an agent’s ability to navigate.
„Perhaps this means that language can capture some higher-level information than cannot be captured with purely visual features,” he says.
This is an area that researchers want to continue to explore. They want to create a navigation-based captioner, which can increase the efficiency of the method. Additionally, they want to explore the ability of large language models to demonstrate spatial awareness and see how this could help with language-based navigation.
This research is funded in part by the MIT-IBM Watson AI Lab.