Introducing Google Deepmind's Scalable Instructable Multiworld Agent (SIMA)
SIMA is a Scalable Instructable Multiworld Agent that can follow natural-language instructions to carry out tasks in a variety of video game settings
Video games are a key proving ground for artificial intelligence (AI) systems. Like the real world, games are rich learning environments with responsive, real-time settings and ever-changing goals.
Today, Google Deepmind's announcing a new milestone - shifting focus from individual games towards a general, instructable game-playing AI agent.
Delve Deeper:
In a new technical report, Googles' Deepmind introduce SIMA, short for Scalable Instructable Multiworld Agent, a generalist AI agent for 3D virtual settings. This research marks the first time an agent has demonstrated it can understand a broad range of gaming worlds, and follow natural-language instructions to carry out tasks within them, as a human might.
This work isn't about achieving high game scores. Learning to play even one video game is a technical feat for an AI system, but learning to follow instructions in a variety of game settings could unlock more helpful AI agents for any environment. Our research shows how we can translate the capabilities of advanced AI models into useful, real-world actions through a language interface. We hope that SIMA and other agent research can use video games as sandboxes to better understand how AI systems may become more helpful.
Learning from video games
Play
We collaborated with eight game studios to train and test SIMA on nine different video games.
To expose SIMA to many environments, we’ve built a number of partnerships with game developers for our research. We collaborated with eight game studios to train and test SIMA on nine different video games, such as No Man’s Sky by Hello Games and Teardown by Tuxedo Labs. Each game in SIMA’s portfolio opens up a new interactive world, including a range of skills to learn, from simple navigation and menu use, to mining resources, flying a spaceship, or crafting a helmet.
We also used four research environments - including a new environment we built with Unity called the Construction Lab, where agents need to build sculptures from building blocks which test their object manipulation and intuitive understanding of the physical world.
By learning from different gaming worlds, SIMA captures how language ties in with game-play behavior. Our first approach was to record pairs of human players across the games in our portfolio, with one player watching and instructing the other. We also had players play freely, then rewatch what they did and record instructions that would have led to their game actions.
Breakdown:
SIMA comprises pre-trained vision models, and a main model that includes a memory and outputs keyboard and mouse actions.
SIMA: a versatile AI agent
SIMA is an AI agent that can perceive and understand a variety of environments, then take actions to achieve an instructed goal. It comprises a model designed for precise image-language mapping and a video model that predicts what will happen next on-screen. These models have been finetuned on training data specific to the 3D settings in the SIMA portfolio.
The current version of SIMA is evaluated across 600 basic skills, spanning navigation (e.g. "turn left"), object interaction ("climb the ladder"), and menu use ("open the map"). SIMA is trained to perform simple tasks that can be completed within about 10 seconds.
Play
SIMA was evaluated across 600 basic skills, spanning navigation, object interaction, and menu use.
Future agents will be able to tackle tasks that require high-level strategic planning and multiple sub-tasks to complete, such as “Find resources and build a camp”. This is an important goal for AI in general, because while Large Language Models have given rise to powerful systems that can capture knowledge about the world and generate plans, they currently lack the ability to take actions on our behalf.