Back in February, when Meta CEO Mark Zuckerberg announced that the company was working on a range of new AI initiatives, he noted that among those projects, Meta was developing new experiences with text, images, as well as with video and ‘multi-modal’ elements.
So what does ‘multi-modal’ mean in this context?
Today, Meta has outlined how its multi-modal AI could work, with the launch of ImageBind, a process that enables AI systems to better understand multiple inputs for more accurate and responsive recommendations.
As explained by Meta:
“When humans absorb information from the world, we innately use multiple senses, such as seeing a busy street and hearing the sounds of car engines. Today, we’re introducing an approach that brings machines one step closer to humans’ ability to learn simultaneously, holistically, and directly from many different forms of information – without the need for explicit supervision. ImageBind is the first AI model capable of binding information from six modalities.”
The ImageBind process essentially enables the system to learn association, not just between text, image and video, but audio too, as well as depth (via 3D sensors), and even thermal inputs. Combined, these elements can provide more accurate spatial cues, that can then enable the system to produce more accurate representations and associations, which take AI experiences a step closer to emulating human responses.
“For example, using ImageBind, Meta’s Make-A-Scene could create images from audio, such as creating an image based on the sounds of a rain forest or a bustling market. Other future possibilities include more accurate ways to recognize, connect, and moderate content, and to boost creative design, such as generating richer media more seamlessly and creating wider multimodal search functions.”
The potential use cases are significant, and if Meta’s systems can establish more accurate alignment between these variable inputs, that could advance the current slate of AI tools, which are text and image based, to a whole new realm of interactivity.
Which could also facilitate the creation of more accurate VR worlds, a key element in Meta’s advance towards the metaverse. Via Horizon Worlds, for example, people can create their own VR spaces, but the technical limitations of such, at this stage, mean that most Horizon experiences are still very basic – like walking into a video game from the 80s.
But if Meta can provide more tools that enable anybody to create whatever they want in VR, simple by speaking it into existence, that could facilitate a whole new realm of possibility, which could quickly make its VR experience a more attractive, engaging option for many users.
We’re not there yet, but advances like this move towards the next stage of metaverse development, and point to exactly why Meta is so high on the potential of its more immersive experiences.
Meta also notes that ImageBind could be used in more immediate ways to advance in-app processes.
“Imagine that someone could take a video recording of an ocean sunset and instantly add the perfect audio clip to enhance it, while an image of a brindle Shih Tzu could yield essays or depth models of similar dogs. Or when a model like Make-A-Video produces a video of a carnival, ImageBind can suggest background noise to accompany it, creating an immersive experience.”
These are early usages of the process, and it could end up being one of the more significant advances in Meta’s AI development process.
We’ll now wait and see how Meta looks to apply it, and whether that leads to new AR and VR experiences in its apps.
You can read more about ImageBind and how it works here.