Meta Introduces ImageBind, A Multimodal AI Model That Combines Six Data Types - Artificial Intelligence

Meta AI just introduced ImageBind, a new open-source AI model that can bind information from six different modalities: text, audio, visual, thermal, motion, and depth. By recognizing the relationships between these modalities, the model helps advance AI by enabling machines to better analyze many different forms of information together.

According to Mark Zuckerberg’s Facebook post, “ImageBind is a new AI model that combines different senses, just like humans do.”

ImageBind brings machines closer to humans’ ability to learn simultaneously and holistically from many different forms of information without requiring explicit oversight (the process of organizing and labeling raw data).

The model learns a single common representation space, not only for text, image/video and audio, but also for sensors that record depth (3D), heat (infrared radiation) and measurement units, inertial sensors (IMU) that calculate movement and position.

ImageBind thus provides machines with a holistic understanding that relates objects in a photo to their sound, their 3D shape, their warmth or coldness and their movement.

A Meta-AI research paper shows that ImageBind can outperform trained models for a given modality, but for researchers it primarily helps advance AI by enabling machines to analyze many types of information simultaneously.

Meta cites the example of Make-A-Scene on his blog, which could use ImageBind to create images from audio, such as an image based on the sounds of a rainforest or a busy market. Other possibilities for the future include more accurate methods of recognizing, linking and moderating content, and encouraging design creation, e.g. B. Easier generation of richer media and creation of richer multimodal search functions.

Encourage the development of multimodal AI models

According to Meta, this latest model is part of its effort to create multimodal AI systems that learn from all sorts of data types around them. It also paves the way for researchers to develop new holistic systems, for example by combining 3D sensors and IMUs to design or experience immersive virtual worlds.

ImageBind shows that it is possible to create a common embedding space across multiple modalities without having to train on data from different combinations of modalities.

ImageBind’s multimodal capabilities could allow researchers to use modalities other than input queries and retrieve outputs in other formats. Additionally, existing AI models can be upgraded to support input from any of six modalities, enabling audio search, cross-modal search, cross-modal arithmetic, and cross-modal generation.

By aligning the integration of six modalities in a common space, ImageBind allows cross-modal retrieval of different types of content not seen together, adding integrations of different modalities to the natural composition of their semantics, and generating audio-image with audio embeds using a DALLE-2 decoder pre-trained to work with CLIP text embeds.

model performance

Image-aligned self-supervised learning shows that the performance of our model can actually be improved with very few training examples. According to the team, “the model has emerging capabilities or scaling behaviors, which are capabilities that weren’t present in the smaller models but appear in larger versions.” This may include recognizing audio that matches a particular image, or that Predict the depth of a scene from a photo.”

Research has also shown that ImageBind’s scaling behavior improves with the strength of the image encoder: ImageBind’s ability to align modalities increases with the strength and size of the vision model. This suggests that larger vision models benefit non-visual tasks such as audio classification and that the benefits of training such models extend beyond computer vision tasks.

Researchers compared ImageBind’s audio and depth encoders to previous work on zero-shot recovery and audio and depth classification tasks, where they outperformed.

For researchers, the introduction of new modalities such as touch, speech, smell, and fMRI brain signals will enable the creation of richer, human-centric AI models. They hope the research community will study ImageBind and her work to find new ways to evaluate viewing patterns and develop new applications.

References: Meta AI Blog

For more information, see the demo and research article: “IMAGEBIND: One Embedding Space To Bind Them All”