An AI That Comes Closer To Human Perception

At a June press conference in Paris, Meta Group research director Yann Le Cun announced a new breed of AI that learns by comparing abstract representations of images, rather than comparing the pixels themselves. Named I-JEPA, this AI shows excellent performance on multiple computer vision tasks. The representations learned from I-JEPA can also be used for many different applications. For this purpose, the researchers trained a special mink transformer-like learning model with 632 million parameters and 16 GPUs (type A100) in less than 72 hours. With only twelve labeled examples per class, the I-JEPA model managed to classify images as well as any artificial intelligence trained on millions of examples from the ImageNet reference database. The scientific work was exhibited at the CVPR 2023 international conference in Toronto at the end of June and is available online.

Build machines that learn even faster

This AI is a foretaste of Yann La Cun’s vision of new statistical learning architectures aimed at creating machines capable of learning internal models of how the world works. The goal? Learn much faster, plan complex tasks and easily adapt to unfamiliar situations. In short, getting a little closer to human intelligence…

In practice, the implemented model aims to predict the appearance of one part of an input (e.g. an image or a piece of text) based on the appearance of other parts of the same input. By predicting representations at a high level of abstraction instead of predicting pixel values, one hopes to learn useful representations that also avoid the limitations of generative approaches, the basis of the big language models that have caused so much excitement lately (like ChatGPT) .

Effective, but (already) outdated

What is the difference between the two? Generative architectures learn by deleting or distorting parts of the model input—for example, by deleting part of a photo or hiding certain words in a passage of text. They then try to predict damaged or missing pixels or words. However, although very effective, these methods have a major flaw: the model tries to fill in all the missing information, even when the world is inherently unpredictable. Therefore, generative methods can be prone to errors that a human would never make because they focus too much on irrelevant details instead of capturing predictable high-level concepts. Contributing to this limitation is the tendency to “hallucinate”—that is, to tell false things—inherent in large language models. In addition, it is notoriously difficult for generative models to accurately design human hands (in the images so created, characters often have extraneous or deformed fingers, a useful clue for detecting a synthetic image).

Eliminate what seems useless

I-JEPA is based on predicting missing information in an abstract representation that is closer to the common understanding people have of it. Compared to generative methods that predict in pixel space, I-JEPA uses abstract prediction targets where unnecessary pixel-level detail may be eliminated. This allows the model to learn more semantic features (like humans do). The newly developed algorithms are able to learn high-level representations of parts of objects without discarding their localized positional information in the image.

At a more technical level, “the idea is to create a representation of the model outputs, rather than simply decoding a hidden and learned representation from the input data,” explains Aurélien Rodriguez, a researcher at Meta Research in Paris. “This breakthrough is also due in large part to all the innovations that have been carried out in our research laboratories around the world over the last few years,” says Yann Le Cun.

After images, this type of method can be extended to coupled text-image data and video data to perform automatic video understanding tasks, the automation of which is still quite rudimentary. Will these novel models take precedence over generative models? Anyway, Meta is right in more ways than one, as the company just announced the release of a new language model called Llama2. Unlike GPT4, it is available in open access and can be found on the Azure (Microsoft) and AWS (Amazon) platforms, as well as via Hugging Face.

Philippe Pajot

opening picture : Example of visualization of plots of I-JEPA predictor. The green boxes represent the reconstructions, the quality of which varies between samples (Source: Meta).