1704196099 What do we know about Ferret the multimodal LLM introduced

What do we know about Ferret, the multimodal LLM introduced by Apple? – Artificial Intelligence – AI News

We learned last July that Apple was discreetly taking its first steps into generative AI with the “Apple GPT” chatbot, which will only be used internally by its employees. Ferret, an LLM designed specifically for its smartphones, was no longer promoted by the company. However, the open source multimodal language model developed by AI experts at Apple in collaboration with researchers at Columbia University was featured in a research report on arXiv.

Apple seems to have lagged behind other GAFAMs in its use of generative AI. Tim Cook, the general manager, believed that GenAI had enormous potential, but for him there were still certain problems to be solved.

The company had therefore decided to ban its employees from using not only ChatGPT, but also other AI tools, including the programming assistant GitHub Copilot, for security reasons. However, she had no intention of giving up the revenue that the technology could bring her, and one of her teams worked on developing LLMs under the leadership of John Giannandrea, her senior vice president of machine learning strategy and technology. AI, hired in 2018 after leading Google's research and artificial intelligence departments for eight years. Apple employees who received special permission were able to access Apple GPT, but could not use it to develop customer-facing features.

Ferret, a multimodal open source LLM

Ferret, trained on 8 Nvidia A100 GPUs, stands out from other existing multimodal models by performing excellently on benchmark and grounding tasks. It does not analyze an image as a whole, but rather specific areas submitted by a user, whether related to objects or text, whether as a dot, box or other free form, and includes them in queries.

What do we know about Ferret the multimodal LLM introduced

As shown in the figure above, Ferret consists of an image encoder for extracting image embeddings, a proposed spatial visual sampler for extracting regional continuous features, and an MLLM for jointly modeling image, text and region features.

GRIT training

To train their model, the researchers created GRIT (Ground-and-Refer Instruction-Tuning), an instruction optimization dataset for multimodal reference and anchoring, including 1.1 million samples. GRIT covers various levels of spatial knowledge, including objects, relationships, descriptions of regions, and complex thinking. The dataset comes primarily from existing vision language tasks that have been customized using carefully designed models to follow instructions.

Additionally, 34,000 lesson adaptation conversations were collected using ChatGPT/GPT-4 to facilitate the training of a vocabulary-open and lesson-oriented reference and anchor GP. Finally, a deliberate extraction of negative spatial data was performed to strengthen the robustness of the model.

Model performance

The researchers compared Ferret's performance to that of GPT-4V, OpenAI's multimodal LLM that extended GPT-4's image understanding capabilities.

They found that GPT-4V has limited expertise when it comes to referencing small regions in an image, while Ferret excels at the accuracy of this task. Although GPT-4V has extensive common sense knowledge, Ferret stands out for its ability to provide precise bounding boxes, especially for applications that require high precision in smaller regions.

Ferret also excels at anchoring, as tests with CAPTCHAs show, demonstrating accurate traffic light recognition even in complex scenarios. Overall, GPT-4V excels at answering general questions, while Ferret shines in situations that require spatial precision and a detailed understanding of specific areas of the image.

The researchers point out that ferrets, like most MLLMs, can elicit harmful and counterfactual responses. They plan to improve their model so that they can also generate segmentation masks in addition to bounding boxes.

Ferret provides a solid foundation for Apple's future advances in conversational AI, including visual search and other applications that require sophisticated contextual understanding.

Article references: “Ferret: Relate and Ground Anything, Anywhere, at Any Granularity.” arXiv:2310.07704v1

Authors: Haoxuan You1, Haotian Zhang2, Zhe Gan2, Xianzhi Du2, Bowen Zhang2, Zirui Wang2, Liangliang Cao2, Shih-Fu Chang1, Yinfei Yang2.

Affiliations: 1Columbia University, 2Apple AI/ML