TL;DR: We propose BLIP-2, a scalable multimodal pre-training method that enables any Large Language Models (LLMs) to ingest and understand images, unlocks the capabilities of zero-shot image-to-text generation and powers the world’s first open-sourced multimodal Chatbot prototype.
OpenAI just released GPT-4, a powerful new multimodal AI model with its eye-catching capability of accepting image inputs to generate text. However, such capability is not new, which has been shown in our recent BLIP-2 models and prototype released on 30 January 2023. Our novel BLIP-2 method enables us to build the world’s first open-sourced multimodal chatbot prototype. Below we discuss the differences between our BLIP-2 model and OpenAI’s GPT-4.
BLIP-2 vs. GPT-4
- Generic vs. Specific: BLIP-2 is a novel and generic multimodal pre-training methodology for vision-language pretraining, which can enable any family of LLMs to understand images and unlock zero-shot image-to-text generation capabilities. GPT-4 is a specific type of pre-trained model and its technical novelty is unclear (not disclosed).
- Open-source vs. Closed-source (API-only): The code and models of BLIP-2 are open-sourced in the LAVIS library (https://github.com/salesforce/LAVIS) and also integrated into HuggingFace Transformers (https://huggingface.co/docs/transformers/main/model_doc/blip-2). GPT-4 is a close-sourced model with paid API service (text-only API as of now).
- Fast vs. Slow: BLIP-2 runs much faster than GPT-4. The inference time of BLIP-2 for each image is around 1 second on a single GPU. According to the GPT-4’s livestream, their multimodal inference time of GPT-4 took nearly 40 seconds to process one image.
- Unsupervised learning vs. (presumably) Supervised learning: BLIP-2 is trained on large amounts of noisy image-text pairs automatically crawled from the Internet. Although the learning paradigm of GPT-4 has not been released, it could be reasonably deduced from ChatGPT that GPT-4 may have used large human-annotated datasets.
BLIP-2 is a scalable multimodal pre-training method that enables any LLMs to understand images while keeping their parameters entirely frozen. It is significantly more compute-efficient than existing multimodal pre-training methods. Why? BLIP-2 effectively Bootstraps Language-Image Pre-training with frozen image encoders and frozen LLMs. For example, to transform an existing 11B-LLM into a state-of-the-art multimodal foundation model, it only requires training of less than 2% parameters (only 188M trainable parameters).
BLIP-2 is the first to unlock the capability of zero-shot instructed image-to-text generation. Given an input image, BLIP-2 can generate various natural language responses according to the user’s instruction. The following figure shows some examples from BLIP-2.
Example outputs from BLIP-2
Example outputs from BLIP-2
How does BLIP-2 work? Let’s take a deeper look at our pre-training method.
How BLIP-2 works
For LLMs to understand visual content, the key is to bridge the vision-language modality gap. Since LLMs have not seen any images during their natural language pre-training, it is challenging to bridge the modality gap, especially when the LLMs remain frozen. To this end, we propose a Querying Transformer (Q-Former) pre-trained with a new two-stage pre-training strategy. As shown in the following figure, after pre-training, the Q-Former can effectively act as a bridge between a frozen image encoder and a frozen LLM, thus closing the modality gap.
Overview of BLIP-2 two-stage pre-training strategy
The first stage is vision-and-language representation learning. In this stage, we connect the Q-Former to a frozen image encoder and pre-train with image-text pairs. The Q-Former learns to extract image features that are most relevant to the corresponding text. We reinvent the pre-training objectives from BLIP (https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) for vision-and-language representation learning.
Overview of Q-Former and the first stage of vision-language representation learning in BLIP-2
The second stage is vision-to-language generative learning. In this stage, we connect the output of Q-Former to a frozen LLM. We pre-train the Q-Former such that its output features can be interpreted by the LLM to generate the corresponding text. We experiment with both decoder-based LLMs (e.g. OPT) and encoder-decoder-based LLMs (e.g. FlanT5).
Overview of the second stage of vision-to-language generative learning in BLIP-2
During inference, we simply append the text instruction after the Q-Former’s output as input to the LLM. We have experimented with various image encoders and LLMs, and arrived at a promising observation: a stronger image encoder and a stronger LLM both lead to better performance with BLIP-2. This observation indicates that BLIP-2 is a generic vision-language pre-training method that can efficiently harvest the rapid advances in vision and natural language communities. BLIP-2 is an important groundbreaking technique towards building a multimodal conversational AI agent.
Community attention and efforts after BLIP-2 was released and open-sourced!
BLIP-2 has been extensively discussed and actively used by the AI communities.
Checkout these projects and resources that use BLIP & BLIP-2 for various tasks!
- BLIP-2 + ChatGPT: https://github.com/Vision-CAIR/ChatCaptioner
- BLIP + ChatGPT: https://github.com/microsoft/visual-chatgpt
- ImageSEO: https://wordlift.io/blog/en/image-seo-using-ai/
- BLIP + DreamBooth: https://github.com/KaliYuga-ai/DreamBooth_With_Dataset_Captioning/blob/main/DreamBooth_With_Dataset_Captioning.ipynb
- BLIP-2 on Huggingface: https://huggingface.co/blog/blip-2
- BLIP Blog (Previously released model): https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/
The Bottom Line
We’ve proposed BLIP-2, a novel scalable multimodal pretraining method that transforms any LLMs to multimodal foundation models. Powered by the family of BLIP-2 pretrained models, we’ve developed and released the world’s first open-sourced multimodal chatbot prototype.
There is still a lot of room to improve BLIP-2. Will BLIP-2 improve with supervised finetuning? How will BLIP-2 be useful for image generation? We look forward to further improving it with community feedback and exploring new use cases. Stay tuned for more exciting research!
Explore more
- Read more details about our research in our research paper https://arxiv.org/abs/2301.12597
- Code: https://github.com/salesforce/LAVIS/tree/main/projects/blip2
- More about LAVIS – a one-stop vision-language library: https://blog.salesforceairesearch.com/lavis-language-vision-library/
- Contact: Junnan Li at junnan.li@salesforce.com
- Follow us on Twitter: @SFResearch @Salesforce
- Visit our main website to learn more about all of the exciting projects Salesforce AI is working on: https://www.salesforceairesearch.com