MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

[Image: Yurii / Adobe Stock]

We are excited to open-source 🍃MINT-1T, the first trillion token multimodal interleaved dataset and a valuable resource for the community to study and build large multimodal models.

Caiming Xiong

Ran Xu

2 additional authors

July 24, 2024 3 min read

This work was done in collaboration with many other great co-authors: Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Yejin Choi, and Ludwig Schmidt.

We are excited to open-source 🍃MINT-1T, the first trillion token multimodal interleaved dataset and a valuable resource for the community to study and build large multimodal models.

Background

Multimodal interleaved documents are sequences of images interspersed in text. This structure allows us to train large multimodal models that can reason across image and text modalities. Some of the most capable multimodal models like MM1, Chameleon, and Idefics2 have shown the importance of training on interleaved data to attain best performance.

Building MINT-1T

Our key principles behind curating 🍃MINT-1T are scale and diversity. While previous open-source datasets such as OBELICS and MMC4, where at most 115 billion tokens, we collect 1 trillion tokens for 🍃MINT-1T allowing practitioners to train much larger multimodal models. To improve the diversity of 🍃MINT-1T we go beyond HTML documents, and include web-scale PDFs and ArXiv papers. We find that these additional sources improve domain coverage particularly on scientific documents.

Model Experiments

We validate 🍃MINT-1T by pre-training XGen-MM multimodal models (sampling 50% of tokens from HTML documents and the rest from PDFs/ArXiv). We evaluate our models on captioning and visual question answering benchmarks. We find that 🍃MINT-1T outperforms the previous leading multimodal interleaved dataset, OBELICS!

Future Work

We are already training our new iteration of XGen-MM models on 🍃MINT-1T and we are excited to continue to share some of the best open-source datasets and models with the community. Stay tuned for more!

Explore More

Salesforce AI invites you to dive deeper into the concepts discussed in this blog post (links below). Connect with us on social media and our website to get regular updates on this and other research projects.

Dataset
Paper
Salesforce AI Research Website
Follow us on X (Previously Twitter): @SFResearch, @Salesforce

Acknowledgments

A big thank you to the infrastructure team, Srinath Reddy Meadusani and Lavanya Karanam, for their tremendous work and Paul Josel for helping us with figure design.

Illustration showcasing AI agents used for small business productivity.

5 Types of AI Agents To Grow Your Business

63 min read

Illustration of a woman at a desk, with a depiction of connected bits of information in a node-link diagram in the background.

What is Ontology and Its Role in Agentic Experience Design?

7 min read

Caiming Xiong VP Salesforce Research

More by Caiming

Ran Xu Director, AI Research

Ran Xu received his Ph.D. in computer science from University at Buffalo from 2015. Currently, he leads a group of exceptional computer vision and multimodal AI researchers at Salesforce to push the boundary of research and productive AI for CRM.

More by Ran

Le Xue Senior Applied Scientist

Le Xue is an AI researcher working on multimodal foundation models such as Multimodal LLMs and Multimodal 3D foundation models. He leads AI research for series of projects of xGen-MM(BLIP-3) -- A Family of Open Large Multimodal Models.

More by Le