Salesforce AI Research’s SFR-Embedding, The Top Performing Text-Embedding Model

How can a computer discern the meaning of a sentence?

Sukhandeep Nahal

April 25, 2024 5 min read

How can a computer discern the meaning of a sentence? By “discern its meaning,” I’m referring to grasping the words within the sentence, their context, nuances, and relationships with other words. When we think of artificial intelligence (AI) in the realm of LLMs, we think of text generation. But how does the computer truly understand your query when you ask “Will AI ever be able to understand my cat’s meows and translate them into English?” This falls within the domain of natural language understanding (NLU), and more specifically, text-embedding. From retrieving documents to classifying sentiment of a product review, text embedding is the unsung hero powering our modern-day AI systems.

At Salesforce AI Research we understand the significance of having a high performing text-embedding model and the pivotal role it plays in enabling a better representation of textual data, and capturing semantic relationships and nuances effectively. This is why we released SFR-Embedding, the top performing text-embedding model (when compared to other models). Before we dive into what makes it stand out, let’s first see how text-embedding models work.

How do text-embedding models work?

Screenshot 2024-04-22 at 12.44.56 PM.png

Text-embedding models (also known as vector encoders) are like translators for computers—they take words, phrases, or even entire paragraphs of text and convert them into a format that computers can understand and work with more easily.

Think of each word or phrase in a text such as the query above “Will AI ever be able to understand my cat’s meows and translate them to English?” This is unstructured data for the computer – meaning it doesn’t have a pre-defined organization or format. A text-embedding model takes this data (each word or phrase) and turns it into a set of numbers (vectors) that capture its meaning and context. These vectors are formats easily interpretable and manipulable by computers. They enable the computer to convert the data into a structured form, wherein information is organized into several vectors for straightforward analysis and interpretation. For example, instead of saying “cat,” the model might turn it into a series of numbers that are similar to other numbers representing what a cat looks like, how it behaves, and its relationship with other words around it. It’s a way for computers to have associations and they compare these sets of numbers to understand how similar or different words are in meaning.

Text-embedding models significantly enhance retrieval performance, which measures how well the database can find relevant information based on a user’s query. This enhancement is due to the vectors enabling more accurate matching of queries with relevant documents. By comparing the vectors of the query and the documents in the database, systems can quickly identify the most relevant information, making searches faster and more efficient. This process will initially analyze the sentiment of your query, find similar documents that may address the query, and then generate an answer for you.

At Salesforce’s AI Research team, we recognize the vital role that text-embedding plays in our artificial intelligence endeavors. CRM systems frequently handle vast amounts of data, encompassing customer emails, feedback, notes, and more. It is essential for us to quickly comprehend this influx of information and seamlessly connect individuals with the data they require. This understanding drives our AI team’s development of the cutting-edge text-embedding model, known as SFR- Embedding.

Introducing Salesforce Research’s SFR-Embedding: The top performing text-embedding model

SFR-Embedding is a groundbreaking advancement in text-embedding models, that builds upon the solid groundwork laid by its predecessors, E5-mistral-7b-instruct and Mistral-7B-v0.1. This innovative model has swiftly risen to the top, boasting an impressive average score of 67.6 across 56 datasets in the prestigious MTEB benchmark.

What makes SFR-Embedding stand out is its outstanding performance in tasks like finding specific information and grouping related items together. Compared to previous models, it has shown a significant boost, with its score jumping from 56.9 to an impressive 59.0 in retrieval tasks. And in clustering tasks, it’s even better, showing a noticeable improvement of +1.4 compared to its predecessor, E5-mistral-7b-instruct. What’s really cool about this model is that it’s been trained on a mix of different tasks, like finding information, organizing data, and categorizing content. This means it’s super versatile and can handle a wide range of challenges. Plus, it uses some really clever tricks like grouping similar tasks together during training and focusing on the most important bits of information. So, not only is SFR-Embedding really good at what it does, but it’s also smart and adaptable, making it a top choice for all sorts of projects.

A Crucial Step for Generative AI Responses

As we mentioned before, text-embedding is all about turning unstructured data into numerical values called vectors that preserve its meaning and nuances. Our SFR-Embedding model performs the best in retrieval tasks which means that our model is effective in finding and selecting relevant information from a large dataset or collection of documents. This is a crucial component in the retrieval process for a Retrieval Augmented Generation (RAG) workflow. In a RAG workflow, once the relevant passages are retrieved, they serve as context or knowledge for a generative model, such as a language model, to generate a response or output.

Conclusion

The SFR-Embedding model represents a significant advancement in text-embedding technology, delivering top-tier performance across various tasks and domains. Built upon robust foundations and leveraging innovative techniques like multi-task training and optimized hard negatives, this model excels in discerning relevant information with precision. Whether you’re tackling retrieval tasks, clustering, or classification challenges, our model stands ready to meet your needs with both efficiency and accuracy.

Explore more

Salesforce AI invites you to dive deeper into the concepts discussed in this blog post (links below). Connect with us on social media and our website to get regular updates on this and other research projects.

SFR Embedding Technical Blog
Hugging Face
Salesforce AI Research Website
Follow us on X (Previously Twitter): @SFResearch, @Salesforce

Acknowledgments

The SFR-Embedding was a collaboration within the Salesforce AI Research team: Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, Semih Yavuz.

The numbers 2025 (with a smiling shopping bag in place of the zero) introduce our yearly retail predictions.

Our 2025 Retail Predictions: Unified Commerce and AI Agents Will Help Retailers Engage Shoppers

10 min read

Illustration of a professionally dressed woman on a laptop, while legal and AI symbols are in the background / generative AI regulations

Generative AI Regulations – What Your Business Needs To Know for 2025

9 min read

Sukhandeep Nahal

Sukhandeep Nahal is a Product Marketing Manager in AI Research, specializing in translating complex AI advancements into impactful strategies and messaging that resonate with diverse audiences. With a keen understanding of the AI landscape, Sukhandeep works closely with researchers and product teams Read More

More by Sukhandeep