Skip to Content

2024 in AI Research: Building Blocks for the Agentic Era Ahead

From new open source models to evaluation frameworks, our AI Research team has been moving the needle in AI. Take a look at some of our 2024 highlights.

As we enter this new year, we reflect on last year’s extraordinary – and yet systematic – progress that laid crucial foundations for this emerging era of agentic AI. Our Salesforce AI Research team delivered significant contributions across model architectures, evaluation frameworks, applied systems, and more.

Below, we’ve curated a handful of the research milestones that helped shape the landscape of AI research in 2024. From the xLAM family’s advances in model scaling to pioneering work in asynchronous tool use, these developments help bridge the gap between academic innovation and enterprise-grade deployment. Each project connects to our broader evolution toward more capable, reliable AI systems – particularly those that can serve as effective agents in complex environments. 

As we build upon these foundations, we’re grateful to be part of a vibrant research community that continues to expand the horizons of what’s possible in AI.

xLAM Family: The Future of Scalable AI

In 2024 we released the xLAM family of models, ranging from the efficient 1B parameter “tiny giant” to the powerful 8x22B variant. This family of models demonstrated how different scales of AI can be optimized for specific use cases, proving that effective AI isn’t always about bigger models.

Moirai-MoE: Efficiency Meets Performance

The Moirai Mixture of Experts (MoE) model demonstrated remarkable efficiency in time series forecasting, achieving superior performance with 65x fewer parameters while improving accuracy by 17%. This breakthrough exemplifies the potential for more efficient AI architectures.

LLM Benchmark for CRM: How to Fast-Track Your AI Use Cases for CRM

We added agentic benchmarking to our dashboard. This will help you select the right LLM for your business with a reliable guide to evaluate LLMs based on trust, accuracy, and cost.

SFR-RAG: Building Contextually Faithful RAG Applications

SFR-RAG is our latest 9B parameter model, outperforming larger baselines on key Retrieval Augmented Generation benchmarks. Pushing limits of contextual understanding and faithful generation, SFR-RAG is an LLM specialized in RAG use cases, optimized to fully leverage contextual content for accurate and faithful response generation.

xGen-MM (BLIP): A Family of Open Large Multimodal Models

This year we introduced xGen-MM, our first in-house open source multimodal LLMs. The model enables AI assistant and AI agent to read multimodal content such as images and videos, and generate an answer or an action plan. All of the models are trained with 1 billion images and 100 billion text tokens, and achieve state-of-the-art accuracy on 5 benchmark datasets compared with models of a similar size.

SFR-Judge: Accelerating Your Model Evaluation and Fine-tuning

SFR-Judge is our latest family of generative judge models (8B, 12B, and 70B parameters), specializing in evaluating LLM outputs with natural language explanations. SFR-Judge models outperform powerful proprietary and open-source judge models on a wide range of evaluation benchmarks, while providing actionable feedback for downstream model improvement.

SFR-Embedding: Enhance Text Retrieval with Transfer Learning

SFR-Embedding is a groundbreaking advancement in text-embedding models that builds upon the solid groundwork laid by its predecessors, E5-mistral-7b-instruct and Mistral-7B-v0.1. This innovative model has swiftly risen to the top, boasting an impressive average score of 67.6 across 56 datasets in the prestigious MTEB benchmark. What makes SFR-Embedding stand out is its outstanding performance in tasks like finding specific information and grouping related items together.

CRM Arena: Understanding the Capacity of LLM Agents to Perform CRM Tasks

Introducing CRMArena – a work-oriented benchmark for LLM agents to prove their mettle in real-world business scenarios! CRMArena features nine distinct tasks within a complex business environment filled with rich and realistic data, all validated by domain experts. We’ve found that current LLM agents aren’t cut out for the hustle of corporate life!

Asynchronous Tool Usage for Real-Time Agents

Listen to our specially fine-tuned model fluidly multitask in real-time: taking new requests while processing others, switching context naturally when interrupted. Think classical OS-inspired event-driven architecture meets modern LLMs.

MINT-1T: Breaking the Trillion-Token Barrier

The release of MINT-1T marked a milestone in multimodal data scaling, becoming the first trillion-token interleaved dataset. This breakthrough significantly advanced open-source machine learning capabilities, setting new standards for data scale and quality.

PROVE: Revolutionizing VLM Evaluation

The PROVE benchmark emerged as a game-changer in our approach to evaluating visual language models (VLMs). By introducing programmatic validation across 10,500 grounded question-answer pairs, PROVE directly addresses one of the most persistent challenges in VLM development: the “plausible but wrong” problem. This innovative framework moves beyond traditional evaluation methods by providing concrete, measurable metrics for assessing hallucination rates in visual language models.

Gift-Eval: Setting New Standards

Gift-Eval introduced novel methodologies for evaluating AI systems, bringing unprecedented rigor to performance assessment. This framework has become instrumental in benchmarking AI capabilities across diverse tasks and domains.

CodeTree: A Milestone in Code Generation

CodeTree represents a significant leap forward in automated programming. By achieving remarkable scores of 95.1% on HumanEval and 98.7% on MBPP, this unified framework demonstrated the power of combining tree-based exploration with LLM guidance. These results set new standards for code generation accuracy and reliability.

INDICT: Towards Better Code Generation by Both Security and Helpfulness

INDICT facilitates an autonomous agent system between two critic models, each of which focuses on either the safety or helpfulness quality of outputs from the “actor” code generation LLM.

Shared Imagination of AI

One of the year’s most fascinating discoveries revealed that different LLMs share remarkably similar “hallucinations,” agreeing on imaginary facts with 86% accuracy. This research opens up intriguing questions about the nature of AI creativity and the underlying patterns in model behavior, suggesting deeper implications for our understanding of artificial intelligence.

Looking Ahead

These innovations of 2024 have laid crucial groundwork for the future of AI research. From evaluation frameworks to efficient architectures, each breakthrough contributes to a more robust, efficient, and capable AI ecosystem. As we move into 2025, these achievements will undoubtedly influence the next generation of AI research and development.

Note: All projects mentioned are publicly available through their respective repositories and papers, with code and datasets accessible to the research community.

Resources

Get the latest articles in your inbox.