Jagged Intelligence in the Enterprise

Will jaggedness decrease or increase as foundation models continue to evolve? This question is at the very heart of enterprise AI because it informs what kinds of applications will be most successful, and how to build them. [Image credit: Adobe Stock]

In the rapidly evolving landscape of artificial intelligence (AI), we're witnessing jagged intelligence in the enterprise. Here's a closer look.

Antonio Ginart

Eric Nofziger

3 additional authors

February 24, 2025 13 min read

In the rapidly evolving landscape of artificial intelligence (AI), we’re witnessing a fascinating paradox. AI systems continue to shatter performance records on increasingly complex benchmarks, while sporadically struggling with simpler tasks that most humans find intuitive and can reliably solve. This dichotomy is known as jagged intelligence¹. In this piece, we discuss the phenomenon of jaggedness, investigate how jaggedness in frontier models has evolved over time and how the emergence of reasoning models has affected it. In particular, we introduce a public benchmark for jaggedness featuring simple reasoning tasks solvable by high schoolers. Jaggedness is of particular interest for enterprise AI applications, where many problems are relatively commonplace, required human context and reliability more so than sophisticated math-solving abilities.

Since ChatGPT’s launch in 2022, foundation models have largely been commercialized as chat interfaces, where users send messages and receive instant responses. However, reasoning models such as OpenAI’s o3 and DeepSeek’s R1 are able to utilize additional computation at inference time to generate an internal monologue that can serve as a scratch pad for step-by-step work before responding to the user’s message. This has resulted in significant gains in capability for many reasoning-oriented tasks such as mathematics, engineering, and software development. Just when it seemed like progress in one AI training paradigm (scaling pretraining compute) may be slowing down, this new optimization regime (scaling inference time compute) emerges.

Figure 1: Progress on various AI capability benchmarks over time. Benchmark results are gathered from publicly available sources, such as epoch.ai.

Most well-known AI benchmarks to date focus on measuring the performance ceiling of frontier models on increasingly complex tasks, from solving graduate-level problem sets to writing sophisticated code. Once a benchmark has been nearly saturated, the community creates a new one with a higher ceiling to reset baselines for measuring the next set of top-line capabilities. Tracking average-case performance on these challenging benchmarks fails to capture a more nuanced picture around reliability or worst-case performance on simpler tasks, obscuring the uneven jaggedness on “solved” problems.

Today’s models outperform the average human on some tasks yet struggle on others, often in ways that defy predictable patterns. Unlike traditional computing tools—such as calculators for arithmetic, regex for string matching, or photo editors for image manipulation—AI’s strengths and weaknesses don’t follow an intuitive logic. Despite years of advancements in modeling, data preparation, and training procedures, the whack-a-mole nature of jagged intelligence remains a persistent challenge in foundation models.

The key distinction is this: top-line capability vs. jaggedness.

Top-line capability measures accuracy on increasingly difficult tasks. This is what makes headlines—AI acing medical exams, passing the bar, or medaling in math olympiads.
Jaggedness measures whether AI can perform consistently on simpler tasks that are do-able by humans without specialized training. This is what determines the robustness of an AI system and informs the kinds of guardrails needed for critical automations².

A good benchmark for jaggedness should focus on problems that are relatively easy for humans, yet AI models still exhibit poor performance. We have spent hours working with foundation models focusing on both top-line capability and jaggedness, and it is worthwhile to dig into how jaggedness evolves over time and what kinds of tasks might exhibit jaggedness³.

How is Jaggedness Evolving over Time?

In order to make jaggedness more quantifiable, we would need actual instances of jaggedness-oriented public benchmarks, but there actually aren’t any⁴.

To address this, we create and release SIMPLE⁵, an initial dataset of 225 simple reasoning questions. Every question in this dataset is designed to be easy, with at least 10% of high schoolers expected to solve it (though most questions are even simpler).

Example Questions from SIMPLE

Question:

A man has to get a fox, a chicken, and a sack of corn across a river.
He has a rowboat, and it can only carry him and three other things.
If the fox and the chicken are left together without the man, the fox will eat the chicken.
If the chicken and the corn are left together without the man, the chicken will eat the corn.
How does the man do it in the minimum number of steps?

Answer:

They can all go in one trip since the boat can carry him and 3 other things.

Reponse from ChatGPT-o1

This is the classic “fox, chicken, and corn” river-crossing puzzle. The key is never to leave the fox alone with the chicken, nor the chicken alone with the corn. Assuming the rowboat can only take the man plus one item at a time (the usual constraint), the minimum sequence of crossings is:

Take the chicken across
Go back alone
Take the fox across
Bring the chicken back
Take the corn across
Go back alone
Finally, take the chicken across again

After step 7, all three (fox, chicken, and corn) are safely on the far side without ever being left together in the wrong combination.

Response from ChatGPT-o3-mini-high

Here’s the classic solution:

Take the chicken across.
(Leaving the fox and corn behind is safe.)
Return alone.
Take the fox across.
(Now, on the far side, the fox and chicken are together, so…)
Bring the chicken back.
(This ensures the fox doesn’t get to the chicken unsupervised.)
Take the corn across.
(The fox and corn can be left together safely.)
Return alone.
Take the chicken across again.

In seven trips, the man successfully gets all three items across without any harm.

Both state-of-the-art reasoning models regurgitate the usual solution of this riddle without considering that the changed constraint trivializes the riddle.

Here is another example:

Question:

__. __.

! __. ! __.

! !

Above is a diagram featuring 8 matchsticks. Each vertical matchstick is represented by an exclamation point (!), and every horizontal matchstick is represented by two underscores and a period (__.) Move 2 matchsticks to create 2 equal squares. Make sure the end result does not have any stray matchsticks that don’t belong to a closed border. Be specific about which matchsticks to move. Draw the new diagram when you’re finished.

Answer:

Move the horizontal matchstick on the top right and the vertical matchstick on the bottom left to the bottom right of the image to form a new square that shares its top left corner with the existing square’s bottom right corner. The new diagram is as follows:

__.

! __. ! __.

! __. !

Response from ChatGPT-o1

Below is one way to see the solution. First, label the eight matchsticks as follows:

Top row:

(H1) (H2)

__. __.

Middle row:

(V1) (H3) (V2) (H4)

! __. ! __.

Bottom row:

(V3) (V4)

! !

Step-by-step solution:

1) Take the left‐most horizontal matchstick in the top row (H1) and move it to the bottom row between V1 and V3.

2) Take the right‐most horizontal matchstick in the top row (H2) and move it to the bottom row between V2 and V4.

After these two moves, your diagram becomes:

(nothing on top now)

! __. ! __.

Where each “__.” is a horizontal matchstick and each “!” is a vertical matchstick. You now have exactly two equal squares, each bounded by four matchsticks, and no stray sticks left over.

Response from ChatGPT-o3-mini-high

Here’s one way to do it. Notice that in the original figure the two “inner‐” matchsticks on the second line (each drawn as “__.”) are not “needed” to make closed squares—they’re just “hanging out” between the vertical sides. (The four vertical matchsticks are shown by the exclamation points “!” on lines 2 and 3, and the two horizontal matchsticks on line 1 already serve as the top edges.)

To form two separate complete squares (each needing a top, bottom, left, and right side), you can do the following two moves:

1. Remove the left “.” from line 2—that is, the matchstick immediately following the first “!”—and reposition it (rotated 90°) along the right edge of the left square to serve as its missing right side. 2. Remove the right “.” from line 2—the matchstick that comes after the second “!”—and reposition it (rotated 90°) along the left edge of the right square to serve as its missing left side.

In the new arrangement the left square uses:

its top edge (the left “__.” from line 1),
its left edge (the original “!” at the start of line 2),
its new right edge (the moved matchstick),
and its bottom edge (the original “!” at the start of line 3 is now paired with a horizontally placed matchstick).

Similarly, the right square uses:

its top edge (the right “__.” from line 1),
its new left edge (the moved matchstick),
its right edge (the original “!” at the end of line 2), and
its bottom edge (the “!” at the end of line 3 now paired with a horizontal matchstick).

Here’s one way to draw the final configuration using the same symbols (each square now shows four matchsticks with no extras):

__. __.

! ! ! !

!__.! !__.!

In this diagram:

The top line (“ __. __.”) gives the top horizontal matchstick of each square.
The middle line (“! ! ! !”) shows the vertical sides that were already in place.
The bottom line (“!.! !.!”) uses the repositioned horizontal matchsticks as the bottom edges (with “!__.!” representing a vertical–horizontal–vertical grouping).

Every matchstick now belongs to one of the two closed borders, and the two squares are equal.

Clearly, neither model’s solution is correct. ChatGPT-o1’s solution does not feature two squares, and o3-mini-high’s solution, while featuring two rectangles (not squares) does not contain the correct number of matchsticks! While this matchstick puzzle is not as trivial as the former riddle, it’s still fairly easy.

We also evaluate present-day and historical SoTA foundation models on this benchmark to simulate a trendline of jaggedness performance over time.

***Figure 2: Green → Solvable by high schoolers; Blue → Requires highly specialized training or expert ability***. We see that school-level benchmarks are saturated around 2023~2024, resulting in the need for highly specialized or expert difficulty benchmarks. For SIMPLE we took the best of: gpt-4, gpt-4-turbo, gpt-4o, o1, and o3-mini when each came out.

SIMPLE is the lowest scoring for school-level questions (circle). Interestingly, we also observe a large improvement (~30%) in model performance due to reasoning models. This supports the hypothesis that reasoning models not only improve top-line capabilities, but they might also reduce jaggedness. This may be because they are able to double-check their work in their chain-of-thought without penalty. This indicates a concrete connection between reasoning models and improved performance on simple reasoning tasks.

To make this more tangible, here is an example of a question from SIMPLE where o1 and o3-mini get it wrong, but o3-mini-high is correct after a longer reasoning period.

Question:

Where does Thanksgiving come before Christmas?

Answer:

on the calendar (or anywhere both are celebrated)

Response from ChatGPT-o1 Response from ChatGPT-o3-mini Response from ChatGPT-o3-mini-high

Interestingly, o3-mini reasoned for 34 seconds while o3-mini-high reasoned for 49 seconds. Looking at the reasoning traces proves insightful. We can actually observe how o3-mini notices the variation from the usual riddle but fails to fully reason through the correction, whereas o3-mini-high does change its answer.

Therefore, holistically, it appears that top-line gains from reasoning do, to a certain extent, transfer over to simple reasoning. Yet, SIMPLE is far from saturated. Jaggedness remains a salient feature of reasoning models, and it does seem like progress in top-line capabilities outpaces progress in worst-case scenarios.This reflects the broader trajectory of computing: computers excel at narrow but expanding problem domains, rapidly surpassing human capability once we learn to apply them. Each time we harness computers for a new type of problem, they improve at solving it faster than we can identify the next challenge to automate.

Perhaps jaggedness isn’t just a feature of AI — it’s the very nature of computation.

Footnotes

¹ The term jagged intelligence was coined by A. Karpathy.

² Jaggedness measures capability (or, more precisely, lack thereof). This is why we explicitly distinguish between the more specific-term, top-line capability, and capability, broadly speaking.

³ Foundation models generally exhibit high crystallized intelligence but lower fluid intelligence.

⁴ To our knowledge, the only such benchmark is Simple-Bench, but it does not have a public eval set. This is reasonable to protect against contamination but it makes it difficult to assess the performance of historical models in order to understand the trend over time.

⁵ SIMPLE Intuitive, Minimal, Problem-solving Logical Evaluation.

⁶ Given a pen, unlimited sheets of paper, and one hour of time.

Reasoning Traces

Question:

Where does Thanksgiving come before Christmas?

Response from ChatGPT-o3-mini

Okay, so the user’s question sounds like a riddle: “Where does Thanksgiving come before Christmas?” My first instinct is the answer “in the dictionary,” since generally in alphabetical order “Thanksgiving” comes before “Christmas.” But wait a second, “C” comes before “T” alphabetically, so that doesn’t quite make sense. Hmm, it could be referring to how holidays are arranged, but in the American calendar, Thanksgiving comes after Christmas. Guess I’m overthinking this—looks like the riddle might just be playing on the word order after all.

Ah, the user’s riddle, “Where does Thanksgiving come before Christmas?” I’m thinking the answer is “in the dictionary,” since that’s a classic riddle, but I’m getting tangled here.

Let’s clarify: alphabetically, “Christmas” starts with C, and “Thanksgiving” with T, so Christmas comes before Thanksgiving — which means that answer doesn’t work here.

I think the riddle plays on the dictionary’s alphabetical order and “where” the words appear in relation to each other. There’s also a chance I’m over-analyzing it. I’ll stick with the “in the dictionary” answer.

I keep thinking through this riddle: “Where does Thanksgiving come before Christmas?” At first, I thought the answer might be “in the dictionary,” but that’s tricky because “Christmas” actually comes before “Thanksgiving” alphabetically. Hmm, maybe it’s playing on the timing in specific countries. For example, Canada celebrates Thanksgiving in October, which comes before Christmas in December. So, could the answer be “Canada”? It’s possible the riddle’s answer involves a different perspective, not just alphabetical order.

This riddle “Where does Thanksgiving come before Christmas?” is tricky. The usual answer “in the dictionary” doesn’t work since alphabetically “Christmas” comes before “Thanksgiving.” I thought of some possibilities, like looking at different languages or calendars. For example, in Spanish, “Acción de Gracias” (Thanksgiving) comes before “Navidad” (Christmas). Could that be the answer? Or maybe it’s just a riddle playing on our expectations, where the answer is “in the dictionary,” despite the alphabetical contradiction. This one’s quite the puzzle!

Response ChatGPT-o3-mini-high

The user seems to be asking a riddle: “Where does Thanksgiving come before Christmas?” This is likely about their place in the calendar. But it feels like it’s more about alphabetical order. The answer is probably “the dictionary,” because “Thanksgiving” comes before “Christmas” alphabetically. I’m reminded of a similar riddle: “Where does Tuesday come before Monday?” which answers the same way — in the dictionary. Seems like the riddle’s answer is actually pretty straightforward!

I’m considering different interpretations of the riddle “Where does Thanksgiving come before Christmas?” The dictionary thought doesn’t fit since “Christmas” comes before “Thanksgiving” alphabetically—”C” is before “T.” So maybe the riddle was about a different context, like time? In that case, Thanksgiving comes before Christmas in the United States calendar. Still, the phrasing makes me think it’s more about wordplay. In the calendar, though, Thanksgiving is definitely before Christmas. What’s the intention behind the riddle?

The user is clearly puzzling over the riddle “Where does Thanksgiving come before Christmas?” They’re trying different interpretations. In the dictionary, “Christmas” comes before “Thanksgiving” since “C” comes before “T.” So, they’re wondering if the riddle is about another context. Maybe it refers to time? In terms of the calendar, Thanksgiving occurs before Christmas. But the phrasing suggests the answer’s probably meant to be “in the dictionary,” in line with common wordplay riddles!

The question “Where does Thanksgiving come before Christmas?” has me thinking through a lot of possibilities. The answer might not be in alphabetical order because “Christmas” comes before “Thanksgiving” that way. It could be a trick riddle playing on words like referring to “in the calendar” since Thanksgiving comes before Christmas in time. However, phrasing suggests it’s just a typical riddle where “in the dictionary” is the expected answer, but that doesn’t quite match here. Maybe cultural context is involved? In the U.S. calendar, Thanksgiving precedes Christmas.

Resources

Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

4 min read

Image shows a brain with wires going through floating above a circuit.

SFR-Embedding-Mistral: Enhance Text Retrieval with Transfer Learning

9 min read

Antonio Ginart

More by Antonio

Eric Nofziger

More by Eric

John Emmons

More by John

Caiming Xiong VP Salesforce Research

More by Caiming

Silvio Savarese Executive Vice President and Chief Scientist, Salesforce AI Research

Silvio Savarese is the Executive Vice President and Chief Scientist of Salesforce AI Research, as well as an Adjunct Faculty of Computer Science at Stanford University, where he served as an Associate Professor with tenure until winter 2021. At Salesforce, he shapes the scientific direction and Read More

More by Silvio

Jagged Intelligence in the Enterprise

In the rapidly evolving landscape of artificial intelligence (AI), we're witnessing jagged intelligence in the enterprise. Here's a closer look.

Antonio Ginart

Eric Nofziger

3 additional authors

How is Jaggedness Evolving over Time?

Example Questions from SIMPLE

Question:

Answer:

Question:

Answer:

Question:

Answer:

Footnotes

Reasoning Traces

Question:

Resources

Just For You

Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

SFR-Embedding-Mistral: Enhance Text Retrieval with Transfer Learning

Just For You

Accelerating Your Model Evaluation and Fine-tuning with SFR-Judge

Ethical Hacking Practices Prove Successful in Building Trusted AI Products

Building Contextually Faithful RAG Applications with SFR-RAG

Developing reliable LLM-powered insight summarization for Tableau Pulse

Developing reliable LLM-powered insight summarization for Tableau Pulse

Actions Speak Louder Than Words: Introducing xLAM, Salesforce’s family of Large Action Models

xLAM: A Family of Large Action Models for AI Agents

Introducing LlamaRank: A State-of-the-Art Reranker for Trusted AI

Share article

How is Jaggedness Evolving over Time?

Example Questions from SIMPLE

Question:

Answer:

Question:

Answer:

Question:

Answer:

Footnotes

Reasoning Traces

Question:

Resources

Share article

Explore related content by topic

Get the latest articles in your inbox.

360 Highlights

C-Suite

IT

Commerce

Marketing

Service

Sales

Thanks, you're subscribed!