Skip to Content
0%

Jagged Intelligence in the Enterprise

Illustration of a confused robot
Will jaggedness decrease or increase as foundation models continue to evolve? This question is at the very heart of enterprise AI because it informs what kinds of applications will be most successful, and how to build them. [Image credit: Adobe Stock]

In the rapidly evolving landscape of artificial intelligence (AI), we're witnessing jagged intelligence in the enterprise. Here's a closer look.

In the rapidly evolving landscape of artificial intelligence (AI), we’re witnessing a fascinating paradox. AI systems continue to shatter performance records on increasingly complex benchmarks, while sporadically struggling with simpler tasks that most humans find intuitive and can reliably solve. This dichotomy is known as jagged intelligence1. In this piece, we discuss the phenomenon of jaggedness, investigate how jaggedness in frontier models has evolved over time and how the emergence of reasoning models has affected it. In particular, we introduce a public benchmark for jaggedness featuring simple reasoning tasks solvable by high schoolers. Jaggedness is of particular interest for enterprise AI applications, where many problems are relatively commonplace, required human context and reliability more so than sophisticated math-solving abilities.

Since ChatGPT’s launch in 2022, foundation models have largely been commercialized as chat interfaces, where users send messages and receive instant responses. However, reasoning models such as OpenAI’s o3 and DeepSeek’s R1 are able to utilize additional computation at inference time to generate an internal monologue that can serve as a scratch pad for step-by-step work before responding to the user’s message. This has resulted in significant gains in capability for many reasoning-oriented tasks such as mathematics, engineering, and software development. Just when it seemed like progress in one AI training paradigm (scaling pretraining compute) may be slowing down, this new optimization regime (scaling inference time compute) emerges.

Figure 1: Progress on various AI capability benchmarks over time. Benchmark results are gathered from publicly available sources, such as epoch.ai.

Most well-known AI benchmarks to date focus on measuring the performance ceiling of frontier models on increasingly complex tasks, from solving graduate-level problem sets to writing sophisticated code. Once a benchmark has been nearly saturated, the community creates a new one with a higher ceiling to reset baselines for measuring the next set of top-line capabilities. Tracking average-case performance on these challenging benchmarks fails to capture a more nuanced picture around reliability or worst-case performance on simpler tasks, obscuring the uneven jaggedness on “solved” problems.

Today’s models outperform the average human on some tasks yet struggle on others, often in ways that defy predictable patterns. Unlike traditional computing tools—such as calculators for arithmetic, regex for string matching, or photo editors for image manipulation—AI’s strengths and weaknesses don’t follow an intuitive logic. Despite years of advancements in modeling, data preparation, and training procedures, the whack-a-mole nature of jagged intelligence remains a persistent challenge in foundation models.

The key distinction is this: top-line capability vs. jaggedness.

  • Top-line capability measures accuracy on increasingly difficult tasks. This is what makes headlines—AI acing medical exams, passing the bar, or medaling in math olympiads.
  • Jaggedness measures whether AI can perform consistently on simpler tasks that are do-able by humans without specialized training. This is what determines the robustness of an AI system and informs the kinds of guardrails needed for critical automations2.

A good benchmark for jaggedness should focus on problems that are relatively easy for humans, yet AI models still exhibit poor performance. We have spent hours working with foundation models focusing on both top-line capability and jaggedness, and it is worthwhile to dig into how jaggedness evolves over time and what kinds of tasks might exhibit jaggedness3

How is Jaggedness Evolving over Time?

Will jaggedness decrease or increase as foundation models continue to evolve? This question is at the very heart of enterprise AI because it informs what kinds of applications will be most successful, and how to build them.

In order to make jaggedness more quantifiable, we would need actual instances of jaggedness-oriented public benchmarks, but there actually aren’t any4.

To address this, we create and release SIMPLE5, an initial dataset of 225 simple reasoning questions. Every question in this dataset is designed to be easy, with at least 10% of high schoolers expected to solve it (though most questions are even simpler).

Example Questions from SIMPLE

Question:

A man has to get a fox, a chicken, and a sack of corn across a river.
He has a rowboat, and it can only carry him and three other things.
If the fox and the chicken are left together without the man, the fox will eat the chicken.
If the chicken and the corn are left together without the man, the chicken will eat the corn.
How does the man do it in the minimum number of steps?

Answer:

They can all go in one trip since the boat can carry him and 3 other things.

Both state-of-the-art reasoning models regurgitate the usual solution of this riddle without considering that the changed constraint trivializes the riddle. 

Here is another example:

Question: 

  __.   __.

! __. ! __.

!       !

Above is a diagram featuring 8 matchsticks. Each vertical matchstick is represented by an exclamation point (!), and every horizontal matchstick is represented by two underscores and a period (__.) Move 2 matchsticks to create 2 equal squares. Make sure the end result does not have any stray matchsticks that don’t belong to a closed border. Be specific about which matchsticks to move. Draw the new diagram when you’re finished.

Answer:

Move the horizontal matchstick on the top right and the vertical matchstick on the bottom left to the bottom right of the image to form a new square that shares its top left corner with the existing square’s bottom right corner. The new diagram is as follows:

  __.

! __. ! __.

        ! __. !

Clearly, neither model’s solution is correct. ChatGPT-o1’s solution does not feature two squares, and o3-mini-high’s solution, while featuring two rectangles (not squares) does not contain the correct number of matchsticks! While this matchstick puzzle is not as trivial as the former riddle, it’s still fairly easy.

We also evaluate present-day and historical SoTA foundation models on this benchmark to simulate a trendline of jaggedness performance over time.

Figure 2: Green →  Solvable by high schoolers; Blue → Requires highly specialized training or expert ability. We see that school-level benchmarks are saturated around 2023~2024, resulting in the need for highly specialized or expert difficulty benchmarks. For SIMPLE we took the best of: gpt-4, gpt-4-turbo, gpt-4o, o1, and o3-mini when each came out.

SIMPLE is the lowest scoring for school-level questions (circle). Interestingly, we also observe a large improvement (~30%) in model performance due to reasoning models. This supports the hypothesis that reasoning models not only improve top-line capabilities, but they might also reduce jaggedness. This may be because they are able to double-check their work in their chain-of-thought without penalty. This indicates a concrete connection between reasoning models and improved performance on simple reasoning tasks.

To make this more tangible, here is an example of a question from SIMPLE where o1 and o3-mini get it wrong, but o3-mini-high is correct after a longer reasoning period.

Question:

Where does Thanksgiving come before Christmas?

Answer:

on the calendar (or anywhere both are celebrated)

Interestingly, o3-mini reasoned for 34 seconds while o3-mini-high reasoned for 49 seconds. Looking at the reasoning traces proves insightful. We can actually observe how o3-mini notices the variation from the usual riddle but fails to fully reason through the correction, whereas o3-mini-high does change its answer.

Therefore, holistically, it appears that top-line gains from reasoning do, to a certain extent, transfer over to simple reasoning. Yet, SIMPLE is far from saturated. Jaggedness remains a salient feature of reasoning models, and it does seem like progress in top-line capabilities outpaces progress in worst-case scenarios.This reflects the broader trajectory of computing: computers excel at narrow but expanding problem domains, rapidly surpassing human capability once we learn to apply them. Each time we harness computers for a new type of problem, they improve at solving it faster than we can identify the next challenge to automate.

Perhaps jaggedness isn’t just a feature of AI — it’s the very nature of computation.

Footnotes

1 The term jagged intelligence was coined by A. Karpathy.

2 Jaggedness measures capability (or, more precisely, lack thereof). This is why we explicitly distinguish between the more specific-term, top-line capability, and capability, broadly speaking.

3 Foundation models generally exhibit high crystallized intelligence but lower fluid intelligence.

4  To our knowledge, the only such benchmark is Simple-Bench, but it does not have a public eval set. This is reasonable to protect against contamination but it makes it difficult to assess the performance of historical models in order to understand the trend over time.

5  SIMPLE Intuitive, Minimal, Problem-solving Logical Evaluation.

6  Given a pen, unlimited sheets of paper, and one hour of time.

Reasoning Traces

Question:

Where does Thanksgiving come before Christmas?

Resources

Get the latest articles in your inbox.