Why AI Progress Is Slowing Despite Better Models

Why This Sounds Paradoxical

For five years (2019–2024), AI labs followed a simple formula: train bigger models on more data using more compute, and performance increases reliably.

GPT-2 → GPT-3 → GPT-3.5 → GPT-4 each brought dramatic leaps. Then something shifted in 2024. Despite released models becoming more useful and accessible, the performance improvements plateaued.

The paradox: Models are "better" (faster, cheaper, integrated everywhere), yet core capabilities are barely improving. This looks like progress meeting a ceiling.


How Normal Thinking About AI Progress Works

Intuitive expectation: AI scaling laws = Moore's Law for AI.

Just as transistors doubled every 2 years for 60 years, AI performance should double every time we 10× compute or data. Extrapolate linearly: AGI in 5 years, superintelligence inevitable.

This view powered $100B+ investments and hype cycles. Reality is messier.


How AI Progress Actually Works (And Why It's Slowing)

The Scaling Law That Got Us Here

From 2019–2023, researchers discovered empirical scaling laws:

  • Bigger model (more parameters) → better performance
  • More training data → better performance
  • More compute (GPUs, FLOPs) → better performance

These relationships held across multiple independent labs and architectures, suggesting they might be fundamental. This wasn't magic—it was predictable exponential returns, enabling confident long-term planning and investment.

Why It's Slowing Now (2024–2025)

Three concrete bottlenecks have emerged:

1. Data Scarcity

  • High-quality training data (especially in English) is exhausted.
  • LLMs have already consumed most internet text, academic papers, books.
  • Remaining data is either lower-quality, duplicated, or behind paywalls.
  • Synthetic data helps but introduces subtle quality degradation.
  • Without new high-quality data sources, throwing more compute at training yields marginal gains.

Evidence:

  • GPT-3.5 → GPT-4: ~15–25% benchmark improvement
  • GPT-4 → GPT-4o: only 3–7% improvement on same benchmarks
  • Claude 3 → Claude 3.5: modest incremental gains, not transformative leaps

2. Benchmark Saturation (Approaching Human Performance)

  • Models now score 88–90% on MMLU (knowledge quizzes), approaching human level (~95%).
  • Further improvements require diminishing effort for marginal gains.
  • The "easy" gains from raw scaling have been exhausted.
  • What's left requires solving hard problems: reasoning, generalization, novel domains.

3. Hardware & Infrastructure Bottlenecks

  • GPU utilization in large AI training is only 30–40%, not the theoretical 80–90%.
  • Reason: data movement between chips, memory, and storage systems is slow.
  • Compute is fast; data I/O is slow. GPUs sit idle waiting for data ("GPU starvation").
  • Your $500M GPU cluster is underutilized because data can't feed it fast enough.

Result: Throwing more hardware at the problem doesn't scale linearly anymore. The bottleneck shifted from compute to data engineering.


What This Actually Means (Real Implications)

1. The Scaling Law Paradigm Is Over (For Now)

The previous five-year playbook—"throw more scale at the problem"—is hitting diminishing returns. Simply training GPT-5 with 10× the compute of GPT-4 won't yield 10× better reasoning or 10× better understanding.

Labs are pivoting to new scaling frontiers:

  • Test-time scaling ("reasoning models")

    • Instead of scaling training, scale the inference time: give models more compute/steps to "think" about problems before answering.
    • OpenAI's o1 model demonstrates this: it reasons through problems step-by-step, trading faster answer time for better accuracy.
    • This may be the next 5-year scaling axis, but it's still unproven at scale.
  • Algorithmic innovations

    • New attention mechanisms, training techniques, or architectures that learn more efficiently from less data.
    • Harder to discover than "use 10× more GPUs," but potentially more impactful.
  • Synthetic & curated data

    • Using LLMs to generate training data (risky but necessary).
    • Hand-curating high-value training datasets instead of scraping internet scale.
    • Reducing reliance on "raw" scale, increasing reliance on data quality.

2. The Timeline for AGI Just Shifted

Overly optimistic forecasts (AGI by 2027–2030 with continued exponential scaling) are now calibrated. Without major algorithmic breakthroughs, progress may decelerate significantly.

This doesn't mean AGI is impossible—just that the path forward is less clear and slower than linear extrapolation suggested.

3. Compute Itself Becomes Less of a Bottleneck

As scaling laws slow, the value of having the largest GPU cluster decreases. Smaller, smarter teams with better algorithms and curated data may outcompete brute-force scale.

This is already happening:

  • Mistral (scrappy EU startup) created competitive models with fraction of NVIDIA's hardware.
  • Open-source models (LLaMA 2, Qwen) are now "good enough" for many tasks.
  • Inference-efficient models (smaller, quantized) run on laptops.

The AI race may shift from "who has the most GPUs?" to "who has the best algorithms and data?"


What This Advancement Is Actually Good At

Current state (2025):

  • LLMs are excellent at pattern recognition, generation, and low-level reasoning.
  • Remarkable at coding, content generation, summarization, translation, Q&A.
  • Increasingly capable at arithmetic and multi-step logic (reasoning models).
  • Useful across thousands of verticals (healthcare, finance, sales, education, etc.).

What LLMs are not good at:

  • True reasoning and novelty (solving truly novel problems, not seen before in training)
  • Reliable factuality (hallucinate confidently)
  • Understanding causation (confuse correlation with causation)
  • Commonsense reasoning (fail at tasks toddlers solve easily)
  • Long-horizon planning and adaptation

The implication: Current LLMs are powerful tools for automating cognitive tasks within domains they've been trained on. But they're hitting a generalization ceiling. Moving beyond requires solving problems brute-force scale can't solve.


Real Problems This Change Could Tackle

In the Short Term (2025–2027):

  • More efficient training methods (reducing cost, enabling smaller teams to compete)
  • Better data curation and synthetic data generation
  • Improved reasoning and planning in reasoning-class models
  • Specialized models for specific domains (medicine, law, science) with targeted training

In the Medium Term (2027–2030):

  • Potential breakthrough in test-time scaling or new architectures
  • AI systems that can learn and adapt from smaller amounts of data
  • Multi-modal models that integrate vision, audio, text more robustly
  • AI that reliably explains its reasoning (interpretability breakthroughs)

What likely won't change:

  • Fundamental limitations of Transformers (may persist unless new architectures emerge)
  • The need for high-quality data (only gets more critical)
  • The fact that AI operates on patterns, not true understanding

Common Myths

Myth 1: "AI progress has stopped."

False. LLMs continue improving; benchmarks show consistent (if slower) gains. What's stopped is the exponential trajectory. Improvement is now linear-to-polynomial, not exponential.

Myth 2: "The plateau means we hit some fundamental limit; AGI is impossible."

False. The plateau suggests brute-force scaling hit a wall, not that intelligence itself is capped. Breakthroughs in algorithms or data may reignite exponential progress.

Myth 3: "More data always helps, so we just need to find more data."

Partly false. Data helps, but the quality-to-quantity tradeoff is real. Synthetic data and fine-tuning on small curated datasets often beat random scraping.

Myth 4: "Scaling laws were laws of nature; their slowdown is a surprise."

False. Scaling laws were empirical observations, not laws of physics. All empirical patterns eventually hit constraints. This was predictable in hindsight.


Why Trending Now?

November 2024 inflection: Major CEOs and researchers (Sutskever, Andreessen, Nadella) publicly acknowledged the slowdown, pivoting narratives toward test-time scaling and algorithmic innovation.

2025 reality: The industry is reorganizing around a new playbook:

  • More emphasis on data quality and curation
  • Infrastructure focus on I/O and data movement (not just compute)
  • Exploration of new scaling axes (test-time, ensemble methods, etc.)
  • Broader adoption of smaller, specialized models

The peak-hype period (2023–early 2024) is over. The real-world utility period is beginning.


Are These Shifts a Threat?

To current AI companies: Partly. Companies with scale advantage (OpenAI, Google, Anthropic) are hedging by exploring new frontiers. Companies with pure scale (some competitors) face headwinds.

To AI hype/investment: Absolutely. Unrealistic 5-year timelines for AGI are now dead. Funding may rationalize toward companies with proven value (not just moonshots).

To AI's long-term potential: Not really. The slowdown is a redirection, not a terminal diagnosis. If anything, it makes AI safer (slower, more time to understand implications) and more sustainable.


Future Outlook

2025–2027: Era of Efficiency & Specialization

  • Smaller models (7B–13B parameters) become dominant for deployment.
  • Fine-tuning and retrieval-augmented generation (RAG) become primary techniques.
  • Test-time scaling experiments guide the field.
  • Compute-efficient architectures gain focus.

2027–2030: New Scaling Frontiers Emerge (Or Plateau Persists)

  • If algorithmic breakthroughs occur: exponential progress resumes on new axes.
  • If not: AI plateaus at current capability level; focus shifts to integration and applications.

Long-term (2030+): Either new breakthroughs reignite progress, or AI reaches a capability plateau below AGI. Most experts now assign higher probability to the latter.


Conclusion

AI progress is slowing because the three pillars of past success—compute scaling, data availability, and benchmark improvements—have hit hard constraints. Data is scarce, benchmarks are saturated near human performance, and hardware I/O has become the bottleneck. This doesn't mean AI stops improving, but the exponential trajectory flattens. Labs are pivoting toward test-time scaling, algorithmic innovation, and data quality, signaling a shift from "more scale" to "smarter scale." The era of brute-force AI advancement is ending; the era of understanding what works and why is beginning.

Read Next