Predicting the Future Is the Benchmark That Matters

How we're using LLMs to predict forward earnings changes

Apr 16, 2026

This week we launched the Intelligent Earnings Benchmark (IEB) at Intelligent Alpha. The IEB measures how well frontier AI models predict the forward trajectory of company earnings. One of the fundamental jobs of human analysts is to understand the trajectory of earnings. Usually if you get the earnings right, you get the stock right. To the extent AI models can get earnings right at scale, they should be able to get stocks right too.

It’s the first time that we’re sharing in a structured way some of the many tests and experiments we run at Intelligent Alpha to understand how frontier AI models think and act as investors.

The nature of alpha is always changing, and research is the process by which you try to keep up. If you’ve been reading The Deload for a while, you’ve probably experienced the journey of Intelligent Alpha along with me. When I started experimenting with LLMs to invest in 2023, it was out of curiosity about the capabilities of the LLMs. It was more of a research project than an attempt to build a company.

Since then, we’ve done something few other companies using frontier AI models in the investing domain have done, which is register as an investment advisor to actually manage investor money. Being an RIA brings restrictions on how freely we can talk about what we’re doing, but talking about the IEB publicly brings us back to the roots of Intelligent Alpha.

Elon Musk@elonmusk

The ability to predict the future is the best measure of intelligence

X Freeze @XFreeze

Grok 4 ranks #1 on the latest FutureX benchmark for real-world predictions surpassing GPT-5 Pro

8:09 AM · Sep 5, 2025 · 24M Views

8.56K Replies · 6.06K Reposts · 37.7K Likes

The Intelligent Earnings Benchmark

The Intelligent Earnings Benchmark (IEB) tests frontier AI models on the core prediction that active investors make: What direction are earnings expectations headed?

Every quarter, we run a universe of large cap US stocks ($10b+ market cap) through a standardized process where the models predict the direction of forward consensus estimates over the next 60 days. Note, we’re not predicting earnings per se. The actual outcome of earnings isn’t what moves stocks. The changes in future expectations do. We believe predicting how the next quarter consensus estimates change is a more valuable test than merely predicting a quarter’s earnings.

We’ve locked in the Q2 earnings prediction cohort of 715 stocks with predictions across eight models:

GPT 5.4
Claude 4.6 Opus
Gemini 3.1 Pro
Grok 4.20 Reasoning
GLM 5.1
Qwen 3.5
MiniMax M2.7
DeepSeek R1

Each of the models is asked to predict:

Direction: For both revenue and EPS estimates for the next quarter. Example: April 2026 will begin the Q126 earnings reporting period, but the models are predicting the direction of Q226 estimates.
Revision %: The model’s estimate of the change.
Magnitude: Ranges of small/medium/large change which correlate to the revision percentage estimate.
Along with those predictions, models are asked to rate their confidence 0-100, offer rationale, a counter-thesis/risk assessment, and key signals most important to the prediction.

To make predictions, models receive:

A dataset of two years of historical financial information including earnings, income statement, and balance sheet.
Current consensus estimates for the prediction period.
The most recent earnings transcript.
A cache of current economic data from FRED.
Web search via Exa where the models can make up to 10 searches.

At the end of the prediction period, which will be in early June for this first public benchmark, the models will be scored on the accuracy of their predictions. Public scores will be available on the Benchmark section of our website. Select partners of Intelligent Alpha may be able to access the full dataset of model predictions.

Early Insights

The first predictions run is complete and locked in for measurement.

The consensus of the eight models is that 70% of revenue estimates and 61% of earnings estimates for CYQ2 are going up between now and the end of the quarter. That’s bullish relative to history, but more in-line with the last quarter or two.

Going back to 2020 on a quarterly basis, revenue/EPS from the beginning of a new reporting period to the end of that reporting period increased roughly 55% of the time, was flat 16% of the time, and was down 29% of the time. This makes sense because management teams have an incentive to keep expectations in check so that they can beat them. We should generally expect to see an upward bias in the data.

As a naive baseline, if a human or a model just predicted that earnings would go up for every stock, they should achieve a 55% accuracy rate on average. That should be the minimum hurdle for value add in earnings predictions. Beyond that, the accuracy of magnitude predictions will be the ultimate test of the model in this benchmark.

As far as specific stocks, the consensus of the models is that VRT is the highest conviction upside call. All eight models believe forward expectations are too low with the highest average consensus score. EL is the name the models have the most shared concern that revenue and EPS expectations will come down.

My prediction based on years of watching the models: GPT or Grok is likely to be the top models in this first iteration of the IEB.

We plan on occasionally sharing more of the aggregate expectations from the models via our email list and on X. Subscribe for updates.

A True Measure

Elon is right. The truest measure of intelligence is the ability to predict the future.

Most AI benchmarks don’t test that. They test the ability of a model to solve a puzzle or retrieve information or some other rote task. We believe that makes most benchmarks inherently flawed because they’re solvable. Whether it’s GPT 6 or Mythos or some other model, eventually puzzles and retrieval tasks won’t measure anything because every model will be able to conquer them.

Markets are a different animal. A complex adaptive system where the answer keeps changing. The truest test of super intelligence would be when the earnings benchmark is solved because that would mean markets are solved. By the time AI solves markets, it can probably solve a lot more too.

See Intelligent Alpha’s Important Disclosures Page here.

Additionally, note our benchmark disclosures: Intelligent Alpha’s Intelligent Earnings Benchmark (IEB) is an analytical tool designed to evaluate and communicate the comparative performance of AI models on earnings prediction tasks for US listed large-cap companies defined as market capitalization over $10 billion at the time of testing. This benchmark is published for general information and educational purposes only. It does not constitute investment advice, a recommendation to buy or sell any security, or an offer or solicitation with respect to any investment product or service. The Benchmark compares AI model-generated earnings direction predictions against consensus earnings prediction changes across a defined universe of US listed large-cap companies. Results do not represent the performance of any investment portfolio, fund, or client account managed by Intelligent Alpha, and earnings prediction accuracy should not be construed as an indicator of investment returns. The effectiveness of AI models in predicting earnings is limited by access to accurate historical data, tool usage, prompt structure, consistency of harnesses used to control the environment, and other factors. Past benchmark performance is not indicative of future predictive accuracy. This benchmark and all related content do not create an investment advisory, client or fiduciary relationship. Intelligent Alpha’s advisory services are provided solely pursuant to a written investment advisory agreement. No person should rely on this benchmark as a substitute for individualized investment advice.

The Deload

Discussion about this post

Ready for more?