Read enough LLM-in-finance papers and a pattern becomes visible. The abstract claims strong results. The experiment runs on S&P 500 or NASDAQ-100. The metric is directional accuracy. The conclusion is that LLMs show promise for market prediction. The limitations section is short.

A paper accepted at IEEE CAI 2026 breaks from that template. Zhang and Zhang — writing from Lumos Alpha, an actual fund — ran the same literature through a different set of questions. Not "does it work on the benchmark?" but "would this survive deployment?" The answer, section by section, is complicated in useful ways.

// Paper
A Review of Large Language Models for Stock Price Forecasting from a Hedge-Fund Perspective
Zhang, O. & Zhang, Z.  ·  Lumos Alpha  ·  arXiv:2605.05211  ·  Accepted IEEE CAI 2026  ·  arxiv.org/abs/2605.05211

Surveys LLM applications in stock price forecasting — sentiment extraction, earnings call analysis, price tokenisation, multi-agent trading systems — with critical assessment from a hedge fund perspective. Surfaces structural problems in academic evaluation methodology that practitioners encounter in deployment.

// Five use cases, different verdicts

The paper covers five main application areas. Each gets a different honest assessment when you apply fund-level criteria rather than benchmark accuracy.

Fig 1 — LLM use cases in finance: academic vs deployment readiness
USE CASE ACADEMIC MATURITY DEPLOYMENT READINESS Sentiment from news High Prompt-fragile Earnings call NLP Medium Promising Price tokenisation Early Unclear value Cross-asset relationships Medium Mixed Multi-agent systems Very early Untested live Academic attention / bar length = relative maturity Deployment concerns (orange = caution) Author synthesis from Zhang & Zhang 2026 · arXiv:2605.05211

Sentiment extraction is the most researched application — and also the one with the most deployment problems. The core issue is fragility to prompt design. Studies in the review show that the same news item produces markedly different sentiment classifications depending on how the question is framed, and that the framing used in academic papers is often tuned to look good on held-out test sets rather than representative of production conditions.

Earnings call analysis looks more durable. When LLMs are used to track tone shifts over time — whether management language is becoming more hedged compared to prior quarters — rather than assign a one-off sentiment label, the signal appears more consistent. The text is long, specific, and changes at known intervals. It's a better match for what current LLMs are actually good at.

Price tokenisation — feeding raw price series directly into an LLM by encoding them as text tokens — is the application the paper is most sceptical of. The results exist, but the mechanism is unclear. Why would a model pretrained on language have any advantage over a model designed for numerical sequence modelling? The paper raises this without resolving it.

// The data leakage problem is worse than it looks

Most people who think about data leakage in ML finance think about look-ahead bias — using future information in training. That's real and the paper addresses it. But there's a second leakage problem specific to LLMs that is harder to detect and harder to fix.

LLMs are pretrained on internet-scale text corpora with cutoff dates that are often imprecise and rarely disclosed at the level of granularity you'd need for rigorous financial backtesting. A model tested on S&P 500 price movements from 2018–2023 may have seen news articles, analyst reports, and social media from that period during pretraining — not in a form that directly encodes prices, but in a form that encodes which companies performed well, which sectors were rising, which macro events were coming. The model may be pattern-matching on learned associations that constitute a form of lookahead that's invisible in the standard experimental setup.

The paper flags this as a systematic problem with no clean solution in the current literature. You can hold out test periods and use proper splits — and you should — but you cannot easily know what a pretrained LLM has absorbed about those periods from its training data. The inflation in reported backtested performance may be substantial and is very difficult to quantify.
Fig 2 — Sentiment signal decay: predictive power vs days since news release
High Mid Low SIGNAL STRENGTH 0 1d 2d 5d 10d 20d Days since news release Actionable window ~0–2 days Negative news signal Positive news signal Schematic based on findings in Zhang & Zhang 2026 · not empirical data

The signal decay finding is one of the paper's more actionable points. Negative news decays slower than positive news — the market takes longer to fully price in bad information, so the sentiment signal from negative articles stays predictive for a slightly longer window. Small-cap stocks show stronger and more persistent effects than large-caps, consistent with the information asymmetry story. But in both cases, the window is short. Most of the predictive value is in the first two days after release. Academic papers testing over monthly holding periods are measuring noise.

// The metrics problem

Directional accuracy — did the model correctly predict whether the price went up or down — is what academic papers report. It's not what funds care about.

Funds care about Sharpe ratio. Max drawdown. Turnover. Capacity — whether the strategy can absorb enough capital to be worth running without the trades moving the market. Correlation with existing book positions. None of these appear in most published LLM-finance papers, and the paper documents this gap systematically. A model that's right 56% of the time on direction can still have a negative Sharpe after realistic transaction costs, particularly in small-cap names where the bid-ask spread alone can consume a significant fraction of the theoretical edge.

Fig 3 — Metric usage: academic literature vs hedge fund deployment criteria
ACADEMIC PAPERS FUND DEPLOYMENT Directional accuracy ✓ always RMSE / MAE ✓ often Sharpe ✓ rare Turnover ✗ Capacity ✗ Sharpe (after costs) ← primary Max drawdown Turnover × transaction costs Strategy capacity Correlation with existing book Source: Zhang & Zhang 2026 · author synthesis

// What's actually worth building on

The paper's conclusion isn't that LLMs don't work in finance. It's that the places where they work are narrower than the literature suggests, and the places where they look good in papers often don't survive contact with production constraints.

The two applications that look most defensible are earnings call tone tracking — reading the same company's management language across multiple quarters for systematic shifts — and preprocessing pipelines that use LLMs for entity extraction, document classification, and information routing rather than prediction. These tasks are well-matched to what current models are actually good at. They don't require the model to have genuine market knowledge, just to be a capable reader of financial text.

Multi-agent trading systems — where several LLM agents coordinate to make trading decisions — are interesting architecturally but almost entirely untested under real market conditions. The controlled-environment results are good. The gap between controlled environments and live markets is where most quantitative strategies go wrong, and there's no evidence yet that LLM-based systems are an exception.

Worth reading. The honest assessment of a growing literature is useful, and the people doing the reviewing are close enough to deployment to know what questions matter.


Research notes. All sources linked.