LSTM, Transformer, PatchTST, TimeMixer, N-BEATS. Different architectures, different training regimes, different benchmarks. One thing they all share: you feed in a history, you get a forecast out. One pass. The model doesn't revise, doesn't ask questions, doesn't notice when something looks off in the input.

Nobody calls this the "single-pass assumption" because nobody thinks of it as an assumption. It's just how forecasting works. You put data in, predictions come out.

A paper out of USTC in February 2026 looked at that and asked what seems like an obvious question once someone else asks it: is that actually how good forecasters work?

// Paper
Cast-R1: Learning Tool-Augmented Sequential Decision Policies for Time Series Forecasting
Tao, Cheng, Jiang, Gao, Zhang, Liu  ·  USTC  ·  arXiv:2602.13802  ·  Feb 14, 2026  ·  arxiv.org/abs/2602.13802

Reformulates time series forecasting as a sequential decision-making problem. An agent maintains a running memory state, calls tools from a modular forecasting toolkit, and revises predictions iteratively. Trained via supervised fine-tuning on expert trajectories, then multi-turn RL.

// The observation at the centre of it

The paper's starting point is empirical rather than theoretical. When you watch a quant analyst actually produce a forecast — not use a model, but do the thinking — it doesn't look like inference. They examine the data, note anomalies, check for structural breaks, pick a method conditionally, run it, look at residuals, adjust. When new information surfaces, they update. The whole thing is sequential and evidence-accumulating.

None of the models we currently build does any of that. They learn a fixed function from inputs to outputs during training, then apply it at inference with no ability to interrogate the input or revise based on intermediate results.

The claim isn't subtle: "high-quality forecasting involves a series of interdependent decisions rather than a one-shot model inference." Cast-R1's bet is that if you train the decision process itself — not just the prediction — you get something that generalises better across conditions.

// What they built

The architecture has three parts that fit together tightly. A memory state that persists across decision steps — accumulating assessments of data quality, structural characterisation, intermediate model outputs — so nothing gets thrown away between steps. A modular toolkit the agent calls into: data quality check, statistical profiling (trend, seasonality, ACF), structural analysis, event summarisation, residual diagnostics, model invocation. The agent decides the order. And a two-stage training regime — SFT on expert forecasting trajectories first, then multi-turn RL where the agent optimises the full decision sequence against downstream accuracy.

Fig 1 — Forecasting paradigm comparison
CURRENT PARADIGM Historical observations T–N … T–1, fixed lookback Model — single forward pass no revision, no context check Forecast T+1 … T+H take it or leave it Fixed mapping. Regime change invisible to it. CAST-R1 Memory state — accumulates across steps nothing discarded between decisions Data quality flag anomalies Structural trend, regime, ACF Model select conditional on above Residual check revise if needed Revised forecast evidence-grounded, auditable Decision process is what's learned — not just the prediction.

The qualitatively interesting result isn't the accuracy number. It's that the agent — through RL — discovered orderings of tool calls that the researchers didn't specify. It learned to run residual diagnostics before invoking a second model. It learned when to skip the structural analysis module entirely because the data quality check had already flagged the series as unreliable. The policy it learned is interpretable, which is something most ML systems working on financial data can't say.

// Why finance specifically is the harder case

The paper tests on standard benchmarks — ETTh1, ETTm1, Weather. These are not financial time series. They don't have fat tails, adversarial participants, regime switches, or the particular problem of knowing that your model's signals will eventually affect the thing you're trying to predict.

Financial series are non-stationary in a way that matters more than in other domains. A macro fund running an LSTM on yield curve dynamics is implicitly assuming the data-generating process is stable enough that one look at the history tells you what you need. It rarely is — and the periods when it isn't are exactly the ones where getting it wrong is expensive. The sequential, evidence-accumulating structure of Cast-R1 is at least architecturally capable of detecting that the current data doesn't look like the training distribution and adjusting accordingly. A single-pass model has no such mechanism.

// Three things the paper doesn't address

The toolkit is fixed at training time. The agent learns which tools to call but can't acquire new ones at inference. In practice, you'll often want to pull in a data source that didn't exist when the system was trained — a new alternative data feed, a macro regime indicator, an event calendar. Dynamic tool registration isn't solved here, and it's the thing you'd need before deploying this in anything other than a closed environment.

Latency is not discussed. Multi-step agentic loops are slower than single-pass inference by definition. For weekly macro forecasting this probably doesn't matter much. For intraday signals or anything with short prediction horizons, it might disqualify the architecture before you get to accuracy. The paper is silent on this.

The audit trail is accidental, not designed. Because the agent makes sequential, logged decisions, you get something that looks like an explanation — which tool was called, what it returned, how the forecast changed. In a compliance context, this is genuinely valuable. But the paper doesn't frame it that way, which means the explainability property isn't validated or stress-tested. It's a byproduct. Whether it holds up when a regulator actually reads it is a different question.

// The thing worth taking seriously

Thirty years of quant ML research has optimised model architectures — better attention, better positional encoding, better loss functions. Cast-R1 is making a different bet: that the bottleneck isn't the model, it's the process around the model. The way evidence is gathered, the order decisions are made, the mechanism for revision when intermediate outputs look wrong.

If that's right, the next 2% of benchmark improvement from a better transformer is worth less than it looks. And building something that reasons about how to forecast — not just what to forecast — is worth more than the current literature implies. That's a genuinely uncomfortable conclusion for a field that has spent decades on architecture search.

The Cast-R1 paper doesn't prove this at scale, and certainly not in finance. But it makes the argument clearly enough that it's hard to dismiss.


Research notes. All sources linked.