Vol. 1Tuesday, March 31, 2026

Modeling Irregular Time Series: From Neural ODEs to Graph Networks and Foundation Models

A comparative review of 15 models spanning continuous-time dynamics, attention mechanisms, graph neural networks, foundation models, and traditional baselines

continuous-time-dynamicsirregular-samplinggraph-neural-networksfoundation-modelsclinical-time-seriesattention-mechanisms

Summary

Irregular time series — where observations arrive at uneven intervals with missing values across variables — are the norm in healthcare, climate science, and IoT. This review maps 15 models across five families, tracing how the field has progressively attacked three core challenges: (1) representing continuous dynamics without discretizing time, (2) scaling to high-dimensional multivariate data with complex inter-variable dependencies, and (3) transferring knowledge across domains without task-specific retraining.

The story begins with continuous-time ODE models (2019–2022) that replaced fixed-step RNNs with differential equations, then shifts to attention/Transformer architectures (2021–2023) that learned temporal interpolation without sequential integration. Graph neural networks (2022–2024) added explicit relational structure between variables. Most recently, foundation models (2024–2025) brought large-scale pre-training to time series, promising zero-shot generalization. Throughout, traditional baselines like GRU-D remain surprisingly competitive, anchoring what 'good enough' looks like.

Researcher Notes

The central tension in this field is expressivity vs. computational cost. Latent ODE (2019) showed that Neural ODEs could elegantly handle irregular sampling by defining continuous latent trajectories — but numerical ODE solvers are slow. Every subsequent model can be understood as an attempt to preserve that continuous-time expressivity while reducing cost: GRU-ODE-Bayes kept the ODE but added familiar GRU gating; CRU used linear SDEs for closed-form solutions; Neural Flows eliminated the solver entirely by parameterizing solutions directly. Meanwhile, mTAN showed you could skip ODEs altogether and use learned attention kernels for temporal interpolation at 10–100× the speed.

The graph models represent a genuinely different inductive bias, not just an incremental improvement. RAINDROP, T-PatchGNN, and GraFITi all argue that irregular multivariate time series have relational structure that flat sequence models miss. When sensor A is observed but sensor B is not, the relationship between A and B matters for imputation and prediction. This is especially compelling in clinical settings where lab tests and vital signs have known physiological correlations.

Foundation models are the newest entrant but face a unique challenge with irregular data. MOMENT and Chronos were designed for regularly sampled time series and need adaptation for irregular inputs. MIRA (Microsoft, 2025) is the first foundation model to tackle irregular medical time series head-on, using Neural ODE-based extrapolation and continuous-time positional encoding. Whether domain-specific foundation models like MIRA will outperform general-purpose ones like Chronos remains an open question — but the architectural innovations (CT-RoPE, frequency-specific MoE) suggest the irregular time series community's decade of work on continuous-time representations is finally feeding back into the foundation model paradigm.

GRU-D (2016/2018) remains the baseline that refuses to die. Its trainable decay mechanism is simple and interpretable, and many papers still struggle to beat it convincingly on standard clinical benchmarks. This is a healthy reminder that architectural complexity must earn its keep against well-tuned simple models.

Continuous-Time Dynamics: Learning to Evolve Between Observations

The irregular time series problem has a natural mathematical framing: if observations arrive at arbitrary times, why not model the latent state as a continuous trajectory? This is exactly what Neural ODE-based models do.

Latent ODE (Rubanova et al., NeurIPS 2019) was the breakthrough. It places a Neural ODE inside a variational autoencoder: an ODE-RNN encoder reads observations at their actual timestamps to produce a latent distribution, then a Neural ODE decoder evolves the latent state forward continuously. The trajectory is defined by a neural network parameterizing dz/dt, with numerical solvers (e.g., Runge-Kutta) performing integration. This was the first model to show that continuous latent dynamics could handle irregular sampling natively, without imputation or discretization.

Published concurrently, GRU-ODE-Bayes (De Brouwer et al., NeurIPS 2019) took a different architectural path. Instead of a VAE framework, it defined a continuous-time GRU where the hidden state evolves according to GRU-style ODEs between observations, and a separate Bayesian update network incorporates each new observation. This single-pass filtering design is more natural for online/streaming scenarios (healthcare monitoring, real-time tracking) where you update beliefs incrementally rather than encoding the entire sequence first.

Both models share a critical bottleneck: numerical ODE solvers are slow. Each forward pass requires multiple integration steps, and backpropagation through the solver (via adjoint methods) adds further cost. Two subsequent models attacked this directly.

CRU (Schirmer et al., ICML 2022) restricts the latent dynamics to linear stochastic differential equations, which admit closed-form solutions via the continuous-discrete Kalman filter. The payoff is exact integration (no solver), principled uncertainty quantification (the Kalman filter naturally propagates variance), and faster training. The trade-off is reduced expressivity in the dynamics — though the authors show this matters less than expected on standard benchmarks.

Neural Flows (Biloš et al., NeurIPS 2021) take the opposite approach: keep nonlinear dynamics but eliminate the solver by directly parameterizing the solution map z(t) rather than the derivative dz/dt. Given an initial condition z(t₀), the network outputs z(t) in a single forward pass. The key insight is defining mathematical constraints (identity at t₀, invertibility) that guarantee the output is a valid ODE solution. This gives O(1) cost per evaluation versus O(N) for numerical integration, while retaining the expressivity of nonlinear dynamics.

Latent ODE

Yulia Rubanova, Ricky T. Q. Chen, David Duvenaud — University of Toronto

NeurIPS 2019

Latent ODEs for Irregularly-Sampled Time Series

Places a Neural ODE inside a VAE — an ODE-RNN encoder reads irregular observations, and a Neural ODE decoder evolves the latent state continuously. Defines latent trajectories via learned dz/dt, integrated numerically.

Key Innovation

First model to demonstrate that continuous-time latent dynamics could handle irregular sampling natively without imputation

Limitations

•
Expensive numerical ODE integration at every forward pass
•
Adjoint backpropagation adds further computational overhead
•
VAE framework requires encoding the entire sequence before decoding

neural-odevaecontinuous-timeirregular-sampling

arXiv PDF GitHub

GRU-ODE-Bayes

Edward De Brouwer, Jaak Simm, Adam Arany, Yves Moreau — KU Leuven

NeurIPS 2019

GRU-ODE-Bayes: Continuous Modeling of Sporadically-Observed Time Series

Continuous-time GRU where hidden state evolves via GRU-style ODEs between observations, with a Bayesian update network that incorporates each new observation. Single-pass filtering model — no encoder-decoder split.

Key Innovation

Retains familiar GRU gating in continuous time with explicit Bayesian observation updates, making it natural for online/streaming filtering

Limitations

•
Still requires numerical ODE solver between observations
•
Bayesian update network adds complexity
•
Less expressive dynamics than fully nonlinear Neural ODE

neural-odegrubayesianfilteringcontinuous-time

arXiv PDF GitHub

CRU (Continuous Recurrent Units)

Mona Schirmer, Mazin Eltayeb, Stefan Lessmann, Maja Rudolph — Humboldt University of Berlin, Bosch Center for AI

ICML 2022

Modeling Irregular Time Series with Continuous Recurrent Units

Models latent dynamics as a linear SDE with closed-form solutions via the continuous-discrete Kalman filter. No numerical ODE solver needed — transitions are computed analytically.

Key Innovation

Closed-form state transitions via Kalman filter give exact integration, principled uncertainty propagation, and faster training than nonlinear ODE models

Limitations

•
Linear dynamics trade expressivity for tractability
•
May underfit highly nonlinear systems
•
Kalman filter assumes Gaussian noise

linear-sdekalman-filtercontinuous-timeuncertainty

arXiv PDF GitHub

Neural Flows

Marin Biloš, Johanna Sommer, Syama Sundar Rangapuram, Tim Januschowski, Stephan Günnemann — TU Munich, Amazon Research

NeurIPS 2021

Neural Flows: Efficient Alternative to Neural ODEs

Instead of parameterizing dz/dt and numerically solving, directly parameterizes the solution map z(t) in a single forward pass. Defines mathematical constraints ensuring the output is a valid ODE solution.

Key Innovation

Eliminates the ODE solver entirely — O(1) cost per evaluation vs. O(N) for numerical integration, while retaining nonlinear expressivity

Limitations

•
Mathematical constraints on the flow limit architectural flexibility
•
Less interpretable dynamics (no explicit derivative)
•
Newer and less battle-tested than Latent ODE

neural-flowscontinuous-timeefficientsolver-free

arXiv PDF GitHub

Comparison

Model	Year	Dynamics	Integration Cost	Observation Handling
Latent ODE	2019	Nonlinear Neural ODE	O(N) numerical solver	VAE encoder-decoder
GRU-ODE-Bayes	2019	Continuous-time GRU ODE	O(N) numerical solver	Bayesian update network
Neural Flows	2021	Nonlinear direct solution	O(1) single forward pass	Pluggable (GRU Flow, etc.)
CRU	2022	Linear SDE	O(1) closed-form Kalman	Kalman filter update

Attention and Transformers: Learning Temporal Kernels Without ODEs

A parallel research thread asked: do we actually need differential equations to handle irregular time? Attention mechanisms offer an alternative — they can naturally weight observations by learned relevance, regardless of when those observations occurred.

mTAN (Shukla & Marlin, ICLR 2021) was the pivotal model here. It learns a continuous-time embedding of timestamps using parametric functions (sines/cosines at learnable frequencies), then uses multi-head attention where queries are desired reference time points, keys are actual observation times, and values are the measurements. This produces smooth interpolations of the irregular data into a fixed-length representation that can be fed into any downstream model. The key result: mTAN runs 10–100× faster than Neural ODE baselines while achieving comparable or better performance, because attention is parallelizable while ODE solvers are inherently sequential.

STraTS (Tipirneni & Reddy, ACM TKDD 2022) made a radical representational choice. Instead of treating a multivariate time series as a matrix (features × time steps) — which forces you to deal with missing cells — STraTS represents it as a set of observation triplets (time, variable-id, value). Each triplet becomes one token, processed by a standard Transformer encoder. This means the model never sees 'empty' cells; sparsity is handled by construction. Two technical innovations make this work: Continuous Value Embedding (CVE), a small neural network that maps scalar values (timestamps and measurements) to dense vectors without discretization, and a self-supervised forecasting pretext task for pre-training on unlabeled clinical data.

ContiFormer (Chen et al., NeurIPS 2023) unified the ODE and Transformer threads. It defines attention in continuous time: latent states evolve via ODEs between observations, and the attention mechanism operates over these continuous trajectories rather than discrete tokens. Both queries and keys are functions of time, not fixed vectors. The authors prove that several existing architectures (including designs resembling mTAN) are special cases of ContiFormer under specific function hypotheses. A reparameterization trick enables parallel computation of continuous-time attention scores, avoiding the sequential bottleneck that plagues standard Neural ODE approaches.

The evolution here is clear: mTAN showed attention could replace ODEs for temporal modeling; STraTS showed you could eliminate the time-series matrix representation entirely; ContiFormer showed you could have both ODE dynamics and attention in a single principled framework.

mTAN (Multi-Time Attention Network)

Satya Narayan Shukla, Benjamin M. Marlin — University of Massachusetts Amherst

ICLR 2021

Multi-Time Attention Networks for Irregularly Sampled Time Series

Learns continuous-time embeddings of timestamps and uses multi-head attention to interpolate irregular observations into fixed-length representations. Queries are reference times, keys are observation times, values are measurements.

Key Innovation

Replaces ODE-based temporal modeling with learned attention kernels — 10–100× faster than Neural ODE baselines

Limitations

•
Interpolation-first approach may lose fine-grained temporal patterns
•
No explicit modeling of continuous dynamics between observations
•
Attention complexity scales quadratically with observation count

attentioncontinuous-timeinterpolationefficient

arXiv PDF GitHub

STraTS (Self-supervised Transformer for Time-Series)

Sindhu Tipirneni, Chandan K. Reddy — Virginia Tech

ACM TKDD 2022

Self-Supervised Transformer for Sparse and Irregularly Sampled Multivariate Clinical Time-Series

Represents irregular multivariate time series as a set of (time, variable-id, value) triplets, each embedded via Continuous Value Embedding (CVE). A Transformer encoder processes the set directly — no imputation or grid alignment needed.

Key Innovation

Triplet representation eliminates empty cells by construction; CVE embeds continuous scalars without discretization; self-supervised pretraining addresses label scarcity

Limitations

•
Sequence length grows with number of observations (no patching/compression)
•
Designed for clinical data — less tested on other domains
•
No continuous dynamics modeling between observations

transformerself-supervisedtriplet-representationclinical

arXiv PDF GitHub

ContiFormer

Yuqi Chen, Kan Ren, Yansen Wang, Yuchen Fang, Weiwei Sun, Dongsheng Li — Microsoft Research

NeurIPS 2023

ContiFormer: Continuous-Time Transformer for Irregular Time Series Modeling

Unifies Neural ODEs and Transformers by defining attention in continuous time. Latent states evolve via ODEs between observations; attention queries and keys are continuous functions of time. Subsumes several existing architectures as special cases.

Key Innovation

Continuous-time attention mechanism with reparameterization trick for parallel computation — combines ODE expressivity with Transformer scalability

Limitations

•
More complex to implement and tune than pure attention or pure ODE models
•
Reparameterization approximations may introduce errors
•
Newer model with less community adoption

transformerneural-odecontinuous-time-attentionunified

arXiv PDF

Comparison

Model	Year	Time Handling	Attention Role	ODE Component	Pre-training
mTAN	2021	Learned continuous-time embedding	Interpolation to reference grid	None (compatible with ODE-RNN downstream)	None
STraTS	2022	CVE of timestamps	Direct processing of observation triplets	None	Self-supervised forecasting
ContiFormer	2023	ODE-governed continuous trajectories	Continuous-time attention over ODE states	Core component	None

Graph Neural Networks: Adding Relational Structure to Irregular Observations

Sequence models (whether ODE-based or attention-based) process irregular time series as flat sequences of observations. But multivariate time series have relational structure: sensors are physically connected, lab tests share physiological pathways, and IoT devices form spatial networks. Graph neural networks encode this structure explicitly.

RAINDROP (Zhang et al., ICLR 2022) was the first major GNN approach for irregular multivariate time series. It represents each sample as a separate sensor graph where nodes are variables and edges encode learned inter-sensor dependencies. The key mechanism: when an observation arrives at one sensor (a 'raindrop'), it sends messages to neighboring sensors ('creating ripples'). This means the graph dynamically reflects which sensors have observations at any given time — sparsity becomes a structural property of the graph rather than a problem to solve. Temporal self-attention captures within-sensor dynamics, while the GNN message-passing captures cross-sensor dependencies.

GraFITi (Yalavarthi et al., AAAI 2024) made a bold reformulation. Instead of the usual 'inputs → hidden states → predictions' pipeline, GraFITi casts forecasting as edge weight prediction on a bipartite graph. One set of nodes represents observed time-variable pairs; the other represents target time-variable pairs. A GNN adapted from Graph Attention Networks learns embeddings and predicts edge weights that yield forecasted values. This sidesteps ODE solvers and imputation entirely — sparsity is a natural graph property — and runs up to 5× faster than ODE-based methods with 17% better accuracy.

T-PatchGNN (Zhang et al., ICML 2024) bridged the patching paradigm (popularized by PatchTST for regular time series) with graph neural networks. It divides each univariate irregular series into 'transformable patches' — segments covering a uniform time horizon but containing a variable number of observations. A Transformer handles intra-series (temporal) modeling within patches, while time-adaptive GNNs model inter-series (cross-variable) correlations through learned time-varying graph structures. The patching avoids the sequence-length explosion of processing every observation individually.

These three models share a conviction: irregular multivariate time series are inherently graph-structured, and making this structure explicit improves both accuracy and interpretability. They differ in what the graph represents (sensor networks vs. observation-target bipartite graphs vs. patch-level adaptive graphs) and how the graph is constructed (learned vs. task-derived).

RAINDROP

Xiang Zhang, Marko Zeman, Theodoros Tsiligkaridis, Marinka Zitnik — Harvard University, MIT Lincoln Lab

ICLR 2022

Graph-Guided Network for Irregularly Sampled Multivariate Time Series

Represents each sample as a sensor graph with learned inter-sensor edges. When an observation arrives ('raindrop'), it sends messages to neighboring sensors ('ripples'). Temporal self-attention handles within-sensor dynamics.

Key Innovation

Dynamic graph structure that adapts to which sensors are observed at each time, making sparsity a structural property rather than a problem

Limitations

•
Classification-focused — less tested for forecasting
•
Graph learning adds computational overhead for many sensors
•
Sample-level graphs don't share structure across the dataset

graph-neural-networkmessage-passingirregular-samplingclinical

arXiv PDF GitHub

T-PatchGNN

Weijia Zhang, Chenlong Yin, Hao Liu, Xiaofang Zhou, Hui Xiong — HKUST, Rutgers University

ICML 2024

Irregular Multivariate Time Series Forecasting: A Transformable Patching Graph Neural Networks Approach

Divides irregular series into 'transformable patches' (uniform time horizon, variable observation count). Transformer for intra-series temporal modeling, time-adaptive GNN for inter-series cross-variable correlations.

Key Innovation

Transformable patching avoids sequence-length explosion while preserving irregular structure; time-adaptive graph captures evolving variable relationships

Limitations

•
Patch boundaries may split meaningful temporal patterns
•
Dual architecture (Transformer + GNN) increases model complexity
•
Patch size is a hyperparameter that affects performance

graph-neural-networkpatchingtransformerforecasting

Proceedings GitHub

GraFITi

Vijaya Krishna Yalavarthi, Kiran Madhusudhanan, Randolf Scholz, Nourhan Ahmed, Johannes Burchert, Shayan Jawed, Stefan Born, Lars Schmidt-Thieme — University of Hildesheim

AAAI 2024

GraFITi: Graphs for Forecasting Irregularly Sampled Time Series

Reformulates forecasting as edge weight prediction on a bipartite graph — observed (time, variable) pairs on one side, target pairs on the other. A GNN adapted from GAT learns embeddings and predicts edge weights as forecasted values.

Key Innovation

Casts prediction as graph edge weight estimation — sidesteps ODE solvers and imputation entirely; up to 5× faster than ODE-based methods

Limitations

•
Bipartite graph grows with number of observation-target pairs
•
Edge weight prediction is an unusual framing that may be harder to interpret
•
Less explored for classification tasks

graph-neural-networkbipartite-graphforecastingefficient

arXiv PDF

Comparison

Model	Year	Graph Structure	Temporal Component	Primary Task	Handles Sparsity Via
RAINDROP	2022	Sensor-level (nodes = variables)	Temporal self-attention	Classification	Dynamic graph adapts to observed sensors
GraFITi	2024	Bipartite (observed ↔ target pairs)	GAT message passing	Forecasting	Sparsity is natural graph property
T-PatchGNN	2024	Time-adaptive inter-series graph	Transformer on patches	Forecasting	Transformable patching

Foundation Models: Pre-trained Generalists Meet Irregular Data

The foundation model wave has reached time series, promising zero-shot generalization across domains. But irregular sampling poses a unique challenge: most foundation models assume regular grids.

MOMENT (Goswami et al., ICML 2024) is a family of open-source, pre-trained Transformer models for general-purpose time series analysis. Pre-trained on the 'Time Series Pile' — a large, diverse collection of public time series data — MOMENT handles five tasks: short/long-horizon forecasting, classification, anomaly detection, and imputation. It addresses time series-specific challenges (varying lengths, multi-channel inputs, distribution shifts) through systematic design choices. However, it was primarily designed for regularly sampled data and requires adaptation for irregular inputs.

Chronos (Ansari et al., Amazon, TMLR 2024) made the provocative choice to treat time series as language. It tokenizes real-valued observations into a fixed discrete vocabulary via scaling and quantization, then trains T5-family transformers (20M–710M parameters) with standard cross-entropy loss. Training data is augmented with synthetic time series from Gaussian processes. The result: strong zero-shot probabilistic forecasting on 42 benchmarks. Like MOMENT, Chronos assumes regular sampling but its language-model approach means irregular data can potentially be handled through special tokens or positional encodings.

MIRA (Li et al., Microsoft, 2025) is the first foundation model built specifically for irregular medical time series. Three architectural innovations address this: Continuous-Time Rotary Positional Encoding (CT-RoPE) handles variable time intervals (extending the standard RoPE used in LLMs to continuous time); a frequency-specific mixture-of-experts layer routes computation across latent frequency regimes; and a Continuous Dynamics Extrapolation Block based on Neural ODEs enables forecasting at arbitrary timestamps. Pre-trained on 454+ billion medical time points, MIRA achieves 7–10% error reductions over baselines.

The progression from MOMENT/Chronos to MIRA mirrors the broader field's evolution: general-purpose architectures work well for regular data, but irregular time series ultimately demand the continuous-time machinery that the specialized model community has been developing since 2019. MIRA's CT-RoPE and Neural ODE blocks are direct descendants of the ideas in Latent ODE and mTAN.

MOMENT

Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, Artur Dubrawski — Carnegie Mellon University, University of Pennsylvania

ICML 2024

MOMENT: A Family of Open Time-series Foundation Models

Open-source pre-trained Transformer family for general-purpose time series. Pre-trained on the 'Time Series Pile' across five tasks: forecasting, classification, anomaly detection, and imputation.

Key Innovation

Multi-task, multi-domain pre-training on a curated 'Time Series Pile' — first open-source general-purpose time series foundation model

Limitations

•
Designed for regular sampling — needs adaptation for irregular data
•
Transformer architecture may struggle with very long sequences
•
Pre-training data diversity may not cover all domains

foundation-modeltransformermulti-taskopen-source

arXiv PDF GitHub

Chronos

Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, et al. — Amazon

TMLR 2024

Chronos: Learning the Language of Time Series

Tokenizes time series values into a discrete vocabulary, then trains T5-family transformers (20M–710M params) with cross-entropy loss. Augmented with synthetic Gaussian process data for robust zero-shot probabilistic forecasting.

Key Innovation

Language-model paradigm for time series — tokenization + cross-entropy loss enables direct reuse of LLM training infrastructure

Limitations

•
Tokenization quantizes continuous values — information loss
•
Assumes regular sampling
•
Forecasting-only — not multi-task like MOMENT

foundation-modellanguage-modeltokenizationforecasting

arXiv PDF GitHub

MIRA

Hao Li, Bowen Deng, Chang Xu, Zhiyuan Feng, Viktor Schlegel, et al. — Microsoft Research, University of Manchester, Peking University, Tsinghua University

arXiv preprint 2025

MIRA: Medical Time Series Foundation Model for Real-World Health Data

First foundation model for irregular medical time series. Uses CT-RoPE for variable time intervals, frequency-specific MoE for multi-scale dynamics, and Neural ODE-based extrapolation for arbitrary-timestamp forecasting. Pre-trained on 454B+ medical time points.

Key Innovation

Brings continuous-time representations (CT-RoPE, Neural ODE extrapolation) into the foundation model paradigm — directly builds on a decade of irregular time series research

Limitations

•
Domain-specific (medical) — transfer to other domains untested
•
Large model may be impractical for edge deployment
•
Pre-print status — not yet peer-reviewed at major venue

foundation-modelmedicalirregular-samplingneural-odecontinuous-time

arXiv PDF GitHub

Comparison

Model	Year	Pre-training Scale	Irregular Sampling	Domain	Architecture
MOMENT	2024	Time Series Pile (diverse)	Requires adaptation	General (5 tasks)	Transformer
Chronos	2024	Public + synthetic GP data	Requires adaptation	General (forecasting)	T5 with tokenization
MIRA	2025	454B+ medical time points	Native (CT-RoPE + Neural ODE)	Medical	Transformer + MoE + Neural ODE

Traditional Baselines: The Lower Bound That Keeps Punching Up

Every new model must answer: does it beat these simple approaches, and by how much?

GRU-D (Che et al., Scientific Reports 2018) extends the standard GRU with trainable decay mechanisms for both input values and hidden states. When a variable hasn't been observed for a while, its last value decays toward the empirical mean at a learned rate, and the hidden state similarly decays. Binary masking indicators and time-since-last-observation intervals are fed directly into the recurrence. Despite its simplicity, GRU-D remains one of the most competitive baselines for clinical time series — many architecturally complex models struggle to beat it convincingly on standard benchmarks like PhysioNet and MIMIC.

TCN (Bai et al., 2018) established that temporal convolutions could rival or beat recurrent networks for sequence modeling. Using causal dilated convolutions with residual connections, TCNs achieve exponentially large receptive fields while being fully parallelizable (unlike RNNs). However, TCN does not natively handle missing data — it assumes a complete, regularly sampled input. When used as a baseline for irregular time series, observations must first be placed on a regular grid (e.g., via forward-fill or mean imputation).

Mean-impute + Transformer is the simplest possible combination: fill missing values with per-feature training means, then apply a standard Transformer encoder. It decouples imputation from modeling and tests a key question: can a sufficiently powerful sequence model compensate for naive preprocessing? This baseline isolates the contribution of sophisticated missing-data handling — if a complex model can't beat mean-impute + Transformer, its architectural innovations aren't earning their keep.

These baselines serve different purposes. GRU-D is the strong specialized baseline that directly models missingness. TCN is the architecture baseline testing whether recurrence or attention is needed at all. Mean-impute + Transformer is the ablation baseline isolating the value of sophisticated imputation. Together, they establish the lower bound that any new model must clear.

GRU-D

Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, Yan Liu — USC, NYU, MIT

Scientific Reports 2018

Recurrent Neural Networks for Multivariate Time Series with Missing Values

Extends GRU with trainable decay on inputs (last observed values decay toward empirical mean) and hidden states (memory decays when observations are absent). Incorporates binary masks and time intervals directly.

Key Innovation

Trainable decay mechanism that captures informative missingness patterns — simple, interpretable, and stubbornly competitive

Limitations

•
No continuous-time dynamics between observations
•
Decay toward mean is a strong assumption
•
Sequential processing (not parallelizable)

rnngrumissing-datadecaybaseline

arXiv PDF

TCN (Temporal Convolutional Network)

Shaojie Bai, J. Zico Kolter, Vladlen Koltun — Carnegie Mellon University, Intel Labs

arXiv 2018

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

Causal dilated convolutions with residual connections achieve exponentially large receptive fields while being fully parallelizable. Demonstrated that convolutions can rival or beat LSTMs/GRUs across diverse sequence tasks.

Key Innovation

Showed that a simple convolutional architecture can match recurrent networks on temporal modeling — fast, parallelizable, and easy to train

Limitations

•
Does not handle missing data natively
•
Assumes regular sampling
•
Fixed receptive field determined by architecture depth and dilation

cnntemporal-convolutioncausalbaseline

arXiv PDF GitHub

Mean-Impute + Transformer

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin — Google Brain, Google Research, University of Toronto

NeurIPS 2017 (Transformer); composite baseline

Composite baseline using 'Attention Is All You Need' (Vaswani et al., NeurIPS 2017)

Fill missing values with per-feature training means, then apply a standard Transformer encoder. Tests whether a powerful sequence model can compensate for naive imputation — isolating the value of sophisticated missing-data handling.

Key Innovation

Serves as the critical ablation baseline — if a complex model can't beat this, its architectural innovations for handling missingness aren't justified

Limitations

•
Mean imputation destroys temporal correlation information
•
No mechanism for informative missingness
•
Transformer may overfit on small clinical datasets

transformerimputationablationbaseline

arXiv PDF

Comparison

Model	Year	Architecture	Handles Missing Data	Parallelizable	Role as Baseline
GRU-D	2018	RNN (GRU + decay)	Yes (trainable decay)	No	Strong specialized baseline for missingness
TCN	2018	CNN (causal dilated convolutions)	No (requires pre-imputation)	Yes	Architecture baseline — are RNNs/Transformers needed?
Mean-impute + Transformer	2017+	Transformer encoder	Naive (mean fill)	Yes	Ablation baseline — is sophisticated imputation needed?

Connecting the Threads: How These 15 Models Relate

Several cross-cutting themes emerge when viewing these models together:

1. The Solver Elimination Arc. The ODE family shows a clear progression: Latent ODE (2019) required numerical solvers → GRU-ODE-Bayes (2019) kept solvers but added GRU gating → Neural Flows (2021) eliminated the solver via direct solution parameterization → CRU (2022) used closed-form linear SDE solutions. Meanwhile, mTAN (2021) bypassed the entire ODE paradigm with attention-based interpolation. This arc has now come full circle with MIRA (2025), which reintroduces Neural ODE blocks into a foundation model but uses them selectively alongside Transformer attention.

2. From Flat Sequences to Structured Representations. Early models (Latent ODE, GRU-D, mTAN) process time series as flat sequences of observations. STraTS broke from this by representing data as unordered sets of triplets. RAINDROP, T-PatchGNN, and GraFITi went further by imposing graph structure — encoding that variables have relationships, not just co-occurrence. This structural inductive bias is especially valuable for multivariate clinical data where sensor correlations are known.

3. The Pre-training Question. Only four models incorporate pre-training: STraTS (self-supervised on clinical data), MOMENT (multi-task on diverse time series), Chronos (language-model training), and MIRA (medical-specific pre-training). The specialized models (Latent ODE through GraFITi) are all trained from scratch per-task. Whether foundation model pre-training can substitute for architectural innovations in handling irregularity is the field's central open question.

4. The Speed-Expressivity Frontier. Models arrange along a frontier: GRU-D and TCN are fast but limited; Latent ODE and ContiFormer are expressive but slow; mTAN, Neural Flows, and GraFITi achieve good accuracy at moderate cost. Foundation models (MOMENT, Chronos, MIRA) trade training-time cost for inference-time generalization. Practitioners should pick based on their constraint: if you have abundant task-specific data, a well-tuned specialized model (even GRU-D) may suffice; if you need zero-shot transfer, a foundation model is worth the investment.

Sources Checked

04:00 AM UTC

04:00 AM UTC

04:00 AM UTC

04:00 AM UTC

04:00 AM UTC

04:00 AM UTC

04:00 AM UTC

04:00 AM UTC

Saturday, April 4, 2026→