Modeling Irregular Time Series: From Neural ODEs to Graph Networks and Foundation Models
A comparative review of 15 models spanning continuous-time dynamics, attention mechanisms, graph neural networks, foundation models, and traditional baselines
Summary
Irregular time series — where observations arrive at uneven intervals with missing values across variables — are the norm in healthcare, climate science, and IoT. This review maps 15 models across five families, tracing how the field has progressively attacked three core challenges: (1) representing continuous dynamics without discretizing time, (2) scaling to high-dimensional multivariate data with complex inter-variable dependencies, and (3) transferring knowledge across domains without task-specific retraining.
The story begins with continuous-time ODE models (2019–2022) that replaced fixed-step RNNs with differential equations, then shifts to attention/Transformer architectures (2021–2023) that learned temporal interpolation without sequential integration. Graph neural networks (2022–2024) added explicit relational structure between variables. Most recently, foundation models (2024–2025) brought large-scale pre-training to time series, promising zero-shot generalization. Throughout, traditional baselines like GRU-D remain surprisingly competitive, anchoring what 'good enough' looks like.
Researcher Notes
The central tension in this field is expressivity vs. computational cost. Latent ODE (2019) showed that Neural ODEs could elegantly handle irregular sampling by defining continuous latent trajectories — but numerical ODE solvers are slow. Every subsequent model can be understood as an attempt to preserve that continuous-time expressivity while reducing cost: GRU-ODE-Bayes kept the ODE but added familiar GRU gating; CRU used linear SDEs for closed-form solutions; Neural Flows eliminated the solver entirely by parameterizing solutions directly. Meanwhile, mTAN showed you could skip ODEs altogether and use learned attention kernels for temporal interpolation at 10–100× the speed.
The graph models represent a genuinely different inductive bias, not just an incremental improvement. RAINDROP, T-PatchGNN, and GraFITi all argue that irregular multivariate time series have relational structure that flat sequence models miss. When sensor A is observed but sensor B is not, the relationship between A and B matters for imputation and prediction. This is especially compelling in clinical settings where lab tests and vital signs have known physiological correlations.
Foundation models are the newest entrant but face a unique challenge with irregular data. MOMENT and Chronos were designed for regularly sampled time series and need adaptation for irregular inputs. MIRA (Microsoft, 2025) is the first foundation model to tackle irregular medical time series head-on, using Neural ODE-based extrapolation and continuous-time positional encoding. Whether domain-specific foundation models like MIRA will outperform general-purpose ones like Chronos remains an open question — but the architectural innovations (CT-RoPE, frequency-specific MoE) suggest the irregular time series community's decade of work on continuous-time representations is finally feeding back into the foundation model paradigm.
GRU-D (2016/2018) remains the baseline that refuses to die. Its trainable decay mechanism is simple and interpretable, and many papers still struggle to beat it convincingly on standard clinical benchmarks. This is a healthy reminder that architectural complexity must earn its keep against well-tuned simple models.
Continuous-Time Dynamics: Learning to Evolve Between Observations
The irregular time series problem has a natural mathematical framing: if observations arrive at arbitrary times, why not model the latent state as a continuous trajectory? This is exactly what Neural ODE-based models do.
Latent ODE (Rubanova et al., NeurIPS 2019) was the breakthrough. It places a Neural ODE inside a variational autoencoder: an ODE-RNN encoder reads observations at their actual timestamps to produce a latent distribution, then a Neural ODE decoder evolves the latent state forward continuously. The trajectory is defined by a neural network parameterizing dz/dt, with numerical solvers (e.g., Runge-Kutta) performing integration. This was the first model to show that continuous latent dynamics could handle irregular sampling natively, without imputation or discretization.
Published concurrently, GRU-ODE-Bayes (De Brouwer et al., NeurIPS 2019) took a different architectural path. Instead of a VAE framework, it defined a continuous-time GRU where the hidden state evolves according to GRU-style ODEs between observations, and a separate Bayesian update network incorporates each new observation. This single-pass filtering design is more natural for online/streaming scenarios (healthcare monitoring, real-time tracking) where you update beliefs incrementally rather than encoding the entire sequence first.
Both models share a critical bottleneck: numerical ODE solvers are slow. Each forward pass requires multiple integration steps, and backpropagation through the solver (via adjoint methods) adds further cost. Two subsequent models attacked this directly.
CRU (Schirmer et al., ICML 2022) restricts the latent dynamics to linear stochastic differential equations, which admit closed-form solutions via the continuous-discrete Kalman filter. The payoff is exact integration (no solver), principled uncertainty quantification (the Kalman filter naturally propagates variance), and faster training. The trade-off is reduced expressivity in the dynamics — though the authors show this matters less than expected on standard benchmarks.
Neural Flows (Biloš et al., NeurIPS 2021) take the opposite approach: keep nonlinear dynamics but eliminate the solver by directly parameterizing the solution map z(t) rather than the derivative dz/dt. Given an initial condition z(t₀), the network outputs z(t) in a single forward pass. The key insight is defining mathematical constraints (identity at t₀, invertibility) that guarantee the output is a valid ODE solution. This gives O(1) cost per evaluation versus O(N) for numerical integration, while retaining the expressivity of nonlinear dynamics.
Latent ODE
Yulia Rubanova, Ricky T. Q. Chen, David Duvenaud — University of Toronto
NeurIPS 2019
Latent ODEs for Irregularly-Sampled Time Series
Places a Neural ODE inside a VAE — an ODE-RNN encoder reads irregular observations, and a Neural ODE decoder evolves the latent state continuously. Defines latent trajectories via learned dz/dt, integrated numerically.
Key Innovation
First model to demonstrate that continuous-time latent dynamics could handle irregular sampling natively without imputation
Limitations
- •Expensive numerical ODE integration at every forward pass
- •Adjoint backpropagation adds further computational overhead
- •VAE framework requires encoding the entire sequence before decoding
GRU-ODE-Bayes
Edward De Brouwer, Jaak Simm, Adam Arany, Yves Moreau — KU Leuven
NeurIPS 2019
GRU-ODE-Bayes: Continuous Modeling of Sporadically-Observed Time Series
Continuous-time GRU where hidden state evolves via GRU-style ODEs between observations, with a Bayesian update network that incorporates each new observation. Single-pass filtering model — no encoder-decoder split.
Key Innovation
Retains familiar GRU gating in continuous time with explicit Bayesian observation updates, making it natural for online/streaming filtering
Limitations
- •Still requires numerical ODE solver between observations
- •Bayesian update network adds complexity
- •Less expressive dynamics than fully nonlinear Neural ODE
CRU (Continuous Recurrent Units)
Mona Schirmer, Mazin Eltayeb, Stefan Lessmann, Maja Rudolph — Humboldt University of Berlin, Bosch Center for AI
ICML 2022
Modeling Irregular Time Series with Continuous Recurrent Units
Models latent dynamics as a linear SDE with closed-form solutions via the continuous-discrete Kalman filter. No numerical ODE solver needed — transitions are computed analytically.
Key Innovation
Closed-form state transitions via Kalman filter give exact integration, principled uncertainty propagation, and faster training than nonlinear ODE models
Limitations
- •Linear dynamics trade expressivity for tractability
- •May underfit highly nonlinear systems
- •Kalman filter assumes Gaussian noise
Neural Flows
Marin Biloš, Johanna Sommer, Syama Sundar Rangapuram, Tim Januschowski, Stephan Günnemann — TU Munich, Amazon Research
NeurIPS 2021
Neural Flows: Efficient Alternative to Neural ODEs
Instead of parameterizing dz/dt and numerically solving, directly parameterizes the solution map z(t) in a single forward pass. Defines mathematical constraints ensuring the output is a valid ODE solution.
Key Innovation
Eliminates the ODE solver entirely — O(1) cost per evaluation vs. O(N) for numerical integration, while retaining nonlinear expressivity
Limitations
- •Mathematical constraints on the flow limit architectural flexibility
- •Less interpretable dynamics (no explicit derivative)
- •Newer and less battle-tested than Latent ODE
Comparison
| Model | Year | Dynamics | Integration Cost | Observation Handling |
|---|---|---|---|---|
| Latent ODE | 2019 | Nonlinear Neural ODE | O(N) numerical solver | VAE encoder-decoder |
| GRU-ODE-Bayes | 2019 | Continuous-time GRU ODE | O(N) numerical solver | Bayesian update network |
| Neural Flows | 2021 | Nonlinear direct solution | O(1) single forward pass | Pluggable (GRU Flow, etc.) |
| CRU | 2022 | Linear SDE | O(1) closed-form Kalman | Kalman filter update |
Attention and Transformers: Learning Temporal Kernels Without ODEs
A parallel research thread asked: do we actually need differential equations to handle irregular time? Attention mechanisms offer an alternative — they can naturally weight observations by learned relevance, regardless of when those observations occurred.
mTAN (Shukla & Marlin, ICLR 2021) was the pivotal model here. It learns a continuous-time embedding of timestamps using parametric functions (sines/cosines at learnable frequencies), then uses multi-head attention where queries are desired reference time points, keys are actual observation times, and values are the measurements. This produces smooth interpolations of the irregular data into a fixed-length representation that can be fed into any downstream model. The key result: mTAN runs 10–100× faster than Neural ODE baselines while achieving comparable or better performance, because attention is parallelizable while ODE solvers are inherently sequential.
STraTS (Tipirneni & Reddy, ACM TKDD 2022) made a radical representational choice. Instead of treating a multivariate time series as a matrix (features × time steps) — which forces you to deal with missing cells — STraTS represents it as a set of observation triplets (time, variable-id, value). Each triplet becomes one token, processed by a standard Transformer encoder. This means the model never sees 'empty' cells; sparsity is handled by construction. Two technical innovations make this work: Continuous Value Embedding (CVE), a small neural network that maps scalar values (timestamps and measurements) to dense vectors without discretization, and a self-supervised forecasting pretext task for pre-training on unlabeled clinical data.
ContiFormer (Chen et al., NeurIPS 2023) unified the ODE and Transformer threads. It defines attention in continuous time: latent states evolve via ODEs between observations, and the attention mechanism operates over these continuous trajectories rather than discrete tokens. Both queries and keys are functions of time, not fixed vectors. The authors prove that several existing architectures (including designs resembling mTAN) are special cases of ContiFormer under specific function hypotheses. A reparameterization trick enables parallel computation of continuous-time attention scores, avoiding the sequential bottleneck that plagues standard Neural ODE approaches.
The evolution here is clear: mTAN showed attention could replace ODEs for temporal modeling; STraTS showed you could eliminate the time-series matrix representation entirely; ContiFormer showed you could have both ODE dynamics and attention in a single principled framework.
mTAN (Multi-Time Attention Network)
Satya Narayan Shukla, Benjamin M. Marlin — University of Massachusetts Amherst
ICLR 2021
Multi-Time Attention Networks for Irregularly Sampled Time Series
Learns continuous-time embeddings of timestamps and uses multi-head attention to interpolate irregular observations into fixed-length representations. Queries are reference times, keys are observation times, values are measurements.
Key Innovation
Replaces ODE-based temporal modeling with learned attention kernels — 10–100× faster than Neural ODE baselines
Limitations
- •Interpolation-first approach may lose fine-grained temporal patterns
- •No explicit modeling of continuous dynamics between observations
- •Attention complexity scales quadratically with observation count
STraTS (Self-supervised Transformer for Time-Series)
Sindhu Tipirneni, Chandan K. Reddy — Virginia Tech
ACM TKDD 2022
Self-Supervised Transformer for Sparse and Irregularly Sampled Multivariate Clinical Time-Series
Represents irregular multivariate time series as a set of (time, variable-id, value) triplets, each embedded via Continuous Value Embedding (CVE). A Transformer encoder processes the set directly — no imputation or grid alignment needed.
Key Innovation
Triplet representation eliminates empty cells by construction; CVE embeds continuous scalars without discretization; self-supervised pretraining addresses label scarcity
Limitations
- •Sequence length grows with number of observations (no patching/compression)
- •Designed for clinical data — less tested on other domains
- •No continuous dynamics modeling between observations
ContiFormer
Yuqi Chen, Kan Ren, Yansen Wang, Yuchen Fang, Weiwei Sun, Dongsheng Li — Microsoft Research
NeurIPS 2023
ContiFormer: Continuous-Time Transformer for Irregular Time Series Modeling
Unifies Neural ODEs and Transformers by defining attention in continuous time. Latent states evolve via ODEs between observations; attention queries and keys are continuous functions of time. Subsumes several existing architectures as special cases.
Key Innovation
Continuous-time attention mechanism with reparameterization trick for parallel computation — combines ODE expressivity with Transformer scalability
Limitations
- •More complex to implement and tune than pure attention or pure ODE models
- •Reparameterization approximations may introduce errors
- •Newer model with less community adoption
Comparison
| Model | Year | Time Handling | Attention Role | ODE Component | Pre-training |
|---|---|---|---|---|---|
| mTAN | 2021 | Learned continuous-time embedding | Interpolation to reference grid | None (compatible with ODE-RNN downstream) | None |
| STraTS | 2022 | CVE of timestamps | Direct processing of observation triplets | None | Self-supervised forecasting |
| ContiFormer | 2023 | ODE-governed continuous trajectories | Continuous-time attention over ODE states | Core component | None |
Graph Neural Networks: Adding Relational Structure to Irregular Observations
Sequence models (whether ODE-based or attention-based) process irregular time series as flat sequences of observations. But multivariate time series have relational structure: sensors are physically connected, lab tests share physiological pathways, and IoT devices form spatial networks. Graph neural networks encode this structure explicitly.
RAINDROP (Zhang et al., ICLR 2022) was the first major GNN approach for irregular multivariate time series. It represents each sample as a separate sensor graph where nodes are variables and edges encode learned inter-sensor dependencies. The key mechanism: when an observation arrives at one sensor (a 'raindrop'), it sends messages to neighboring sensors ('creating ripples'). This means the graph dynamically reflects which sensors have observations at any given time — sparsity becomes a structural property of the graph rather than a problem to solve. Temporal self-attention captures within-sensor dynamics, while the GNN message-passing captures cross-sensor dependencies.
GraFITi (Yalavarthi et al., AAAI 2024) made a bold reformulation. Instead of the usual 'inputs → hidden states → predictions' pipeline, GraFITi casts forecasting as edge weight prediction on a bipartite graph. One set of nodes represents observed time-variable pairs; the other represents target time-variable pairs. A GNN adapted from Graph Attention Networks learns embeddings and predicts edge weights that yield forecasted values. This sidesteps ODE solvers and imputation entirely — sparsity is a natural graph property — and runs up to 5× faster than ODE-based methods with 17% better accuracy.
T-PatchGNN (Zhang et al., ICML 2024) bridged the patching paradigm (popularized by PatchTST for regular time series) with graph neural networks. It divides each univariate irregular series into 'transformable patches' — segments covering a uniform time horizon but containing a variable number of observations. A Transformer handles intra-series (temporal) modeling within patches, while time-adaptive GNNs model inter-series (cross-variable) correlations through learned time-varying graph structures. The patching avoids the sequence-length explosion of processing every observation individually.
These three models share a conviction: irregular multivariate time series are inherently graph-structured, and making this structure explicit improves both accuracy and interpretability. They differ in what the graph represents (sensor networks vs. observation-target bipartite graphs vs. patch-level adaptive graphs) and how the graph is constructed (learned vs. task-derived).
RAINDROP
Xiang Zhang, Marko Zeman, Theodoros Tsiligkaridis, Marinka Zitnik — Harvard University, MIT Lincoln Lab
ICLR 2022
Graph-Guided Network for Irregularly Sampled Multivariate Time Series
Represents each sample as a sensor graph with learned inter-sensor edges. When an observation arrives ('raindrop'), it sends messages to neighboring sensors ('ripples'). Temporal self-attention handles within-sensor dynamics.
Key Innovation
Dynamic graph structure that adapts to which sensors are observed at each time, making sparsity a structural property rather than a problem
Limitations
- •Classification-focused — less tested for forecasting
- •Graph learning adds computational overhead for many sensors
- •Sample-level graphs don't share structure across the dataset
T-PatchGNN
Weijia Zhang, Chenlong Yin, Hao Liu, Xiaofang Zhou, Hui Xiong — HKUST, Rutgers University
ICML 2024
Irregular Multivariate Time Series Forecasting: A Transformable Patching Graph Neural Networks Approach
Divides irregular series into 'transformable patches' (uniform time horizon, variable observation count). Transformer for intra-series temporal modeling, time-adaptive GNN for inter-series cross-variable correlations.
Key Innovation
Transformable patching avoids sequence-length explosion while preserving irregular structure; time-adaptive graph captures evolving variable relationships
Limitations
- •Patch boundaries may split meaningful temporal patterns
- •Dual architecture (Transformer + GNN) increases model complexity
- •Patch size is a hyperparameter that affects performance
GraFITi
Vijaya Krishna Yalavarthi, Kiran Madhusudhanan, Randolf Scholz, Nourhan Ahmed, Johannes Burchert, Shayan Jawed, Stefan Born, Lars Schmidt-Thieme — University of Hildesheim
AAAI 2024
GraFITi: Graphs for Forecasting Irregularly Sampled Time Series
Reformulates forecasting as edge weight prediction on a bipartite graph — observed (time, variable) pairs on one side, target pairs on the other. A GNN adapted from GAT learns embeddings and predicts edge weights as forecasted values.
Key Innovation
Casts prediction as graph edge weight estimation — sidesteps ODE solvers and imputation entirely; up to 5× faster than ODE-based methods
Limitations
- •Bipartite graph grows with number of observation-target pairs
- •Edge weight prediction is an unusual framing that may be harder to interpret
- •Less explored for classification tasks
Comparison
| Model | Year | Graph Structure | Temporal Component | Primary Task | Handles Sparsity Via |
|---|---|---|---|---|---|
| RAINDROP | 2022 | Sensor-level (nodes = variables) | Temporal self-attention | Classification | Dynamic graph adapts to observed sensors |
| GraFITi | 2024 | Bipartite (observed ↔ target pairs) | GAT message passing | Forecasting | Sparsity is natural graph property |
| T-PatchGNN | 2024 | Time-adaptive inter-series graph | Transformer on patches | Forecasting | Transformable patching |
Foundation Models: Pre-trained Generalists Meet Irregular Data
The foundation model wave has reached time series, promising zero-shot generalization across domains. But irregular sampling poses a unique challenge: most foundation models assume regular grids.
MOMENT (Goswami et al., ICML 2024) is a family of open-source, pre-trained Transformer models for general-purpose time series analysis. Pre-trained on the 'Time Series Pile' — a large, diverse collection of public time series data — MOMENT handles five tasks: short/long-horizon forecasting, classification, anomaly detection, and imputation. It addresses time series-specific challenges (varying lengths, multi-channel inputs, distribution shifts) through systematic design choices. However, it was primarily designed for regularly sampled data and requires adaptation for irregular inputs.
Chronos (Ansari et al., Amazon, TMLR 2024) made the provocative choice to treat time series as language. It tokenizes real-valued observations into a fixed discrete vocabulary via scaling and quantization, then trains T5-family transformers (20M–710M parameters) with standard cross-entropy loss. Training data is augmented with synthetic time series from Gaussian processes. The result: strong zero-shot probabilistic forecasting on 42 benchmarks. Like MOMENT, Chronos assumes regular sampling but its language-model approach means irregular data can potentially be handled through special tokens or positional encodings.
MIRA (Li et al., Microsoft, 2025) is the first foundation model built specifically for irregular medical time series. Three architectural innovations address this: Continuous-Time Rotary Positional Encoding (CT-RoPE) handles variable time intervals (extending the standard RoPE used in LLMs to continuous time); a frequency-specific mixture-of-experts layer routes computation across latent frequency regimes; and a Continuous Dynamics Extrapolation Block based on Neural ODEs enables forecasting at arbitrary timestamps. Pre-trained on 454+ billion medical time points, MIRA achieves 7–10% error reductions over baselines.
The progression from MOMENT/Chronos to MIRA mirrors the broader field's evolution: general-purpose architectures work well for regular data, but irregular time series ultimately demand the continuous-time machinery that the specialized model community has been developing since 2019. MIRA's CT-RoPE and Neural ODE blocks are direct descendants of the ideas in Latent ODE and mTAN.
MOMENT
Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, Artur Dubrawski — Carnegie Mellon University, University of Pennsylvania
ICML 2024
MOMENT: A Family of Open Time-series Foundation Models
Open-source pre-trained Transformer family for general-purpose time series. Pre-trained on the 'Time Series Pile' across five tasks: forecasting, classification, anomaly detection, and imputation.
Key Innovation
Multi-task, multi-domain pre-training on a curated 'Time Series Pile' — first open-source general-purpose time series foundation model
Limitations
- •Designed for regular sampling — needs adaptation for irregular data
- •Transformer architecture may struggle with very long sequences
- •Pre-training data diversity may not cover all domains
Chronos
Abdul Fatir Ansari, Lorenzo Stella, Ali Caner Turkmen, et al. — Amazon
TMLR 2024
Chronos: Learning the Language of Time Series
Tokenizes time series values into a discrete vocabulary, then trains T5-family transformers (20M–710M params) with cross-entropy loss. Augmented with synthetic Gaussian process data for robust zero-shot probabilistic forecasting.
Key Innovation
Language-model paradigm for time series — tokenization + cross-entropy loss enables direct reuse of LLM training infrastructure
Limitations
- •Tokenization quantizes continuous values — information loss
- •Assumes regular sampling
- •Forecasting-only — not multi-task like MOMENT
MIRA
Hao Li, Bowen Deng, Chang Xu, Zhiyuan Feng, Viktor Schlegel, et al. — Microsoft Research, University of Manchester, Peking University, Tsinghua University
arXiv preprint 2025
MIRA: Medical Time Series Foundation Model for Real-World Health Data
First foundation model for irregular medical time series. Uses CT-RoPE for variable time intervals, frequency-specific MoE for multi-scale dynamics, and Neural ODE-based extrapolation for arbitrary-timestamp forecasting. Pre-trained on 454B+ medical time points.
Key Innovation
Brings continuous-time representations (CT-RoPE, Neural ODE extrapolation) into the foundation model paradigm — directly builds on a decade of irregular time series research
Limitations
- •Domain-specific (medical) — transfer to other domains untested
- •Large model may be impractical for edge deployment
- •Pre-print status — not yet peer-reviewed at major venue
Comparison
| Model | Year | Pre-training Scale | Irregular Sampling | Domain | Architecture |
|---|---|---|---|---|---|
| MOMENT | 2024 | Time Series Pile (diverse) | Requires adaptation | General (5 tasks) | Transformer |
| Chronos | 2024 | Public + synthetic GP data | Requires adaptation | General (forecasting) | T5 with tokenization |
| MIRA | 2025 | 454B+ medical time points | Native (CT-RoPE + Neural ODE) | Medical | Transformer + MoE + Neural ODE |
Traditional Baselines: The Lower Bound That Keeps Punching Up
Every new model must answer: does it beat these simple approaches, and by how much?
GRU-D (Che et al., Scientific Reports 2018) extends the standard GRU with trainable decay mechanisms for both input values and hidden states. When a variable hasn't been observed for a while, its last value decays toward the empirical mean at a learned rate, and the hidden state similarly decays. Binary masking indicators and time-since-last-observation intervals are fed directly into the recurrence. Despite its simplicity, GRU-D remains one of the most competitive baselines for clinical time series — many architecturally complex models struggle to beat it convincingly on standard benchmarks like PhysioNet and MIMIC.
TCN (Bai et al., 2018) established that temporal convolutions could rival or beat recurrent networks for sequence modeling. Using causal dilated convolutions with residual connections, TCNs achieve exponentially large receptive fields while being fully parallelizable (unlike RNNs). However, TCN does not natively handle missing data — it assumes a complete, regularly sampled input. When used as a baseline for irregular time series, observations must first be placed on a regular grid (e.g., via forward-fill or mean imputation).
Mean-impute + Transformer is the simplest possible combination: fill missing values with per-feature training means, then apply a standard Transformer encoder. It decouples imputation from modeling and tests a key question: can a sufficiently powerful sequence model compensate for naive preprocessing? This baseline isolates the contribution of sophisticated missing-data handling — if a complex model can't beat mean-impute + Transformer, its architectural innovations aren't earning their keep.
These baselines serve different purposes. GRU-D is the strong specialized baseline that directly models missingness. TCN is the architecture baseline testing whether recurrence or attention is needed at all. Mean-impute + Transformer is the ablation baseline isolating the value of sophisticated imputation. Together, they establish the lower bound that any new model must clear.
GRU-D
Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, Yan Liu — USC, NYU, MIT
Scientific Reports 2018
Recurrent Neural Networks for Multivariate Time Series with Missing Values
Extends GRU with trainable decay on inputs (last observed values decay toward empirical mean) and hidden states (memory decays when observations are absent). Incorporates binary masks and time intervals directly.
Key Innovation
Trainable decay mechanism that captures informative missingness patterns — simple, interpretable, and stubbornly competitive
Limitations
- •No continuous-time dynamics between observations
- •Decay toward mean is a strong assumption
- •Sequential processing (not parallelizable)
TCN (Temporal Convolutional Network)
Shaojie Bai, J. Zico Kolter, Vladlen Koltun — Carnegie Mellon University, Intel Labs
arXiv 2018
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Causal dilated convolutions with residual connections achieve exponentially large receptive fields while being fully parallelizable. Demonstrated that convolutions can rival or beat LSTMs/GRUs across diverse sequence tasks.
Key Innovation
Showed that a simple convolutional architecture can match recurrent networks on temporal modeling — fast, parallelizable, and easy to train
Limitations
- •Does not handle missing data natively
- •Assumes regular sampling
- •Fixed receptive field determined by architecture depth and dilation
Mean-Impute + Transformer
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin — Google Brain, Google Research, University of Toronto
NeurIPS 2017 (Transformer); composite baseline
Composite baseline using 'Attention Is All You Need' (Vaswani et al., NeurIPS 2017)
Fill missing values with per-feature training means, then apply a standard Transformer encoder. Tests whether a powerful sequence model can compensate for naive imputation — isolating the value of sophisticated missing-data handling.
Key Innovation
Serves as the critical ablation baseline — if a complex model can't beat this, its architectural innovations for handling missingness aren't justified
Limitations
- •Mean imputation destroys temporal correlation information
- •No mechanism for informative missingness
- •Transformer may overfit on small clinical datasets
Comparison
| Model | Year | Architecture | Handles Missing Data | Parallelizable | Role as Baseline |
|---|---|---|---|---|---|
| GRU-D | 2018 | RNN (GRU + decay) | Yes (trainable decay) | No | Strong specialized baseline for missingness |
| TCN | 2018 | CNN (causal dilated convolutions) | No (requires pre-imputation) | Yes | Architecture baseline — are RNNs/Transformers needed? |
| Mean-impute + Transformer | 2017+ | Transformer encoder | Naive (mean fill) | Yes | Ablation baseline — is sophisticated imputation needed? |
Connecting the Threads: How These 15 Models Relate
Several cross-cutting themes emerge when viewing these models together:
1. The Solver Elimination Arc. The ODE family shows a clear progression: Latent ODE (2019) required numerical solvers → GRU-ODE-Bayes (2019) kept solvers but added GRU gating → Neural Flows (2021) eliminated the solver via direct solution parameterization → CRU (2022) used closed-form linear SDE solutions. Meanwhile, mTAN (2021) bypassed the entire ODE paradigm with attention-based interpolation. This arc has now come full circle with MIRA (2025), which reintroduces Neural ODE blocks into a foundation model but uses them selectively alongside Transformer attention.
2. From Flat Sequences to Structured Representations. Early models (Latent ODE, GRU-D, mTAN) process time series as flat sequences of observations. STraTS broke from this by representing data as unordered sets of triplets. RAINDROP, T-PatchGNN, and GraFITi went further by imposing graph structure — encoding that variables have relationships, not just co-occurrence. This structural inductive bias is especially valuable for multivariate clinical data where sensor correlations are known.
3. The Pre-training Question. Only four models incorporate pre-training: STraTS (self-supervised on clinical data), MOMENT (multi-task on diverse time series), Chronos (language-model training), and MIRA (medical-specific pre-training). The specialized models (Latent ODE through GraFITi) are all trained from scratch per-task. Whether foundation model pre-training can substitute for architectural innovations in handling irregularity is the field's central open question.
4. The Speed-Expressivity Frontier. Models arrange along a frontier: GRU-D and TCN are fast but limited; Latent ODE and ContiFormer are expressive but slow; mTAN, Neural Flows, and GraFITi achieve good accuracy at moderate cost. Foundation models (MOMENT, Chronos, MIRA) trade training-time cost for inference-time generalization. Practitioners should pick based on their constraint: if you have abundant task-specific data, a well-tuned specialized model (even GRU-D) may suffice; if you need zero-shot transfer, a foundation model is worth the investment.