Nonparametric filtering, estimation and classification using neural jump ODEs

Jakob Heiss; Florian Krach; Thorsten Schmidt; Félix B. Tambe-Ndonfack

doi:10.1515/strm-2025-0001

Article

Nonparametric filtering, estimation and classification using neural jump ODEs

Jakob Heiss , Florian Krach , Thorsten Schmidt and Félix B. Tambe-Ndonfack

Published/Copyright: September 5, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Statistics & Risk Modeling

Abstract

Neural Jump ODEs model the conditional expectation between observations by neural ODEs and jump at arrival of new observations. They have demonstrated effectiveness for fully data-driven online forecasting in settings with irregular and partial observations, operating under weak regularity assumptions. This work extends the framework to input-output systems, enabling direct applications in online filtering and classification. We establish theoretical convergence guarantees for this approach, providing a robust solution to L 2 -optimal filtering. Empirical experiments highlight the model’s superior performance over classical parametric methods, particularly in scenarios with complex underlying distributions. These results emphasize the approach’s potential in time-sensitive domains such as finance and health monitoring, where real-time accuracy is crucial.

Keywords: Classification; filtering; input-output systems; neural jump ODEs; optimal estimation

MSC 2020: 60G35; 62M20; 68T07

Funding source: Deutsche Forschungsgemeinschaft

Award Identifier / Grant number: 499552394

Funding statement: The funding for Félix Ndonfack by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project ID 499552394 – SFB 1597 Small Data – and support from the FDMAI, Freiburg, are gratefully acknowledged.

A Signatures

We give a brief overview of the signature transform and its universal approximation results, following [35, Section 3.2]. We start by defining paths of bounded variation.

Definition 4

Let 𝐽 be a closed interval in ℝ and d ≥ 1 . Let X : J → R d be a path on 𝐽. The variation of 𝑋 on the interval 𝐽 is defined by

∥ X ∥ var , J = sup P ⁢ ( J ) ∑ t j ∈ P ⁢ ( J ) | X t j − X t j − 1 | 2 ,

where the supremum is taken over all finite partitions P ⁢ ( J ) of 𝐽.

Definition 5

We denote the set of R d -valued paths of bounded variation on 𝐽 by BV ⁢ ( J , R d ) and endow it with the norm

∥ X ∥ BV : = | X 0 | 2 + ∥ X ∥ var , J .

For continuous paths of bounded variation, we can define the signature transform.

Definition 6

Let 𝐽 denote a closed interval in ℝ. Let X : J → R d be a continuous path with finite variation. The signature of 𝑋 is defined as

S ⁢ ( X ) = ( 1 , X J 1 , X J 2 , … ) ,

where, for each m ≥ 1 ,

X J m = ∫ u 1 < ⋯ < u m u 1 , … , u m ∈ J d X u 1 ⊗ ⋯ ⊗ d X u m ∈ ( R d ) ⊗ m

is a collection of iterated integrals. The map from a path to its signature is called signature transform.

A good introduction to the signature transform with its properties and examples can be found in [6, 33, 15]. In practice, we are not able to use the full (infinite) signature, but instead use a truncated version.

Definition 7

Let 𝐽 denote a compact interval in ℝ. Let X : J → R d be a continuous path with finite variation. The truncated signature of 𝑋 of order 𝑚 is defined as

π m ⁢ ( X ) = ( 1 , X J 1 , X J 2 , … , X J m ) ,

i.e., the first m + 1 terms (levels) of the signature of 𝑋.

Note that the size of the truncated signature depends on the dimension of 𝑋, as well as the chosen truncation level. Specifically, for a path of dimension 𝑑, the dimension of the truncated signature of order 𝑚 is given by

{ m + 1 if ⁢ d = 1 , d m + 1 − 1 d − 1 if ⁢ d > 1 .

When using the truncated signature as input to a model, this results in a trade-off between accurately describing the path and model complexity.

A well-known result in stochastic analysis states that continuous functions can be uniformly approximated using truncated signatures, which is made precise in the following theorem. For references to the literature and an idea of the proof, see [33, Theorem 1]. This classical result was extended in [35, Proposition 3.8] to additionally incorporate the input 𝑐 from a compact set C ⊂ R m , which can be stated as follows.

Proposition 1

Consider 𝒫 as a compact subset of BV 0 c ⁢ ( [ 0 , 1 ] , H ) consisting of paths that are not tree-like equivalent, and let C ⊂ R m for some m ∈ N be compact. We take the Cartesian product BV 0 c ⁢ ( [ 0 , 1 ] , H ) × R m with the product norm defined as the sum of the individual norms (variation norm and 1-norm). Suppose f : P × C → R is continuous. Then, for any ε > 0 , there exist an M > 0 and a continuous function f ̃ such that

sup ( x , c ) ∈ P × C | f ⁢ ( x , c ) − f ̃ ⁢ ( π M ⁢ ( x ) , c ) | < ε .

To apply this result, we need a tractable description of certain compact subsets of BV 0 c ⁢ ( [ 0 , 1 ] , R d ) that include suitable paths for our considerations. Since BV 0 c ⁢ ( [ 0 , 1 ] , R d ) is not finite-dimensional, not every closed and bounded subset is compact. In [4, Example 4], the following set of functions is proven to be relatively compact. As already observed in [35, Remark 3.11], this also holds for R d -valued paths.

Proposition 2

For any N ∈ N , the set A N ⊆ BV 0 c ⁢ ( [ 0 , 1 ] , R ) , consisting of all piecewise linear, bounded and continuous functions expressible as

f ⁢ ( t ) = ( a 1 ⁢ t ) ⁢ 1 [ s 0 , s 1 ] ⁢ ( t ) + ∑ i = 2 N ( a i ⁢ t + b i ) ⁢ 1 ( s i − 1 , s i ] ⁢ ( t ) ,

is relatively compact. Here,

a i , b i ∈ [ − N , N ] , b 1 = 0 , a i ⁢ s i + b i = a i + 1 ⁢ s i + b i + 1 ⁢ for all ⁢ 1 ≤ i ≤ N , and 0 = s 0 < s 1 < ⋯ < s N = 1 .

B Kalman filter

If the observation and signal distributions in a filtering system are Gaussian and independent, then the Kalman filter [29] recovers the optimal solution, i.e., the true conditional expectation. This is the case in Example 1, where the normally distributed drift should be filtered from observations of a Brownian motion with drift, and in the filtering example with two Brownian motions in Section 6.4.

We first recall the Kalman filter in Section B.1 and then show that applying it in Example 1 leads to the same result as the direct computation of the conditional expectation given there. The example of Section 6.4 works along the same lines.

B.1 Definition of the Kalman filter

The Kalman filter (without control input) assumes an underlying system of a discrete, unobserved state process 𝑥 and an observation process 𝑧 given as

(B.1) x k = F k ⁢ x k − 1 + w k and z k = H k ⁢ x k + v k ,

where F k is the state transition matrix, w k ∼ N ⁢ ( 0 , Q k ) is the process noise, H k is the observation matrix and v k ∼ N ⁢ ( 0 , R k ) is the observation noise. The initial state x 0 and all noise terms ( w k ) K and ( v k ) k are assumed to be mutually independent.

Then the Kalman filter, which is optimal in this setting, initializes the prediction of 𝑥 as x ̂ 0 | 0 = E ⁡ [ x 0 ] and of its covariance by P 0 | 0 = Cov ⁡ ( x 0 ) . Then the Kalman filter proceeds in two steps, the prediction and the update step, which can be summarized as

(Predict) x ̂ k | k − 1 = F k ⁢ x ̂ k − 1 | k − 1 , P k | k − 1 = F k ⁢ P k − 1 | k − 1 ⁢ F k ⊤ + Q k ,

(Update) x ̂ k | k = x ̂ k | k − 1 + K k ⁢ y ̃ k , P k | k = ( I − K k ⁢ H k ) ⁢ P k | k − 1 ,

where

y ̃ k = z k − H k ⁢ x ̂ k | k − 1 , S k = H k ⁢ P k | k − 1 ⁢ H k ⊤ + R k , K k = P k | k − 1 ⁢ H k ⊤ ⁢ S k − 1 .

B.2 Kalman filter applied to Example 1

There are different ways to set up the Kalman filter for the filtering task of Example 1; see Remark 10. Here, we shall use the setup resembling the computation of conditional expectation as we did it in the example. In particular, we have the constant signal process x k = μ , which we want to predict; hence F k = 1 , w k = 0 , Q k = 0 , and we use only one step of the observation process, where we assume to make all observations concurrently with^[12]

z 1 = ( X t 1 − X 0 , … , X t n − X 0 ) ⊤ = ( μ ⁢ t k + σ ⁢ W t k ) 1 ≤ k ≤ n ⊤ ;

y ̃ 1 = ( X t 1 − X 0 − t 1 ⁢ a , … , X t n − X 0 − t n ⁢ a ) ⊤ ,

and S 1 = b 2 ⁢ ( t i ⁢ t j ) i , j + R . A direct calculation shows that S 1 = Cov ⁡ ( z 1 ) , which implies that S 1 = Σ ̃ 11 holds for Σ ̃ 11 defined in Example 1. Moreover, note that P 1 | 0 ⁢ H k ⊤ = b 2 ⁢ ( t 1 , … , t n ) = Cov ⁡ ( μ , z 1 ⊤ ) = Σ ̃ 2 , 1 ; hence K k = Σ ̃ 2 , 1 ⁢ Σ ̃ 11 − 1 . Therefore, the update step leads to the a posteriori prediction

x ̂ 1 | 1 = a + Σ ̃ 2 , 1 ⁢ Σ ̃ 11 − 1 ⁢ ( X t 1 − X 0 − t 1 ⁢ a , … , X t n − X 0 − t n ⁢ a ) ⊤ ,

which coincides with the prediction μ ̂ computed in Example 1. Moreover, it is easy to verify that the a posteriori covariance satisfies P k | k = Σ ̂ .

Remark 10

As mentioned before, the Kalman filter could also be applied differently, leading to the same result. In particular, and probably more naturally, one could use the increments z 1 = ( X t 1 − X t 0 , … , X t n − X t n − 1 ) as observations. Using them as concurrent observations, one only has to adapt H 1 , v 1 , R in the above computations. Checking that this coincides with the given solution in Example 1 might be a bit tedious. However, it is easy to adjust the computation in Example 1 to also use the increments, making it easy to verify that this coincides with the Kalman filter. Conditioning on the same information, whether using the increments or the direct values of the path X t k , yields identical conditional distributions. Therefore, the resulting computed μ ̂ must also be the same. Hence all four mentioned ways lead to the same result. Since the increments are mutually independent, using them also allows for a multi-step (recursive) application of the Kalman filter (while this does not work with the correlated path values X t k ). It is important to note that the increments could be used as observations once at a step or multiply at a step and in arbitrary order, always leading to the same resulting estimator μ ̂ as was outlined, e.g., by [44].

C Experimental details

Our experiments are based on the implementation used by [35], which is available at https://github.com/FlorianKrach/PD-NJODE. Therefore, we refer the reader to its appendix for any details that are not provided here.

C.1 A practical note on self-imputation

In real world settings, data is usually incomplete. In experiments, we have seen that self-imputation is an effective way to deal with missing values, outperforming imputing the last observation or 0s or only using the signature as input without directly passing the current observation to the model. Using an input-output setting where some input variables are not output variables leads to the problem that self-imputation cannot be used, since the model does not predict an estimate for all input variables. On the other hand, using all input variables also as output variables, might lead to a worsened performance if the added input variables overshadow the actual target variables in the loss function. There are different approaches to overcome this issue.

One can first train a separate model for predicting all input variables, and then use this model for imputation of missing values while training the second model with the actual target variables. This is probably the best possible imputation, but also the most expensive to train.
An alternative is to jointly train both networks, and to do so efficiently, use only a different readout network g ̃ θ ̃ for the two models, while sharing the other neural networks f θ and ρ θ . Hence two different losses are used to train the two models. Variants where one of the optimizers only has access to the parameters of its respective readout network can be used.
Another approach is to use only one model predicting all input and the target variables at once and weighting the different variables in the loss function. Then a reasonable self-imputation and good results for the target prediction can by achieved by changing the weighting throughout the training from putting all weight on the input variables in the beginning of the training to putting all weight on the target variables later on.

We note that if a large number of training paths is available, training a different model for each output variable will usually lead to an improvement of the results, since no overshadowing of the different variables in the loss can happen. If this is of importance, then it might be beneficial to use the first approach or to use the second approach with a different readout network for each target variable.

On the other hand, if we can computationally afford to train a very large architecture with many hidden neurons for many epochs with a small learning rate but only have access to a limited number of training paths, then having one model that predicts everything could be even better in terms of generalization to new paths. If we suspect that there might be some hidden features that are valuable for predicting both the outputs and the inputs, one can even benefit from multi-task learning [5, 23, 1, 22]. Training one model that predicts both can help to learn more reliable features in the hidden state rather than overfitting its features to only be useful for predicting the outputs. In practice, it can still be beneficial to weight down the loss for predicting the inputs to mitigate numerical problems related to overshadowing the loss for the outputs.

C.2 Differences between the implementation and the theoretical description of the NJODE

Since we use the same implementation of the PD-NJODE, all differences between the implementation and the theoretical description listed in [35, Appendix D.1.1] also apply here.

C.3 Details for synthetic datasets

Below, we list the standard settings for all synthetic datasets. Any deviations or additions are listed in the respective subsections of the specific datasets.

Dataset

We use the Euler scheme to sample paths from the given stochastic processes on the interval [ 0 , 1 ] , i.e., with T = 1 and a discretization time grid with step size 0.01. At each time point, we observe the process with probability p = 0.1 . We sample between 20,000 and 100,000 paths of which 80 % are used as training set and the remaining 20 % as validation set. Additionally, a test set with 4,000 to 5,000 paths is generated.

Architecture

We use the PD-NJODE with the following architecture. The latent dimension is d H ∈ { 100 , 200 } and all three neural networks have the same structure of one hidden layer with ReLU or tanh activation function and 100 nodes. The signature is used up to truncation level 3, the encoder is recurrent and both the encoder and decoder use a residual connection.

Training

We use the Adam optimizer with the standard choices β = ( 0.9 , 0.999 ) , weight decay of 0.0005 and learning rate 0.001. Moreover, a dropout rate of 0.1 is used for every layer and training is performed with a mini-batch size of 200 for 200 epochs. The PD-NJODE models are trained with the loss function (3.2).

C.3.1 Details for scaled Brownian motion with uncertain drift

Dataset

For the combined training and validation set, 20,000 paths are sampled, and the test set has 5,000 independent paths.

Architecture

We use d H = 100 and tanh activation function.

C.3.2 Details for geometric Brownian motion with uncertain parameters

Dataset

For the combined training and validation set, 100,000 paths are sampled, and the test set has 5,000 independent paths.

Architecture

We use d H = 100 and ReLU activation function for the first experiments. In the convergence study, we use d H = 200 .

C.3.3 Details for CIR process with uncertain parameters

Dataset

For the combined training and validation set, 100,000 paths are sampled, and the test set has 4,000 independent paths.

Architecture

We use d H = 200 and ReLU activation function for the first experiments. For Experiment 2, the empirical performance is slightly better when the encoder and decoder do not use residual connections (validation loss of 0.446 compared to 0.450); therefore, we report these results.

C.3.4 Details for Brownian motion filtering

Dataset

For the combined training and validation set, 40,000 paths are sampled, and the test set has 4,000 independent paths.

Architecture

We use d H = 100 and tanh activation functions.

C.3.5 Details for classifying a Brownian motion

Dataset

For the combined training and validation set, 40,000 paths are sampled, and the test set has 4,000 independent paths.

Architecture

We use d H = 200 and ReLU activation functions.

C.3.6 Details for loss comparison on Black–Scholes

Dataset

The geometric Brownian motion model described in [24, Appendix F.1] is used with the same parameters. For the combined train and validation set, 20,000 paths are sampled; no test set is used.

Architecture

We use d H = 100 and ReLU activation function.

D Inductive bias

We have proven in Theorem 2 that the IO NJODE is asymptotically unbiased. In Sections 5 and 6, we have studied settings, where we can simulate arbitrarily many training paths. In such setting, we can rely on Theorem 2 if we sample sufficiently many training paths. However, in settings where we observe a limited number of real-world training paths without being able to simulate further training paths, the inductive bias becomes more important. The inductive bias of NJODEs has been discussed in [1, Appendix B] and in [22] and this discussion is also applicable to IO NJODE. These insights on the inductive bias can be helpful for understanding when IO NJODE will be able to generalize well from a few observed training paths to new unseen test paths.

Acknowledgements

The authors thank Josef Teichmann for helpful inputs and discussions. Moreover, we thank the reviewers for their insightful comments that helped to improve the paper.

References

[1] W. Andersson, J. Heiss, F. Krach and J. Teichmann, Extending path-dependent NJ-ODEs to noisy observations and a dependent observation framework, Trans. Mach. Learn. Res. (2024), https://openreview.net/forum?id=0T2OTVCCC1. Search in Google Scholar

[2] E. Archer, I. M. Park, L. Buesing, J. Cunningham and L. Paninski, Black box variational inference for state space models, preprint (2015), https://arxiv.org/abs/1511.07367. Search in Google Scholar

[3] I. Azizi, J. Bodik, J. Heiss and B. Yu, Clear: Calibrated learning for epistemic and aleatoric risk, preprint (2025), https://arxiv.org/abs/2507.08150. Search in Google Scholar

[4] D. Bugajewski and J. Gulgowski, On the characterization of compactness in the space of functions of bounded variation in the sense of Jordan, J. Math. Anal. Appl. 484 (2020), no. 2, Article ID 123752. 10.1016/j.jmaa.2019.123752Search in Google Scholar

[5] R. Caruana, L. Pratt and S. Thrun, Multitask learning, Mach. Learn. 128 (1997), 41–75. 10.1023/A:1007379606734Search in Google Scholar

[6] I. Chevyrev and A. Kormilitzin, A primer on the signature method in machine learning, preprint (2016), https://arxiv.org/abs/1603.03788. Search in Google Scholar

[7] S. N. Cohen and R. J. Elliott, Stochastic Calculus and Applications, Probab. Appl., Springer, Cham, 2015. 10.1007/978-1-4939-2867-5Search in Google Scholar

[8] A. Corenflos, J. Thornton, G. Deligiannidis and A. Doucet, Differentiable particle filtering via entropy-regularized optimal transport, International Conference on Machine Learning, PMLR, Oxford (2021), 2100–2111. Search in Google Scholar

[9] J. C. Cox, J. E. Ingersoll, Jr. and S. A. Ross, A theory of the term structure of interest rates, Econometrica 53 (1985), no. 2, 385–407. 10.2307/1911242Search in Google Scholar

[10] S. Csörgő, K. Tandori and V. Totik, On the strong law of large numbers for pairwise independent random variables, Acta Math. Hungar. 42 (1983), no. 3–4, 319–330. 10.1007/BF01956779Search in Google Scholar

[11] C. Cuchiero, F. Primavera and S. Svaluto-Ferro, Universal approximation theorems for continuous functions of càdlàg paths and Lévy-type signature models, Finance Stoch. 29 (2025), no. 2, 289–342. 10.1007/s00780-025-00557-5Search in Google Scholar

[12] P. M. Djuric, J. H. Kotecha, J. Zhang, Y. Huang, T. Ghirmai, M. F. Bugallo and J. Miguez, Particle filtering, IEEE Signal Process. Mag. 20 (2003), no. 5, 19–38. 10.1109/MSP.2003.1236770Search in Google Scholar

[13] D. Duffie and D. Lando, Term structures of credit spreads with incomplete accounting information, Econometrica 69 (2001), no. 3, 633–664. 10.1111/1468-0262.00208Search in Google Scholar

[14] M. L. Eaton, Multivariate Statistics, Lecture Notes Monogr. Ser. 53, Institute of Mathematical Statistics, Beachwood, 2007. Search in Google Scholar

[15] A. Fermanian, Embedding and learning with signatures, Comput. Statist. Data Anal. 157 (2021), Article ID 107148. 10.1016/j.csda.2020.107148Search in Google Scholar

[16] C. Fontana and T. Schmidt, General dynamic term structures under default risk, Stochastic Process. Appl. 128 (2018), no. 10, 3353–3386. 10.1016/j.spa.2017.11.003Search in Google Scholar

[17] R. Frey and T. Schmidt, Pricing corporate securities under noisy asset information, Math. Finance 19 (2009), no. 3, 403–421. 10.1111/j.1467-9965.2009.00374.xSearch in Google Scholar

[18] R. Frey and T. Schmidt, Pricing and hedging of credit derivatives via the innovations approach to nonlinear filtering, Finance Stoch. 16 (2012), no. 1, 105–133. 10.1007/s00780-011-0153-0Search in Google Scholar

[19] F. Gehmlich and T. Schmidt, Dynamic defaultable term structure modeling beyond the intensity paradigm, Math. Finance 28 (2018), no. 1, 211–239. 10.1111/mafi.12138Search in Google Scholar

[20] M. Hackenberg, P. Harms, M. Pfaffenlehner, A. Pechmann, J. Kirschner, T. Schmidt and H. Binder, Deep dynamic modeling with just two time points: Can we still allow for individual trajectories?, Biom. J. 64 (2022), no. 8, 1426–1445. 10.1002/bimj.202000366Search in Google Scholar PubMed

[21] M. Hackenberg, M. Pfaffenlehner, M. Behrens, A. Pechmann, J. Kirschner and H. Binder, Investigating a domain adaptation approach for integrating different measurement instruments in a longitudinal clinical registry, Biom. J. 67 (2025), no. 1, Article ID e70023. 10.1002/bimj.70023Search in Google Scholar PubMed PubMed Central

[22] J. Heiss, Inductive bias of neural networks and selected applications, Doctoral thesis, ETH Zurich, Zurich, 2024. Search in Google Scholar

[23] J. Heiss, J. Teichmann and H. Wutte, How infinitely wide neural networks can benefit from multi-task learning – an exact macroscopic characterization, preprint (2022), https://arxiv.org/abs/2112.15577. Search in Google Scholar

[24] C. Herrera, F. Krach and J. Teichmann, Neural jump ordinary differential equations: Consistent continuous-time prediction and filtering, International Conference on Learning Representations, ICLR, Vienna (2021), https://openreview.net/forum?id=JFKR3WqwyXR. Search in Google Scholar

[25] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Netw. 4 (1991), no. 2, 251–257. 10.1016/0893-6080(91)90009-TSearch in Google Scholar

[26] K. Hornik, M. Stinchcombe and H. White, Multilayer feedforward networks are universal approximators, Neural Netw. 2 (1989), no. 5, 359–366. 10.1016/0893-6080(89)90020-8Search in Google Scholar

[27] M. I. Jordan, Serial Order: A Parallel Distributed Processing Approach, Adv. Psychology 121, Elsevier, Amsterdam, 1997. 10.1016/S0166-4115(97)80111-2Search in Google Scholar

[28] O. Kallenberg, Foundations of Modern Probability, 3rd ed., Probab. Theory Stoch. Model. 99, Springer, Cham, 2021. 10.1007/978-3-030-61871-1Search in Google Scholar

[29] R. E. Kalman, A new approach to linear filtering and prediction problems, Trans. ASME Ser. D. J. Basic Engrg. 82 (1960), no. 1, 35–45. 10.1115/1.3662552Search in Google Scholar

[30] R. L. Karandikar and B. V. Rao, Introduction to Stochastic Calculus, Indian Stat. Inst. Ser., Springer, Singapore, 2018. 10.1007/978-981-10-8318-1Search in Google Scholar

[31] M. Karl, M. Soelch, J. Bayer and P. Van der Smagt, Deep variational bayes filters: Unsupervised learning of state space models from raw data, preprint (2016), https://arxiv.org/abs/1605.06432. Search in Google Scholar

[32] K. Khaled and M. Samia, Estimation of the parameters of the stochastic differential equations Black–Scholes model share price of gold, J. Math. Statist. 6 (2010), no. 4, 421–424. 10.3844/jmssp.2010.421.424Search in Google Scholar

[33] F. J. Király and H. Oberhauser, Kernels for sequentially ordered data, J. Mach. Learn. Res. 20 (2019), Paper No. 31. Search in Google Scholar

[34] F. Krach, Neural jump ordinary differential equations, Doctoral thesis, ETH Zurich, Zurich, 2025. Search in Google Scholar

[35] F. Krach, M. Nübel and J. Teichmann, Optimal estimation of generic dynamics by path-dependent neural jump ODEs, preprint (2022), https://arxiv.org/abs/2206.14284. Search in Google Scholar

[36] F. Krach and J. Teichmann, Learning chaotic systems and long-term predictions with neural jump odes, preprint (2024), https://arxiv.org/abs/2407.18808. Search in Google Scholar

[37] R. G. Krishnan, U. Shalit and D. Sontag, Deep Kalman filters, preprint (2015), https://arxiv.org/abs/1511.05121. Search in Google Scholar

[38] R. G. Krishnan, U. Shalit and D. Sontag, Structured inference networks for nonlinear state space models, Proceedings of the AAAI Conference on Artificial Intelligence, ACM, New York (2017), 2101–2109. 10.1609/aaai.v31i1.10779Search in Google Scholar

[39] J. Lai, J. Domke and D. Sheldon, Variational marginal particle filters, International Conference on Artificial Intelligence and Statistics, PMLR, Valencia (2022), 875–895. Search in Google Scholar

[40] T. A. Le, M. Igl, T. Rainforth, T. Jin and F. Wood, Auto-encoding sequential Monte Carlo, preprint (2017), https://arxiv.org/abs/1705.10306. Search in Google Scholar

[41] C. J. Maddison, J. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet and Y. Teh, Filtering variational objectives, Advances in Neural Information Processing Systems, NIPS, Long Beach (2017), 6573–6583. Search in Google Scholar

[42] R. C. Merton, On the pricing of corporate debt: The risk structure of interest rates, J. Finance 29 (1974), no. 2, 449–470. 10.1111/j.1540-6261.1974.tb03058.xSearch in Google Scholar

[43] C. Naesseth, S. Linderman, R. Ranganath and D. Blei, Variational sequential Monte Carlo, International Conference on Artificial Intelligence and Statistics, PMLR, Playa Blanca (2018), 968–977. Search in Google Scholar

[44] Ralff, Kalman filtering: Processing all measurements together vs processing them sequentially (version: 2021-03-11), Mathematics Stack Exchange (2021), https://math.stackexchange.com/q/4058151. Search in Google Scholar

[45] G. Revach, N. Shlezinger, X. Ni, A. López Escoriza, R. J. G. van Sloun and Y. C. Eldar, KalmanNet: neural network aided Kalman filtering for partially known dynamics, IEEE Trans. Signal Process. 70 (2022), 1532–1547. 10.1109/TSP.2022.3158588Search in Google Scholar

[46] D. E. Rumelhart, G. E. Hinton and R. J. Williams, Learning internal representations by error propagation, Technical report, California University San Diego, La Jolla, 1985. 10.21236/ADA164453Search in Google Scholar

[47] T. Schmidt and A. Novikov, A structural model with unobserved default boundary, Appl. Math. Finance 15 (2008), no. 1–2, 183–203. 10.1080/13504860701718281Search in Google Scholar

Received: 2025-01-07

Accepted: 2025-08-12

Published Online: 2025-09-05

You are currently not able to access this content.

https://doi.org/10.1515/strm-2025-0001

Keywords for this article

Classification; filtering; input-output systems; neural jump ODEs; optimal estimation