Home Robustness of online identification-based policy iteration to noisy data
Article
Licensed
Unlicensed Requires Authentication

Robustness of online identification-based policy iteration to noisy data

  • Bowen Song

    Bowen Song is a Ph.D. student at the Institute for Systems Theory and Automatic Control, University of Stuttgart (Germany). He received his B.Eng. in Mechatronics from Tongji University (Shanghai, China) and his M.Sc. in Electrical Engineering and Information Technology from the Technical University of Munich (Germany). He is currently pursuing his Ph.D. in Control Theory and Learning. His research interests include Data-driven Control and Reinforcement Learning.

    EMAIL logo
    and Andrea Iannelli

    Andrea Iannelli is an Assistant Professor in the Institute for Systems Theory and Automatic Control at the University of Stuttgart (Germany). He completed his B.Sc. and M.Sc. degrees in Aerospace Engineering at the University of Pisa (Italy) and received his PhD from the University of Bristol (United Kingdom) on robust control and dynamical systems theory. He was a postdoctoral researcher in the Automatic Control Laboratory at ETH Zurich (Switzerland). His main research interests are at the intersection of control theory, optimization, and learning, with a particular focus on robust and adaptive optimization-based control, uncertainty quantification, and sequential decision-making problems. He serves the community as Associated Editor for the International Journal of Robust and Nonlinear Control and as IPC member of international conferences in the areas of control, optimization, and learning.

Published/Copyright: May 28, 2025

Abstract

This article investigates the core mechanisms of indirect data-driven control for unknown systems, focusing on the application of policy iteration (PI) within the context of the linear quadratic regulator (LQR) optimal control problem. Specifically, we consider a setting where data is collected sequentially from a linear system subject to exogenous process noise, and is then used to refine estimates of the optimal control policy. We integrate recursive least squares (RLS) for online model estimation within a certainty-equivalent framework, and employ PI to iteratively update the control policy. In this work, we investigate first the convergence behavior of RLS under two different models of adversarial noise, namely point-wise and energy bounded noise, and then we provide a closed-loop analysis of the combined model identification and control design process. This iterative scheme is formulated as an algorithmic dynamical system consisting of the feedback interconnection between two algorithms expressed as discrete-time systems. This system-theoretic viewpoint on indirect data-driven control allows us to establish convergence guarantees to the optimal controller in the face of uncertainty caused by noisy data. Simulations illustrate the theoretical results.

Zusammenfassung

Diese Arbeit untersucht die zentralen Mechanismen indirekter, datengetriebener Regelung unbekannter Systeme mit Fokus auf der Anwendung der Policy Iteration (PI) Methode, im Kontext des optimalen Regelungsproblems des linearen-quadratischen Reglers (LQR). Insbesondere betrachten wir ein Szenario, in dem Daten sequenziell von einem linearen System unter dem Einfluss exogener Prozessstörungen erfasst und zur sukzessiven Verbesserung der Schätzung der optimalen Regelungsstrategie verwendet werden. Wir integrieren die Methode der rekursiven kleinsten Fehlerquadrate (RLS) zur Online-Modellschätzung innerhalb eines Rahmens gleicher Certainty-Equivalence und verwenden PI zur iterativen Aktualisierung des Reglers. Zunächst analysieren wir das Konvergenzverhalten von RLS unter zwei verschiedenen Modellen adversarialer Störungen – punktweise und energiebegrenztes Rauschen – und führen anschließend eine geschlossene Rückkopplungsanalyse des kombinierten Prozesses aus Modellidentifikation und Regelungsentwurf durch. Dieses iterative Schema wird als algorithmisches dynamisches System formuliert, bestehend aus der Rückkopplung zweier Algorithmen, die als diskrete Zeitsysteme dargestellt sind. Dieser systemtheoretische Blickwinkel auf indirekte datengetriebene Regelung ermöglicht es, Konvergenzaussagen zum optimalen Regler trotz durch verrauschte Daten bedingter Unsicherheiten zu treffen. Simulationen veranschaulichen die theoretischen Ergebnisse.


Corresponding author: Bowen Song, Institute for Systems Theory and Automatic Control, University of Stuttgart, Stuttgart, Germany, E-mail: 

About the authors

Bowen Song

Bowen Song is a Ph.D. student at the Institute for Systems Theory and Automatic Control, University of Stuttgart (Germany). He received his B.Eng. in Mechatronics from Tongji University (Shanghai, China) and his M.Sc. in Electrical Engineering and Information Technology from the Technical University of Munich (Germany). He is currently pursuing his Ph.D. in Control Theory and Learning. His research interests include Data-driven Control and Reinforcement Learning.

Andrea Iannelli

Andrea Iannelli is an Assistant Professor in the Institute for Systems Theory and Automatic Control at the University of Stuttgart (Germany). He completed his B.Sc. and M.Sc. degrees in Aerospace Engineering at the University of Pisa (Italy) and received his PhD from the University of Bristol (United Kingdom) on robust control and dynamical systems theory. He was a postdoctoral researcher in the Automatic Control Laboratory at ETH Zurich (Switzerland). His main research interests are at the intersection of control theory, optimization, and learning, with a particular focus on robust and adaptive optimization-based control, uncertainty quantification, and sequential decision-making problems. He serves the community as Associated Editor for the International Journal of Robust and Nonlinear Control and as IPC member of international conferences in the areas of control, optimization, and learning.

  1. Research ethics: Not applicable.

  2. Informed consent: Not applicable.

  3. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  4. Use of Large Language Models, AI and Machine Learning Tools: None declared.

  5. Conflict of interest: The authors state no conflict of interest.

  6. Research funding: International Max Planck Research School for Intelligent Systems (IMPRS-IS); German Research Foundation (DFG) - EXC 2075 – 390740016.

  7. Data availability: Not applicable.

Appendix A: Technical proof

A.1 Proof of Theorem 3

Proof of Theorem 3.

From (19), we have:

(43) | Δ θ t | a | Δ θ 0 | | H t 1 | + k = 1 t | w k | | d k | | H t 1 | a | Δ θ 0 | | H t 1 | + d ̄ k = 1 t | w k | | H t 1 | .

With the definition of local persistency, we have:

λ min ( H t ) a + t N d M d M d α d a + t M d + N d α d .

Then we have:

(44) | H t 1 | n x + n u a + t M d + N d α d ( n x + n u ) ( M d + N d ) min ( a , α d ) t , t Z + + .

Substituting (44) into (43), we obtain:

(45) | θ ̂ t θ | β ( | θ ̂ 0 θ | , t ) + c k = 0 t | w k | t , t Z + + .

Based on the bound defined in (2), we obtain:

(46) k = 1 t | w k | t sup t w t w t t w .

Substituting (46) into (45), we conclude the proof of Theorem 3.□

A.2 Proof of Corollary 2

Proof of Corollary 2.

For the proof of Corollary 2, we use the AM-GM inequality,

(47) k = 1 t | w k | t k = 1 t w k w k t k = 1 w k w k t w 2 .

Substituting (47) into (45), we conclude the proof.□

A.3 Proof of Theorem 4

Proof of Theorem 4.

Based on Assumptions 1 and 2, (33b) is directly proved. Now we turn to (33a), Assumption 3 guarantees the formulation of the standard PI procedure. Following the same step in [24], Appendix D6]

(48) P ̂ t + 1 = L A , B , P ̂ t 1 Γ ( P ̂ t ) + ε Δ A t , Δ B t ,

where

(49) ε Δ A t , Δ B t L A , B , P ̂ t 1 Γ ( P ̂ t ) + L A ̂ t , B ̂ t , P ̂ t 1 Q + α ̂ t β ̂ t 1 R β ̂ t 1 α ̂ t ,

and Γ ( P ̂ t ) is defined in (10). Using the same arguments in [24], we can prove that:

(50) | ε Δ A t , Δ B t | C ̄ | Δ θ t | ,

where C ̄ is a polynomial of (A, B, Q, R). For the detailed computation steps and derivation of C ̄ , we refer to [24], Appendix D6]. Then we can prove:

(51) | P ̂ t P * | c | P ̂ t 1 P * | + C ̄ | Δ θ t | c t | P ̂ 0 P * | + C ̄ 1 + c + + c t 1 Δ θ c t | P ̂ 0 P * | + C ̄ 1 c Δ θ .

Then we conclude the proof of (33a).□

A.4 Proof of Theorem 5

Proof of Theorem 5.

In this proof, the robustness of PI algorithms plays a central role, as outlined in our previous work [20], Theorem 7]. For clarity and completeness, we recall this theorem here:

Theorem 6

(Robustness of PI [20]). Given σ and δ1 defined in Theorem 2, there always exist constants a ̄ p ( δ 1 , σ ) 0 and b ̄ p ( δ 1 , σ ) 0 such that if a a ̄ p , b b ̄ p and P ̂ 0 B δ 1 ( P * ) , where sequences {a t } and {b t } are defined as

(52) a t | Δ A t | , b t | Δ B t | ,
with Δ A t A ̂ t A , Δ B t B ̂ t B , then
  1. K ̂ t is stabilizing, t Z + ;

  2. the following holds,:

    (53) | P ̂ t P * | β p | P ̂ 0 P * | , t + γ 1 ( a ) + γ 2 ( b ) δ 1 , t Z + ,
    where β p (x, t) ≔ σ t x; γ 1 ( x ) p ̄ a 1 σ x ; γ 2 ( x ) p ̄ b 1 σ x with constants p ̄ a , p ̄ b > 0 ;
  3. if lim t | Δ A t | = 0 and lim t | Δ B t | = 0 , then lim t | P ̂ t P * | = 0 .

To proceed with the proof, we first verify the conditions under which Theorem 6 holds.

From Theorem 6, if a a ̄ p , b b ̄ p and P ̂ 0 = P * , ensuring that the conditions of Theorem 6 hold, then we have | P ̂ t P * | δ 1 , t Z + . Moreover, this guarantees that lim t | P ̂ t P * | δ 1 . Now, consider fixed matrices A ̃ and B ̃ satisfying | A ̃ A | a ̄ p and | B ̃ B | b ̄ p , then given an initial condition P ̂ 0 = P * , we conclude that lim t | P ̂ t P * | = | P ( A ̃ , B ̃ ) * P * | δ 1 , where P ( A ̃ , B ̃ ) * is the optimal solution to (6) corresponding to ( A ̃ , B ̃ , Q , R ) . Thus, we can conclude, for any A ̃ and B ̃ satisfying | A ̃ A | a ̄ p and | B ̃ B | b ̄ p , then P ( A ̃ , B ̃ ) * B δ 1 ( P * ) .

When the maximum estimation error Δ θ ̄ ( θ ̂ 0 , D ̄ ) min { a ̄ p , b ̄ p } , then we have

(54) γ 1 ( a ) + γ 2 ( b ) = p ̄ a a + p ̄ b b 1 σ ( p ̄ a + p ̄ b ) Δ θ 1 σ .

Together with (24), when Assumption 4 holds and the initial policy K ̂ 0 is selected as the solution to (6) using ( A ̂ 0 , B ̂ 0 , Q , R ) , the conditions required by Theorem 5 are satisfied. Substituting (54) into (53), we conclude (35a). Now we turn to prove (35b). If the data sequence {d t } is bounded, then we can directly use Theorem 3 to prove (35b).

For matrices, |⋅|2 denotes their induced-2 norm. Based on Theorem 6, K ̂ t is stabilizing, for all t Z + . Then we can define:

(55) K ̄ cl sup | A ̂ A | a ̄ p , | B ̂ B | b ̄ p , P B δ 1 ( P * ) | A ̂ + B ̂ ( B ̂ P B ̂ + R ) 1 B ̂ P A ̂ | 2

and we have K ̄ cl [ 0,1 ) . The additional excitation term e t satisfies e t e ̄ , t Z + (27). Additionally, we have

(56) | K ̂ t | = | R + B t P ̂ t B t 1 B ̂ t P ̂ t A ̂ t | | R 1 | ( | B | + Δ θ ) | P * | + δ 1 ( | A | + Δ θ ) K ̄

Then we can introduce the following lemma, which shows the boundedness of x t :

Lemma 1

(Boundedness of state x t ). Given the system (1) with noise satisfying 2 and with the control input u t = K ̂ t x t + e t , where K ̂ t is the stabilizing gain from ORLS + PI and e t satisfies (27), the state of system (1) remains bounded:

(57) | x t | max | B | e ̄ + w 1 K ̄ cl , | x 0 | x ̄ , t Z + ,
where K ̄ cl is defined in (55) and e ̄ is defined in (27).

Proof of Lemma 1.

For the case | x t | | B | e ̄ + w 1 K ̄ cl ,

| x t + 1 | | A + B K ̂ t | 2 | x t | + | B | e ̄ + w K ̄ cl | B | e ̄ + w 1 K ̄ cl + | B | e ̄ + w = | B | e ̄ + w 1 K ̄ cl .

Together with the upper bound on the initialization, we conclude the proof.□

Further, we can also derive the bound of the data d t :

Lemma 2

(Boundedness of data d t ). Given the system (1) with noise satisfying (2) and with the control input u t = K ̂ t x t + e t where K ̂ t is the stabilizing gain from ORLS + PI and e t satisfies (27), the data d t , which is employed for RLS, is bounded:

(58) | d t | = x t u t I K ̂ t | x t | + 0 e t ( 1 + K ̄ ) x ̄ + e ̄ D ̄
where K ̄ is defined in (56) and x ̄ is defined in Lemma 1.

Using the upper bound of the data sequence {d t } and together with Theorem 3, we can conclude (35b).

References

[1] Z.-S. Hou and Z. Wang, “From model-based control to data-driven control: survey, classification and perspective,” Inf. Sci., vol. 235, pp. 3–35, 2013. https://doi.org/10.1016/j.ins.2012.07.014.Search in Google Scholar

[2] A. Khaki-Sedigh, An Introduction to Data-Driven Control Systems, New Jersey, U.S., Wiley, 2023.10.1002/9781394196432Search in Google Scholar

[3] D. Soudbakhsh, et al.., “Data-driven control: theory and applications,” in 2023 American Control Conference (ACC), 2023.10.23919/ACC55779.2023.10156081Search in Google Scholar

[4] F. Dörfler, “Data-driven control: part two of two: hot take: why not go with models?” IEEE Contr. Syst. Mag., vol. 43, no. 6, pp. 27–31, 2023. https://doi.org/10.1109/mcs.2023.3310302.Search in Google Scholar

[5] T. Faulwasser, R. Ou, G. Pan, P. Schmitz, and K. Worthmann, “Behavioral theory for stochastic systems? A data-driven journey from willems to wiener and back again,” Annu. Rev. Control, vol. 55, pp. 92–117, 2023. https://doi.org/10.1016/j.arcontrol.2023.03.005.Search in Google Scholar

[6] J. Berberich and F. Allgöwer, “An overview of systems-theoretic guarantees in data-driven model predictive control,” Annu. Rev. Control Robot. Auton. Syst., vol. 8, 2024.10.1146/annurev-control-030323-024328Search in Google Scholar

[7] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample complexity of the linear quadratic regulator,” Found. Math. Comput., vol. 20, no. 4, pp. 10–679, 2017. https://doi.org/10.1007/s10208-019-09426-y.Search in Google Scholar

[8] M. Ferizbegovic, J. Umenberger, H. Hjalmarsson, and T. B. Schön, “Learning robust LQ-controllers using application oriented exploration,” IEEE Control Syst. Lett., vol. 4, no. 1, pp. 19–24, 2020. https://doi.org/10.1109/lcsys.2019.2921512.Search in Google Scholar

[9] N. Chatzikiriakos, R. Strässer, F. Allgöwer, and A. Iannelli, “End-to-end guarantees for indirect data-driven control of bilinear systems with finite stochastic data,” arXiv preprint arXiv:2409.18010, 2024.Search in Google Scholar

[10] A. Tsiamis, I. M. Ziemann, M. Morari, N. Matni, and G. J. Pappas, “Learning to control linear systems can be hard,” in Proceedings of Thirty Fifth Conference on Learning Theory, 2022.Search in Google Scholar

[11] L. Ljung, System Identification: Theory for the User. Prentice Hall Information and System Sciences Series, New Jersey, U.S., Prentice Hall PTR, 1999.Search in Google Scholar

[12] K. J. Åström and B. Wittenmark, Adaptive Control. Dover Books on Electrical Engineering, New York, U.S., Dover Publications, 2008.Search in Google Scholar

[13] A. M. Annaswamy, “Adaptive control and intersections with reinforcement learning,” in Annual Review of Control, Robotics, and Autonomous Systems, 6(Volume 6, 2023), 2023, pp. 65–93.10.1146/annurev-control-062922-090153Search in Google Scholar

[14] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal Control, New Jersey, U.S., John Wiley & Sons, 2012.10.1002/9781118122631Search in Google Scholar

[15] D. Bertsekas, Abstract Dynamic Programming, 3rd ed. Nashua, U.S.A., Athena Scientific, 2022.Search in Google Scholar

[16] Y. Park, R. A. Rossi, Z. Wen, G. Wu, and H. Zhao, “Structured policy iteration for linear quadratic regulator,” in Proceedings of the 37th International Conference on Machine Learning, 2020.Search in Google Scholar

[17] D. Lee, “Convergence of dynamic programming on the semidefinite cone for discrete-time infinite-horizon LQR,” IEEE Trans. Autom. Control, vol. 67, no. 10, pp. 5661–5668, 2022. https://doi.org/10.1109/tac.2022.3181752.Search in Google Scholar

[18] B. Pang, T. Bian, and Z.-P. Jiang, “Robust policy iteration for continuous-time linear quadratic regulation,” IEEE Trans. Autom. Control, vol. 67, no. 1, pp. 504–511, 2022. https://doi.org/10.1109/tac.2021.3085510.Search in Google Scholar

[19] D. Bertsekas, “Newton’s method for reinforcement learning and model predictive control,” Results Control Optim., vol. 7, 2022, Art. no. 100121. https://doi.org/10.1016/j.rico.2022.100121.Search in Google Scholar

[20] B. Song, C. Wu, and A. Iannelli, “Convergence and robustness of value and policy iteration for the linear quadratic regulator,” arXiv preprint arXiv:2411.04548, 2024.Search in Google Scholar

[21] T. Y. Chun, J. Y. Lee, J. B. Park, and Y.Ho Choi, “Stability and monotone convergence of generalised policy iteration for discrete-time linear quadratic regulations,” Int. J. Control, vol. 89, no. 3, pp. 437–450, 2016. https://doi.org/10.1080/00207179.2015.1079737.Search in Google Scholar

[22] F. A. Yaghmaie, F. Gustafsson, and L. Ljung, “Linear quadratic control using model-free reinforcement learning,” IEEE Trans. Autom. Control, vol. 68, no. 2, pp. 737–752, 2023. https://doi.org/10.1109/tac.2022.3145632.Search in Google Scholar

[23] L. Sforni, G. Carnevale, I. Notarnicola, and G. Notarstefano, “On-policy data-driven linear quadratic regulator via combined policy iteration and recursive least squares,” in 2023 62nd IEEE Conference on Decision and Control (CDC), 2023.10.1109/CDC49753.2023.10383604Search in Google Scholar

[24] B. Song and A. Iannelli, “The role of identification in data-driven policy iteration: a system theoretic study,” Int. J. Robust Nonlinear Control, 2024. https://doi.org/10.1002/rnc.7475.Search in Google Scholar

[25] A. Cassel, A. Cohen, and T. Koren, “Logarithmic regret for learning linear quadratic regulators efficiently,” in Proceedings of the 37th International Conference on Machine Learning, Volume 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 1328–1337.Search in Google Scholar

[26] Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, Volume 19 of Proceedings of Machine Learning Research, Budapest, Hungary, PMLR, 2011, pp. 1–26.Search in Google Scholar

[27] M. Simchowitz and D. Foster, “Naive exploration is optimal for online LQR,” in Proceedings of the 37th International Conference on Machine Learning, Volume 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 8937–8948.Search in Google Scholar

[28] M. Borghesi, A. Bosso, and G. Notarstefano, “On-policy data-driven linear quadratic regulator via model reference adaptive reinforcement learning,” in 2023 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 32–37.10.1109/CDC49753.2023.10383516Search in Google Scholar

[29] F. Dörfler, Z. He, G. Belgioioso, S. Bolognani, J. Lygeros, and M. Muehlebach, “Toward a systems theory of algorithms,” IEEE Control Syst. Lett., vol. 8, pp. 1198–1210, 2024. https://doi.org/10.1109/lcsys.2024.3406943.Search in Google Scholar

[30] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in Proceedings of the 35th International Conference on Machine Learning, Volume 80 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 1467–1476.Search in Google Scholar

[31] S. Takakura and K. Sato, “Structured output feedback control for linear quadratic regulator using policy gradient method,” IEEE Trans. Autom. Control, vol. 69, no. 1, pp. 363–370, 2024. https://doi.org/10.1109/tac.2023.3264176.Search in Google Scholar

[32] G. Carnevale, N. Mimmo, and G. Notarstefano, “Data-driven LQR with finite-time experiments via extremum-seeking policy iteration,” arXiv preprint arXiv:2412.02758, 2024.10.1109/CDC56724.2024.10885851Search in Google Scholar

[33] F. Zhao, F. Dörfler, A. Chiuso, and K. You, “Data-enabled policy optimization for direct adaptive learning of the LQR,” arXiv preprint arXiv:2401.14871, 2024.10.1109/TAC.2025.3569597Search in Google Scholar

[34] A. L. Bruce, A. Goel, and D. S. Bernstein, “Convergence and consistency of recursive least squares with variable-rate forgetting,” Automatica, vol. 119, 2020, Art. no. 109052. https://doi.org/10.1016/j.automatica.2020.109052.Search in Google Scholar

[35] S. Tu and B. Recht, “The gap between model-based and model-free methods on the linear quadratic regulator: an asymptotic viewpoint,” in Proceedings of the Thirty-Second Conference on Learning Theory, 2019.Search in Google Scholar

[36] Y. Xie, J. Berberich, and F. Allgöwer, “Data-driven min-max mpc for linear systems,” in 2024 American Control Conference (ACC), 2024.10.23919/ACC60939.2024.10644295Search in Google Scholar

[37] J. Venkatasubramanian, J. Köhler, M. Cannon, and F. Allgöwer, “Towards targeted exploration for non-stochastic disturbances,” in 20th IFAC Symposium on System Identification SYSID, 2024.10.1016/j.ifacol.2024.08.588Search in Google Scholar

[38] G. Hewer, “An iterative technique for the computation of the steady state gains for the discrete optimal regulator,” IEEE Trans. Autom. Control, vol. 16, no. 4, pp. 382–384, 1971. https://doi.org/10.1109/tac.1971.1099755.Search in Google Scholar

[39] K. B. Petersen and M. S. Pedersen, The Matrix Cookbook, Denmark, Technical University of Denmark, 2008, Version 20081110.Search in Google Scholar

[40] E. D. Sontag, Input to State Stability: Basic Concepts and Results, Berlin, Heidelberg, Springer, 2008, pp. 163–220.10.1007/978-3-540-77653-6_3Search in Google Scholar

[41] Z.-P. Jiang and Y. Wang, “Input-to-state stability for discrete-time nonlinear systems,” Automatica, vol. 37, no. 6, pp. 857–869, 2001. https://doi.org/10.1016/s0005-1098(01)00028-0.Search in Google Scholar

[42] A. Iannelli and R. S. Smith, “A multiobjective LQR synthesis approach to dual control for uncertain plants,” IEEE Control Syst. Lett., vol. 4, no. 4, pp. 952–957, 2020. https://doi.org/10.1109/lcsys.2020.2997085.Search in Google Scholar

Received: 2024-11-15
Accepted: 2025-04-01
Published Online: 2025-05-28
Published in Print: 2025-06-26

© 2025 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 27.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/auto-2024-0164/html
Scroll to top button