A comparative study on multi-task uncertainty quantification in semantic segmentation and monocular depth estimation
-
Steven Landgraf is a research associate and Ph.D. student with the Institute of Photogrammetry and Remote Sensing (IPF), Karlsruhe Institute of Technology (KIT). His research areas include machine vision applications and deep learning-based uncertainty quantification.,
, Markus Hillemann received his doctorate in geodesy and geoinformatics from KIT in 2020. From 2016 to 2020, he was a Research Associate at IPF, KIT, and an external employee at Fraunhofer IOSB. He later joined IPF's Machine Vision Metrology group as a Postdoc. His research focuses on machine vision, image and point cloud processing, machine learning, and photogrammetry. and Theodor Kapler is a Master’s student in Geodesy and Geoinformatics at the Karlsruhe Institute of Technology (KIT), with a particular interest in machine vision. Markus Ulrich received his doctorate in geodesy from TUM in 2003. He was a Software Engineer at MVTec, later leading its research team. From 2005 to 2020, he was a Guest Lecturer at TUM and KIT. In 2017, he completed his habilitation at KIT, where he became a professor in 2020. His research focuses on machien vision, close-range photogrammetry, image processing, and machine learning for industrial applications.
Abstract
Deep neural networks excel in perception tasks such as semantic segmentation and monocular depth estimation, making them indispensable in safety-critical applications like autonomous driving and industrial inspection. However, they often suffer from overconfidence and poor explainability, especially for out-of-domain data. While uncertainty quantification has emerged as a promising solution to these challenges, multi-task settings have yet to be explored. In an effort to shed light on this, we evaluate Monte Carlo Dropout, Deep Sub-Ensembles, and Deep Ensembles for joint semantic segmentation and monocular depth estimation. Thereby, we reveal that Deep Ensembles stand out as the preferred choice, particularly in out-of-domain scenarios, and show the potential benefit of multi-task learning with regard to the uncertainty quality in comparison to solving both tasks separately. Additionally, we highlight the impact of employing different uncertainty thresholds to classify pixels as certain or uncertain, with the median uncertainty emerging as a robust default.
Zusammenfassung
Tiefe neuronale Netze zeichnen sich durch ihre exzellenten Fähigkeit in Wahrnehmungsaufgaben wie der semantischen Segmentierung und monokularen Tiefenschätzung aus, was sie für sicherheitskritische Anwendungen wie autonomes Fahren und industrielle Inspektion unverzichtbar macht. Allerdings leiden sie oft unter Selbstüberschätzung und schlechter Erklärbarkeit, insbesondere bei domänenfremden Daten. Während sich die Quantifizierung von Unsicherheiten als vielversprechende Lösung für diese Herausforderungen herausgestellt hat, müssen Multi-task-Szenarien noch erforscht werden. Um diese Lücke in der aktuellen Literatur zu füllen, evaluieren wir Monte Carlo Dropout, Deep Sub-Ensembles und Deep Ensembles für die gemeinsame semantische Segmentierung und monokulare Tiefenschätzung. Dabei zeigt sich, dass deep Ensembles vor allem außerhalb des Domänenspektrums zu bevorzugen sind und der potenzielle Vorteil von Multi-task-Lernen im Hinblick auf die Qualität der Unsicherheit im Vergleich zur getrennten Lösung beider Aufgaben deutlich wird. Darüber hinaus zeigen wir die Auswirkungen der Verwendung verschiedener Unsicherheitsschwellenwerte zur Klassifizierung von Pixeln als sicher oder unsicher, wobei sich der Median als robuster Standard herausstellt.
About the authors

Steven Landgraf is a research associate and Ph.D. student with the Institute of Photogrammetry and Remote Sensing (IPF), Karlsruhe Institute of Technology (KIT). His research areas include machine vision applications and deep learning-based uncertainty quantification.

Markus Hillemann received his doctorate in geodesy and geoinformatics from KIT in 2020. From 2016 to 2020, he was a Research Associate at IPF, KIT, and an external employee at Fraunhofer IOSB. He later joined IPF's Machine Vision Metrology group as a Postdoc. His research focuses on machine vision, image and point cloud processing, machine learning, and photogrammetry.

Theodor Kapler is a Master’s student in Geodesy and Geoinformatics at the Karlsruhe Institute of Technology (KIT), with a particular interest in machine vision.

Markus Ulrich received his doctorate in geodesy from TUM in 2003. He was a Software Engineer at MVTec, later leading its research team. From 2005 to 2020, he was a Guest Lecturer at TUM and KIT. In 2017, he completed his habilitation at KIT, where he became a professor in 2020. His research focuses on machien vision, close-range photogrammetry, image processing, and machine learning for industrial applications.
Acknowledgments
The authors acknowledge support by the state of Baden-Württemberg through bwHPC. This work is supported by the Helmholtz Association Initiative and Networking Fund on the HAICORE@KIT partition.
-
Research ethics: Not applicable.
-
Informed consent: Not applicable.
-
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
-
Use of Large Language Models, AI and Machine Learning Tools: None declared.
-
Conflict of interest: The authors state no conflict of interest.
-
Research funding: None declared.
-
Data availability: Not applicable.
References
[1] S. Landgraf, M. Hilleman, T. Kapler, and M. Ulrich, “Evaluation of multi-task uncertainties in joint semantic segmentation and monocular depth estimation,” in Forum Bildverarbeitung 2024, KIT Scientific Publishing, 2024, p. 147.10.58895/ksp/1000174496-13Search in Google Scholar
[2] R. McAllister, et al.., “Concrete problems for autonomous vehicle safety: advantages of bayesian deep learning,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017, pp. 4745–4753.10.24963/ijcai.2017/661Search in Google Scholar
[3] C. Steger, M. Ulrich, and C. Wiedemann, Machine Vision Algorithms and Applications, John Wiley & Sons, 2018.Search in Google Scholar
[4] S. Landgraf, M. Hillemann, M. Aberle, V. Jung, and M. Ulrich, “Segmentation of industrial burner flames: a comparative study from traditional image processing to machine and deep learning,” ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci., vol. 10, 2023. https://doi.org/10.5194/isprs-annals-X-1-W1-2023-953-2023.Search in Google Scholar
[5] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image segmentation using deep learning: a survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3523–3542, 2022. https://doi.org/10.1109/TPAMI.2021.3059968.Search in Google Scholar PubMed
[6] X. Dong, M. A. Garratt, S. G. Anavatti, and H. A. Abbass, “Towards real-time monocular depth estimation for robotics: a survey,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 10, pp. 16940–16961, 2022. https://doi.org/10.1109/tits.2022.3160741.Search in Google Scholar
[7] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in Proceedings of the 34th International Conference on Machine Learning, PMLR, 2017, pp. 1321–1330.Search in Google Scholar
[8] J. Gawlikowski, et al.., “A survey of uncertainty in deep neural networks,” arXiv:2107.03342, 2022.Search in Google Scholar
[9] K. Lee, H. Lee, K. Lee, and J. Shin, “Training confidence-calibrated classifiers for detecting out-of-distribution samples,” arXiv:1711.09325, 2018.Search in Google Scholar
[10] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: representing model uncertainty in deep learning,” in Proceedings of the 33rd International Conference on Machine Learning, Volume 48 of Proceedings of Machine Learning Research, PMLR, 2016, pp. 1050–1059.Search in Google Scholar
[11] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in Advances in Neural Information Processing Systems, vol. 30, Curran Associates, Inc., 2017.Search in Google Scholar
[12] D. J. C. MacKay, “A practical bayesian framework for backpropagation networks,” Neural Comput., vol. 4, no. 3, pp. 448–472, 1992. https://doi.org/10.1162/neco.1992.4.3.448.Search in Google Scholar
[13] M. Valdenegro-Toro, “Sub-ensembles for fast uncertainty estimation in neural networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4119–4127.10.1109/ICCVW60793.2023.00445Search in Google Scholar
[14] S. Landgraf, M. Hillemann, T. Kapler, and M. Ulrich, “Efficient multi-task uncertainties for joint semantic segmentation and monocular depth estimation,” in DAGM German Conference on Pattern Recognition (GCPR), Springer, 2024.10.58895/ksp/1000174496-13Search in Google Scholar
[15] D. Bruggemann, M. Kanakis, S. Georgoulis, and L. van Gool, “Automated search for resource-efficient branched multi-task networks,” arXiv preprint arXiv:2008.10292, 2020.10.5244/C.34.93Search in Google Scholar
[16] D. Brüggemann, M. Kanakis, A. Obukhov, S. Georgoulis, and L. van Gool, “Exploring relational context for multi-task dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15869–15878.10.1109/ICCV48922.2021.01557Search in Google Scholar
[17] T. Gao, et al.., “Ci-net: a joint depth estimation and semantic segmentation network using contextual information,” Appl. Intell., vol. 52, no. 15, pp. 18167–18186, 2022. https://doi.org/10.1007/s10489-022-03401-x.Search in Google Scholar
[18] L. He, J. Lu, G. Wang, S. Song, and J. Zhou, “Sosd-net: joint semantic object segmentation and depth estimation from monocular images,” Neurocomputing, vol. 440, pp. 251–263, 2021. https://doi.org/10.1016/j.neucom.2021.01.126.Search in Google Scholar
[19] N. Ji, H. Dong, F. Meng, and L. Pang, “Semantic segmentation and depth estimation based on residual attention mechanism,” Sensors, vol. 23, no. 17, p. 7466, 2023. https://doi.org/10.3390/s23177466.Search in Google Scholar PubMed PubMed Central
[20] J. Jiao, Y. Cao, Y. Song, and R. Lau, “Look deeper into depth: monocular depth estimation with semantic booster and attention-driven loss,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 53–69.10.1007/978-3-030-01267-0_4Search in Google Scholar
[21] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7482–7491.10.1109/CVPR.2018.00781Search in Google Scholar
[22] X. Lin, D. Sánchez-Escobedo, J. R. Casas, and M. Pardàs, “Depth estimation and semantic segmentation from a single rgb image using a hybrid convolutional neural network,” Sensors, vol. 19, no. 8, p. 1795, 2019. https://doi.org/10.3390/s19081795.Search in Google Scholar PubMed PubMed Central
[23] J. Liu, Y. Wang, Y. Li, J. Fu, J. Li, and H. Lu, “Collaborative deconvolutional neural networks for joint depth estimation and semantic segmentation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 11, pp. 5655–5666, 2018. https://doi.org/10.1109/tnnls.2017.2787781.Search in Google Scholar
[24] S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1871–1880.10.1109/CVPR.2019.00197Search in Google Scholar
[25] A. Mousavian, H. Pirsiavash, and J. Košecká, “Joint semantic segmentation and depth estimation with deep convolutional networks,” in 2016 Fourth International Conference on 3D Vision (3DV), IEEE, 2016, pp. 611–619.10.1109/3DV.2016.69Search in Google Scholar
[26] V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, C. Shen, and I. Reid, “Real-time joint semantic segmentation and depth estimation using asymmetric annotations,” in 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019, pp. 7101–7107.10.1109/ICRA.2019.8794220Search in Google Scholar
[27] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, “Towards unified depth and semantic prediction from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2800–2809.Search in Google Scholar
[28] D. Xu, W. Ouyang, X. Wang, and N. Sebe, “Pad-net: multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 675–684.10.1109/CVPR.2018.00077Search in Google Scholar
[29] X. Xu, H. Zhao, V. Vineet, S.-N. Lim, and A. Torralba, “MTformer: multi-task learning via transformer and cross-task reasoning,” in European Conference on Computer Vision, Springer, 2022, pp. 304–321.10.1007/978-3-031-19812-0_18Search in Google Scholar
[30] S. Landgraf, M. Hillemann, K. Wursthorn, and M. Ulrich, “Uncertainty-aware cross-entropy for semantic segmentation,” ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci., vol. X-2-2024, pp. X-2–136, 2024. https://doi.org/10.5194/isprs-annals-x-2-2024-129-2024.Search in Google Scholar
[31] D. W. Wolf, P. Balaji, A. Braun, and M. Ulrich, “Decoupling of neural network calibration measures,” in DAGM German Conference on Pattern Recognition, Springer, 2024.10.1007/978-3-031-85181-0_8Search in Google Scholar
[32] K. Wursthorn, M. Hillemann, and M. Ulrich, “Uncertainty quantification with deep ensembles for 6d object pose estimation,” ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci., vols. X-2, 2024.10.5194/isprs-annals-X-2-2024-223-2024Search in Google Scholar
[33] Y. Gal, “Uncertainty in deep learning,” Ph.D. thesis, University of Cambridge, 2016.Search in Google Scholar
[34] Y. Gal, R. Islam, and Z. Ghahramani, “Deep bayesian active learning with image data,” in International Conference on Machine Learning, PMLR, 2017, pp. 1183–1192.Search in Google Scholar
[35] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 5580–5590.Search in Google Scholar
[36] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” Adv. Neural Inf. Process. Syst., vol. 34, pp. 12077–12090, 2021.Search in Google Scholar
[37] A. Loquercio, M. Segu, and D. Scaramuzza, “A general framework for uncertainty estimation in deep learning,” IEEE Robot. Autom. Lett., vol. 5, no. 2, pp. 3153–3160, 2020. https://doi.org/10.1109/lra.2020.2974682.Search in Google Scholar
[38] D. A. Nix and A. S. Weigend, “Estimating the mean and variance of the target probability distribution,” in Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), vol. 1, IEEE, 1994, pp. 55–60.10.1109/ICNN.1994.374138Search in Google Scholar
[39] F. K. Gustafsson, M. Danelljan, and T. B. Schon, “Evaluating scalable bayesian deep learning methods for robust computer vision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 318–319.10.1109/CVPRW50498.2020.00167Search in Google Scholar
[40] S. Fort, H. Hu, and B. Lakshminarayanan, “Deep ensembles: a loss landscape perspective,” arXiv:1912.02757, 2020.Search in Google Scholar
[41] S. Landgraf, K. Wursthorn, M. Hillemann, and M. Ulrich, “Dudes: deep uncertainty distillation using ensembles for semantic segmentation,” PFG J. Photogramm. Remote Sens. Geoinf. Sci., vol. 92, no. 2, pp. 101–114, 2024. https://doi.org/10.1007/s41064-024-00280-4.Search in Google Scholar
[42] J. Mukhoti and Y. Gal, “Evaluating bayesian deep learning methods for semantic segmentation,” arXiv preprint arXiv:1811.12709, 2018.Search in Google Scholar
[43] M. Cordts, et al.., “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.10.1109/CVPR.2016.350Search in Google Scholar
[44] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, 2012, Proceedings, Part V 12, Springer, 2012, pp. 746–760.10.1007/978-3-642-33715-4_54Search in Google Scholar
[45] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.Search in Google Scholar
[46] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: a large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 248–255.10.1109/CVPR.2009.5206848Search in Google Scholar
[47] M. Pakdaman Naeini, G. Cooper, and M. Hauskrecht, “Obtaining well calibrated probabilities using bayesian binning,” Proc. AAAI Conf. Artif. Intell., vol. 29, no. 1, 2015. https://doi.org/10.1609/aaai.v29i1.9602.Search in Google Scholar
[48] C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene understanding with synthetic data,” Int. J. Comput. Vis., vol. 126, pp. 973–992, 2018. https://doi.org/10.1007/s11263-018-1072-8.Search in Google Scholar
[49] X. Hu, C.-W. Fu, L. Zhu, and P.-A. Heng, “Depth-attentional features for single-image rain removal,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8022–8031.10.1109/CVPR.2019.00821Search in Google Scholar
© 2025 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Editorial
- Image Processing Forum – Forum Bildverarbeitung 2024
- Research Articles
- Optimizing speed and quality in rendering of sensor-realistic images
- Characterisation metrics for performance comparison of area scan and event-based sensors
- A comparative study on multi-task uncertainty quantification in semantic segmentation and monocular depth estimation
- Comparing fast semantic segmentation CNNs for FPGAs with standard methods
- A comparative study of Q-Seg, quantum-inspired techniques, and U-Net for crack image segmentation
- Bayesian optimization of single-pulse laser drilling using advanced image processing
- Robuste Ampeldetektion und Haltelinienfreigabe durch HD-Karten Assoziation
Articles in the same Issue
- Frontmatter
- Editorial
- Image Processing Forum – Forum Bildverarbeitung 2024
- Research Articles
- Optimizing speed and quality in rendering of sensor-realistic images
- Characterisation metrics for performance comparison of area scan and event-based sensors
- A comparative study on multi-task uncertainty quantification in semantic segmentation and monocular depth estimation
- Comparing fast semantic segmentation CNNs for FPGAs with standard methods
- A comparative study of Q-Seg, quantum-inspired techniques, and U-Net for crack image segmentation
- Bayesian optimization of single-pulse laser drilling using advanced image processing
- Robuste Ampeldetektion und Haltelinienfreigabe durch HD-Karten Assoziation