Advanced recognition systems for industrial environments using robot binocular vision and semantic segmentation

Yancheng Li; Hongli Wang; Tingting Zhan

doi:10.1515/cppm-2025-0287

Article

Advanced recognition systems for industrial environments using robot binocular vision and semantic segmentation

, and

Published/Copyright: January 15, 2026

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Chemical Product and Process Modeling

Abstract

This paper presents a thorough description of a highly dependable object recognition model for industrial applications in difficult environments based on a combination of stereo vision, semantic segmentation and a hybrid CNN–V3 architecture; The pipeline performs pixel-level segmentation using DeepLabV3+, extracts RGB-D features with VGG19, and final classification is achieved through a concise CNN–ViT fusion module. Solid semantic segmentation and stereo depth-aware modelling are adaptable and effective solutions to perennial challenges in industrial recognition such as occlusion, lighting fluctuations, and complex backdrops that can undermine performance in industry. The proposed pipeline has the potential to enhance depth-based perception and spatial reasoning for identification and interaction with industrial objects in at least three modes: recognition, cognition, and classification. The model was evaluated through an experiment based on the XYZ-IBD dataset revealing a total accuracy of 96.54 %, F1 score of 0.956 and AUC of 0.996 demonstrably indicating significant advantage over existing 3D deep learning-based recognition models and those based on binocular images. The combined semantic segmentation and stereo depth approach offers a robust architecture that enhances perception and accuracy for Industry 4.0–driven industrial robotic applications. Performance gains were confirmed by comparing the model with baseline approaches such as the Bilateral Vision-Aided Transformer Network and binocular Mask R–CNN, where it achieved higher accuracy, F1-score, and AUC. The framework also introduces a compact RGB-D fusion design and hybrid CNN–ViT architecture that improves robustness and recognition reliability in complex industrial settings.

Keywords: binocular vision; semantic segmentation; DeepLabV3+; vision transformer (ViT); CNN-ViT hybrid model; industrial robotics

Corresponding author: Yancheng Li, School of Mechatronics and Vehicle Engineering, Chongqing Jiaotong University, 400074, Chongqing, China, E-mail: yanchengli543@outlook.com

Research ethics: This study does not involve human participants, animals, or sensitive personal data; therefore, Ethical Approval was not required.
Author contribution: Yancheng Li: Conceptualization, Methodology, Model Development, Writing – Original Draft. Hongli Wang: Experimentation, Data Curation, Validation, Visualization. Tingting Zhan: Literature Review, Editing, Supervision Assistance, Results Analysis. All authors have read and approved the final manuscript.
Conflict of interest: The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper. The authors declare that they have no competing interests.
Research funding: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Data availability: Not Applicable.
Consent to Participate: Not applicable. No human subjects were involved in this study.
Consent to Publication: All authors consent to the publication of this manuscript.

References

1. Nagy, M, Lăzăroiu, G, Valaskova, K. Machine intelligence and autonomous robotic technologies in the corporate context of SMEs: deep learning and virtual simulation algorithms, cyber-physical production networks, and industry 4.0-based manufacturing systems. Appl Sci 2023;13:1681. https://doi.org/10.3390/app13031681.Search in Google Scholar

2. Ivanov, V, Andrusyshyn, V, Pavlenko, I, Pitel’, J, Bulej, V. New classification of industrial robotic gripping systems for sustainable production. Sci Rep 2024;14:295. https://doi.org/10.1038/s41598-023-50673-5.Search in Google Scholar PubMed PubMed Central

3. Zohaib, M, Ahsan, M, Khan, M, Iqbal, J. A featureless approach for object detection and tracking in dynamic environments. PLoS One 2023;18:e0280476. https://doi.org/10.1371/journal.pone.0280476.Search in Google Scholar PubMed PubMed Central

4. Alazeb, A, Chughtai, BR, Al Mudawi, N, AlQahtani, Y, Alonazi, M, Aljuaid, H, et al.. Remote intelligent perception system for multi-object detection. Front Neurorob 2024;18:1398703. https://doi.org/10.3389/fnbot.2024.1398703.Search in Google Scholar PubMed PubMed Central

5. Sarker, IH. AI-based modeling: techniques, applications and research issues towards automation, intelligent and smart systems. SN Comput Sci 2022;3:158. https://doi.org/10.1007/s42979-022-01043-x.Search in Google Scholar PubMed PubMed Central

6. Balasubramanian, B, Cetin, K. Vision-based 6D pose analytics solution for high-precision industrial robot pick-and-place applications. Sensors 2025;25:4824. https://doi.org/10.3390/s25154824.Search in Google Scholar PubMed PubMed Central

7. Khor, W, Chen, YK, Roberts, M, Ciampa, F. Automated detection and classification of concealed objects using infrared thermography and convolutional neural networks. Sci Rep 2024;14:8353. https://doi.org/10.1038/s41598-024-56636-8.Search in Google Scholar PubMed PubMed Central

8. Rahman, MU, Khan, S, Khan, H, Ali, A, Sarwar, F. Computational chemistry unveiled: a critical analysis of theoretical coordination chemistry and nanostructured materials. Chem Prod Process Model 2024;19:473–515. https://doi.org/10.1515/cppm-2024-0001.Search in Google Scholar

9. Yang, C, Kim, J, Kang, D, Eom, D-S. Vision AI system development for improved productivity in challenging industrial environments: a sustainable and efficient approach. Appl Sci 2024;14:2750. https://doi.org/10.3390/app14072750.Search in Google Scholar

10. Li, Y, Wu, Y, Zhu, G. Automatic rating method based on deep transfer learning for machine translation considering contextual semantic awareness. Alex Eng J 2024;105:588–97. https://doi.org/10.1016/j.aej.2024.08.046.Search in Google Scholar

11. Alaba, SY, Ball, JE. Deep learning-based image 3-D object detection for autonomous driving: review. IEEE Sens J 2023;23:3378–94. https://doi.org/10.1109/JSEN.2023.3235830.Search in Google Scholar

12. Hussain, M. YOLO-v1 to YOLO-v8, the rise of YOLO and its complementary nature toward digital manufacturing and industrial defect detection. Machines 2023;11:677. https://doi.org/10.3390/machines11070677.Search in Google Scholar

13. Li, X, Li, X, Hu, G, Niu, Q, Xu, L. A low-cost 3D mapping system for indoor scenes based on 2D LiDAR and monocular cameras. Remote Sens 2024;16:4712. https://doi.org/10.3390/rs16244712.Search in Google Scholar

14. Mohammed, SA, Ralescu, AL. Insights into image understanding: segmentation methods for object recognition and scene classification. Algorithms 2024;17:189. https://doi.org/10.3390/a17050189.Search in Google Scholar

15. Rong, J, Wang, P, Wang, T, Hu, L, Yuan, T. Fruit pose recognition and directional orderly grasping strategies for tomato harvesting robots. Comput Electron Agric 2022;202:107430. https://doi.org/10.1016/j.compag.2022.107430.Search in Google Scholar

16. Santos, AA, Schreurs, C, da Silva, AF, Pereira, F, Felgueiras, C, Lopes, AM, et al.. Integration of artificial vision and image processing into a pick and place collaborative robotic system. J Intell Rob Syst 2024;110:159. https://doi.org/10.1007/s10846-024-02195-z.Search in Google Scholar

17. Zhang, Z, Shen, Z, Liu, J, Shu, J, Zhang, H. A binocular vision-based crack detection and measurement method incorporating semantic segmentation. Sensors 2024;24:3. https://doi.org/10.3390/s24010003.Search in Google Scholar PubMed PubMed Central

18. Yang, Y, Yu, Z. A collaborative navigation model based on multi-sensor fusion of Beidou and binocular vision for complex environments. Appl Sci 2025;15:7912. https://doi.org/10.3390/app15147912.Search in Google Scholar

19. Sitaraman, SR, Narayana, MVS, Lande, J, Shnain, AH. Center intersection of union loss with you only look once for object detection and recognition. In: 2024 International Conference on Intelligent Algorithms for Computational Intelligence Systems (IACIS). Hassan, India: IEEE; 2024:1–4 pp.10.1109/IACIS61494.2024.10721907Search in Google Scholar

20. Hu, Q, Wang, K, Ren, F, Wang, Z. Research on underwater robot ranging technology based on semantic segmentation and binocular vision. Sci Rep 2024;14:12309. https://doi.org/10.1038/s41598-024-63017-8.Search in Google Scholar PubMed PubMed Central

21. Xiao, Y. Integrating CNN and RANSAC for improved object recognition in industrial robotics. Syst Soft Comput 2025;7:200240. https://doi.org/10.1016/j.sasc.2025.200240.Search in Google Scholar

22. Chen, Q, Wen, J, Lin, T, Ren, H. Target recognition and localization of environmental perception system based on binocular cameras for unmanned cleaning vehicle. Eng Appl Artif Intell 2025;143:109994. https://doi.org/10.1016/j.engappai.2024.109994.Search in Google Scholar

23. Liu, G, Li, D, Sun, W, Xie, Z, Liao, R, Feng, J. An obstacle avoidance safety detection algorithm for power lines combining binocular vision technology and improved object detection. Energy Inform 2024;7:72. https://doi.org/10.1186/s42162-024-00378-4.Search in Google Scholar

24. Wen, Y, Xue, J, Sun, H, Song, Y, Lv, P, Liu, S, et al.. High-precision target ranging in complex orchard scenes by utilizing semantic segmentation results and binocular vision. Comput Electron Agric 2023;215:108440. https://doi.org/10.1016/j.compag.2023.108440.Search in Google Scholar

25. Feng, W, Liang, Z, Mei, J, Yang, S, Liang, B, Zhong, X, et al.. Petroleum pipeline interface recognition and pose detection based on binocular stereo vision. Processes 2022;10:1722. https://doi.org/10.3390/pr10091722.Search in Google Scholar

26. Li, S, Wang, L, Yu, B, Du, S, Yang, Z. A vision-based method for simultaneous instance segmentation and localization of indoor objects. Appl Sci 2023;13:11702. https://doi.org/10.3390/app132111702.Search in Google Scholar

27. Fang, T, Chen, W, Han, L. Location research and picking experiment of an apple-picking robot based on improved mask R-CNN and binocular vision. Horticulturae 2025;11:801. https://doi.org/10.3390/horticulturae11070801.Search in Google Scholar

28. Tian, F, Hu, G, Yu, S, Wang, R, Song, Z, Yan, Y, et al.. Navigation path extraction and experimental research of pusher robot based on binocular vision. Appl Sci 2022;12:6641. https://doi.org/10.3390/app12136641.Search in Google Scholar

29. Lu, C, Tang, X. Binocular vision-based recognition method for table tennis motion trajectory. Mob Inf Syst 2022;2022:2093631. https://doi.org/10.1155/2022/2093631.Search in Google Scholar

30. Ramesh, K, Deshmukh, S, Ray, T, Parimi, C. Enhancing manufacturing process accuracy: a multidisciplinary approach integrating computer vision, machine learning, and control systems. J Manuf Process 2025;142:453–67. https://doi.org/10.1016/j.jmapro.2025.03.112.Search in Google Scholar

31. Khan, D. A functional energy minimization framework for the detection of crash stones on road surfaces in intelligent transportation systems. Mechatron Intell Transp Syst 2025;4. https://doi.org/10.56578/mits040203.Search in Google Scholar

32. Khan, MN, Rahi, A, Hasan, MA, Anwar, S. Decision-level multi-sensor fusion to improve limitations of single-camera-based CNN classification in precision farming: application in weed detection. Computation 2025;13:174. https://doi.org/10.3390/computation13070174.Search in Google Scholar

33. Lee, J, Bjelonic, M, Reske, A, Wellhausen, L, Miki, T, Hutter, M. Learning robust autonomous navigation and locomotion for wheeled-legged robots. Sci Robot 2024;9:eadi9641. https://doi.org/10.1126/scirobotics.adi9641.Search in Google Scholar PubMed

34. Mehta, V, Sharma, C, Thiyagarajan, K. Large language models and 3D vision for intelligent robotic perception and autonomy. Sensors 2025;25:6394. https://doi.org/10.3390/s25206394.Search in Google Scholar PubMed PubMed Central

35. Busam, JH, Liang, J, Hu, J, Sundermeyer, M, Yu, PKT, Navab, N. XYZ-IBD: high-precision bin-picking dataset for 6D pose & depth estimation. Bengaluru, India: XYZ-IBD Dataset; 2025. https://xyz-ibd.github.io/.Search in Google Scholar

36. Alarood, A, Atoum, MS, Manaf, AA, Abubakar, A, Alsmadi, I. Enhanced obstacle detection using bilateral vision-aided transformer neural network for visually impaired persons. Clust Comput 2025;28:997. https://doi.org/10.1007/s10586-025-05740-z.Search in Google Scholar

37. Su, X, Wang, J, Wang, Y, Zhang, D. Steel roll eye pose detection based on binocular vision and mask R-CNN. Sensors 2025;25:1805. https://doi.org/10.3390/sensors25061805.Search in Google Scholar

Received: 2025-11-22

Accepted: 2025-12-24

Published Online: 2026-01-15

You are currently not able to access this content.

https://doi.org/10.1515/cppm-2025-0287

Keywords for this article

binocular vision; semantic segmentation; DeepLabV3+; vision transformer (ViT); CNN-ViT hybrid model; industrial robotics