Home Surgical phase recognition by learning phase transitions
Article Open Access

Surgical phase recognition by learning phase transitions

  • Manish Sahu EMAIL logo , Angelika Szengel , Anirban Mukhopadhyay and Stefan Zachow
Published/Copyright: September 17, 2020

Abstract

Automatic recognition of surgical phases is an important component for developing an intra-operative context-aware system. Prior work in this area focuses on recognizing short-term tool usage patterns within surgical phases. However, the difference between intra- and inter-phase tool usage patterns has not been investigated for automatic phase recognition. We developed a Recurrent Neural Network (RNN), in particular a state-preserving Long Short Term Memory (LSTM) architecture to utilize the long-term evolution of tool usage within complete surgical procedures. For fully automatic tool presence detection from surgical video frames, a Convolutional Neural Network (CNN) based architecture namely ZIBNet is employed. Our proposed approach outperformed EndoNet by 8.1% on overall precision for phase detection tasks and 12.5% on meanAP for tool recognition tasks.

Introduction

Decomposing a surgical process into a sequence of abstract tasks independent of the physical factors was introduced by MacKenzie et al. [1] under the term Surgical Process Modeling (SPM). SPM centers around a concept of multi-scale temporal abstraction – termed as granularity [2]. In SPM, the highest granularity level is called surgical phase. A fully automatic recognition of such phases can enable multiple applications [3], [4]. Understanding the temporal evolution of tool usage patterns and discovering their relation to respective surgical phases can provide important cues for an automatic recognition of surgical phases [5]. Henceforth, by the term evolution, we will refer to the sequential process of surgical tool usage (including co-occurrence) over a complete surgical procedure. The term pattern refers to the visibility of tools within endoscopic images.

The general trend in literature on fully automatic surgical phase recognition is to explicitly model a low level SPM abstraction followed by a global temporal alignment, using some variant of Hidden Markov Models (HMM). For example, Padoy et al. [5] and Twinanda et al. [6] modeled tool presence information using Dynamic Time Warping (DTW) and CNN respectively, followed by a hierarchical HMM. Dergachyova et al. [7] used image and instrument signals for modeling low level information, followed by a Hidden semi Markov Model for global alignment. A spatio-temporal CNN was introduced by Lea et al. [8] for modeling tool motion followed by a semi Markov model or DTW. LSTM has been introduced by DiPietro et al. [9] for surgical phase recognition from kinematics data and more recently utilized by Jin et al. [10] in combination with low-level image features (obtained from a CNN) and heuristic post processing.

Key observations of the present work are the inter- and intra-phase differences in tool usage evolution, as shown in Figure 1. Unlike prior methods, which solely learn within-phase information, we learn key changes for sequences of tool usage that uniquely identify phase transitions in an endoscopic video sequence. In its essence, the proposed method utilizes tool presence prediction of ZIBNet [11] in a state-preserving LSTM [12] framework for encoding complete evolution of tool usage during a surgical procedure.

Figure 1: Corrplot visualization of the evolution of surgical tool usage (including tool co-occurrences) at different phases of a surgery as provided with the Cholec80 dataset.
Figure 1:

Corrplot visualization of the evolution of surgical tool usage (including tool co-occurrences) at different phases of a surgery as provided with the Cholec80 dataset.

The major contribution of the proposed work is a novel framework for learning evolution of surgical tool usage patterns in relation to inter- and intra-phases, where many-to-one LSTM sequence learning is utilized for fully automatic learning of long-term tool usage evolution.

Method

A set of video sequences, each composed of individual frames {It}t=1T in combination with ground truth annotations of corresponding phases {yt}t=1T are given as input for training. The goal of fully automatic phase recognition is to learn a mapping scheme from It to yt. However, instead of predicting phase yˆt directly from It, we use ZIBNet [11] to automatically detect tools associated with It and train an LSTM on top of this tool presence information, as shown in Figure 2. To keep this article self-contained, we will briefly introduce ZIBNet in the next section, before describing the state-preserving LSTM methodology.

Figure 2: Overall design of the proposed surgical phase recognition framework.
Figure 2:

Overall design of the proposed surgical phase recognition framework.

ZIBNet

Sahu et al. introduced ZIBNet as a specialized transfer learning framework developed on top of generic CNN feature learning architectures for surgical tool presence detection [13], [11]. ZIBNet considers surgical tool presence detection as a multi-label classification task and tool co-occurrences (second and third order) are treated as separate classes. A detrimental effect of imbalance in tool usage on the performance of the CNN was analyzed. A stratification technique to counter the imbalance was employed during CNN training. Moreover, online post-processing using temporal smoothing was introduced to enhance run time prediction. In contrast to [11] which only considered AlexNet [14] as the base CNN architecture, we also investigate Residual Neural Networks (ResNet [15]) trained for ImageNet classification as the base CNN architecture of ZIBNet.

Considering a number of L tools are necessary to perform a surgical procedure, ZIBNet learns the mapping between It and the ground truth tool labels. During testing, given a video frame It, ZIBNet predicts the probability of a tool xtL present in the frame.

State-preserving LSTM

The goal of an LSTM is to learn the mapping between tool labels xt generated by ZIBNet and the corresponding phase label yt. The main reason behind considering an LSTM unit is their built in memory states, for which the LSTM learns to store, update and output information [12].

Typical LSTM architectures map an input with fixed dimensionality to an output vector. However, this setup offers a serious limitation for online phase detection since the length of a surgical procedure is not known a-priori. One key design choice of our particular learning strategy is to focus on learning phase transitions rather than the overall description of phases. In this work, we formulate the online phase detection as a many-to-one LSTM sequence learning task. Figure 2 visualizes the pipeline of the proposed method. In particular, the input sequence of tool predictions is fed into the LSTM, one mini-sequence at a time, and the respective state of the LSTM is preserved for the following mini-sequence in order to retain long term dependency.

Experiments and results

The performance of the proposed method was quantitatively evaluated in view of a fully automatic surgical phase recognition. In the following, the used data set, the experimental strategy, the quantitative analysis and the comparison with related state-of-the-art methods are described.

Data preparation and experimental settings

All our experiments were conducted with the Cholec80 dataset [6] in an online setting (i.e. no future information was used for training). The Cholec80 dataset comprises 80 videos of cholecystectomy procedures performed by 13 surgeons. The frame rate of 25 fps was down-sampled to 1 fps for a ground truth annotation of tool labels as well as for further processing. For SPM, a cholecystectomy procedure is divided into seven surgical phases, where seven tools are commonly used to perform the procedure (see Figure 1). We follow the evaluation strategy of Twinanda et al. [6] for both tool presence detection and phase recognition in order to provide direct comparison with their method.

The training data was created by converting the video into mini-sequences of five frames while maintaining the original sequential order. We utilized eight LSTM units, categorical cross entropy as loss function and Adam as optimization algorithm with a learning rate of 0.001.

Quantitative comparison

Our proposed framework outperforms EndoNet [6] on phase recognition evaluation metrics as developed by Padoy et al. [5], namely average precision and average recall. Results of top performing methods on the Cholec80 dataset from Twinanda et al. [6] are listed together with our results in Table 1. In particular, the performance of our proposed method is compared to those of PhaseNet, EndoNet and EndoNet followed by HHMM (EndoNet + HHMM) based global alignment. Our proposed method leads to approx. 8% improvement in terms of average precision, compared to the second best result (EndoNet + HHMM).

Table 1:

Comparison of phase recognition results with other approaches on Cholec80 dataset (PN → PhaseNet [6], EN → EndoNet [6] and EH → EndoNet + HHMM [6])

PNENEHProposed
Avg. precision67.070.073.781.8
Avg. recall63.466.079.680.9

ZIBNet performance for tool detection

Due to the dependence of our proposed method on tool presence, it is important to employ an accurate and reliable method for the detection of tool presence right at the outset. To this end, we compared the performance of two state-of-the-art tool detection methods on the Cholec80 dataset using Average Precision (AP) as a metric. We considered two flavors of ZIBNet [11] with base CNN as either AlexNet [14] or ResNet [15], reported as ZIBNet-AlexNet and ZIBNet-ResNet respectively. It becomes evident from Table 2 that both versions of ZIBNet outperformed EndoNet [6] in terms of Mean Average Precision (MeanAP) for all tools. However, the ResNet flavor beats the two other methods in each tool category and achieved an overall MeanAP of 93.5. An interesting observation is that the less frequent tools like scissors, irrigators, specimen bags etc. are more related to phase transitions. The general superior performance of ZIBNet over EndoNet for these tools (for example, an increase by approx. 30% for scissors), as shown in Table 2, prompted us to choose ZIBNet with ResNet [15] for initial fully automatic tool presence detection. Note that in Tables 1 and 3 the performance of ResNet flavor of ZIBNet is reported in the columns entitled Proposed.

Table 2:

Comparison of tool recognition performance between EndoNet [6] (EN) and two flavors of ZIBNet: ZIBNet-AlexNet (ZA) and ZIBNet-ResNet (ZR) on the Cholec80 dataset.

ToolsENZAZR
Grasper84.886.690.5
Bipolar86.994.895.6
Hook95.698.899.3
Scissors58.668.886.8
Clipper80.185.096.5
Irrigator74.477.289.7
Specimen bag86.888.396.1
MeanAP81.085.793.5
Table 3:

Analysis and comparison of recognition performance for each phase using Precision(Recall) on the Cholec80 dataset (EH → EndoNet + HHMM [6])

PhasesEHProposed
Preparation90.0 (85.5)99.1 (87.6)
Calot triangle dissection96.4 (81.1)98.5 (85.3)
Clipping and cutting69.8 (71.2)73.6 (68.8)
Gall bladder dissection82.8 (86.5)81.6 (92.7)
Gall bladder packaging55.5 (75.5)83.0 (83.4)
Cleaning and coagulation63.9 (68.7)64.7 (77.4)
Gall bladder retraction57.5 (88.9)71.9 (68.1)
Average73.7 (79.6)81.8 (80.9)

Quantitative analysis

We quantitatively analyzed the specific design choices of our proposed method and its performance on each of the seven surgical phases. Precision and recall are considered as the metric for measuring the performance of fully automatic phase recognition. We specified the combined performance of EndoNet and HHMM (reported as EndoNet + HHMM) for each surgical phase in Table 3. Our proposed method outperformed EndoNet in terms of precision and recall.

Our proposed framework is implemented in Keras-Theano, running at 6 fps on an NVIDIA GTX 1080 Ti when applied on a surgical video.

Discussions and conclusion

We presented a fully automatic technique for the recognition of surgical phases from endoscopic videos of cholecystectomy procedures. Unlike prior work, we focused on differentiating between evolution of tool usage at phase transitions and within phases. In particular, a many-to-one state-preserving LSTM was trained on the tool presence predictions of ZIBNet to learn evolution of tool transition patterns from endoscopic videos. The proposed method outperformed EndoNet methods on the Cholec80 dataset by 8.1% on average precision. This study solely concentrated on tool presence as the initial source of information. However, future studies of surgical phase recognition from endoscopic videos may benefit from studying tool localization on the image context as the initial source. Finally, such fully automatic techniques are expected to be instrumental in advancing the computer assistance during surgical intervention from bench to bedside.


Corresponding author: Manish Sahu, Zuse Institute Berlin, Berlin, Germany, E-mail:

Award Identifier / Grant number: 16 SV 8019

  1. Research funding: This study was funded by the German Federal Ministry of Education and Research (BMBF) under the project COMPASS (grant no. - 16 SV 8019).

  2. Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.

  3. Conflict of interest: The authors state no conflict of interest.

  4. Informed consent: This study contains patient data from a publicly available dataset.

  5. Ethical approval: This article does not contain any studies with human participants or animals performed by any of the authors.

References

1. MacKenzie, L, Ibbotson, J, Cao, C, Lomax, A. Hierarchical decomposition of laparoscopic surgery: a human factors approach to investigating the operating room environment. Minim Invasive Ther Allied Technol 2001;10:121–7. https://doi.org/10.1080/136457001753192222.Search in Google Scholar

2. Lalys, F, Jannin, P. Surgical process modelling: a review. IJCARS 2014;9:495–511. https://doi.org/10.1007/s11548-013-0940-5.Search in Google Scholar

3. Blum, T, Feußner, H, Navab, N. Modeling and segmentation of surgical workflow from laparoscopic video. In: MICCAI Springer; 2010, pp. 400–7.10.1007/978-3-642-15711-0_50Search in Google Scholar PubMed

4. Franke, S, Meixensberger, J, Neumuth, T. Multi-perspective workflow modeling for online surgical situation models. J Biomed Inf 2015;54:158–66. https://doi.org/10.1016/j.jbi.2015.02.005.Search in Google Scholar

5. Padoy, N, Blum, T, Ahmadi, SA, Feussner, H, Berger, MO, Navab, N. Statistical modeling and recognition of surgical workflow. Med Image Anal 2012;16:632–41. https://doi.org/10.1016/j.media.2010.10.001.Search in Google Scholar

6. Twinanda, AP, Shehata, S, Mutter, D, Marescaux, J, de Mathelin, M, Padoy, N. EndoNet: a deep architecture for recognition tasks on laparoscopic videos. In: IEEE TMI 2017;36:86–97. https://doi.org/10.1109/TMI.2016.2593957.10.1109/TMI.2016.2593957Search in Google Scholar PubMed

7. Dergachyova, O, Bouget, D, Huaulmé, A, Morandi, X, Jannin, P. Automatic data-driven real-time segmentation and recognition of surgical workflow. IJCARS 2016;11:1081–9. https://doi.org/10.1007/s11548-016-1371-x.Search in Google Scholar

8. Lea, C, Choi, JH, Reiter, A, Hager, GD. Surgical phase recognition: from instrumented ORs to hospitals around the world. M2CAI workshop, MICCAI; 2016.Search in Google Scholar

9. DiPietro, R, Lea, C, Malpani, A, Ahmidi, N, Vedula, SS, Lee, GI, et al. Recognizing surgical activities with recurrent neural networks. In: MICCAI, Springer; 2016, pp. 551–8.10.1007/978-3-319-46720-7_64Search in Google Scholar

10. Jin, Y, Dou, Q, Chen, H, Yu, L, Heng, PA. SV-RCNet: workflow recognition from surgical videos using recurrent convolutional network. In: IEEE TMI 2018;37:1114–26. https://doi.org/10.1109/TMI.2017.2787657.Search in Google Scholar

11. Sahu, M, Mukhopadhyay, A, Szengel, A, Zachow, S. Addressing multi-label imbalance problem of surgical tool detection using CNN. IJCARS 2017;12:1013–20. https://doi.org/10.1007/s11548-017-1565-x.Search in Google Scholar

12. Hochreiter, S, Schmidhuber, J. Long short-term memory. Neural Comput 1997;9:1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.Search in Google Scholar

13. Sahu, M, Mukhopadhyay, A, Szengel, A, Zachow, S. Tool and phase recognition using contextual CNN features. Tech report – M2CAI challenge. MICCAI; 2016.Search in Google Scholar

14. Krizhevsky, A, Sutskever, I, Hinton, GE. Imagenet classification with deep convolutional neural networks. In: NeurIPS; 2012, pp. 1097–105.10.1145/3065386Search in Google Scholar

15. He, K, Zhang, X, Ren, S, Sun, J. Deep residual learning for image recognition. In: IEEE CVPR; 2016, pp. 770–8.10.1109/CVPR.2016.90Search in Google Scholar

Published Online: 2020-09-17

© 2020 Manish Sahu et al., published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Proceedings Papers
  2. 4D spatio-temporal convolutional networks for object position estimation in OCT volumes
  3. A convolutional neural network with a two-stage LSTM model for tool presence detection in laparoscopic videos
  4. A novel calibration phantom for combining echocardiography with electromagnetic tracking
  5. Domain gap in adapting self-supervised depth estimation methods for stereo-endoscopy
  6. Automatic generation of checklists from business process model and notation (BPMN) models for surgical assist systems
  7. Automatic stent and catheter marker detection in X-ray fluoroscopy using adaptive thresholding and classification
  8. Autonomous guidewire navigation in a two dimensional vascular phantom
  9. Cardiac radiomics: an interactive approach for 4D data exploration
  10. Catalogue of hazards: a fundamental part for the safe design of surgical robots
  11. Catheter pose-dependent virtual angioscopy images for endovascular aortic repair: validation with a video graphics array (VGA) camera
  12. Cinemanography: fusing manometric and cinematographic data to facilitate diagnostics of dysphagia
  13. Comparison of spectral characteristics in human and pig biliary system with hyperspectral imaging (HSI)
  14. COMPASS: localization in laparoscopic visceral surgery
  15. Conceptual design of force reflection control for teleoperated bone surgery
  16. Data augmentation for computed tomography angiography via synthetic image generation and neural domain adaptation
  17. Deep learning for semantic segmentation of organs and tissues in laparoscopic surgery
  18. DL-based segmentation of endoscopic scenes for mitral valve repair
  19. Endoscopic filter fluorometer for detection of accumulation of Protoporphyrin IX to improve photodynamic diagnostic (PDD)
  20. EyeRobot: enabling telemedicine using a robot arm and a head-mounted display
  21. Fluoroscopy-guided robotic biopsy intervention system
  22. Force effects on anatomical structures in transoral surgery − videolaryngoscopic prototype vs. conventional direct microlaryngoscopy
  23. Force estimation from 4D OCT data in a human tumor xenograft mouse model
  24. Frequency and average gray-level information for thermal ablation status in ultrasound B-Mode sequences
  25. Generalization of spatio-temporal deep learning for vision-based force estimation
  26. Guided capture of 3-D Ultrasound data and semiautomatic navigation using a mechatronic support arm system
  27. Improving endoscopic smoke detection with semi-supervised noisy student models
  28. Infrared marker tracking with the HoloLens for neurosurgical interventions
  29. Intraventricular flow features and cardiac mechano-energetics after mitral valve interventions – feasibility of an isolated heart model
  30. Localization of endovascular tools in X-ray images using a motorized C-arm: visualization on HoloLens
  31. Multicriterial CNN based beam generation for robotic radiosurgery of the prostate
  32. Needle placement accuracy in CT-guided robotic post mortem biopsy
  33. New insights in diagnostic laparoscopy
  34. Robotized ultrasound imaging of the peripheral arteries – a phantom study
  35. Segmentation of the distal femur in ultrasound images
  36. Shrinking tube mesh: combined mesh generation and smoothing for pathologic vessels
  37. Surgical audio information as base for haptic feedback in robotic-assisted procedures
  38. Surgical phase recognition by learning phase transitions
  39. Target tracking accuracy and latency with different 4D ultrasound systems – a robotic phantom study
  40. Towards automated correction of brain shift using deep deformable magnetic resonance imaging-intraoperative ultrasound (MRI-iUS) registration
  41. Training of patient handover in virtual reality
  42. Using formal ontology for the representation of morphological properties of anatomical structures in endoscopic surgery
  43. Using position-based dynamics to simulate deformation in aortic valve replacement procedure
  44. VertiGo – a pilot project in nystagmus detection via webcam
  45. Visual guidance for auditory brainstem implantation with modular software design
  46. Wall enhancement segmentation for intracranial aneurysm
Downloaded on 9.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/cdbme-2020-0037/html
Scroll to top button