Line segment using displacement prior

Xin Zhu; Hancheng Yu; Yupu Zhang; Ming Zhou

doi:10.1515/comp-2025-0035

Article Open Access

Line segment using displacement prior

Xin Zhu , Hancheng Yu , Yupu Zhang and Ming Zhou

Published/Copyright: August 29, 2025

Published by

Become an author with De Gruyter Brill

Author Information Explore this Subject

From the journal Open Computer Science Volume 15 Issue 1

Abstract

Line segment detection can offer essential technical assistance for various visual tasks, aiding computer systems in comprehending image content more effectively and executing advanced analysis and applications. Focusing on the fact that most current line segment detection methods predict the center and the displacement maps of two endpoints of a line segment separately without exploiting the coupling between them, this article proposed a dynamic label assignment strategy and an improved deformable convolution for center prediction using displacement priors, which enhances the model’s line segment sensing capability and effectively improves the detection performance. The predicted displacements are used as a priori information to guide the centroid label assignment and deformable convolution sampling of center branches, which significantly improves the performance of centroid prediction. In addition, HRNet and UNet3+ are introduced to enhance the feature expression capability of the backbone network. Finally, experiments show that the s A P 10 metrics of the proposed one-stage model are 0.8 and 6.6% higher than those of the two-stage model efficient line segment detector and descriptor with the best performance in line segment detection on the Wireframe dataset and YorkUrban dataset, respectively.

Keywords: line segment detection; deep learning; label assignment; deformable convolution

1 Introduction

Line segments are important visual features in images that can provide basic information for more advanced visual tasks, such as object detection, image segmentation, pose estimation, and 3D reconstruction. In object detection [1–3], line segment detection can be used as a preprocessing step to extract the structural information of the image, enhance features such as edges and contours of the object, and reduce noise interference. It can also serve as a post-processing step to optimize the position and shape of the detection results, improve detection accuracy, and assist in identification and object localization. In image segmentation [4,5], line segment detection can refine rough rectangular boxes into line segments with semantic and perceptual meanings. This process further helps determine which object the pixels belong to, enhancing accuracy and more precisely delineating different regions. In 3D reconstruction [6–8], line segment detection can be used to recognize structural information, such as edges of buildings and road contours in an image, and thus estimate the shape, size, and relative position of an object, aiding in the reconstruction of a 3D scene. In addition, as one of the most basic geometric features in image structure, line segments have important applications in fields such as pose estimation [9–11], template matching [12], and vanishing point estimation [13].

According to the way of generating line segments, the current deep line segment detection methods can be categorized into several groups, including endpoint matching, attractive field, center point, and transformer [14]. Deep wireframe parser (DWP) [15] and end-to-end wireframe parsing (L-CNN) [16] are two highly representative methods that utilize endpoint matching. The fundamental concept involves detecting the endpoints in the image and subsequently matching these endpoints to form the line segments. Holistically-attracted wireframe parser (HAWP) [17] represents each pixel in the image as a vector of the attraction field of its nearest line segment. It then predicts the line segment based on the attraction field. The predicted line segment is retained only if the two endpoints of the predicted line segment meet the Euclidean distance threshold condition with the corresponding endpoints in the endpoint heatmap. The transformer-based line segment detection method LinE segment transformers (LETR) [18] uses a multi-scale encoder–decoder structure to iteratively enhance a set number of pre-defined line entities and produces an equal number of final line entities. Subsequently, the feedforward network directly predicts the endpoint coordinates and confidence of the line segments for each line entity.

The line segment detection methods using center points are similar to the CenterNet [19]. Tri-points based line segment detector (TP-LSD) [20] uses tri-points for the first time to predict line segments, which are represented by centers and displacements. It avoids the extensive computation caused by endpoint matching and the intricate representations of the attraction field, significantly enhancing the speed and efficiency of line segment detection. F-Clip [21] and efficient line segment detector and descriptor (ELSD) [22] generate line segments by predicting the angles and lengths of line segments. They outperform other line segment detection methods, including LETR, in terms of performance, and achieve better results in efficiency. Existing deep learning-based methods tend to use larger network models to improve detection performance, which restricts their applicability in real-time environments. In pursuit of lightweight and high efficiency, mobile line segment detection (M-LSD) [23] eliminates the multi-module prediction process in traditional methods by minimizing the backbone network. This reduction significantly decreases the computational cost, enhances network inference speed, and enables real-time detection on mobile devices with superior results.

The key to improving the model’s performance in line segment detection is to enhance its ability to perceive line segments. In order to better extract line segment features, TP-LSD uses a pixel-wise line map to provide auxiliary information for the center branch, M-LSD uses dilated convolution [24] to increase the receptive field of the network, and ELSD uses deformable convolution [25] to adapt the shape of the line segment. Rehman et al. [26] implement a modified resilient backpropagation algorithm to improve the convergence and efficiency of convolutional neural network training, Wang et al. [27] provide a comprehensive survey of the relationship between ConvNet with different pre-trained learning methodologies and its optimization effects. These inspired us that in addition to the network structure, loss function, etc., preprocessing, backpropagation, and other factors that are often unnoticed can also have a large impact on the performance of a model. In this article, we utilize displacements as a priori information, relying on the coupling between the center and the displacements of the line segment. Predicted displacements are used to dynamically assign labels to centers and guide the sampling of deformable convolution for the center branch.

The main contributions of this article are as follows:

The sampling of deformable convolution for the center branch is guided by the predicted displacements, which enhances the model’s perception of line segment features.
The labels of the centers during training are dynamically adjusted based on the prediction results of the displacements. This adjustment allows the model to concentrate more on learning the challenging-to-score center samples.
HRNet [28] and UNet3+ [29] are used as the backbone network to enhance the feature extraction capability of the model. The proposed method achieves state-of-the-art performance on the Wireframe dataset [15] and YorkUrban dataset [6].

2 Related works

2.1 Deep line segment detection

M-LSD [23] uses tri-points to predict line segments, producing superior results in model parameters, network complexity, efficiency, and precision. Therefore, this article aims to dive deeper into line segment detection using M-LSD. The backbone of M-LSD is an encoder–decoder structure, the MobileNetV2 [30] network is used as the encoder structure of the backbone network, and the decoder is designed as a top-down architecture.

As shown in Figure 1, M-LSD outputs 16 feature maps, including center maps, displacement maps, angle maps, length maps, line map, and endpoint map. Each feature map corresponds to a labeled map of the same resolution. To address the limited detection performance of the model for long line segments caused by an insufficient network receptive field, M-LSD uses the segments of line (SoL) augmentation. The specific approach involves splitting a long line segment into several shorter line segments that are treated as a separate entity for prediction. The SoL maps do not play a direct role in the final line segment prediction; instead, they offer auxiliary information to aid in the prediction of line segments.

Figure 1

Overall architecture of M-LSD-tiny.

As shown in Figure 2, the center map is used to predict the centers of line segments, while the displacement maps are used to predict the displacements from the centers to the endpoints of the line segments:

(1) p s = p c + d s , p e = p c + d e ,

where d s and d e denote the displacements from the center of the segment to the two endpoints, respectively.

Figure 2

Tri-points representation of line segments.

The centroid-based representation avoids the significant computation required by direct endpoint matching and greatly enhances the detection efficiency of line segments. Compared with representing line segments based on length and angle, the tri-points representation of line segments can make the model more sensitive to angles, particularly for long line segments. In this article, the tri-points representation will be utilized to predict line segments.

2.2 Deformable convolution

In visual tasks, the learning process essentially involves adjusting the model parameters to adapt to the size, shape, and posture of the object. Vanilla convolutional kernels sample the feature map at a fixed position and apply the same weight to each position to process the feature. However, they lack the processing ability for objects with irregular shapes, making it difficult to accurately capture the shape and position of the object. Since different positions in the feature map may correspond to objects of various sizes and shapes, convolution is necessary to adaptively adjust the sampling position and receptive field size based on different objects for visual tasks that demand high precision. In 2017, Dai et al. [25] proposed deformable convolution to enhance the ability to express deformable characteristics in processing objects. The spatial deformation was introduced to allow the convolution kernel to generate positional deviations when sampling from the input feature maps.

In 2D convolution, sampling occurs at a fixed position R , which determines the receptive field of the convolution kernel. For a convolution kernel of size 3 × 3 , R = { ( − 1 , − 1 ) , ( − 1 , 0 ) , … , ( 0 , 1 ) , ( 1 , 1 ) } , the corresponding output at p 0 in the output feature map y is as follows:

(2) y ( p 0 ) = ∑ p n ∈ R w ( p n ) ⋅ x ( p 0 + p n ) ,

where x represents the input feature map, p n is a vector in R , and w is the weight of the convolution kernel.

As shown in Figure 3, deformable convolution introduces a set of offsets { Δ p n ∣ n = 1 , … , N } , where N = ∣ R ∣ , obtained through network learning, to the fixed sampling points of ordinary 2D convolution. This allows the receptive field of the convolution kernel to be adaptively tuned to the shape of the object, rather than capturing features in a fixed shape. The output feature of deformable convolution can be expressed as follows:

(3) y ( p 0 ) = ∑ p n ∈ R w ( p n ) ⋅ x ( p 0 + p n + Δ p n ) .

Deformable convolution adjusts the sampling position of the convolution based on the object’s shape. This adjustment ensures the validity and accuracy of the features extracted by the convolution kernel, allowing the network to model the object more precisely and enhancing its robustness to shape variations.

$Figure 3 Illustration of 3 × 3 3\times 3 deformable convolution.$

Figure 3

Illustration of 3 × 3 deformable convolution.

2.3 Datasets

In this article, the model will be trained and tested using the Wireframe dataset [15] and the YorkUrban dataset [6], 5,000 images from the Wireframe dataset, while the test set will include 462 images from the Wireframe dataset and 102 images from YorkUrban. In the training process, we perform the following dataset augmentations: (1) unchanged; (2) horizontal flip, vertical flip, or horizontal and vertical direction flip simultaneously; (3) rotate 90 degrees clockwise or counterclockwise; (4) randomly crop the image, and then resize it to a resolution of 512 × 512 .

3 Displacement line segment detection (D-LSD) for line segment detection

In this section, we present the details of line segment detection using displacement prior (D-LSD). First, HRNet and UNet3+ are used as the backbone network. In the prediction head, the predicted displacements are used as prior information to guide the sampling of the deformable convolution for the center branch. Additionally, the predicted displacements are also used to dynamically adjust the label assignment on the center map.

3.1 Overall network architecture

As shown in Figure 4, our proposed network is one-stage and consists of a backbone and a prediction head. The backbone takes an image of size 512 × 512 × 3 as input and outputs the shared features with a size of 256 × 256 × 64 . The prediction head consists of three branches. The first branch consists of a 3 × 3 deformable convolution, a 3 × 3 vanilla convolution, and a 1 × 1 vanilla convolution connected sequentially. It takes shared features as inputs and produces displacement maps as outputs. The second branch consists of a 3 × 3 deformable convolution using displacement prior (Displace-Deform Conv), a 3 × 3 vanilla convolution kernel, and a 1 × 1 vanilla convolution connected sequentially. It takes as inputs the shared features and the output displacements of the first branch, and produces the center map as output. The third branch consists of a sequence of 3 × 3 deformable convolution, 3 × 3 vanilla convolution, and 1 × 1 vanilla convolution. These components are connected sequentially, sharing inputs and feature maps. The branch outputs a junction map and a line map, which only offer auxiliary information for the training process and are not directly involved in the final line segment prediction.

Figure 4

Overall architecture of D-LSD.

In the final feature maps, displacement maps predict the displacements of line segments, the center map predicts the centers of line segments, the junction map and line map predict junctions, and the pixel-wise map predicts the line segments’ details. During training, displacements are regressed using smooth L1 loss, while centers, junctions, and pixels are trained using focal loss.

3.2 Backbone network

HRNet is a deep learning network architecture used for human pose estimation. It has also been employed as a backbone network for line segment detection, demonstrating promising results. The decoder of UNet3+ fully integrates high-level semantic features and low-level semantic features in the full-size feature map through full-size skip connections, allowing information to be more comprehensively transmitted within the network and effectively enhancing model performance. As shown in Figure 5, our backbone network features an encoder–decoder structure. To enhance the network’s ability to extract line segment features, we utilize HRNet as the encoder of the backbone network. In order to fully integrate features at different levels, capture detailed information at various scales, and enhance the network’s capability to represent features such as fine-grained structures and edges in the image, the decoder of UNet3+ is selected as the decoder of the backbone network.

Figure 5

Overall architecture of the backbone network.

The decoder network takes an image of size 512 × 512 × 3 as input and integrates features of various resolutions through parallel sub-networks. It ultimately produces five sets of feature maps with resolutions of 256 × 256 , 128 × 128 , 64 × 64 , 32 × 32 , and 16 × 16 . The decoder takes the output of the encoder as input and fully integrates the high-level semantic features and low-level semantic features in the full-size feature maps to enhance the network’s ability to retain details. When fusing feature maps with different resolutions to reduce the number of network parameters, the number of channels is standardized to match the number of channels in feature maps with a resolution of 128 × 128 . Finally, block B of M-LSD is utilized in the output layer of the backbone network to conduct feature fusion on the output feature maps of the decoder layer in order to produce shared features.

3.3 Deformable convolution using displacement prior

In this section, the predicted displacements are used as prior information to guide the sampling position of the deformable convolution in the center branch. As shown in Figure 6, the offsets of the deformable convolution for the center branch do not need to be obtained by convolution; instead, they are entirely derived from the prediction results of the displacement maps.

Figure 6

Center branch using displacement prior.

Let the shared feature be f , and the center be p 0 ; the offset from the vanilla convolutional sampling point to the center point is p 0 , p n ∈ { ( − 1 , − 1 ) , ( − 1 , 0 ) , … , ( 0 , 1 ) , ( 1 , 1 ) } . The corresponding predicted displacements for p 0 are p s and p e . Therefore, the offset from the sampling point to the center point of the convolution is as follows:

(4) p 0 n ∈ ( 0,0 ) , p s i , p e i , i = 1,2,3,4 .

The offset of the deformable convolution can be expressed as follows:

(5) Δ p n ′ = p 0 n − p n .

The value at p 0 in the output feature map f ′ can be expressed as follows:

(6) f ′ ( p 0 ) = ∑ p n ∈ R w ( p n ) ⋅ f ( p 0 + p n + Δ p ′ n ) .

The offsets of the deformable convolution using the displacements are illustrated in Figure 7, where all the sampling points are distributed between the center and the endpoints.

Figure 7

Deformable convolution using prior displacements.

3.4 Dynamic label assignment for the center of line segment

In M-LSD, the centers of line segments are predicted by working with a feature map of size 256 × 256 . As shown in Figure 8 (a), all pixels within a 3 × 3 range around the centers are labeled as positive samples represented by “1,” and the remaining pixels are considered negative samples represented by “0,” respectively. In this article, we propose a dynamic center label assignment strategy. As shown in Figure 8 (b), the pixels are divided into positive samples, ignore samples, candidate negative samples, and negative samples according to the distance to the centers. For a pixel in the feature map, the distance between it and the centers of all line segments in the label is calculated as follows:

(7) d c ( p , l c ) = max ( ∣ x p − x l c ∣ , ∣ y p − y l c ∣ ) , l ∈ G ,

where x p and y p denote the x -axis and y -axis coordinates of pixel p , x l c , and y l c denote the x -axis and y -axis coordinates of the center point of line segment l , and G denotes the set of labeled line segments corresponding to the feature map. When the condition d c ( p , l c ) ≤ 3 is satisfied, p is labeled as a positive sample; otherwise, it is labeled as a negative sample.

Figure 8

Center label of line segment: (a) is the static center label assignment strategy used by M-LSD: “1” represents a positive sample and “0” represents a negative sample, respectively; (b) is our dynamic center label assignment strategy: “X” means ignore sample, “?” represents the candidate negative sample based on the its displacement prediction: (a) center label in M-LSD and (b) dynamic center label.

The displacement maps predict the vectors from the centers to the endpoints, and the final predicted line segment can be obtained from the center map and the displacement maps:

(8) ( x l s , y l s ) = ( x l c , y l c ) + d s ( x l c , y l c ) , ( x l e , y l e ) = ( x l c , y l c ) + d e ( x l c , y l c ) ,

where x l s and y l s denote the x -axis and y -axis coordinates of the start point of line segment l , x l e and y l e denote the x -axis and y -axis coordinates of the end point of line segment l , d s ( x l c , y l c ) , and d e ( x l c , y l c ) denote the displacements from the center of line segment l to the start points and end points, respectively.

The line segment is predicted by both the center and the displacement. During the training process, the proposed center label assignment strategy dynamically assigns whether a candidate center negative sample is classified as a negative sample or an ignore sample according to its segmentation results. For the candidate negative sample, based on its displacement, if the line segment result is accurate, it should be classified as an ignored sample. The accuracy of the line segment is calculated based on the distance between the predicted endpoints and the label endpoints:

(9) d ( l ˆ , l ) = max ( d ( l ˆ s , l s ) , d ( l ˆ e , l e ) ) ,

where l ˆ denotes the predicted line segment and l denotes its label line segment. d ( l ˆ s , l s ) and d ( l ˆ e , l e ) denote the distance between two predicted endpoints and the label endpoints, respectively, which can be calculated by

(10) d ( l ˆ s , l s ) = ( x l ˆ s − x l s ) 2 + ( y l ˆ s − y l s ) 2 , d ( l ˆ e , l e ) = ( x l ˆ e − x l e ) 2 + ( y l ˆ e − y l e ) 2 .

If d ( l ˆ , l ) ≤ γ , γ is the distance threshold between line segments, then p is not considered a negative sample. The specific process is shown in Algorithm 1.

Algorithm 1. Dynamic label assignment for the centers of line segments
Require Set of ground truth G , contains lines l , centers l c , and two set of displacement maps ( x l s , y l s ) ; distance threshold γ
Ensure Label map L
L ← 0
while l ∈ G do
Calculate the distance from the predicted line segment to the label center according to Eq. (7): d c ( p , l c )
if d c ( p , l c ) = 0 then
L ( x p , y p ) ← 1
else if d c ( p , l c ) = 1 and L ( x p , y p ) = 0 then
L ( x p , y p ) ← − 1
else if d c ( p , l c ) = 2 and L ( x p , y p ) = 0 then
Calculate the predicted line segment l ˆ for candidate negative pixel p according to Eq. (8);
Calculate the distance d ( l ˆ , l ) between the candidate negative pixel p and its corresponding labeled line segment according to Eqs. (9) and (10).
if d ( l ˆ , l ) < γ then
L ( x p , y p ) ← − 1
end if
end if
end while
In the above process, the label equal to 1 indicates positive sample, equal to 0 indicates negative sample, and equal to − 1 indicates no training.

Furthermore, the other pixels surrounding the centers no longer share the displacements of the centers. Instead, they are computed separately for each of their respective displacements to the endpoints of the line segments. This enhancement significantly improves the accuracy of the displacements.

The dynamic label assignment strategy enables the network to concentrate more on learning to classify the center and the background without allocating excessive attention to the regions surrounding the centers, which have less impact on the results. This approach allows the network to demonstrate improved performance.

4 Experiments

In this section, we will validate the effectiveness of the proposed methods through experiments, all of which are conducted on the Wireframe dataset and the YorkUrban dataset.

4.1 Ablation study

In order to verify the effectiveness of the dynamic label assignment strategy and the backbone network proposed in Section 3, the following ablation experiments will be conducted using M-LSD-tiny as a benchmark. All experiments will be carried out on NVIDIA GTX 3080Ti GPUs using the PyTorch framework. The model parameter settings and optimization will be consistent with those of M-LSD-tiny.

As shown in Table 1, we use M-LSD-tiny as baseline for the ablation experiments. Replacing the label assignment strategy in baseline with our dynamic label assignment strategy, the F H , s A P 5 , and s A P 10 metrics improve by 0.9, 1, and 1.2% in Wireframe dataset and by 2.7, 3.2, and 2.0% in YorkUrban dataset. Replacing the encoder and decoder of the backbone network with HRNet32 and UNet3+, respectively, the F H , s A P 5 , and s A P 10 metrics improve by 3.2, 18.0, and 14.8% in Wireframe dataset and by 4.2, 14.9, and 11.2% in YorkUrban dataset. Then, simultaneously using the dynamic label assignment strategy and backbone network proposed in this article, the F H , s A P 5 , and s A P 10 metrics improve by 3.6, 18.4, and 15.3% in Wireframe dataset, and by 4.3, 16.7, and 12.8% in YorkUrban dataset.

Table 1

Ablation study of dynamic label and backbone

		Wireframe			YorkUrban
Dynamic	Backbone	F H	s A P 5	s A P 10	F H	s A P 5	s A P 10
		77.2	52.3	58.0	62.4	22.1	25.0
✓		77.9	52.8	58.7	64.1	22.8	25.5
	✓	79.7	61.7	66.6	65.0	25.4	27.8
✓	✓	80.0	61.9	66.9	65.1	25.8	28.2

We use M-LSD-tiny as baseline, Dis-Deform denotes replacing Block C with the prediction head shown in Section 3.3, and Deform denotes changing the deformable convolution using displacement prior in the prediction head of Section 3.3 to normal deformable convolution. As shown in Table 2, using the prediction head proposed in this article, the F H , s A P 5 , and s A P 10 metrics of the detection results are improved by 2.1, 13.6, and 10.9% in Wireframe dataset, and by 3.7, 13.6, and 11.2% in YorkUrban dataset. Compared to using normal deformable convolution, deformable convolution using displacement prior has 0.2, 3.5, and 2.9% higher for F H , s A P 5 , and s A P 10 metrics in Wireframe dataset, and 1.4, 5.0, and 4.9% higher for F H , s A P 5 , and s A P 10 metrics in the York dataset, respectively.

Table 2

Ablation study of deformable convolution using displacement prior

	Wireframe			YorkUrban
Model	F H	s A P 5	s A P 10	F H	s A P 5	s A P 10
Baseline	77.2	52.3	58.0	62.4	22.1	25.0
Deform	78.6	57.4	62.5	63.8	23.9	26.5
Dis-Deform	78.8	59.4	64.3	64.7	25.1	27.8

Then, simultaneously using the dynamic label assignment strategy, prediction head and backbone network proposed in this article, the F H , s A P 5 , and s A P 10 metrics on the wireframe dataset are 81.0, 64.4, and 68.7, respectively, and on the York dataset, the F H , s A P 5 , and s A P 10 metrics are 65.2, 27.8, and 30.4, respectively.

In addition, two non-maximal suppression methods, SoftNMS and StructNMS, proposed by F-Clip [21] are applied to the post-processing algorithm of our proposed algorithm, and the results is shown in Table 3. Eventually, the proposed model achieves 81.2, 65.5, and 69.5 for F H , s A P 5 , and s A P 10 metrics in Wireframe dataset, and 65.3, 29.5, and 32.2 for F H , s A P 5 , and s A P 10 metrics in YorkUrban dataset, respectively.

Table 3

Ablation study of SoftNMS and StructNMS

		Wireframe			YorkUrban
SoftNMS	StructNMS	F H	s A P 5	s A P 10	F H	s A P 5	s A P 10
		81.0	64.4	68.7	65.2	27.8	30.4
✓		81.1	65.1	69.1	65.3	29.2	31.9
	✓	81.2	64.7	68.9	65.3	28.6	31.4
✓	✓	81.2	65.5	69.5	65.3	29.5	32.2

4.2 Comparison with other methods

The results of comparing our proposed line segment detection algorithm with other top-performing line segment detection algorithms such as HAWP, L-CNN, and F-Clip on Wireframe dataset and YorkUrban dataset are shown in Table 4. On the Wireframe dataset, F H , s A P 5 , and s A P 10 metrics are 2.0, 0.8, and 0.3% higher than the current top-performing ELSD algorithm, respectively, and on the YorkUrban dataset, F H , s A P 5 , and s A P 10 metrics are 3.5, 4.5, and 8.0% higher than the current top-performing F-Clip algorithm, respectively. The performance lead of our proposed algorithm over other algorithms is higher on the YorkUrban dataset than on the Wireframe dataset, which proves that the model is also better in terms of robustness and generalization.

Table 4

Quantitative comparisons with existing LSD methods

			Wireframe				YorkUrban
	Method	Input size	F H	s A P 5	s A P 10	s A P 15	F H	s A P 5	s A P 10	s A P 15
	LSD	320	64.1	6.7	8.8	—	60.6	7.5	9.2	—
Two-stage	HAWP	512	80.3	62.5	66.5	68.2	64.8	26.1	28.5	29.7
	L-CNN	512	76.9	58.9	62.8	64.9	63.8	24.3	26.4	27.5
	ELSD	512	83.1	64.3	68.9	70.9	64.8	27.6	30.2	31.8
One-stage	DWP	512	72.2	3.7	5.1	5.9	61.6	1.5	2.1	2.6
	AFM	320	77.2	18.5	24.4	27.5	63.3	7.3	9.4	11.1
	LETR	512	83.3	—	65.2	67.7	66.6	—	29.4	31.7
	TP-LSD	512	80.6	57.6	57.2	—	67.2	27.6	27.7	—
	M-LSD-tiny	512	77.2	52.3	58.0	—	62.4	22.1	25.0	—
	F-Clip	512	80.9	64.3	68.3	70.1	64.5	28.5	30.8	31.3
	Ours	512	81.2	65.6	69.5	71.1	65.3	29.5	32.2	33.8

Bold values highlight the best-performing method for each evaluation metric.

Among the existing methods for line segment detection, two-stage methods demonstrate better performance. The method described in this article utilizes displacements as prior information to guide the sampling process of deformable convolutions for the center branch and the label assignment of centers. Finally, our method achieves better detection performance with a one-stage structure than the existing two-stage methods.

Comparison of detection results with L-CNN, HAWP, and F-Clip on the Wireframe dataset and the YorkUrban dataset are shown in Figure 9, respectively. Both L-CNN and HAWP rely on endpoint detection and sampling of line segment features. However, they face challenges with connectivity points and texture variations, and F-Clip predicts line segments based on angle and length, making it more sensitive to angle. This sensitivity limits its accuracy in detecting long line segments. In contrast, the method proposed in this chapter has higher detection accuracy, check-all rate, and overall performance.

Figure 9

Visualization of line segment detection methods on Wireframe dataset and YorkUrban dataset. (a) Label, (b) L-CNN, (c) HAWP, (d) F-Clip, and (e) Ours.

5 Conclusion

This article proposes a one-stage line segment detection method using displacement priors; line segments are predicted by centers and displacements. To enhance the accuracy of the centers, we utilize displacements as prior information to dynamically adjust the allocation of samples to the centers and to direct the sampling of deformable convolution in the center branch. In addition, HRNet and UNet3+ have been implemented to enhance the backbone network, resulting in s A P 10 metrics of 69.5 and 32.2 in the Wireframe dataset and the YorkUrban dataset, respectively, which compare favorably with the detection performance of other existing line segment detection algorithms, such as F-Clip and ELSD.

In the future, we will further investigate the dynamic label assignment strategy for other feature maps and use line segment angles and lengths as priors to further enhance the line segment perception capability of the model.

Funding information: Authors state no funding involved.
Author contributions: Xin Zhu: conceptualization, methodology, software, writing. Hancheng Yu: conceptualization, methodology, supervision, funding acquisition. Yupu Zhang: visualization, validation, writing. Ming Zhou: visualization, validation, writing.
Conflict of interest: Authors state no conflict of interest.
Data availability statement: The data that support the findings of this study are openly available in the following repositories: The Wireframe dataset is available at https://github.com/huangkuns/wireframe, under a permissive academic license. The YorkUrban dataset is available at https://github.com/NamgyuCho/Linelet-code-and-YorkUrban-LineSegment-DB, also under a permissive academic license. These datasets were used for training and evaluating the proposed method.

References

[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2014, pp. 580–587. 10.1109/CVPR.2014.81Search in Google Scholar

[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 779–788. 10.1109/CVPR.2016.91Search in Google Scholar

[3] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, et al., “Ssd: Single shot multibox detector,” In: Proc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 21–37. 10.1007/978-3-319-46448-0_2Search in Google Scholar

[4] M. Bai, and R. Urtasun, “Deep watershed transform for instance segmentation,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 5221–5229. 10.1109/CVPR.2017.305Search in Google Scholar

[5] S. Peng, W. Jiang, H. Pi, H. Bao, and X. Zhou, “Deep snake for real-time instance segmentation,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 8533–8542. 10.1109/CVPR42600.2020.00856Search in Google Scholar

[6] P. Denis, J. H. Elder, and F. J. Estrada, “Efficient edge-based methods for estimating manhattan frames in urban imagery,” In: Proc. Eur. Conf. Comput. Vis. (ECCV), 2008, pp. 197–210. 10.1007/978-3-540-88688-4_15Search in Google Scholar

[7] Y. Zhou, J. Huang, X. Dai, L. Luo, Z. Chen, and Y. Ma, HoliCity: A city-scale data platform for learning holistic 3D structures, 2020, arXiv:2008.03286. Search in Google Scholar

[8] Y. Zhou, H. Qi, Y. Zhai, Q. Sun, Z. Chen, and L. Y. Wei, “Learning to reconstruct 3d manhattan wireframes from a single image,” In: Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 7698–7707. 10.1109/ICCV.2019.00779Search in Google Scholar

[9] B. Přibyl, P. Zemçík, and M. Čadík, “Absolute pose estimation from line correspondences using direct linear transformation,” Comput vis Image UND., vol. 161, pp. 130–144, Aug. 2017. 10.1016/j.cviu.2017.05.002Search in Google Scholar

[10] C. Xu, L. Zhang, L. Cheng, and R. Koch, “Pose estimation from line correspondences: A complete analysis and a series of solutions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, pp. 1209–1222, 2016. 10.1109/TPAMI.2016.2582162Search in Google Scholar PubMed

[11] A. Elqursh, and A. Elgammal, “Line-based relative pose estimation,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2011, pp. 3049–3056. 10.1109/CVPR.2011.5995512Search in Google Scholar

[12] N. Xue, G. S. Xia, X. Bai, L. Zhang, and W. Shen, “Anisotropic-scale junction detection and matching for indoor images,” IEEE Trans. Image Process., vol. 27, pp. 78–91, 2017. 10.1109/TIP.2017.2754945Search in Google Scholar PubMed

[13] Y. Zhou, H. Qi, J. Huang, and Y. Ma, “Neurvps: Neural vanishing point scanning via conic convolution,” In: Proc. Adv. Neural Inform. Process. Syst. (NeurIPS), vol. 32, 2019. Search in Google Scholar

[14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, and A N Gomez, et al., “Attention is all you need,” In: Proc. Adv. Neural Inform. Process. Syst. (NeurIPS), vol. 30, 2017, pp. 6000–6010. Search in Google Scholar

[15] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma, “Learning to parse wireframes in images of man-made environments,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 626–635. 10.1109/CVPR.2018.00072Search in Google Scholar

[16] Y. Zhou, H. Qi, and Y. Ma, “End-to-end wireframe parsing,” In: Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 962–971. 10.1109/ICCV.2019.00105Search in Google Scholar

[17] N. Xue, T. Wu, S. Bai, F. Wang, G. S. Xia, and L. Zhang, et al., “Holistically-attracted wireframe parsing,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 2788–2797. 10.1109/CVPR42600.2020.00286Search in Google Scholar

[18] Y. Xu, W. Xu, D. Cheung, and Z. Tu, “Line segment detection using transformers without edges,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 4257–4266. 10.1109/CVPR46437.2021.00424Search in Google Scholar

[19] X. Zhou, D. Wang, and P. Krähenbühl, Objects as points, 2019, arXiv:1904.07850. 10.1007/978-3-030-58548-8_28Search in Google Scholar

[20] S. Huang, F. Qin, P. Xiong, N. Ding, Y. He, and X. Liu, “TP-LSD: Tri-points based line segment detector,” In: Proc. Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 770–785. 10.1007/978-3-030-58583-9_46Search in Google Scholar

[21] X. Dai, H. Gong, S. Wu, X. Yuan, and Y. Ma, “Fully convolutional line parsing,” Neurocomputing., vol. 506, pp. 1–11, 2022. 10.1016/j.neucom.2022.07.026Search in Google Scholar

[22] H. Zhang, Y. Luo, F. Qin, Y. He, and X. Liu, “ELSD: Efficient line segment detector and descriptor,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 2969–2978. 10.1109/ICCV48922.2021.00296Search in Google Scholar

[23] G. Gu, B. Ko, S. H. Go, S. H. Lee, J. Lee, and M. Shin, “Towards light-weight and real-time line segment detection,” In: Proc. AAAI Conf. Artif. Intell., 2022, pp. 726–734. 10.1609/aaai.v36i1.19953Search in Google Scholar

[24] F. Yu, and V. Koltun, Multi-scale context aggregation by dilated convolutions, 2015, arXiv:1511. 07122. Search in Google Scholar

[25] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, and H. Hu, “Deformable convolutional networks,” In: Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2017, pp. 764–773. 10.1109/ICCV.2017.89Search in Google Scholar

[26] S. U. Rehman, S. Tu, O. U. Rehman, Y. Huang, C. M. S. Magurawalage, and C. C. Chang, “Optimization of CNN through novel training strategy for visual classification problems,” Entropy., vol. 20, 2018, id. 290. 10.3390/e20040290Search in Google Scholar PubMed PubMed Central

[27] S. U. Rehman, S. Tu, M. Waqas, Y. F. Huang, O. U. Rehman, and B. Ahmad, et al., “Unsupervised pre-trained filter learning approach for efficient convolution neural network,” Neurocomputing, vol. 365, pp. 171–190, 2019. 10.1016/j.neucom.2019.06.084Search in Google Scholar

[28] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, and Y. Zhao, et al., “Deep high-resolution representation learning for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, pp. 3349–3364, 2020. 10.1109/TPAMI.2020.2983686Search in Google Scholar PubMed

[29] H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, and Y. Iwamoto, et al., “Unet 3.: A full-scale connected unet for medical image segmentation,” In:Proc. ICASSP., 2020, pp. 1055–1059. 10.1109/ICASSP40776.2020.9053405Search in Google Scholar

[30] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 4510–4520. 10.1109/CVPR.2018.00474Search in Google Scholar

Received: 2024-03-31

Revised: 2024-12-17

Accepted: 2025-06-05

Published Online: 2025-08-29

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/comp-2025-0035

Keywords for this article

line segment detection; deep learning; label assignment; deformable convolution

Creative Commons

BY 4.0