Urban road surface condition detecting and integrating based on the mobile sensing framework with multi-modal sensors

Haiyang Lyu; Yu Huang; Xiaoyang Qian; Donglai Jiao

doi:10.1515/geo-2025-0895

Enjoy 40% off

academic books on De Gruyter Brill *

Article Open Access

Urban road surface condition detecting and integrating based on the mobile sensing framework with multi-modal sensors

Haiyang Lyu , Yu Huang , Xiaoyang Qian and Donglai Jiao

Published/Copyright: October 6, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Open Geosciences Volume 17 Issue 1

Abstract

The urban road surface condition, encompassing elements such as road surface damages and facility distributions, which collectively refers to bump features of road surface (BFRS), is critical for road maintenance and management in smart cities. Traditional BFRS collection methods relying on specialized equipment are often time-consuming and labor-intensive, whereas crowdsourced data frequently suffers from inconsistent quality and reliability. To address these issues, the BFRS is detected and integrated via a mobile sensing framework (MSF), which leverages multi-modal Internet of Things sensors, including global positioning system, gyroscope, accelerometer, camera, and a 4G module. First, the MSF was designed, and a comprehensive set of multi-modal sensors was seamlessly integrated into a Jetson Nano, and further deployed on the vehicle. Second, the software for BFRS computing was implemented to process the collected multi-modal sensor data. This software comprises three key components: the collection of multi-modal sensor data at the mobile end, data processing on the server side, and map interaction and visualization via the web browser. Finally, the fine-tuned TimesFM model (FTTM) dedicated to BFRS acquisition was proposed. BFRS were detected using spatially transformed acceleration data derived from the Inertial Measurement Units. These detections were temporally synchronized and integrated with corresponding road images identified by a specially trained YOLO model. Experiments were conducted, and the BFRS was detected and integrated using the proposed FTTM, with results presented through web visualizations. The results demonstrated that the proposed method effectively captures the road surface condition using the MSF, highlighting its significant potential for applications in smart cities.

Keywords: mobile sensing framework; road surface condition; multi-modal sensors; BFRS; detecting and integrating

1 Introduction

The urban road network serves as the backbone of the urban transportation system, facilitating the movement of people and goods and supporting the rhythm of city life [1–3]. However, vehicle traffic inherently impacts pavement integrity [4,5], causing damages to road surface, municipal infrastructure, including speed bumps, manhole covers, and drainage systems, poses a threat to the transportation safety [6,7]. These factors generate bump features of road surface (BFRS), and thereby directly influence the road network’s capability [8,9]. Therefore, accurately acquiring the BFRS is critical for the maintenance and management of municipal roads, as well as for the sustainable development of smart cities [10].

Traditional BFRS collecting methods utilize professional surveying equipment to acquire specific road surface condition, including remote sensing, global positioning system (GPS), ground-penetrating radar, Light Detection and Ranging (LiDAR), and unmanned aerial vehicle (UAV), etc. Torbaghan et al. [11] utilized the ground penetrating radar to collect BFRS, applied the singular value decomposition to the radar image, and detected cracks in road surface based on the peak signal to noise ratio. Using LiDAR, Sui et al. [12] extracted road boundaries from LiDAR-acquired point cloud data through an analysis of the spatial distribution of the raw point clouds. Besides, the BFRS detection can also be conducted with the combination of LiDAR and camera, since direct visual information can enhance feature identification [13,14]. Alternatively, Tan and Li [10] obtained road surface condition via the UAV. They computed 3D point clouds from collected oblique images and detected road surface distress based on the reconstructed 3D surface model. BFRS can also be observed from the remote sensing images [15]. These tools are highly effective in capturing precise road surface information. However, the reliance on such sophisticated technology requires significant expertise and entails substantial logistical planning and execution, making the process both labor-intensive and time-consuming. With the widespread adoption of devices such as smartphones, the use of non-specialized equipment for road surface data collection has become increasingly feasible. Smartphones, along with street-view videos, can capture sensor-based vibration data from road surfaces, enabling the extraction of features related to surface unevenness. Consequently, these non-professional devices have emerged as practical alternatives to traditional methods for acquiring BFRS. Based on the street-view images, Ren et al. [7] used the Generalized Feature Pyramid Network structure and improved the YOLO v5 model to detect road surface damage. Other deep learning-based image processing approaches, such as the U-Net model [16] and diffusion model [17], have also been applied. These methods are adapted to detect specific BFRS features from 2D images [3,5,6,18]. In addition to image-based methods, some approaches detect BFRS directly from one-dimensional, time-serialized sensor recordings. For instance, Zang et al. [19] used a smartphone equipped with an accelerometer and GPS, mounted it on a bicycle, and detected BFRS based on local threshold values of acceleration. Since acceleration changes directly reflect surface irregularities, smartphone sensor data have been widely used for road surface condition detection [20,21]. Various detection techniques have been explored, including pattern recognition [8], spectral analysis, and deep learning-based feature detection methods [21,22], to extract re-encoded vibration features from acceleration data. For instance, Lyu et al. [23] proposed a method for detecting BFRS using smartphone-based sensors (accelerometer, gyroscope, GPS) and a Bidirectional Long Short-Term Memory (Bi-LSTM) network [24], achieving high-accuracy identification by transforming sensor data into an inertial coordinate system and extracting features via a sliding window. Moreover, detection methods leveraging multi-source smartphone sensors have gained increasing attention. However, crowdsourced data collected via these non-professional devices often contain substantial noise and exhibit in-consistent quality. As a result, processing such data remains a significant challenge.

In recent years, the advancement of Internet of Things (IoT) technology has increasingly facilitated the application of IoT sensors for the collection and acquisition of road surface conditions, including audio microphones [25], cameras [26], ultrasonic sensors [27], GPS, and Inertial Measurement Units (IMUs) [28]. These sensors enable the rapid acquisition of high-quality geospatial data, such as geographic coordinates, road surface vibration signals, and road imagery. Moreover, the integration of edge computing capabilities in mobile computing terminals, such as the Jetson Nano, has significantly enhanced the data processing capacity of IoT systems. Utilizing multi-sensor systems for object detection effectively compensates for the limitations inherent in single-sensor approaches, harnessing the complementary strengths of various sensors. This strategy is particularly critical in simultaneous localization and mapping systems integrating cameras, LiDAR, and Inertial Measurement Unit (IMU) sensors. Additionally, leveraging deep learning techniques to construct unified multi-modal fusion frameworks has emerged as a prominent area of current research [29]. In BFRS detection, integrating multiple sensors such as cameras, GPS, and IMU significantly enhances both detection efficiency and accuracy. Mednis et al. [25] proposed a method for pothole detection in moving vehicles using onboard microphones and a GPS device connected to the vehicle’s onboard computer. As the vehicle moves, pothole-induced sound signals are captured and analyzed, demonstrating the potential of such methods for broader event detection tasks involving diverse sensor modalities. By constructing a ground-based sensing networks using these IoT devices, it can provide practical references for the acquisition of road surface condition. Artificial intelligence-based deep learning models offer significant advantages in object detection, particularly in terms of speed and accuracy. These benefits are especially pronounced when dealing with multi-modal data, such as camera video streams and LiDAR measurements. After appropriate sensor calibration, deep learning approaches are well-suited for fusing multi-modal detection results, thereby improving overall detection accuracy [30]. For BFRS detection, integrating camera data with IMU time series information represents a highly efficient strategy for enhancing detection performance. The rapid development of large foundation models in natural language processing, whose core module is the self-attention module in transformer [31], has shown remarkable zero-shot learning capabilities, inspiring efforts to develop similar models for time series forecasting. Existing time series forecasting models can generally be categorized into three types: (1) local univariate models [32], which are trained independently on each time series to predict its future values; (2) global univariate models [33], which are trained across multiple time series but infer each series independently based on its own historical data and covariates; and (3) global multivariate models [34,35], which take all historical time series data as input to jointly predict future values. Long-term time series forecasting based on Transformer models has become a research hotspot in this field. Although standard Transformer models can capture long-term dependencies via the self-attention mechanism, their point-wise attention calculation method faces the problems of computational complexity and memory consumption that grow quadratically with the sequence length. Du et al. [36] introduced the Multi-Scale Segment-Correlation mechanism, which directly calculates the correlation between time series segments as attention weights, aiming to preserve intra-segment continuity and capture inter-segment dependencies with lower complexity. More importantly, the researcher specifically designed the Predictive Multi-Scale Segment-Correlation paradigm for the decoding process, which explicitly utilizes historical information to guide the aggregation of future information, thereby more directly serving the forecasting task. Ni et al. [37] proposed an end-to-end framework that utilizes a novel self-supervised contrastive learning mechanism to learn interpretable basis vectors and leverages a cross-attention mechanism to achieve flexible, instance-adaptive associations between time series and basis vectors, thereby improving prediction accuracy and model interpretability. Liu et al. [38] proposed AutoTimes model, which for the first time systematically repurposes large language models (LLMs) as autoregressive time series forecasters by freezing core LLM parameters, introducing a minimal number of trainable parameters for modality alignment (time series segment embedding and projection), combined with a novel method using LLM-embedded textual timestamps for temporal/positional encoding. Its lightweight adaptation strategy provides a new paradigm for the construction of time series foundation models. These recent models have also begun exploring semi-supervised and transfer learning approaches. Nevertheless, there is still a lack of general-purpose foundation models capable of efficient zero-shot prediction across diverse time series datasets. Recent progress has been made by time series foundation model (TimesFM) [39], which constructed a large-scale time series corpus from both real-world and synthetic data. TimesFM utilizes a decoder-only attention architecture with input chunking for pre-training. Despite its relatively small model size and limited pretraining data, TimesFM achieves near state-of-the-art performance in zero-shot forecasting on multiple unseen datasets. These approaches leverage the ubiquity and portability of mobile devices equipped with sensors such as accelerometers, offering promising solutions for scalable and efficient road information detection. However, while deep learning models are highly effective for multi-modal data detection and result fusion, a key challenge lies in aligning multi-modal experimental results – particularly under dynamic environmental and motion changes – where fixed system parameters often prove to be inadequate [40]. In addition, the availability of large-scale, high-quality annotated time series datasets – particularly those tailored for acceleration-based data – remains limited.

To address the challenges posed by the sophisticated technology of professional equipment and the data quality uncertainty associated with crowdsourced devices, it integrates multiple IoT sensors, designs the mobile sensing framework (MSF) and implements the BFRS computation system. Based on deep learning models for time-series analysis, a pre-trained and fine-tuned TimesFM model (FTTM) tailored for BFRS detection is constructed and combined with a YOLO model. Additionally, corresponding BFRS images are generated and integrated by leveraging the consistent spatiotemporal mapping between the camera and IMU sensors. Unlike traditional approaches that rely on fixed system parameters and are thus limited in adapting to dynamic environments, this method emphasizes dynamic prediction and fusion of detection results from both IMU and camera sensors, enhancing adaptability and robustness under varying conditions.

The main contributions are as follows: (1) Hardware integration for data collection within the MSF. The hardware device for road surface condition acquisition was developed by integrating multiple sensors, including Jetson Nano, GPS, IMU, camera, and Wi-Fi. This integrated device is deployed on vehicles to continuously record vehicle movement data and the corresponding BFRS information; (2) Software design for BFRS computation within the MSF. Utilizing the Message Queuing Telemetry Transport (MQTT) protocol, multi-modal sensor data are acquired and transmitted to the server through stream data encoding. The BFRS are then computed and published on the server, with the results made accessible via a set of RESTful application program interfaces (APIs); (3) Detection and integration of BFRS via the MSF based on the FTTM with related images. The BFRS detection method is constructed using the FTTM. The detection results are subsequently fused with related images captured and identified by a specially trained YOLO v5 model, using a multi-result fusion strategy.

The remainder of this article is structured as follows. Section 2 discusses hardware integration of the MSF for data collection, and the deployment of the sensor on the vehicle. Section 3 illustrates the software design of the mobile sensor framework for BFRS computation, and implementation of the system. Section 4 provides a detailed description of the BFRS detection and integration by the MSF, based on the proposed FTTM, where the BFRS detection results are further integrated with the detected images detected by the specially trained YOLO. Experiments and results are discussed in Section 5, followed by conclusion and future work in Section 6.

2 Hardware integration of the MSF for data collection

The MSF is designed based on multi-modal sensors, as illustrated in Figure 1. The hardware comprises a suite of multi-modal sensors, including GPS, IMU, camera, WIFI, and a Jetson Nano processing unit. This integration enables the acquisition of comprehensive sensor network terminal information, effectively enhancing the system’s capacity for robust road surface data collection and processing.

Figure 1

Multiple IoT sensors for the hardware integration. (a) GPS, (b) IMU, (c) camera, (d) WIFI, and (e) Jetson Nano.

2.1 Multi-modal sensors for BFRS data collection

2.1.1 GPS

The GPS sensor acquires positional information, enabling precise localization of detected BFRS, as depicted in Figure 1(a). Operating within the MSF, the GPS sensor receives signals from a minimum of four GPS satellites. Satellite distance is calculated from signal travel time, and coordinates (latitude/longitude) are derived via trilateration.

The sensor consists of a GPS sensor board and an external antenna, mounted on the vehicle. The antenna transmits signals to the board, which then computes location data and transmits real-time recordings to the Jetson Nano processing unit via universal serial bus (USB). Upon MSF system initialization, the GPS sampling rate is set to 1 Hz, with collected data encoded in the format {time, latitude, longitude}.

2.1.2 IMU

The IMU sensor, consisting of an accelerometer, gyroscope, and electronic compass, collects BFRS by recording vibrations detected by the accelerometer, as shown in Figure 1(b). However, accurate BFRS detection cannot solely rely on total acceleration recordings due to horizontal accelerations and inherent noise. To mitigate this, the IMU sensor’s 3D orientation data transform sensor readings into a reference frame where the z-axis aligns perpendicular to the ground. This isolates and accurately measures acceleration along the z-axis, reflecting road surface irregularities. Within the MSF, the IMU sensor connects to the Jetson Nano via a USB to TTL converter, enabling a modular and easily assembled system. The accelerometer, with a sampling rate of 0.2–200 Hz and a measurement range of −16 g to 16 g, primarily captures bumps on the road surface. During data collection, MSF sets the accelerometer sampling rate to 100 Hz, with collected data encoded as {time, 3D-axis acceleration, 4-dimensional quaternion}.

2.1.3 Camera

The camera sensor in the MSF captures 2D images of the road surface, as shown in Figure 1(c). This sensor is designed to be modular and connects to the system via a USB connector, allowing for flexible assembly and customization Two distinct BFRS collection modes are supported: (1) Online Mode: In this mode, road surface images are captured and processed in real-time by the Jetson Nano for immediate BFRS detection. (2) Offline Mode: This mode involves recording video of the road surface for later analysis and BFRS detection, either on the Jetson Nano or the data server. To capture a wide field of view of the road surface, the camera sensor is mounted on the vehicle’s rearview mirror, enabling the collection of forward-facing imagery. The camera outputs standard definition video with a resolution of 480p and a field of view of 120°. When operating in online mode, the MSF sets the camera sampling rate to 20 frames per second (FPS), and the collected data are encoded as {time, BFRS type}.

2.1.4 Jetson Nano and WIFI

The MSF leverages a Jetson Nano as an edge computing device for efficient real-time data acquisition and processing from multi-modal sensors. The platform integrates various sensors, including GPS, IMU, and a camera, connected via USB interfaces, with data subsequently packaged and transmitted to a central server. The Jetson Nano’s powerful CPU, GPU, and 4 GB of RAM provide robust mobile edge computing, crucial for handling bandwidth limitations of IoT networks and efficiently processing video data from the camera. The BFRS detection model embedded on the Jetson Nano enables real-time detection with results transmitted to the database server via the MQTT protocol. A dedicated WIFI module, integrated to address the Jetson Nano’s lack of a built-in wireless interface, establishes a 4G connection for transferring collected multi-modal sensor data, as illustrated in Figure 1(d). Figure 1(e) depicts the Jetson Nano.

2.2 Deployment of the hardware device

To facilitate road surface data collection using the MSF, multi-modal sensors were mounted on a vehicle as illustrated in Figure 2. The GPS antenna, positioned on the vehicle’s roof for optimal satellite signal reception, was secured using its built-in magnet. This antenna connects to the GPS sensor board housed inside the vehicle, enabling GPS data transmission to the Jetson Nano via a USB connector, as shown in Figure 2(a) and (b). The IMU sensor is positioned inside the vehicle and connected to the Jetson Nano using a USB to TTL converter, establishing a USB connection. The USB camera, mounted on the vehicle’s rearview mirror, is oriented to capture imagery in the direction of travel, as shown in Figure 2(c). Furthermore, the WIFI sensor is integrated within the Jetson Nano, providing network connectivity for the platform and associated sensors, as depicted in Figure 2(d). To ensure optimal performance of the Jetson Nano, a power inverter was employed to convert the vehicle’s 12 V power supply to 220 V, providing sufficient power for the mobile data collection platform. With sampling rates set at 1 Hz for the GPS, 100 Hz for the IMU, and 20 FPS for the camera, multi-modal datasets were collected using the mobile platform.

Figure 2

The deployment of multi-modal sensors in the MSF. (a) GPS antenna mounted on the vehicle, (b) GPS and IMU, (c) camera mounted on the rearview mirror, and (d) Jetson Nano assembled with WIFI.

3 Software design of the MSF for BFRS computation

Considering the computational limitations of mobile devices, network bandwidth constraints, and the complexity of BFRS detection, video-based detection and sensor-based detection are stored separately, and processed on the server for computation and fusion. To deal with the multi-modal data and operate the dataset in the BFRS computation process, the detailed data flow is designed as follows: (1) The collection of multi-modal sensor data on the mobile device, that integrates multi-modal sensors and encompass various components, including the IMU, accelerometer, GPS, camera, and WIFI module, and data can be collected from multiple devices; (2) The data processing of BFRS detection on the data server, that receive the multi-sensor stream data from the distributed mobile device via the MQTT protocol, fuse multi-modal dataset, detect BFRS, and provide restful APIs for the Web Interoperation; (3) The map interaction and visualization via the web browser, that set the detection parameter with the data server and manage the collected multi-modal dataset from the mobile device via the HTTP protocol, conduct the layer control and thematic map at the Web browser. The overall software design is illustrated in Figure 3.

Figure 3

The overall software design of the MSF.

3.1 Multi-modal sensor data processing

Upon initialization of the MSF system, connections with the data server are established, and real-time data streams are generated for each sensor, as illustrated in Figure 4. The system utilizes the MQTT protocol, which operates on a publish/subscribe messaging pattern, enabling devices to publish messages to a central broker while other devices subscribe to specific topics to receive these messages. This architecture facilitates efficient and reliable data transmission within the MSF.

Figure 4

Data collection in the MSF. (a) GPS recordings, (b) IMU recordings, (c) camera recordings, and (d) MQTT connections.

3.1.1 Data collection and transmission

The central broker and subscriber on the data server within the MSF, acting as a central hub for receiving messages published by the deployed IoT sensors, including the GPS, IMU, and camera. This decoupled design effectively separates the data acquisition from the data server, promoting flexibility within the inherently complex communication demands of an IoT system. Consequently, the MSF can readily support a variety of sensor combinations and facilitates the implementation of scalable, robust ground-based sensing networks. Recordings from the GPS, IMU, and camera sensors are illustrated in Figure 4(a)–(c), respectively, demonstrating the data flow. As depicted, distinct MQTT messages are generated for each sensor, ensuring data clarity and traceability. To guarantee continuous data reception, a critical requirement for real-time analysis, connections are either actively maintained or periodically re-subscribed to secure the required sensor data streams. Figure 4(d) presents the dedicated webpage responsible for managing these MQTT connections and visualizing the incoming multi-modal sensor data. Within this interface, subscribed MQTT messages from the various sensors are received and dynamically displayed within the web browser, offering a real-time visualization of the incoming multi-modal sensor data streams. Upon arrival, these datasets are fused and transformed into a unified time series format, enabling accurate and comprehensive representation of road surface condition. Following the sensor and data transmission specifications outlined in the hardware design of Section 2, data from the GPS and IMU sensors are transmitted to the data server as time-series stream data. Camera data are handled differently depending on the operational mode: in offline mode, it is sent as video streams to the data server, while in online mode, it is processed on the Jetson Nano, and only the resulting BFRS detection results are transmitted. Given the bandwidth and resource constraints inherent in IoT scenarios, efficient data transmission is crucial. The MQTT protocol, a lightweight messaging protocol specifically designed for reliable and secure communication in such environments, is employed for this purpose. Multi-modal sensor data are packaged into MQTT packets to establish a real-time connection with the data server. Data are transmitted from the mobile-end subsequently storing the data in the database. The structure of the packaged dataset is represented in equation (1).

(1) Multi − modal sensor data = MQTT ( GPS , IMU , camera ) .

3.1.2 Time alignment and spatial transformation

The data processing integrates data from the GPS and IMU sensors. Specifically, location and acceleration data are aligned with orientation recordings based on their respective timestamps. This alignment utilizes linear interpolation, assuming that minimal changes occur between consecutive recordings due to the high sampling rate (e.g., 100 Hz corresponds to a 0.01 s interval between recordings, during which movement is assumed to be relatively constant). With the time-aligned acceleration and orientation dataset, each acceleration reading is associated with a corresponding orientation at the same timestamp. Spatial transformation of the acceleration data is then performed based on the orientation, which is represented as a quaternion. This transformation ensures that acceleration readings are correctly interpreted within a consistent reference frame, facilitating accurate BFRS detection. For example, there is an acceleration recording Acc ( acc x , acc y , acc z ) , and related quaternion q ( w , x , y , z ) . Then ,the direction cosine matrix R is generated from the quaternion q ( w , x , y , z ) , and the spatial transformation can be implemented based on equation (2).

(2) Acc = 1 − 2 y 2 − 2 z 2 2 xy − 2 zw 2 xz + 2 yw 2 xy + 2 zw 1 − 2 x 2 − 2 z 2 2 yz − 2 xw 2 xz − 2 yw 2 yz + 2 xw 1 − 2 x 2 − 2 y 2 · Acc .

3.1.3 BFRS detection based on camera

Utilizing the collected multi-modal sensor data, BFRS detection is initially performed on the video stream captured by the camera sensor. As described in Section 2, the video is recorded at 20 FPS. Each frame of the video is extracted as a separate image and processed using the YOLO model, which offers a balance of real-time detection speed and high accuracy. The YOLO model is trained on a dataset of BFRS images with corresponding labels, enabling it to detect BFRS in real-time during operation. The detection results are recorded as BFRS type and associated timestamp. These results are then mapped onto the road network using the corresponding GPS coordinates recorded simultaneously with the video frames. This spatial mapping provides a comprehensive representation of BFRS locations and their associated attributes.

3.1.4 BFRS detection based on IMU

In addition to image-based detection, BFRS are also detected using acceleration data recorded by the IMU sensor. Through the spatial transformation process, acceleration readings are transformed into a reference frame where the z-axis acceleration is perpendicular to the road surface. This allows for the clear observation and detection of BFRS features from the time-series acceleration data. The 3-axis acceleration data are then fed into the FTTM, which is specifically designed to handle long sequential data while mitigating the vanishing gradient problem common in traditional recurrent neural networks. The FTTM effectively captures temporal patterns and dependencies within the acceleration data, enabling accurate BFRS detection. The detected results are labeled with corresponding timestamps, providing precise temporal information for each detected BFRS.

3.1.5 Road surface condition integration

The initial BFRS detections are represented as discrete points with associated labels. To effectively represent and locate specific road surface condition within the road network, these discrete detections need to be integrated into meaningful BFRS objects. This integration is achieved using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, which groups data points based on density connectivity. DBSCAN does not require a predetermined number of clusters. Instead, it utilizes a distance parameter (d) to determine the density of BFRS detections and group them accordingly. This approach effectively handles clusters of varying shapes and sizes, providing a robust method for integrating BFRS detections from multiple data sources and modalities. Equation (3) outlines the complete BFRS detection and integration process.

(3) BFRS = DBSCAN ( { YOLO ( camera ) , FTTM ( IMU ) , GPS } , d ) .

3.2 Web processing via restful APIs

To facilitate Web Processing Service (WPS) functionality for BFRS detection, a web system was developed using the Python Django framework. Collected multi-modal sensor data are stored on the data server, and each BFRS detection method can be invoked and applied to the dataset via RESTful APIs accessed through the web browser. This architecture allows for flexible and on-demand processing of road surface data. Furthermore, detection results are also stored and managed within the system. Different data requests from the web browser are handled using the Model-View-Controller architectural pattern provided by Django, ensuring efficient and organized data access and manipulation. This integrated system provides a robust platform for web-based BFRS detection and analysis, as depicted in Figure 5.

Figure 5

Interaction of BFRS detection and representation. (a) Parameter setting, (b) WPS invoke, (c) detection result by camera, and (d) detection result by IMU.

3.2.1 Web configuration

In the MSF, road surface condition is represented as BFRS, and the detection process is executed as a WPS on the data server. However, parameters controlling the various detection methods are configured interactively through the web browser interface. This interaction is implemented using RESTful APIs provided by the Django framework, allowing for seamless communication between the web browser and the data server. Specifically, users can adjust parameters related to different BFRS detection methods directly within the web browser. This flexible approach allows for customized analysis and exploration of road surface data. These parameters are sealed and serialized as JSON data packages and transferred to the data server, including YOLO para , FTTM para , DBSCAN d . Besides, the access of detected BFRS is also affected by the different filtering conditions BFRS filter . The interaction of parameter setting at the Web browser and impacts on the WPS are denoted in equation (4).

(4) WebSettings : JSON ( { YOLO para , FTTM para , DBSCAN d , BFRS filter } ) ↔ W PS .

3.2.2 Data management

As detailed in the hardware design of Section 2, multi-modal sensors are integrated into the Jetson Nano mobile platform, allowing for flexible sensor configurations such as GPS and camera or GPS and IMU combinations. Multi-modal sensor data are collected from these mobile platforms and transmitted to the data server via the MQTT protocol. Besides, the MSF supports data acquisition from multiple mobile platforms, enabling the integration of diverse data sources for a more comprehensive understanding of road surface conditions. The web browser interface provides functionalities for managing these multi-source datasets, ensuring efficient organization and accessibility. This includes tools for data visualization, exploration, and download.

Visualization of BFRS detection results is facilitated by Leaflet.js within the web browser interface. To achieve this, the first step involves accessing road surface condition collected at various times and locations by different sensors and mobile platforms. Leveraging RESTful APIs from the data server, these data are organized and managed as distinct layers within the web browser. Each layer represents a unique dataset reflecting specific aspects of road conditions, such as potholes or speed bumps. Layer controls enable users to select and visualize data based on various criteria, including base maps, sensor sources, timestamps, and geographic locations. This interactive visualization allows for comprehensive exploration and analysis of road surface data. BFRS is visualized as thematic maps within the web browser using Leaflet.js. These maps can be customized based on data source, timestamp, location, and other relevant parameters. Recorded trajectories and detected BFRS are represented as points along the road network. The BFRS detection process functions as a WPS accessible via various web requests, including those for configuring detection parameters and retrieving data, as illustrated in Figure 5(a). Parameters related to YOLO or FTTM training, such as execution environment, initial learning rate, maximum epochs, and mini-batch size, can be accessed and modified using GET and POST requests. This facilitates interactive parameter configuration through RESTful APIs provided to the web browser. Modified settings are then transmitted to the data server for execution. Based on these settings, the BFRS detection service is invoked via its corresponding API, as illustrated in Figure 5(b). For instance, the detection service can be initiated using parameters such as model, sensor type, and dataset, specified within the URL. This invocation of the WPS utilizes RESTful APIs. BFRS is then processed on the data server, represented as BFRS, and visualized through the web interface. Figure 5(c) illustrates BFRS detection results derived from images captured by the camera sensor. To validate the effectiveness of this detection method, a road segment exhibiting BFRS features was analyzed. The resulting detection, indicated by the highlighted blue points in Figure 5(d), demonstrates the capability of the FTTM to accurately identify BFRS based on acceleration data.

4 BFRS detection and integration by the MSF

The aim of this article is to obtain BFRS and the corresponding image information through IoT sensors. Hence, it focuses on BFRS detection using sensor data. As to the BFRS detection based on video sensors, the YOLO model is specially trained and deployed on the mobile device. Then, these kinds of BFRS results are fused using the consistent spatial-temporal mapping relationship between the two modalities.

4.1 BFRS detection based on the FTTM

4.1.1 TimesFM and BFRS detection

TimesFM is a pretrained time-series foundation model developed by Google Research for time-series forecasting. It leverages large-scale pretraining to enhance its generalization capabilities. At the core of the TimesFM model lies the self-attention mechanism (Self-Attention and Feed Forward Network), as depicted in the middle part of Figure 6. This mechanism enables the model to capture long-range dependencies within time series data and allows it to attend to information at any position in the sequence through Position Encoding (PE), thereby effectively modeling long-term temporal dependencies. Despite its relatively small parameter size, TimesFM adopts an autoregressive approach during both training and inference, predicting the next patch based on previously observed patches. The model first segments the input time series into small patches. Each patch is passed through a residual block, transformed into a vector of the model’s dimensionality, and enriched with PE before being fed into an encoder composed of stacked transformer layers.

Figure 6

The FTTM for BFRS detection.

To detect BFRS, it utilizes a sliding window to encode the acceleration data, thereby constructing multiple patches that serve as the input to TimesFM, as depicted in the left part of Figure 6. Furthermore, The Multilayer Perceptron (MLP) specifically designed for BFRS detection was constructed, as illustrated in the right part of Figure 6. The MLP is consisted of the input layer, hidden layer, and an output. The input is the sequence of n time points output by the layer (utilizing Softmax activation). TimesFM model, where each time point corresponds to the feature representation of the respective time point in the original input sequence. The input consists of sub-sequences extracted from the tri-axial acceleration data using the sliding window of length n. For example, for each time point t _Acc, the model receives acceleration data within the sliding window (t _Acc−(n−1), t _Acc−(n−2),…, t _Acc), encompassing the acceleration values for m axes at each time point. Consequently, the model’s input dimensionality is n*m. For every input sequence, the TimesFM model generates an output sequence of the same length n, which can be interpreted as a denoised or feature-extracted representation of the input sequence. The output layer of the MLP contains q neurons, corresponding to the BFRS detection results as depicted in Figure 6. The MLP is trained using standard supervised learning methods, employing the cross-entropy loss function.

Based on the proposed method, the BFRS detection proceeds sequentially through the following steps:

Data preprocessing: Segment the acceleration data using the sliding window, which encompasses a fixed length of n consecutive sampling points.
Feature encoding: Send each segmented data into the FTTM, and compute the encoded acceleration feature.
Feature representation: Feed the encoded feature sequence into the TimesFM, and take it as the backbone to calculate the represented BFRS feature.
BFRS detection: Take the represented feature into the MLP, compute BFRS, and map the detection to GPS recording based on the timestamps, which enables the identification and localization of BFRS within the detection result.

4.1.2 Fine-tuning for the TimesFM model

In this article, the pre-trained time series foundation model TimesFM is fine-tuned to learn the acceleration patterns of vehicles when passing over BFRS. The objective of fine-tuning is to minimize the discrepancy between the model’s output and the ground-truth labels. The specific processes are as follows:

Data sampling and preprocessing. To prepare the dataset for fine-tuning, timestamps corresponding to events when the vehicle passes over BFRS are manually labeled. Based on these timestamps, segments from the raw acceleration time series are extracted to include the event and a surrounding temporal window, thereby capturing the characteristic acceleration changes associated with each event. For instance, a sliding window of size 128 is applied, consisting of 32 data points before the event, 64 data points centered on the event, and 32 data points after the event. The raw time series data S = { s t } t = 1 T is standardized using the mean value and standard deviation of the training set using equation (5)
(5) s ∼ t = s t − μ train σ train ,
where μ train and σ train denote the mean value and standard deviation of the training data, respectively. To construct model inputs, a sliding window approach is employed. The context window is defined using equation (6)
(6) x context ( i ) = { s ∼ t } t = i i + L c − 1 ,
and the corresponding prediction window is denoted by equation (7)
(7) x future ( i ) = { s ∼ t } t = i + L c i + L c + L h − 1 ,
where L c is the context length, L h is the prediction step length, and i ∈ [ 0 , T − L c − L h ] is the sliding index. The dataset is subsequently partitioned into training, validation, and test sets in a ratio of 6:2:2, with temporal continuity preserved to maintain the integrity of time-dependent structures.
Model configuring for fine-tuning. The fine-tuning process is initialized by loading the base model architecture derived from the pre-trained TimesFM framework, which is centered around a multi-layer spatial-temporal attention mechanism. The model is defined using equation (8)
(8) Model = TimesFM ( d model = 512 , n head = 8 , N layer = 6 ) .

Pre-trained parameters are loaded using the TimesFmCheckpoint, after which the lower spatial-temporal encoder layers are frozen. Only the top-level prediction head is fine-tuned to adapt to the specific task.

The fine-tuning process was configured using a set of hyperparameters Θ = { η , λ , B , E } selected to balance detection accuracy, training stability, and computational efficiency. The learning rate η was determined through a grid search in the range [1 × 10⁻⁵, 1 × 10⁻³], with the chosen value providing stable convergence and optimal performance. The batch size B was set to 128, reflecting the IMU sampling frequency and the average duration of BFRS, while the number of training epochs E was fixed at 50, as validation performance plateaued beyond this point and further training increased the risk of overfitting. The model architecture ( d model = 512 , n head = 8 , N layer = 6 ) followed the base configuration recommended by the TimesFM [39], which has demonstrated strong performance in diverse time-series tasks.
Training convergence and model saving. The training objective is defined using the average quantile loss function, which is formulated as equations (9) and (10), respectively

(9) L τ ( y , y ˆ ) = max ( τ × ( y − y ˆ ) , ( τ − 1 ) × ( y − y ˆ ) ) ,

(10) av g qloss = 1 | Q | × ∑ { τ ∈ Q } L τ ( y , y ˆ ) ,

where τ denotes an individual quantile such that 0 < τ < 1, y represents the ground truth value, y ˆ is the model prediction for quantile τ , and Q is the set of quantiles used during training, with | Q | indicating the total number of quantiles.

Model parameters are optimized using the Adam optimizer, with the learning rate adaptively adjusted via the ReduceLROnPlateau (Reduce Learning Rate on Plateau) strategy. To ensure training stability, the maximum gradient norm is limited to 1.0. An early stopping criterion is applied, whereby training is halted if the validation loss fails to improve for five consecutive epochs. In Low-Rank Adaptation mode, only the adapter weights are saved during checkpointing to reduce storage overhead.

4.2 BFRS integration of multi-modal sources

This article utilizes multi-modal sensors to capture road surface condition, employing the FTTM to detect BFRS using IMU data, and a YOLO detector to identify BFRS from camera image. Consequently, it is necessary to integrate the BFRS detection results from the camera-based YOLO model and the IMU-based sensor system. The YOLO detector, which performs object detection based on camera images, typically has a broader perception range and can identify distant BFRS in advance. In contrast, the IMU detects changes in vehicle dynamics (e.g., vertical acceleration) and tends to recognize BFRS only when the vehicle physically passes over them. This leads to a temporal and spatial misalignment between the two sources, with YOLO detections generally occurring earlier than IMU detections. To enable accurate BFRS identification and localization, a spatial-temporal alignment of the detection results from YOLO and IMU is required. This article proposes a spatial-temporal fusion strategy to address the asynchronous nature of the two modalities. The proposed method effectively integrates outputs from the onboard IMU and the camera-based YOLO detector to achieve precise and semantically enriched BFRS detection. By leveraging the spatial data from the IMU and the semantic information provided by the YOLO model, the method aligns sensor outputs in both space and time, thereby improving the accuracy and completeness of the detection results. The core assumption of this method is that the YOLO detector consistently anticipates the IMU detection by a systematic temporal offset, which is inversely proportional to the vehicle’s speed at the time of passing over the speed bump. This assumption is based on the following considerations:

Difference in perception range: The YOLO detector relies on camera images and typically has a much longer effective perception range compared to the IMU. As the vehicle approaches BFRS, YOLO can detect the obstacle from a considerable distance, whereas the IMU can only register the event when the vehicle is much closer – usually when the wheels experience a noticeable vertical jolt.
Relationship between speed and lead time: When the vehicle passes over BFRS at high speed, the lead time of YOLO detection is relatively short due to the shorter duration required to cover a given distance. Conversely, at lower speeds, the YOLO detector identifies the BFRS earlier in terms of time, since it takes longer for the vehicle to traverse the same distance.

Based on these considerations, an inverse proportional function model can be constructed to characterize the relationship between the YOLO detector’s lead time and the vehicle speed. This model is then used to calibrate the YOLO detection results, aligning them temporally and spatially with those from the IMU, thereby enhancing the consistency of BFRS detection across modalities. The specific steps are as follows:

Data collection and preprocessing: During vehicle traversal over various BFRS, data are synchronously collected from multiple sensors. The YOLO detector outputs the position of the detected BFRS in the image (pixel coordinates), the detection confidence, and the corresponding timestamp t YOLO . Simultaneously, the IMU and GPS provides the vehicle’s position ( Lat I MU , Lon I MU ), speed v, and the associated timestamp t I MU . While the data preprocessing involves applying filters to the velocity data to suppress noise and reduce outlier, anomalies such as timestamp discontinuities, sudden speed fluctuations, and YOLO false positives or missed detections are identified and adjusted additionally. To ensure temporal consistency across all sensor modalities, all timestamps are verified to originate from a unified clock source.
Time difference computation and parameter estimation: For each BFRS in the training set, the approximate time difference Δ t of the YOLO detector relative to the IMU is calculated.

Per-sample time difference computation. For each training sample i, the distance difference d i between the YOLO detection position and the IMU detection position is calculated. Given that the YOLO detection typically occurs earlier and the physical distance between the sensors is relatively fixed, this distance difference d i can be approximated as the distance traveled by the vehicle during the corresponding time interval Δ t i . Given the vehicle speed v i at the moment of YOLO’s initial detection for sample i, this specific lead time Δ t i is approximated by equation (11)
(11) Δ t i = d i v i .
Inverse proportional model construction. Assuming an inverse proportional relationship between the YOLO lead time and vehicle speed, the following model is formulated as equation (12):
(12) Δ t = a v ,
where a is a model parameter representing the average spatial offset between YOLO and IMU detections, i.e., the distance the vehicle travels during the average lead time;
Parameter estimation. Utilizing nonlinear optimization tools, such as the least squares method, and based on the training dataset {(v _i, Δ t i )} (where i denotes the index of the training sample), the parameter a is estimated by minimizing the sum of squared residuals between the model’s predicted values and the actual values, as denoted by equation (13)

(13) a * = argmin ∑ i = 1 N ( Δ t i − a v i ) 2 ,

where N is the number of training samples, and a * is the estimated optimal parameter value.

In equation (13), the parameter a * represents the average spatial offset between the location at which the camera first visually detects a BFRS and the location at which the lMU physically records the event. This spatial offset provides the physical basis for the proposed inverse proportional model, i.e., equation (12), which effectively captures the dominant relationship between vehicle speed ( v ) and the lead time ( Δ t ). Although fixed factors such as the spatial configuration of the sensors and differences in vehicle types may introduce minor variations in the value of a , these effects are secondary compared to the inherent noise and positioning errors inherent to the sensors used in this study. Consequently, a * is treated as an aggregated parameter that encapsulates these average effects. Considering the practical accuracy limitations of the hardware system, this modeling strategy helps to avoid overfitting to sensor noise while ensuring robustness and applicability in real-world environments.

Multiple modal fusion. Compute the time difference of each BFRS detected based on the YOLO, and designate the related GPS location to the detection result, and integrate the detection result.
1. Time difference prediction. For each BFRS, the YOLO time difference Δ t pred is computed using the estimated parameter a * and the vehicle speed v at the moment of YOLO detection, as denoted by equation (14):
  (14) Δ t pred = a * v .
2. YOLO timestamp adjustment. The YOLO timestamp t YOLO adjusted is adjusted by subtracting the predicted time difference Δ t pred to obtain the adjusted YOLO detection time based on equation (15)
  (15) t YOLO adjusted = t YOLO − Δ t pred .
3. Position alignment. Since IMU and GPS typically offers higher temporal and spatial Precision, the IMU detection result is used as the reference. The adjusted YOLO timestamp t YOLO adjusted corresponds to the moment when the YOLO detection is expected to align with the IMU detection. Under ideal conditions, the YOLO detection result at the timestamp t YOLO adjusted should spatially match the location ( Lat I MU , Lon I MU ), achieving spatial-temporal alignment between the two sensors.

In Figure 7, there is a BFRS detected by the IMU but without image information, and it is difficult to judge what kind of BFRS has been detected, However, the BFRS is also detected by the YOLO, and it refers to the same BFRS detected by IMU. Based on the BFRS integration process, the BFRS detection results are clustered to the same BFRS with specific acceleration and image information.

Figure 7

BFRS integration of multi-modal sources.

5 Experiments and discussion

Following the proposed MSF, multi-modal sensors were deployed on vehicles to capture BFRS during transit. The collected data were subsequently transmitted to the designated data server via the MQTT protocol. To facilitate the accurate detection of BFRS, spatial transformation and time alignment techniques were applied to the processed data. Results derived from these multiple data sources were then integrated to yield a comprehensive representation of the prevailing road surface conditions. This section utilizes the methodology developed in this article to perform data collection and BFRS detection within the experiment area, and subsequently presents a comparative analysis of the experiment results obtained from the different sensors employed.

5.1 Collected multi-modal sensor data and the research area

According to the designed MSF, this article deployed it on a vehicle to collect data in the experiment area, where the sampling frequency of the IMU is 100 Hz, the sampling frequency of the GPS is 10 Hz, the sampling frequency of the camera is 20 FPS, and the vehicle speed ranged from 0 to 30 km/h. The experimental area was divided into two distinct subregions, as illustrated in Figure 8. The green-outlined area was designated as the training region for the FTTM and YOLO models, covering approximately 0.36 km² with a total road length of 2.2 km, and containing 86 BFRS (20 speed breakers and 66 potholes). The red-outlined area was designated as the testing region for evaluating BFRS detection performance. This area covered 0.25 km², with 60 BFRS (17 speed breakers and 43 potholes) distributed along a 1.4 km road. During data processing, road samples from the green-outlined region were used as the training set, while those from the red-outlined region were used as the test set. The spatial separation between the training and testing regions ensured objectivity in performance evaluation and enhanced the generalizability of the experimental results. Then, multiple datasets were acquired using the setup described in this article, and the trajectories were preprocessed to remove records with obvious anomalies or errors, from which two datasets sampled at different times were selected. Using the timestamp information from the video data and IMU data, they were time-aligned with the GPS trajectory to map the position information.

Figure 8

Research area and collected datasets.

In Figure 8, the sampling times for dataset D1 and dataset D2 were 16:00 and 18:00, respectively, and the lighting conditions had certain differences. Among them, the length of trajectory in D1 was 1,431 m, the number of GPS sampling points was 2,112, the number of IMU sampling points was 21,204, the video sampling time was 216 s, and the lighting conditions were relatively good, while the length of trajectory in D2 was 1,442 m, the number of GPS sampling points was 2,242, the number of IMU sampling points was 22,486, the video sampling time was 229 s, and the lighting conditions were relatively poor. The data sampling results are shown in Table 1.

Table 1

Statistics of the collected datasets

Dataset	Start time	Length (m)	GPS	IMU	Video
D1	16:00	1,431	2,112	21,204	216 s
D2	18:00	1,442	2,242	22,486	229 s

5.2 Comparison of BFRS detection results under different conditions

According to the method proposed in this article, experiments were conducted on the data from Datasets D1 and D2, respectively, experimenting with and comparing the IMU and camera data under different time points and environmental conditions.

5.2.1 Detection result of the IMU

For the IMU data, the sliding window length n was set to 128, to include 32 data points before the event, 64 data points at the event time point, and 32 data points after the event, based on the FTTM constructed in this article to detect BFRS. The spatial transformation was applied to orient their z-axis perpendicular to the raw acceleration recordings, and subsequently, encoded and sent into the FTTM. The experiment results are shown in Figures 9(a) and 10(a). As depicted in Figure 9(a), in Dataset D1, the IMU correctly detected 36 of the actual features, incorrectly identified 4 instances as features when they were not, and failed to detect 22 of the actual features that were present. In Dataset D2, as shown in Figure 10(a), the results were very similar: 34 features were correctly detected, 5 instances were falsely identified, and 24 actual features were missed.

Figure 9

Detection results for Dataset D1. (a) BFRS detection result of IMU, (b) BFRS detection result of YOLO, (c) BFRS detection results, (d) BFRS₁ with related image, (e) BFRS₂ with related image, and (f) road crack with related image.

Figure 10

Detection results for Dataset D2. (a) BFRS detection result of IMU, (b) BFRS detection result of YOLO, (c) BFRS detection results, (d) BFRS₁ with related image, (e) BFRS₂ with related image, and (f) road crack with related image.

It is important to note that the IMU sensor can only record BFRS features when the vehicle physically traverses them. As a result, isolated driving trajectories capture only the events encountered along those specific paths, and not all BFRS present on the road surface can be detected. Nevertheless, in the experiments, the IMU sensor demonstrated consistent performance across the two datasets, with the number of BFRS detected in each dataset being comparable.

5.2.2 Detection result by the camera

For the video data, BFRS detection was conducted using a YOLOv5-based model. The model parameters followed the official default recommendations [7], comprising 212 network layers with approximately 7 million parameters, including 9 convolutional layers, 15 skip connections, and 2 up-sampling layers.

The optimal confidence threshold of the YOLOv5 model was investigated together with its effect on BFRS detection result. Experiments were conducted with evenly spaced thresholds of 0.9, 0.7, and 0.5 on two datasets, and the results are illustrated in Table 2. In Dataset D1, collected under well lighting conditions, a threshold of 0.5 produced seven false detections (N _fd = 7). Increasing the threshold to 0.7 reduced false detections to two (N _fd = 2) while maintaining the same number of correct detections (N _cd = 52). A stricter threshold of 0.9 eliminated all false detections (N _fd = 0) but increased missed detections from 6 to 13 (N _md = 13). In Dataset D2, acquired under poor lighting conditions, the threshold had a more pronounced influence. At 0.5, the model generated 18 false detections (N _fd = 18). A threshold of 0.7 reduced false detections to six, with correct detections only slightly decreasing from 15 to 13 (N _cd = 13). By contrast, a threshold of 0.9 was overly restrictive, correctly detecting only three targets while missing nearly all others (N _md = 55). A confidence threshold of 0.7 was therefore selected for subsequent experiments, as it achieved a balanced trade-off between maximizing true detections and minimizing false detections in both datasets.

Table 2

Detection results by different YOLO thresholds

Dataset	YOLO threshold	N _cd	N _fd	N _md
D1	0.9	45	0	13
	0.7	52	2	6
	0.5	52	7	6
D2	0.9	3	0	55
	0.7	13	6	45
	0.5	15	18	43

The final experiment results are shown in Figures 9(b) and 10(b). As depicted in Figure 9(b), in Dataset D1, the camera-based method performed exceptionally well, correctly identifying 52 of the 58 actual features. It made very few errors, with only two falsely identified instances and only six missed features. This demonstrates the potential for very high accuracy and completeness using visual detection under favorable conditions. However, in Dataset D2, as shown in Figure 10(b), the camera’s effectiveness decreased sharply. It correctly detected only 13 features, while failing to detect a large number (45) of the actual features. The number of falsely identified instances remained relatively low at 6.

The direct comparison clearly reveals the strengths and limitations of the vision-based approach. Under well lighting conditions (Dataset D1), the number of BFRS detections on the map (Figure 9(b)) exceeded that obtained using only IMU-based detection (Figure 9(a)), demonstrating the camera’s superior capability to capture a broader range of surface features. However, in poor lighting conditions (Dataset D2), the number of detections dropped dramatically, highlighting the high sensitivity of the camera-based method to the environment conditions.

5.3 BFRS integration from multi-modal sources

It aimed to collect BFRS with related images using the MSF, hence, detected BFRS in Figures 9(a), (b) and 10(a), (b) need to be integrated for the detection results by IMU sensor and the related camera sensor. To fuse the detected BFRS from YOLO and IMU and address the spatial-temporal misalignment caused by YOLO’s early detection, we first learn the inverse proportional relationship between the YOLO detection lead time and vehicle speed based on training data. Specifically, by calculating the distance difference d (typically 4–6 m) between the YOLO and IMU detection for each BFRS sample in the training set, and the vehicle speed v (around 25 km/h) at the initial YOLO detection moment, we obtain the approximate lead time Δ t i = d i / v i . Then, using these ( v , Δ t ) data pairs, we fit the inverse proportional model Δ t = a / v using the least squares method to estimate the parameter a * representing the average detection distance difference (its value is expected to be between 4 and 6 m). In the application phase, for new detections, we predict the lead time Δ t pred = a * / v based on the initial YOLO detection speed v and the parameter a * , and subtract this lead time from the original YOLO timestamp t YOLO to obtain the adjusted timestamp t YOLO adjusted . The key integration step is to use the DBSCAN clustering algorithm to associate detection results from different sensors. We treat the IMU-detected position as a reference point and also map the timestamp-adjusted YOLO detection result (whose theoretical position should align with the IMU) onto spatial coordinates.

Too small a value splits single BFRS into multiple instances; too large a value merged adjacent, distinct BFRS. Several values were tested, and 8 m provided the best trade-off. Then, setting the DBSCAN algorithm’s neighborhood radius to 8 m and the minimum number of samples min-samples to three, we cluster these spatial points (IMU position points and the theoretical position points corresponding to the adjusted YOLO detections).

IMU detection points and YOLO detection points that can be clustered into the same cluster are considered to refer to the same BFRS event, thereby successfully integrating the IMU’s precise position and acceleration information with YOLO’s image semantic information.

The isolated YOLO detection points (or low-confidence IMU signal points) that cannot be clustered with IMU detection points may be considered noise or false detections and discarded, thus confirming the correct BFRS that have been validated through multi-modal verification.

5.4 Analysis and discussion of BFRS experiment results

5.4.1 Classification results

To evaluate the performance of individual sensors, and verify robustness under different environmental conditions, comparisons were made with the proposed MSF. Confusion matrices were generated for three configurations – IMU, camera, and IMU & camera – using D1 and D2, as shown in Figure 11.

Figure 11

The confusion matrix across different sensors. For D1: (a) Confusion matrix of IMU sensor, (b) confusion matrix of camera sensor, and (c) confusion matrix of IMU and camera. For D2: (d) confusion matrix of IMU sensor, (e) confusion matrix of camera sensor, and (f) confusion matrix of IMU and camera.

As shown by Figure 11(a)–(c), the camera sensor was able to perform semantic classification and achieved the highest number of detected BFRS in dataset D1. It identified 16 speed breakers and 36 potholes, giving a total of 52 BFRS. In comparison, the IMU also detected 36 BFRS; however, as a binary classifier, it could not distinguish specific BFRS types. The IMU & camera method integrated information from both sensors and produced 46 validated features, including 16 speed breakers and 30 potholes. The pothole count from the fusion method (30) was slightly lower than that from the camera alone (36). This difference did not indicate reduced performance. Rather, it reflected the strict filtering strategy adopted in the fusion process. BFRS detected only by the camera, without confirmation from the IMU, were excluded. As a result, a small amount of Recall was sacrificed in exchange for improved reliability and Precision.

As shown by Figure 11(d)–(f), the camera’s performance dropped significantly due to poor lighting conditions in dataset D2. The number of correctly detected BFRS fell to 13, including 6 speed breakers and 7 potholes, revealing its vulnerability in challenging environments. In contrast, the IMU was almost unaffected by lighting conditions and maintained stable performance, detecting 34 BFRS, which was close to its D1 result of 36. The fusion method, as shown in its confusion matrix, successfully detected and classified 38 BFRS, including 13 speed breakers and 25 potholes. This outcome not only compensated for the severe performance drop of the camera in poor lighting conditions but also provided semantic labels for IMU detections. Moreover, the number of potholes identified through fusion (25) was substantially higher than that obtained from the camera alone (7).

The comparative analysis of the confusion matrices demonstrated that each single-sensor approach had inherent limitations that could not be fully overcome. The proposed MSF addressed these limitations by combining the environmental robustness of the IMU with the semantic recognition capability of the camera. Through this integration, the fusion method achieved higher accuracy in the detection results.

5.4.2 Performance analysis

To quantitatively evaluate the experimental outcomes, an analysis was performed on the results presented in Figures 9 and 10, with the resulting statistics summarized and discussed in Table 3.

Table 3

Statistics of BFRS detection results

Dataset	Sensor	TT	FT	TF	Precision	Recall	Fscore
D1	IMU	36	4	22	0.9000	0.6207	0.7347
	Camera	52	2	6	0.9630	0.8966	0.9286
	IMU and camera	46	4	12	0.9200	0.7931	0.8519
D2	IMU	34	5	24	0.8718	0.5862	0.7010
	Camera	13	6	45	0.6842	0.2241	0.3377
	IMU and camera	38	4	20	0.9048	0.6552	0.7600

In the unified multiple view of Figures 9 and 10, the results for datasets D1 and D2 can be observed. For D1, Figure 9(c) presents the final fusion results, showing the detected locations of BFRS. A total of 46 features were correctly identified, 4 BFRS were falsely detected, and 12 actual BFRS were missed. Ground-truth verification of these locations is provided on the map. Figure 9(d)–(f) respectively display the actual images captured by the camera for one pothole, one speed breaker, and one road crack. These examples demonstrate the successful association between sensor data and semantic image information, and highlight the system’s capability to accurately identify BFRS.

Similarly, for D2, collected under poor lighting conditions, the fusion results are shown in Figure 10(c). The fusion map indicates 38 correctly detected BFRS (N _cd = 38), 4 false detections (N _fd = 4), and 20 missed features (N _md = 20). Compared with the IMU-only approach (Figure 11(a)), the fusion method consistently identified more correct features while reducing missed detections. Although the system’s robustness under different visibility conditions was enhanced, environmental influences remained evident. A comparison between Figure 10(d)–(f) and the images in Figure 9 reveals a noticeable decline in image clarity. Nevertheless, by leveraging the complementary strengths of each sensor, the system was able to compensate for these limitations, enabling stable data integration and producing reliable detection results.

A detected BFRS was considered correct only if its location was within 10 m of a corresponding ground truth BFRS location.

Based on this criterion, the detection results were categorized as follows: True Positives (TT: ground truth BFRS correctly detected), False Positives (FT: instances falsely detected as BFRS), and False Negatives (TF: ground truth BFRS missed by the detection system). Precision, calculated using equation (16), reflects the accuracy of the detections (i.e., the proportion of correctly identified BFRS among all detected instances). Recall, determined by equation (17), indicates the system’s capability to identify all actual BFRS present (i.e., the proportion of ground truth BFRS that were successfully detected). Furthermore, to provide a consolidated performance measure, the Fscore, which represents the harmonic mean of Precision and Recall, was computed according to equation (18). Detailed results are presented in Table 3.

(16) Precision = TT TT + FT ,

(17) Recall = TT TT + TF ,

(18) F score = 2 × Precision × Recall Precision + Recall .

As depicted in Table 3, the performance of camera-based, IMU-based, and fused detection methods evaluated under varying visibility conditions, and different detection results are statistically analyzed and related Precision, Recall, and Fscore are computed respectively, for Datasets D1 and D2. It is important to note that BFRS detection has traditionally relied on two primary modalities: vibration analysis, which provides limited information, and image-based methods, which are susceptible to lighting variations. In Dataset D1, there are 36 TT BFRS detected by IMU which is similar to the that of Dataset D2 with 34 TT BFRS, while it performs different for the result by the camera between Dataset D1 (52 TT BFRS) and Dataset D2 (13 TT BFRS). With the help of integration of IMU and camera, 46 TT BFRS are detected with 4 FT BFRS and 12 TF BFRS in dataset D1, which performs better compared with the result of 38 TT BFRS that are detected with 4 FT BFRS and 20 TF BFRS in Dataset D2. In addition, it achieved the Precision (0.9200) and Recall (0.7931), yielding the Fscore of 0.8519 in Dataset D1, and the Precision (0.9048) and Recall (0.6552), yielding the Fscore of 0.7600 in Dataset D2. It highlights the advantage of leveraging the complementary strengths of IMUs (motion state information, unaffected by lighting) and cameras (rich visual details). Therefore, the fusion approach mitigates the limitations of individual sensors, providing a more robust solution.

In addition, several BFRS in the D2 Dataset were not detected by YOLO, leading to a marked reduction in the Precision, Recall, and Fscore of the camera sensor compared with the results obtained from D1. In both D1 and D2, there were cases where certain BFRS were not recorded by either the camera or the IMU, and thus were not identified. For example, as shown in the areas marked in Figures 9(g) and 10(g), a BFRS existed at this location. In D1, this feature was successfully detected by YOLO. However, in D2, the vehicle’s trajectory may have deviated from the location, preventing the IMU from recording a valid vibration. Furthermore, due to the poor lighting conditions during data acquisition of D2, YOLO was unable to reliably identify the feature from the images. Consequently, during the fusion of IMU and camera results, this BFRS was omitted from the final output, influenced by the clustering process and its decision criteria. Although the Precision, Recall, and Fscore of the IMU sensor remained relatively stable, the performance metrics of the fused results from the camera and IMU were also lower in D2 than in D1, particularly in terms of Recall and Fscore.

5.4.3 Computation efficiency

The proposed MSF utilizes the collaborative architecture between the mobile device and the server, and its overall computational efficiency is influenced by several stages. First, on the mobile side, real-time performance is mainly determined by the YOLO model’s capacity to process video streams. In the experiments, the video sampling rate was fixed at 20 FPS to ensure real-time operation, and the YOLO model on the device processed at rates above 20 FPS. This configuration enabled real-time detection and analysis of visual information. Second, IMU data were sampled at 100 Hz. The FTTM applied a sliding window of 128 samples (1.28 s) to analyze the time series, which introduced a fixed detection delay as a necessary trade-off to maintain accuracy in temporal pattern recognition. Finally, as the system operates in a distributed architecture across the mobile device and the server, the end-to-end time from data acquisition to final fusion is substantially affected by network transmission delays, which are unpredictable and unstable. For this reason, end-to-end processing time was not feasible as a core evaluation metric. Instead, the evaluation focused on detection performance, using Precision, Recall, and Fscore to assess the framework’s ability to accurately identify BFRS in actual applications.

5.5 Comparison with other methods

The proposed method relies primarily on IMU data, with video data as an auxiliary source. These two results are fused to improve BFRS detection performance. Accordingly, the experimental comparison focuses on results derived from IMU data. To benchmark performance, two IMU-based baselines were evaluated: the threshold method [19] and the LSTM method [24]. Parameter were adopted to ensure reproducibility. For the threshold method, and with reference to the mean acceleration change observed around BFRS events, any fluctuation exceeding ±0.7 m/s² was detected as a BFRS event. For the LSTM model, and consistent with the settings of the proposed framework, the hidden size was 128, the network depth was two stacked layers, and the sliding window length was 128. To ensure a fair comparison, detection results were clustered using identical DBSCAN parameters, with a neighborhood radius of 8 m and a minimum sample size of 3, to form the final BFRS objects. The comparative results are shown in Figure 12.

Figure 12

BFRS detection results of different methods. For D1: (a) BFRS detection result by Threshold and (b) BFRS detection result by LSTM. For D2: (c) BFRS detection result by Threshold and (d) BFRS detection result by LSTM.

Detection results of the two baseline methods are shown in Figure 12. Figure 12(a) and (b) display the spatial distributions of detections on D1 for the threshold method and the LSTM model, respectively. Figure 12(c) and (d) show the corresponding results on D2. Both methods detect BFRS to a certain extent, although local discrepancies are observed in several areas. To further assess the proposed method, its outcomes are compared with those in Figures 9 and 10. Precision, Recall, and Fscore are computed for all methods, and the statistics are illustrated in Table 4.

Table 4

Statistics of BFRS detection results of different methods

Dataset	Method	TT	FT	TF	Precision	Recall	Fscore
D1	Threshold	34	16	24	0.6800	0.5862	0.6297
D1	LSTM	17	4	41	0.8095	0.2931	0.4304
D2	Threshold	32	8	26	0.8000	0.5517	0.6531
D2	LSTM	21	6	37	0.7778	0.3621	0.4941

In Table 4, for the D1 dataset, 34 BFRS are correctly detected using the threshold method (TT = 34), with 16 false positives (FT = 16) and 24 false negatives (TF = 24). The resulting Precision, Recall, and Fscore are 0.6800, 0.5862, and 0.6297. The LSTM model yields fewer false positives (FT = 4) but misses 41 BFRS (TF = 41); its Precision, Recall, and Fscore are 0.8095, 0.2931, and 0.4304, respectively. Based on Table 3, the proposed method achieves a better balance on D1, with Precision of 0.9200 and Recall of 0.7931, leading to an Fscore of 0.8519. For the D2 dataset, the Fscore of the threshold and LSTM methods are 0.6531 and 0.4941. For the proposed method on D2, TT = 38, FT = 4, and TF = 20; Precision = 0.9048 and Recall = 0.6552; the Fscore reaches 0.7600, which is higher than both baselines. Although the LSTM model is trained on the same data, its performance is weaker and in some-cases lower than the threshold method. A possible reason is the overfitting or underfitting caused by the network architecture, whereas the proposed approach benefits from fine-tuning a pretrained model.

Across D1 and D2, and in comparison with the results presented in Figures 9 and 10 and Table 3, the proposed MSF achieves the best balance between Precision and Recall, as well as the highest Fscore, outperforming the baseline methods.

5.6 BFRS detection in urban public area

5.6.1 Comparison with different methods

To evaluate the generalization capability of the framework, an urban public road segment of approximately 0.6 km was selected as dataset D3. This segment differs from the initial datasets; the overall condition is good, with five speed breakers and three potholes distributed locally. The MSF and the baseline methods were applied to this urban environment using the same experimental parameters as above. A comparison of BFRS detection results is depicted in Figure 13.

Figure 13

BFRS detection results and comparison for Dataset D3. (a) BFRS detections result of Threshold, (b) BFRS detections result of LSTM, (c) BFRS detections result of MSF, and (d) BFRS with related image.

As shown in Figure 13, the baseline methods show a marked performance decline in this area. The threshold method produces many false positives, whereas the LSTM method yields many missed detections. Quantitative statistics are summarized in Table 5.

Table 5

Statistics of BFRS detection results and comparison for Dataset D3

Method	TT	FT	TF	Precision	Recall	Fscore
Threshold	7	15	1	0.3182	0.8750	0.4667
LSTM	5	5	2	0.5000	0.7143	0.5882
MSF	8	3	0	0.7273	1.0000	0.8421

The Fscore of the threshold method drops to 0.4667, and the Fscore of the LSTM method is 0.5882. In contrast, the MSF attains an Fscore of 0.8421 on D3, which is close to its performance on D1 (0.8519) and indicates strong robustness. The Recall reaches 1.0000, meaning that all BFRS in this area are detected. This perfect Recall is accompanied by three false positives, which lowers Precision to 0.7273. This trade-off is acceptable and highlights the robustness and generalization ability of the proposed approach.

5.6.2 Capability of different BFRS types

To further assess applicability and generalization, another urban public road segment of approximately 0.25 km was selected as dataset D4. Unlike the initial datasets, this area contains multiple BFRS types, including speed breakers, potholes, and road cracks, specifically two speed breakers, three potholes, and one road crack, along with several minor surface irregularities. Data were collected under good lighting conditions, and the MSF framework was applied using the same parameter settings as in the previous experiments. The detection outcomes are shown in Figure 14.

Figure 14

BFRS detection results for Dataset D4. (a) BFRS detection results, (b) road crack with related image, and (c) BFRS with related image.

The proposed method correctly identified all annotated BFRS on this segment, including the 2 speed breakers, 3 potholes, and 1 road crack. Figure 14(b) and (c) display field images corresponding to a detected road crack and a detected pothole, respectively, illustrating the association between sensor data and visual evidence. These results indicate strong generalization to new urban public areas and demonstrate that the method is not limited to speed breakers and potholes, but can also effectively detect other BFRS types.

From experiments in D1, D2, D3, and D4, it can be concluded that the proposed method maintains stable performance. It also detects diverse BFRS types, including cracks and potholes, demonstrating strong applicability, robustness, and generalizability.

6 Conclusion and future work

Road networks are a critical component of urban transportation systems, and road surface conditions directly impact their efficiency and safety. Addressing the limitations inherent in existing specialized equipment and the data quality uncertainties associated with crowdsourced approaches, this article proposed and developed the MSF to detect and integrate BFRS using multi-modal sensors.

The system involves both hardware and software components. On the hardware side, a platform was designed and integrated based on Jetson Nano, incorporating sensors such as GPS, IMU, camera, and Wi-Fi. This platform was deployed on vehicles to capture real-time data on vehicle movement and corresponding road surface conditions. The integrated multi-sensor system enables the collection of comprehensive road condition data, and its scalability allows for deployment across multiple devices simultaneously to enrich the data stream. Complementing the hardware, the MSF software was developed for BFRS computation. Utilizing the MQTT protocol, multi-modal sensor data are acquired and transmitted to the server. Server-side processes handle data storage, computation, multi-modal data fusion, and BFRS detection. A web-based interface, built upon RESTful APIs, facilitates data requests, parameter configurations, access to collected datasets, and visualization of experiment results. Methodologically, BFRS detection and integration are achieved by combining the FTTM with corresponding image data. Specifically, the YOLO v5 model is employed for processing video data on the mobile end, while the FTTM is used for detecting BFRS from accelerometer data. The results from both sources are subsequently fused by modeling the correlation between accelerometer-derived features and those identified in the video stream. Real-world experiments using the MSF validate the effectiveness of the proposed method in terms of both data collection and processing efficiency, as well as its usability for monitoring and analyzing BFRS. The proposed approach offers a practical and scalable solution for sustainable urban transportation management. The MSF provides a comprehensive, ground-based sensing platform for multi-source road surface condition acquisition, significantly enhancing the efficiency of road data collection. This framework lays a robust methodological foundation for future research and applications in GIS technology.

Future research goes for the improvement of the stability and robustness of the MSF, and reducing the physical size of the device. Although robustness under varied lighting has been demonstrated, performance in adverse weather remains to be validated, for example, in heavy rain or when standing water obscures visual cues. To enable this evaluation, the multi-modal dataset should be expanded to cover diverse weather conditions. Another direction is a unified multi-modal BFRS detection and integration scheme that computes BFRS jointly from 1D sensor streams and 2D video streams. IMU, camera, and GPS data will be input to a single model, with dynamic feature alignment adapted to the characteristics of each modality.

Funding information: This research was funded by the National Natural Science Foundation of China, Grant number 42101466.
Author contributions: Haiyang Lyu performed the theory analysis and methodology and contributed to drafting the manuscript. Yu Huang collected and analyzed the data, design, and coding. Xiaoyang Qian conducted the data visualization and related coding. Donglai Jiao performed the literature reviews, provided the background knowledge, and improved the writing.
Conflict of interest: The authors state no conflict of interest.
Data availability statement: Publicly available datasets were analyzed in this research, and they can be found at: https://doi.org/10.6084/m9.figshare.28877003. The street map is available from OpenStreetMap accessed on 30 May 2025.

References

[1] Li X, Goldberg DW. Toward a mobile crowdsensing system for road surface assessment. Comput Environ Urban Syst. 2018;69:51–62.10.1016/j.compenvurbsys.2017.12.005Search in Google Scholar

[2] Ghahramani M, Zhou M, Wang G. Urban sensing based on mobile phone data: Approaches, applications, and challenges. IEEE/CAA J Autom Sin. 2020;7(3):627–37.10.1109/JAS.2020.1003120Search in Google Scholar

[3] Ma N, Fan J, Wang W, Wu J, Jiang Y, Xie L, et al. Computer vision for road imaging and pothole detection: a state-of-the-art review of systems and algorithms. Transp Saf Environ. 2022;4(4):tdac026.10.1093/tse/tdac026Search in Google Scholar

[4] Kassas ZZ, Maaref M, Morales JJ, Khalife JJ, Shamei K. Robust vehicular localization and map matching in urban environments through IMU, GNSS, and cellular signals. IEEE Intell Transp Syst Mag. 2020;12(3):36–52.10.1109/MITS.2020.2994110Search in Google Scholar

[5] Azhar K, Murtaza F, Yousaf MH, Habib HA. Computer vision based detection and localization of potholes in asphalt pavement images. In2016 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). IEEE; 2016. p. 1–5.10.1109/CCECE.2016.7726722Search in Google Scholar

[6] Kulambayev B, Beissenova G, Katayev N, Abduraimova B, Zhaidakbayeva L, Sarbassova A, et al. A deep learning-based approach for road surface damage detection. Comput Mater Continua. 2022;73(2):3403.10.32604/cmc.2022.029544Search in Google Scholar

[7] Ren M, Zhang X, Chen X, Zhou B, Feng Z. YOLOv5s-M: A deep learning network model for road pavement damage detection from urban street-view imagery. Int J Appl Earth Obs Geoinf. 2023;120:103335.10.1016/j.jag.2023.103335Search in Google Scholar

[8] Rajamohan D, Gannu B, Rajan KS. MAARGHA: A prototype system for road condition and surface type estimation by fusing multi-sensor data. ISPRS Int J Geo-Inf. 2015;4(3):1225–45.10.3390/ijgi4031225Search in Google Scholar

[9] Mihoub A, Krichen M, Alswailim M, Mahfoudhi S, Bel Hadj Salah R. Road scanner: A road state scanning approach based on machine learning techniques. Appl Sci. 2023;13(2):683.10.3390/app13020683Search in Google Scholar

[10] Tan Y, Li Y. UAV photogrammetry-based 3D road distress detection. ISPRS Int J Geo-Inf. 2019;8(9):409.10.3390/ijgi8090409Search in Google Scholar

[11] Torbaghan ME, Li W, Metje N, Burrow M, Chapman DN, Rogers CD. Automated detection of cracks in roads using ground penetrating radar. J Appl Geophys. 2020;179:104118.10.1016/j.jappgeo.2020.104118Search in Google Scholar

[12] Sui L, Zhu J, Zhong M, Wang X, Kang J. Extraction of road boundary from MLS data using laser scanner ground trajectory. Open Geosci. 2021;13(1):690–704.10.1515/geo-2020-0264Search in Google Scholar

[13] Kang BH, Choi SI. Pothole detection system using 2D LiDAR and camera. In 2017 Ninth International Conference on Ubiquitous and Future Networks (ICUFN). IEEE; 2017. p. 744–6.10.1109/ICUFN.2017.7993890Search in Google Scholar

[14] Kuduev A, Abdykalykkyzy Z, Shumilov B. Laser and photogrammetric modeling of roads surface damages. In XIV International Scientific Conference “INTERAGROMASH 2021” Precision Agriculture and Agricultural Machinery Industry. Cham: Springer International Publishing; 2021, vol. 2, p. 279–86.10.1007/978-3-030-80946-1_28Search in Google Scholar

[15] Zhang N, Li L, Li J, Jiang G, Ma Y, Ge Y. Multisource remote sensing image fusion processing in plateau seismic region feature information extraction and application analysis–An example of the Menyuan Ms6.9 earthquake, 2022. Open Geosci. 2024;16(1):20220599.10.1515/geo-2022-0599Search in Google Scholar

[16] Jenkins MD, Carr TA, Iglesias MI, Buggy T, Morison G. A deep convolutional neural network for semantic pixel-wise segmentation of road and pavement surface cracks. In 2018 26th European signal Processing Conference (EUSIPCO). IEEE; 2018. p. 2120–4.Search in Google Scholar

[17] Zhang H, Chen N, Li M, Mao S. The crack diffusion model: An innovative diffusion-based method for pavement crack detection. Remote Sens. 2024;16(6):986.10.3390/rs16060986Search in Google Scholar

[18] Li L, Sun L, Ning G, Tan S. Automatic pavement crack recognition based on BP neural network. PROMET-Traffic Transp. 2014;26(1):11–22.10.7307/ptt.v26i1.1477Search in Google Scholar

[19] Zang K, Shen J, Huang H, Wan M, Shi J. Assessing and mapping of road surface roughness based on GPS and accelerometer sensors on bicycle-mounted smartphones. Sensors. 2018;18(3):914.10.3390/s18030914Search in Google Scholar PubMed PubMed Central

[20] Singh G, Bansal D, Sofat S, Aggarwal N. Smart patrolling: An efficient road surface monitoring using smartphone sensors and crowdsourcing. Pervasive Mob Comput. 2017;40:71–88.10.1016/j.pmcj.2017.06.002Search in Google Scholar

[21] Chen C, Seo H, Zhao Y. A novel pavement transverse cracks detection model using WT-CNN and STFT-CNN for smartphone data analysis. Int J Pavement Eng. 2022;23(12):4372–84.10.1080/10298436.2021.1945056Search in Google Scholar

[22] Basavaraju A, Du J, Zhou F, Ji J. A machine learning approach to road surface anomaly assessment using smartphone sensors. IEEE Sens J. 2019;20(5):2635–47.10.1109/JSEN.2019.2952857Search in Google Scholar

[23] Lyu H, Xu K, Jiao D, Zhong Q. Bump feature detection of the road surface based on the Bi-LSTM. Open Geosci. 2023;15(1):20220478.10.1515/geo-2022-0478Search in Google Scholar

[24] Siami-Namini S, Tavakoli N, Namin AS. The performance of LSTM and BiLSTM in forecasting time series. In2019 IEEE International Conference on Big Data (Big Data). IEEE; 2019. p. 3285–92.10.1109/BigData47090.2019.9005997Search in Google Scholar

[25] Mednis A, Strazdins G, Liepins M, Gordjusins A, Selavo L. RoadMic: Road surface monitoring using vehicular sensor networks with microphones. In Networked Digital Technologies: Second International Conference, NDT 2010, Prague, Czech Republic, 2010. Proceedings, Part II 2. Berlin Heidelberg: Springer; 2010. p. 417–29.10.1007/978-3-642-14306-9_42Search in Google Scholar

[26] Zhang Z, Ai X, Chan CK, Dahnoun N. An efficient algorithm for pothole detection using stereo vision. In2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2014. p. 564–8.10.1109/ICASSP.2014.6853659Search in Google Scholar

[27] Sharma SK, Phan H, Lee J. An application study on road surface monitoring using DTW based image processing and ultrasonic sensors. Appl Sci. 2020;10(13):4490.10.3390/app10134490Search in Google Scholar

[28] Ng JR, Wong JS, Goh VT, Yap WJ, Yap TT, Ng H. Identification of road surface conditions using IoT sensors and machine learning. In Computational Science and Technology: 5th ICCST 2018, Kota Kinabalu, Malaysia, 2018. Singapore: Springer; 2019. p. 259–68.10.1007/978-981-13-2622-6_26Search in Google Scholar

[29] Zhu J, Li H, Zhang T. Camera, LiDAR, and IMU based multi-sensor fusion SLAM: A survey. Tsinghua Sci Technol. 2023;29(2):415–29.10.26599/TST.2023.9010010Search in Google Scholar

[30] Nowakowski M, Kurylo J, Dang PH, Camera Based AI. Models used with LiDAR data for improvement of detected object parameters. Int Conf Model Simul Auton Syst. 2023;14615:287–301.10.1007/978-3-031-71397-2_18Search in Google Scholar

[31] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:6000.Search in Google Scholar

[32] Taylor SJ, Letham B. Forecasting at scale. Am Stat. 2018;72(1):37–45.10.1080/00031305.2017.1380080Search in Google Scholar

[33] Salinas D, Flunkert V, Gasthaus J, Januschowski T. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int J Forecast. 2020;36(3):1181–91.10.1016/j.ijforecast.2019.07.001Search in Google Scholar

[34] Zhou T, Ma Z, Wen Q, Wang X, Sun L, Jin R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International conference on machine learning. PMLR; 2022. p. 27268–86.Search in Google Scholar

[35] Wang X, Liu H, Du J, Yang Z, Dong X. CLformer: Locally grouped auto-correlation and convolutional transformer for long-term multivariate time series forecasting. Eng Appl Artif Intell. 2023;121:106042.10.1016/j.engappai.2023.106042Search in Google Scholar

[36] Du D, Su B, Wei Z. Preformer: predictive transformer with multi-scale segment-wise correlations for long-term time series forecasting. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2023. p. 1–5.10.1109/ICASSP49357.2023.10096881Search in Google Scholar

[37] Ni Z, Yu H, Liu S, Li J, Lin W. Basisformer: Attention-based time series forecasting with learnable and interpretable basis. Adv Neural Inf Process Syst. 2023;36:71222–41.Search in Google Scholar

[38] Liu Y, Qin G, Huang X, Wang J, Long M. Autotimes: Autoregressive time series forecasters via large language models. Adv Neural Inf Process Syst. 2024;37:122154–84.Search in Google Scholar

[39] Das A, Kong W, Sen R, Zhou Y. A decoder-only foundation model for time-series forecasting. In Forty-first International Conference on Machine Learning; 2024.Search in Google Scholar

[40] Li Q, Zhuang Y, Huai J, Wang X, Wang B, Cao Y A robust data-model dual-driven fusion with uncertainty estimation for LiDAR–IMU localization system. ISPRS J Photogramm Remote Sens. 2024;210:128–40.10.1016/j.isprsjprs.2024.03.008Search in Google Scholar

Received: 2025-06-02

Revised: 2025-09-10

Accepted: 2025-09-11

Published Online: 2025-10-06

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/geo-2025-0895

Keywords for this article

mobile sensing framework; road surface condition; multi-modal sensors; BFRS; detecting and integrating

Creative Commons

BY 4.0