Home Handling imbalanced data in supervised machine learning for lithological mapping using remote sensing and airborne geophysical data
Article Open Access

Handling imbalanced data in supervised machine learning for lithological mapping using remote sensing and airborne geophysical data

  • Hary Nugroho EMAIL logo , Ketut Wikantika , Satria Bijaksana and Asep Saepuloh
Published/Copyright: August 4, 2023
Become an author with De Gruyter Brill

Abstract

With balanced training sample (TS) data, learning algorithms offer good results in lithology classification. Meanwhile, unprecedented lithological mapping in remote places is predicted to be difficult, resulting in limited and unbalanced samples. To address this issue, we can use a variety of techniques, including ensemble learning (such as random forest [RF]), over/undersampling, class weight tuning, and hybrid approaches. This work investigates and analyses many strategies for dealing with imbalanced data in lithological classification based on RF algorithms with limited drill log samples using remote sensing and airborne geophysical data. The research was carried out at Komopa, Paniai District, Papua Province, Indonesia. The class weight tuning, oversampling, and balance class weight procedures were used, with TSs ranging from 25 to 500. The oversampling approach outperformed the class weight tuning and balance class weight procedures in general, with the following metric values: 0.70–0.80 (testing accuracy), 0.43–0.56 (F1 score), and 0.32–0.59 (Kappa score). The visual comparison also revealed that the oversampling strategy gave the most reliable classifications: if the imbalance ratio is proportionate to the coverage area in each lithology class, the classifier capability is optimal.

1 Introduction

1.1 Background

Lithology classification is critical for geological inquiry and mineral resource prospecting. On the other hand, the standard lithological mapping method is a knowledge-driven method that requires experts and is time-consuming. As a result, data-driven lithological mapping technologies, such as machine learning algorithms (MLAs) capable of rapidly processing enormous amounts of data, have been created [1,2,3]. The use of machine learning using remote sensing data combined with aerial geophysical data has been extensively researched for lithological mapping on a large scale, efficiently, and on time [4,5,6,7].

Prospecting efforts in machine learning have improved lithological mapping. Random forest (RF) is one of the most recent techniques utilized in this field, and it has shown promising results in properly predicting lithology from well logs [8] and seismic data [9]. Another technique that has gained prominence in machine learning is semantic analysis [10]. This method uses meaning from natural language documents using statistical and machine learning techniques. It has been used in image recognition, natural language processing, and speech recognition among other areas. Semantic analysis can be utilized in lithological mapping to investigate and extract geological aspects from language descriptions of lithology [11] and reflect the contextual usage meaning of the terms [12], thereby assisting in forecasting lithology types. The combination of RF with semantic analysis [13,14] can result in more accurate and efficient lithological mapping methods, which has significant implications for a variety of industries, such as oil and gas exploration, geothermal energy, and groundwater resource management.

The RF learning algorithm has been identified as a highly accurate and robust MLA for lithology classification using remote sensing data, with the ability to classify lithology types in each pixel of the study area and generate robust prediction uncertainties, making it a valuable tool for determining the spatial distribution of lithology [15]. Many studies have used MLAs, such as the RF algorithm, for lithology classification [6,7,16,17,18]. The RF algorithm is a supervised machine learning approach that is used to categorize lithology in each pixel in the study area. This algorithm is an ensemble learner that has been recognized as a good classifier, moderately robust to outliers and noise [8,19], simple-to-use, and a highly accurate approach for inferring the spatial distribution and generating robust prediction uncertainties [20]. For instance, Cracknell and Reading’s research [21] revealed that RF is straightforward to train, computationally efficient, highly stable in the face of changes in classification model parameter values, and as accurate as, or significantly more accurate than, the other MLAs tested. Meanwhile, Harris and Grunsky [22] took similar techniques. They explained how the RF algorithm can map geology and how it can reduce classification uncertainty on prediction maps.

MLA implementation necessitates the use of training samples (TSs) to carry out the “learning” process. TSs that do not have the same number of samples for all lithology classes (imbalanced) are more likely to generate bias in the classification process, favoring the classes that have the most significant or a majority of the samples. Furthermore, it complicates the identification of regularities, particularly the homogeneity patterns in the minority class [23]. However, the classifiers tend to have excellent accuracy for the majority class while obtaining poor results for the minority class [24]. In such instances, the minority class is frequently the most critical; hence measures to boost its recognition rates are required [25]. Generally, during the classification, a technique that maximizes accuracy is carried out in the training phase. In an imbalanced scenario, maximizing the overall accuracy may not be the most beneficial method [26] because, even if the prediction model yields a high overall precision, the minority class remains unknown [27,28]. Collecting TSs in lithological mapping in areas that are difficult to access is challenging. Because the number of samples is excessive and irregular at the absolute rarity level, this situation tainted the data. This suggests that the minority class lacks sufficient samples to learn the decision limits [29].

The imbalance ratio (IR) is a ratio that describes the condition of uneven data [30]. The IR is a comparison of data from the majority (negative samples) and minority (positive samples) such as 100:1 or 100,000:1. Although IR is used to sort diverse data sets, it does not always provide an accurate estimate of the difficulty level of the sample [24]. Instead, it provides the most likely class occurrence for each instance of the least likely class [31]. On the other hand, minority or rare classes are difficult to identify. This is due to the infrequency and casualness of the condition, and misclassifying rare classes results in high costs [27]. In this respect, it is crucial to make accurate predictions or identifications of the rarer rather than standard classes [29]. Furthermore, classical classifiers are prone to bias in evaluation and perform poorly for the minority class and the majority.

There are two strategies for overcoming data imbalance problems: (i) improvements at the data level and (ii) improvements at the algorithm level [24,30,32]. At the data level, the steps are to preprocess the data and rebalance the class distribution through resampling by applying the over- or undersampling techniques [32]. Solutions for dealing with unbalanced classes at the data level primarily focus on changing the class distribution to get a more balanced sample [33]. At the algorithm level, adjustment is carried out on the learning procedure, including ensemble learning approaches (e.g., RF) and cost-sensitive learning (CSL) techniques [24,32]. In addition, these two combination methods are called hybrid [25], for instance, RF-over/undersampling and RF-CSL.

Improvements at the data level include numerous techniques, for example, oversampling for tiny classes, undersampling for board categories, informative oversampling for small groups, informative undersampling for common classes, oversampling in small classes using synthetic data, and a combination of the aforementioned techniques [32]. The oversampling method works on the minority class by replicating the data to ensure it becomes balanced. The key advantage is that it minimizes information loss, but the disadvantage is that it just duplicates data and leads to overfitting [3].

CSL (also known as class weight tuning) is a method for modifying algorithms at the algorithm level [24,25,34]. This method is made of a specific set of algorithms that are sensitive to various costs associated with distinct properties of the classification problem. CSL intends to train classifiers to focus on classes with greater costs, which are prioritized. The cost of classes is one way that can be used. This scenario assumes that if the classifier incorrectly identifies a specific class, it will pay a substantial penalty/cost. The algorithm modification is performed by giving weights to the majority and minority classes. The difference in weight affects the classification during the training process. The adjustment tries to penalize misclassification in the minority group by increasing class weight and decreasing the value for the majority. According to Karlhede [35], when calculating the Gini Index, which determines the best feature in splitting data at the decision node and when the leaf acquires its class label, the provided weight should be changed. However, it should also be noted that specified weights have a threshold. When a minority class is given a very high weight, the algorithm is likely to be biased toward that group, lowering the model’s overall performance [36].

In lithological mapping, data imbalances are common. However, little research has looked into the effect of imbalances and procedures on classification accuracy, especially in small sample sizes [17,33]. The degree of imbalance may have an impact on the performance of predictive models trained on uneven data. For some samples, the imbalance can have a considerable impact on the classification quality [37]. This work investigates and analyzes multiple approaches to handling unbalanced data in lithological classification based on RF algorithms, using remote sensing and airborne geophysical data with limited borehole log samples in Komopa, Papua Province, Indonesia. These RF algorithms used oversampling and class weight tuning techniques to generate two hybrids. The oversampling method was chosen because it has successfully addressed the problem in lithological mapping [3]. The class weight tuning approach was used since it was relatively considerable, and many studies indicate successful outcomes when a large number of TSs are used [38,39]. Unfortunately, no research has used this method for lithological mapping using a limited sample collection at this time.

1.2 RF classifier

At the University of California, Berkeley, Breiman and Cutler created RFs [19,22]. This algorithm has evolved into a trustworthy and useful classifier for scientists in a variety of fields, including geology [1,40]. Numerous lithological mapping investigation has demonstrated that these techniques outperform other algorithms [1,22,41]. RF is an excellent starting point for undertaking multi-dimensional data classification. This technique can also generate a ranking list of the best predictors without requiring normally distributed TSs [42].

RF is a supervised MLA that incorporates several decision trees and mixes the decisions from many trees (ensemble classification method). To generate many decision trees, this approach uses bootstrap aggregation [19]. This approach, like other supervised classification algorithms, requires training data (i.e., locations of different rock types) [22]. Using replacements, this strategy selects two-thirds of the TSs at random to form an “in-bag” data set. The in-bag data classify a decision tree using the Gini index [43], which is used to identify the best parting for a particular class, while the remaining data (“out-of-bag” data) are utilized to validate the model. The main advantage of this algorithm is that it predicts classes based on the average of several decision trees, improving predictions and lowering outliers’ mistakes [41].

1.3 The objectives

Because of the restricted number of TSs, the problem of data imbalance necessitates research into the ideal number of samples required to achieve accurate classifications. Besides, a few study have described the behavior of minority data and its association with the IR [28]. As a result, the specific goal of this research is to determine: 1) which method produces the best classifications; 2) the effect of the number of TS datasets on the accuracy of the classifications; 3) the behavior of minority data in the classification process, and 4) the effect of the IR on the classifier’s ability. The quality of the classifications was determined not only by performance metrics computed from confusion matrices but also by a visual comparison of the lithological map obtained from the RF classification with the current map. In lithological mapping, performance indications alone are deemed insufficient to describe classification success. The visual analysis revealed how similar the distribution of lithological types and boundaries was to the previous map.

This study included the following experiments. It classifies both hybrid approaches with unbalanced data and varied amounts of data, ranging from 25 to 500 samples distributed at simple random. These studies identify the number of samples, the IR, and the distribution of the minority TSs that produce the best outcomes. Next, compare the classification result of these two hybrids to those of the balanced class weight technique and the balance TS. The balanced class weight approach assigns equal weight to each lithology class. The balanced TS, on the other hand, is a model with the same amount of TSs for each class, which is dispersed using stratified random sampling with 25 and 50 samples.

2 Materials and methods

2.1 Study areas and data used

2.1.1 Study area

The research area includes the Komopa area in the Paniai District, Papua Province, Indonesia (the western part of New Guinea Island) (Figure 1). We chose this area because it already had a lithological map made by Mine Serve International (MSI) conventional geological mapping techniques to provide an accurate reference map. In 2000, MSI created a lithological map of this area with a scale of 1:25,000 [44]. Furthermore, MSI has performed a series of geophysical data collections and a large number of borehole logs so that this area is ideally used for lithological mapping trials with several geoscience data and TSs. According to the technical report written by Skead [45], the Komopa area is limited to the south by the Aga and Bogodide Rivers, both the southwest and southeast of the Wodege River, respectively. The Komopa area has moderate relief, with elevations ranging from 1,720 to 2,200 m. The terrain consists of 600–1,200-m-wide valleys filled with alluvial material divided by low hills. Longitudes 136°28′3.88″ E–136°33′54.13″ E and latitudes 3°44′38.86″ S–3°48′59.32″ S define the survey area. The majority of the land is densely forested, with trees reaching heights of 30 m, and thick humus covering the earth (Figure 2b). Quaternary alluvium, pseudo gossan, inferred porphyry, sedimentary rocks, and undifferentiated porphyry are the lithologies present in the area (as shown in Figure 2a). Pseudo gossan is a more detailed lithology class compared to others. Nevertheless, MSI used this class as a guide for their mineral exploration.

Figure 1 
                     The area of interest is in the middle of Papua Province (Indonesia), presented by a red rectangle [48].
Figure 1

The area of interest is in the middle of Papua Province (Indonesia), presented by a red rectangle [48].

Figure 2 
                     (a) A local lithological map of Komopa, scale 1:25,000, released by MSI in 2000, indicated the variation of rock types in different colors [44]. (b) Study Area, Komopa, Paniai District, Papua Province, presented by the color composite of Sentinel 2A RGB:11-08-02.
Figure 2

(a) A local lithological map of Komopa, scale 1:25,000, released by MSI in 2000, indicated the variation of rock types in different colors [44]. (b) Study Area, Komopa, Paniai District, Papua Province, presented by the color composite of Sentinel 2A RGB:11-08-02.

According to Glover Consulting’s geological survey reports for MSI [46], New Guinea Island is a well-known late Miocene–Pliocene copper–gold porphyry province-associated gold resource. Island arc volcanism with co-magmatic plutons is linked to mineralization. The island is formed from a collision between the Australian and the southward-migrating Pacific plates. A prominent central fold belt has formed and is split into three east–west trending structural zones: 1) the southern coastal strip comprising Palaeozoic and Tertiary shelf sediments; 2) the central mobile belt comprising Palaeozoic, Mesozoic, and Tertiary sediments, which are folded and faulted by high-angle reverse thrusts and strike-slip faults; and 3) the northern belt of Tertiary sediments and volcanic. The central mobile belt has been intruded by diorite to quartz-monzonite stocks, which are linked to skarn and porphyry copper–gold mineralization. Both intrusive and mineralization are considered structurally controlled [47].

2.1.2 Field sampling data

The study area has no outcrops, and to collect TSs, soil drilling was conducted to the depths of 0–3 m, 3–6 m, or until bedrock was observed using winkie or Longyear drilling [45]. There are 1002 borehole log samples available for this research, of which 500 were utilized as TSs and the rest as testing samples.

2.1.3 Remote sensing and airborne geophysical data

The remote sensing data used were composed of Sentinel-2A, Advanced Land Observing Satellite (ALOS)-Phased Array type L-band Synthetic Aperture Radar (PALSAR), and Digital Elevation Model (DEM) data. The Sentinel-2A image, acquired on January 09, 2019, is a wide-swath (290 km), high-resolution, and multispectral dataset. The Multispectral Instrument measures the Earth’s reflected radiance in 13 spectral bands: (i) four visible and near infrared bands (B2, B3, B4, and B8) at 10 m of spatial resolution; (ii) four vegetation red edge bands (B5, B6, B7, and B8a) and two short-wave infrared (SWIR) bands (B11 and B12) at 20 m of spatial resolution; and (iii) three bands (B1, B9, and B10 for aerosol, water vapor, and cirrus SWIR, respectively), at 60 m of spatial resolution [49]. This study only used the Sentinel-2A bands with a spatial resolution of 10 and 20 m.

The granule at Level-2A was obtained from the United States Geological Survey Earth Explorer Open (Access Hub: https://earthexplorer.usgs.gov/, image code in Supplementary Material S1). A granule is the minimum indivisible product partition (containing all possible spectral bands). It is also known as tiles and possesses 100 × 100 km2 ortho-images in the UTM/WGS84 projection [50].

The study area was projected to universal transverse mercator in the 53S zone. The Level-2A image data offer Bottom of Atmosphere reflectance images obtained from the associated Level-1C image data. This image is used directly in downstream applications without further processing [50]. Before application, vegetation’s influence on this image data was removed using the Vegetation Suppression tool available in the environment for visualizing images software [51].

The radar data used were ALOS PALSAR Fine Beam Double Polarization Data (FBD), L-band, 3.17 m × 14.9 m (azimuth × range) resolution dual-polarization (HH + HV). The data were collected on July 02, 2007, with ALOS’s Single Look Complex (SLC) data type. The granule data were obtained from https://www.asf.alaska.edu/ (Image code in Supplementary Material S2).

The DEM data utilized were DEMNAS (National DEM) from the Geospatial Information Agency, Republic of Indonesia. The National DEM was developed from several data sources, such as IFSAR (5 m resolution), TERRASAR-X (5 m resolution), and ALOS PALSAR (11.25 m resolution), by including the stereo-plotting mass point data from aerial photos. The spatial resolution of DEMNAS is 0.27-arcsecond (8.1 m), using the EGM2008 vertical datum [52].

The airborne geophysical data comprised magnetic, electromagnetic, and radiometric data. Residual field magnetic data effectively reflect the distribution of magnetic material within the survey area. In nature, the most dominant magnetic mineral is magnetite (Fe3O4). Lithologies containing even small amounts of magnetite will produce distinctive magnetic properties [47]. Electromagnetic data give precise information on the structure and lithological variations. Airborne electromagnetics data can deliver beneficial additional information, whereas magnetic data might only offer minimal information [47]. A quantity of 30 cm of rock, 60 cm of soil, or 1 m of water effectively obscures the underlying radiation sources no matter how intense owing to the minimal depth of penetration and the inherently complex nature of the gamma-ray spectra. Thus, the radiometric data reflect lithological variations on the surface [47]. Magnetic and radiometric data surveys were carried out simultaneously with a flight spacing of 400 m. The electromagnetic data survey was performed separately with a flight spacing of 100 m.

All the datasets used are shown in Table 1.

Table 1

A summary of the data used for lithology classification

Dataset Acquisition date Data Resolution/Flight spacing Source
Sentinel 2A January 09, 2019 Band 2 Res. 10 m USGS
Band 3 Res. 10 m
Band 4 Res. 10 m
Band 5 Res. 20 m
Band 6 Res. 20 m
Band 7 Res. 20 m
Band 8 Res. 10 m
Band 8a Res. 20 m
Band 11 Res. 20 m
Band 12 Res. 20 m
DEM No Data DEM Res. 8.1 m Geospatial Information Agency, Indonesia
ALOS PALSAR July 02, 2007 HH Res. 3.17 × 14.9 m2 www.asf.alaska.edu
HV
HH-HV
Magnetic August 6-25, 1993 RTP 400 m flight spacing MSI
Electromagnetic August 6-25, 1993 2 kHz 100 m flight spacing MSI
20 kHz
36 kHz
Radiometric August 6-25, 1993 Thorium (Th) 400 m flight spacing MSI
Potassium (K)
Uranium (U)
K/Th ratio

2.2 Methodology

Figure 3 illustrates the research methodology for this study. It primarily comprises five parts: (1) preprocessing of remote sensing and geophysical data; (2) training and testing data preparation; (3) lithological classification utilizing the RF algorithm; (4) accuracy evaluation; and (5) visual comparative assessment.

Figure 3 
                  Flowchart of research methodology to categorize lithology based on imbalanced or balanced data.
Figure 3

Flowchart of research methodology to categorize lithology based on imbalanced or balanced data.

2.2.1 Data preprocessing

Preprocessing was performed for Sentinel-2A, DEM, and ALOS PALSAR data containing information on geological and lithological features for mineral exploration [53] beneath the surface layer. The preprocessing stage was initiated with the Sentinel-2A data. Afterward, we took two steps: first, we use Sentinel Application Platform software from the European Space Agency to carry out atmospheric correction and orthorectification [54]. Second, we perform a process to eliminate vegetation spectral signatures using the Vegetation Suppression tool from ENVI [51].

The ALOS PALSAR data types applied were FBD, L-band, 3.17 m × 14.9 m (azimuth × range) resolution dual-polarization (HH + HV) with SLC data type. The Japan Aerospace Exploration Agency-Earth Observation Research Center obtained the data on July 02, 2007, and preprocessed it (orthorectification, slope correction, and mosaicking). SAR calibration aims to provide imagery where the pixel values may correlate directly with the scene’s radar backscatter [55]. Hence, it is crucial to apply the radiometric correction to SAR images for the pixel values to represent the radar backscatter of the reflecting surface. The SLC data were converted to ground range detected data, and a spatial resolution of 20 m × 20 m was acquired. The data were in the form of digital number intensities, which were then processed into HH-HV polarization backscattering data [56].

The geophysical data preprocessing is initiated with the magnetic data. Filtering against a specified range of frequencies from the dataset was followed by compilation and leveling work, non-linear filtering, gridding, spectral analysis, low-pass filter, decorrugation, international geomagnetic reference field removal, contouring, and reduction to pole [47]. The magnetic data used for this classification were reduced to pole (RTP). The radiometric data comprised Thorium (Th), Potassium (K), Uranium (U), and the ratio of Thorium and Potassium (K/Th). The process started with validating the data utilizing preliminary grids and stacked profiles. Then, the minimum curvature gridding process, decorrugation using linear contour intervals, and color contour plots were constructed using an equal area distribution [47]. Electromagnetic data included 2, 20, and 36 kHz and were processed in two stages, namely quality control (frequency and time domain) and 1D inversion (layered model and inversion result) [57].

The next stage involved resampling all the data to homogenize the pixel size to 20 m (24 features) using inverse distance weighted interpolation. Finally, we use normalization to homogenize each dataset’s magnitude (between 0 and 1) to avert adverse effects on classification [3].

2.2.2 Training and testing data preparation

Table 2 demonstrates the number and distribution of training and testing samples used in the lithology classifications. All the imbalanced samples were acquired from simple random sampling. Meanwhile, the balance samples were obtained from stratified random sampling. Figure 4 depicts the spatial distribution of the training and testing samples.

Table 2

Training and testing sample size and distribution

Model Number of TSs Sum
Inferred porphyry Pseudo gossan Quaternary alluvium Sedimentary rocks Undifferentiated porphyry
Stratified Random
Stratified Random 25 5 5 5 5 5 25
Stratified Random 50 10 10 10 10 10 50
Simple Random
Simple Random 25 2 2 2 5 14 25
Simple Random 50 2 2 3 12 31 50
Simple Random 100 4 2 8 22 64 100
Simple Random 200 5 2 12 41 140 200
Simple Random 300 6 3 16 67 208 300
Simple Random 400 7 4 19 89 281 400
Simple Random 500 10 6 23 107 354 500
Number of Testing Samples Sum
8 9 17 124 344 502
Figure 4 
                     The distribution of training and testing sample datasets with variations in the number and type of lithology. Figure (a)–(g) are TSs with a simple random distribution with a sample size variation of 25 to 500. Figure (h) and (i) depict stratified random distributions with 25 and 50 TSs. Figure (j) represents the distribution of testing samples, with a total of 502, where the class distribution is imbalanced.
Figure 4

The distribution of training and testing sample datasets with variations in the number and type of lithology. Figure (a)–(g) are TSs with a simple random distribution with a sample size variation of 25 to 500. Figure (h) and (i) depict stratified random distributions with 25 and 50 TSs. Figure (j) represents the distribution of testing samples, with a total of 502, where the class distribution is imbalanced.

The class weight tuning technique was applied using the GridSearchCV application from Scikit-learn [58]. The class weight tuning process was only applied to minority classes, such as inferred porphyry, pseudo gossan, and quaternary alluvium. Weight tuning is a search process that offers the highest training accuracy and F1 score values. Two stages of testing were performed. The first is to perform a series of classifications by giving values to the minority classes with fairly significant differences, for instance, 0.1, 1, 2, 5, 10, 15, 20. Each lithology class is given the same weight at this stage, while the resulting training accuracy and F1 score are examined. This aimed to obtain an interval of weight values with the highest training accuracy and F1 scores. When the best value interval is accomplished, the second stage of the search process is performed. This was aimed to acquire the best combination of minority class weights. In this stage, the classification is performed as in the first phase, applying the GridSearchCV application. In this application, the classification is performed by applying the value to the weight interval obtained in the first stage. The application of the weight value is conducted in stages with a relatively small increase in value. For instance, when the interval is between 0.1 and 2.0, the weight value added at each stage is 0.1.

In the oversampling approach, data preparation is performed by multiplying TSs (augmentation technique). This technique guarantees that the number equals to the highest value of TSs in the majority class. Finally, in the balance class weight method, all data points are allotted the same value, equal to 1 (known as a classic RF).

2.2.3 Lithological classification using RF algorithm

At this point, the lithology classification was carried out using the RF algorithm implemented in the Python module from the Scikit-learn library. RF is an ensemble learning algorithm that comprises many decision trees [19]. RF uses decision trees as base learners [59], and each tree is developed according to numerous hyperparameters. Hyperparameters identify the number of decision trees (n_estimator) and control the decision tree structure [60]. Alternatively, to handle the class weights, the decision tree uses the class_weight hyperparameter, which contains the class weights set before the decision tree is built. In this research, in the decision tree development process, the hyperparameters used were created following the default values of Scikit-learn [61]. The exception was in the class weight tuning method, where the class_weight hyperparameter for the minor class followed the outcomes of the class weight tuning, while the major class got a score of 1. In RF, basic decision trees are built from many randomly generated subsets, and the class with the most votes is considered as the final classification result [19]. Furthermore, each decision tree is built according to training data acquired through the bootstrap or bagging aggregation process [62]. About two-thirds of the TSs in the dataset are utilized for training, and the remaining one-third is used for internal model validation.

2.2.4 Accuracy evaluation

The performance of classifications was quantitatively evaluated by computing the metrics from confusion matrices. The accuracy from different viewpoints was computed, including the overall testing accuracy, precision, recall, F1 score, and Kappa score [63]. The 502 testing samples were used from the borehole log data to compute the metrics. TSs in the RF classification were used as cross-validation to determine out the overall training accuracy.

The overall testing accuracy was acquired from the ratio of correct pixels and the total number of pixels in the confusion matrix. In the imbalanced data condition, the majority class dominated the classifications, permitting even a poor model to accomplish high accuracy, depending on the imbalance rate of data. Hence, to measure the classifications’ performance, use of instruments other than accuracies, such as precision, recall, and F1 score, is required. Precision quantifies the number of positive class predictions from the positive class. Recall quantifies the number of positive class predictions made out of all the positive examples in the dataset. When applied alone, neither precision nor recall explains the whole story [63]. The F1 score offers a harmonic means for precision and recall. It is utilized to compute the accuracy of both precision and recall. A good algorithm should simultaneously maximize precision and recall [64]. We used equations (1)–(4) for the calculation of accuracy, precision, recall, and F1 score, respectively [65]:

(1) Accuracy = i = 1 5 N ii / i = 1 5 j = 1 5 N ij ,

(2) Precision i = N ii / k = 1 5 N ki ,

(3) Recall i = N ii / k = 1 5 N ik ,

(4) F 1 score = 2 × Precision i × Recall i / ( Precision i + Recall i ) .

The confusion matrix is obtained according to a combination of actual and predicted values, as demonstrated in Table 3.

Table 3

Five classification confusion matrix

Predicted grade
I II III IV V
Actual grade I N 11 N 12 N 13 N 14 N 15
II N 21 N 22 N 23 N 24 N 25
III N 31 N 32 N 33 N 34 N 35
IV N 41 N 42 N 43 N 44 N 45
V N 51 N 52 N 53 N 54 N 55

Description: Class I: inferred porphyry, Class II: pseudogossan, Class III: quaternary alluvium, Class IV: sedimentary rocks, Class V: undifferentiated porphyry.

The Kappa score is another performance metric. Kappa score tests the reliability between raters, which precisely depicts the measured variable (equation (5)). Several techniques for measuring inter-rater reliability are generally expressed as a percentage of agreement (i.e., the sum of the agreement scores divided by the total score) [66].

(5) κ = N i = 1 n m i , j i = 1 n ( G i C i ) N 2 i = 1 n ( G i C i ) ,

where i represents the class number, N is the total number of classified values compared to truth values, and m i , j signifies the number of successfully categorized values for truth class i. The total truth values and forecasted values corresponding to class i are designated as C i and G i , respectively [67].

2.2.5 Visual comparative assessment

A comparative visual assessment was performed to compare the outcomes of the RF classification with the existing lithology map. It is necessary to perform these two types of evaluation simultaneously because performance metrics alone cannot describe the quality of the classification results in regards to the accuracy of the location of the identified lithology classes and their boundaries. The low performance metrics do not necessarily signify poor classifications, and vice versa, which is detected when the classification results are visualized and compared with the existing map. The distribution of lithological types and boundaries acquired from the RF classifications were compared with the existing map from MSI. The comparison determined the best approach for handling imbalanced data and providing a lithological map closest to the existing map.

Both evaluations were also performed specifically for the minority lithology classes to identify the minority data’s behavior in the classification and its association with the IR, the number of TSs, and their composition in each class. We superimposed the MSI map on the RF classification results to estimate the deviation between the classification results and the MSI map. Here, an existing MSI map is displayed within the lithology class boundaries. Errors in identifying lithology classes and boundary deviations can be observed, leading to an increase or decrease in lithology classes’ precision and recall values.

3 Results

3.1 Class weight tuning results

As described in Section 2.2.2, class weight tuning results were acquired after conducting a series of classifications as described earlier. From the first stage, it was noted that the best weight value range was between 0.1 and 2.0. Afterward, in the second stage, the GridSearchCV application was applied, which computed the weights between 0.1 and 2.0 with an additional value of 0.1 at each stage utilizing the 5-fold cross-validation technique and 40,000 combinations. Table 4 displays the final results of class weight tuning.

Table 4

Class weight tuning results

TSs Class weight tuning results Training accuracy F1 score
Inferred porphyry Pseudo gossan Quaternary alluvium
Simple Random 25 0.1 0.4 1.7 0.76 0.52
Simple Random 50 1.0 0.8 0.9 0.89 0.57
Simple Random 100 0.3 0.5 0.1 0.96 0.68
Simple Random 200 1.3 0.1 0.2 0.95 0.65
Simple Random 300 1.5 0.4 0.2 0.97 0.72
Simple Random 400 1.5 0.5 0.1 0.97 0.77
Simple Random 500 0.2 0.4 0.7 0.97 0.77

Table 4 depicts the results of class weight tuning for the minority classes, while each majority class obtained a weight of 1. Among the weights of each class, it was discovered that not all classes have weight greater than 1. This condition implied that the weights of these classes did not surpass those in the majority class. Brownlee [63] reports that a class with greater importance was allotted greater weight, while a class with lower importance was given smaller. Meanwhile, Fernández et al. [24] mentioned that higher weights are assigned to instances emerging from the class with a higher value of misclassification cost. The minority class, with a weight of less than 1, attains less attention than the majority. Further information concerning the effects of these class weights can be observed in the performance metrics for the minor classes.

3.2 General accuracy evaluation

Table 5 demonstrates a summary of performance metrics for 23 models with varying numbers of TSs. The average score for each of the five lithology classes are displayed in this table.

Table 5

Performance metrics summary of the lithological classification results

Models Training accuracy Testing accuracy Precision Recall F1 score Kappa score
Random_25_bal 1.00 0.71 0.44 0.41 0.39 0.23
Random_25_CS 0.84 0.69 0.27 0.24 0.23 0.16
Random_25_OV 1.00 0.70 0.44 0.56 0.43 0.32
Stratified_25 1.00 0.61 0.35 0.54 0.38 0.27
Random_50_bal 1.00 0.71 0.51 0.33 0.37 0.23
Random_50_CS 0.94 0.71 0.45 0.26 0.27 0.19
Random_50_OV 1.00 0.74 0.48 0.53 0.49 0.38
Stratified_50 1.00 0.62 0.38 0.65 0.41 0.30
Random_100_bal 1.00 0.75 0.67 0.47 0.52 0.43
Random_100_CS 0.85 0.75 0.28 0.30 0.29 0.44
Random_100_OV 1.00 0.75 0.54 0.59 0.56 0.47
Random_200_bal 1.00 0.77 0.61 0.37 0.41 0.45
Random_200_CS 0.91 0.76 0.48 0.31 0.32 0.40
Random_200_OV 1.00 0.77 0.45 0.41 0.42 0.46
Random_300_bal 1.00 0.78 0.72 0.44 0.49 0.49
Random_300_CS 0.91 0.78 0.39 0.35 0.36 0.48
Random_300_OV 1.00 0.80 0.69 0.50 0.53 0.55
Random_400_bal 1.00 0.82 0.81 0.45 0.51 0.57
Random_400_CS 0.92 0.81 0.41 0.36 0.38 0.54
Random_400_OV 1.00 0.82 0.54 0.47 0.48 0.59
Random_500_bal 1.00 0.81 0.66 0.47 0.52 0.55
Random_500_CS 0.94 0.81 0.42 0.38 0.39 0.56
Random_500_OV 1.00 0.81 0.70 0.49 0.51 0.57

Description: bal: balance class weight method, CS: class weight tuning method, OV: oversampling method; random_500_CS: simple random distribution, 500 TSs, class weight tuning method; stratified_50: stratified random distribution, 50 TSs; bold number: the highest value in each group.

Table 5 illustrates that the range of testing accuracy scores was between 0.69 and 0.80. The accuracy score increased with the number of TSs. The F1 score, alternatively, was lower than the testing accuracy. This condition is because of the classification of a limited number of outcrops and an imbalance of the sample data; therefore, the lithology prediction results from the RF algorithm are often incorrect [17]. Accuracy is suitably used when the data distribution is balanced, and when there are imbalanced classes, it is better to apply the F1 score because it signifies the actual classification performance [68].

The Kappa score values follow the same trend as the testing accuracy; hence, a more significant number of TSs will cause a higher Kappa score. The imbalance models that utilized the oversampling method produced the highest Kappa scores (ranging from 0.32 to 0.59). This Kappa score demonstrates that the level of agreement acquired is weak and minimal, with data reliability of 4–35%, which implies that more than 65% of the evaluated data contain errors [66]. The model that produced the highest Kappa score used 400 TSs (random_400_OV). For samples with a balanced distribution using 25 and 50 TSs, the Kappa scores obtained were relatively low (0.27 and 0.30), with the level of agreement reaching the category at the minimum level. These were lower than those of the imbalance models with the same number of TSs by applying the oversampling method. These findings imply that the oversampling method has enhanced the classifier’s performance.

Generally, the best F1 scores were yielded by models that applied the oversampling method, ranging from 0.43 to 0.56. These F1 scores were higher than the balance class weight model, which was the baseline model in this research, with an F1 score range of 0.37–0.52. The oversampling method could handle imbalanced data by increasing the F1 score by 0.04–0.06. Specifically, the best model was random_100_OV, utilizing 100 TSs. This number of TSs was optimal because further addition to 500 did not elevate the F1 score. Nonetheless, it tends to decrease them by approximately 0.03–0.14. This decrease in the F1 score is linked to the IR value.

3.3 Accuracy evaluation of minority class and IR

Table 6 depicts that, in general, the class weight tuning method did not offer good performance in determining minority classes. The precision and recall scores have a value of 0 in most models. Only one out of three minor classes were determined in each model analyzed. None was detected, even in the model with 100 (random_100_CS) TSs. The classification results indicate that this approach is unsuitable for categorizing lithology classes with samples that are not balanced. Nevertheless, it identified major classes, particularly undifferentiated porphyry, with the highest number of TSs (IR = 1:1) (Table 7) causing high precision, recall, and F1 score. Its F1 score ranges from 0.82 to 0.90. Another major class (sedimentary rocks) has fewer TSs (IR = 3:1), and the performance metrics scores acquired were lower than those of undifferentiated porphyry. Its F1 score ranges from 0.26 to 0.65. It was shown in the two major classes that the more the TSs were used, the higher the yielded performance metric scores were.

Table 6

Summary of classification performance metrics for each lithology class

Number of TSs Lithological class Method
Balance class weight Class weight tuning Oversampling
Precision Recall F1 score Precision Recall F1 score Precision Recall F1 score
25 1 0.50 0.25 0.33 0 0 0 0.38 0.62 0.48
2 0.46 0.67 0.55 0 0 0 0.20 0.89 0.33
3 0 0 0 0.14 0.06 0.08 0.33 0.18 0.23
4 0.49 0.17 0.25 0.46 0.18 0.26 0.50 0.19 0.28
5 0.75 0.95 0.83 0.73 0.94 0.82 0.80 0.91 0.85
50 1 0.50 0.12 0.2 0 0 0 0.33 0.38 0.35
2 0.60 0.33 0.43 1 0.11 0.20 0.50 0.89 0.64
3 0.20 0.06 0.09 0 0 0 0.18 0.18 0.18
4 0.53 0.21 0.30 0.54 0.22 0.31 0.61 0.31 0.41
5 0.74 0.95 0.83 0.73 0.95 0.83 0.80 0.92 0.85
100 1 0.50 0.25 0.33 0 0 0 0.22 0.25 0.24
2 1 0.33 0.5 0 0 0 0.67 0.89 0.76
3 0.43 0.35 0.39 0 0 0 0.39 0.41 0.40
4 0.58 0.51 0.54 0.55 0.62 0.58 0.59 0.51 0.55
5 0.82 0.88 0.85 0.83 0.88 0.85 0.84 0.87 0.85
200 1 1 0.12 0.22 1 0.12 0.22 0.33 0.12 0.18
2 0 0 0 0 0 0 0 0 0
3 0.56 0.29 0.38 0 0 0 0.43 0.53 0.47
4 0.67 0.48 0.56 0.60 0.47 0.53 0.66 0.49 0.56
5 0.80 0.94 0.87 0.80 0.94 0.86 0.82 0.92 0.87
300 1 0.67 0.25 0.36 0.50 0.25 0.33 0.50 0.25 0.33
2 1 0.11 0.2 0 0 0 1 0.22 0.36
3 0.41 0.41 0.41 0 0 0 0.36 0.53 0.43
4 0.70 0.50 0.58 0.65 0.56 0.60 0.74 0.58 0.65
5 0.82 0.94 0.87 0.82 0.93 0.87 0.85 0.92 0.88
400 1 1 0.12 0.22 0.5 0.25 0.33 0.67 0.25 0.36
2 1 0.22 0.36 0 0 0 0 0 0
3 0.46 0.35 0.4 0 0 0 0.38 0.53 0.44
4 0.77 0.56 0.65 0.72 0.59 0.65 0.77 0.60 0.67
5 0.84 0.97 0.90 0.84 0.97 0.90 0.86 0.95 0.90
500 1 0.67 0.25 0.36 0 0 0 0.5 0.25 0.33
2 0.67 0.22 0.33 0 0 0 1 0.11 0.2
3 0.37 0.41 0.39 0.5 0.35 0.41 0.37 0.59 0.45
4 0.78 0.52 0.63 0.74 0.59 0.65 0.77 0.56 0.65
5 0.84 0.96 0.90 0.84 0.96 0.90 0.86 0.95 0.90

Description: Class 1: inferred porphyry, Class 2: pseudo gossan, Class 3: quaternary alluvium, Class 4: sedimentary rocks, Class 5: undifferentiated porphyry; italic number: the lowest value in each group, bold number: the highest value in each group.

Table 7

IR and distribution of TSs

Lithological class Number of TSs
25 50 100 200 300 400 500
IR IR IR IR IR IR IR
1 2 7:1 2 16:1 4 16:1 5 28:1 6 35:1 7 40:1 10 35:1
2 2 7:1 2 16:1 2 32:1 2 70:1 3 69:1 4 70:1 6 59:1
3 2 7:1 3 10:1 8 8:1 12 12:1 16 13:1 19 15:1 23 15:1
4 5 3:1 12 3:1 22 3:1 41 3:1 67 3:1 89 3:1 107 3:1
5 14 1:1 31 1:1 64 1:1 140 1:1 208 1:1 281 1:1 354 1:1

Description: = Number of TSs in each class, IR = imbalance ratio.

Generally, the balance class weight method yielded good results. Most models can determine the three minor classes (inferred porphyry, pseudo gossan, and quaternary alluvium), except for models using 25 (random_25_bal) and 200 (random_200_bal) TSs, which did not identify quaternary alluvium and pseudo gossan, respectively. Almost all classes have higher precision than recall; some even have a precision score of 1, depicting no false positive, with a low recall score (high false negative).

In general, the oversampling method provided the best classification performance in minority classes. Although the model with 200 (random_200_OV) and 400 (random_400_OV) TSs did not determine the pseudo gossan, this technique generally enhanced the F1 score in minority classes, and the overall F1 score increased compared to the balance class weight method as the baseline model. The quaternary alluvium classes that were unidentified by the balance class weight method using 25 (random_25_bal) TSs were determined by the oversampling method (random_25_OV). The model with 200 (random_200_OV) and 400 (random_400_OV) TSs, where the pseudo gossan (IR = 70:1) was not identified, showed that the oversampling method could not always identify minor classes with very high IR values. Other minority classes with an IR ≤ 40:1 could be identified. Quaternary alluvium was the minority class that consistently enhanced its F1 score across all models. This performance was possibly associated with the number of TSs in this class, which was more significant than the other two minor classes, with a lower IR value, implying that the classifier’s performance was linked to each class’s IR value. Zhu et al. [69] reported that datasets with the same IR can portray distinct classification performances when their dimensionalities vary, rendering IR suboptimal for reflecting the extent of imbalance for classification. The classification performance improved with more discriminatory features.

3.4 Visual comparative assessment

Figures 5 and 6 demonstrate visual comparative assessment results. The RF classification result was overlaid with lithological boundaries from the existing map as a visual assessment reference. There were three criteria used: (i) the ability to detect all lithology classes; (ii) the ability to classify in regards to the position and coverage area of each class; and (iii) the ability to detect lithological boundaries. To enhance our comprehension, we examine this visual comparison by detecting the amount of incorrectly identified pixels, as displayed in Tables 8 and 9.

Figure 5 
                  Visualization of classifications from balanced models using (a) 25, (b) 50 TSs, and (c) the existing lithological map.
Figure 5

Visualization of classifications from balanced models using (a) 25, (b) 50 TSs, and (c) the existing lithological map.

Figure 6 Visualization of classifications from imbalanced models using a different number of TSs. (a) Random_25_bal, (b) Random_25_CS, (c) Random_25_OV, (d) Random_50_bal, (e) Random_50_CS, (f) Random_50_OV, (g) Random_100_bal, (h) Random_100_CS, (i) Random_100_OV, (j) Random_200_bal, (k) Random_200_CS, (l) Random_200_OV, (m) Random_300_bal, (n) Random_300_CS, (o) Random_300_OV, (p) Random_400_bal, (q) Random_400_CS, (r) Random_400_OV, (s) Random_500_bal, (t) Random_500_CS, and (u) Random_500_OV.
Figure 6 Visualization of classifications from imbalanced models using a different number of TSs. (a) Random_25_bal, (b) Random_25_CS, (c) Random_25_OV, (d) Random_50_bal, (e) Random_50_CS, (f) Random_50_OV, (g) Random_100_bal, (h) Random_100_CS, (i) Random_100_OV, (j) Random_200_bal, (k) Random_200_CS, (l) Random_200_OV, (m) Random_300_bal, (n) Random_300_CS, (o) Random_300_OV, (p) Random_400_bal, (q) Random_400_CS, (r) Random_400_OV, (s) Random_500_bal, (t) Random_500_CS, and (u) Random_500_OV.
Figure 6

Visualization of classifications from imbalanced models using a different number of TSs. (a) Random_25_bal, (b) Random_25_CS, (c) Random_25_OV, (d) Random_50_bal, (e) Random_50_CS, (f) Random_50_OV, (g) Random_100_bal, (h) Random_100_CS, (i) Random_100_OV, (j) Random_200_bal, (k) Random_200_CS, (l) Random_200_OV, (m) Random_300_bal, (n) Random_300_CS, (o) Random_300_OV, (p) Random_400_bal, (q) Random_400_CS, (r) Random_400_OV, (s) Random_500_bal, (t) Random_500_CS, and (u) Random_500_OV.

Table 8

Summary of misclassification pixel for each lithology class of balanced model

TS Lithological class Pixel Error (%) OE (%)
25 1 6,484 59.83 55.77
2 31 5.07
3 13,507 28.62
4 83,582 72.27
5 15,950 39.79
50 1 6,347 58.56 56.04
2 8 1.31
3 13,219 28.01
4 79,872 69.06
5 20,697 51.64

Description: OE: overall error.

Table 9

Summary of misclassification pixel for each lithology class

TS Lithological class BAL (pixel) Error (%) OE (%) CW (pixel) Error (%) OE (%) OV (pixel) Error (%) OE (%)
25 1 8,679 80.08 70.74 10,838 100.00 70.55 5,958 54.97 70.09
2 454 74.30 611 100.00 139 22.75
3 41,951 88.89 38,607 81.81 39,697 84.12
4 97,519 84.32 98,318 85.01 97,005 83.88
5 3,047 7.60 2,873 7.17 7,464 18.62
50 1 8,999 83.03 62.98 10,587 97.68 67.92 3,775 34.83 59.78
2 463 75.78 571 93.45 67 10.97
3 40,181 85.14 45,461 96.33 39,165 82.99
4 82,238 71.11 86,228 74.56 79,406 68.66
5 3,131 7.81 2,750 6.86 5,747 14.34
100 1 8,298 76.56 43.69 10,837 99.99 44.79 6,158 56.82 44.64
2 522 85.43 611 100.00 255 41.73
3 31,170 66.05 47,191 100.00 33,723 71.46
4 46,236 39.98 29,448 25.46 46,139 39.90
5 7,439 18.56 7,925 19.77 9,431 23.53
200 1 10,327 95.29 45.90 10,813 99.77 50.28 9,681 89.32 43.06
2 604 98.85 611 100.00 603 98.69
3 33,937 71.91 47,192 100.00 26,193 55.50
4 47,359 40.95 43,833 37.90 48,326 41.79
5 6,160 15.37 5,340 13.32 7,516 18.75
300 1 9,344 86.22 38.61 9,898 91.33 43.28 9,465 87.33 34.83
2 580 94.93 611 100.00 500 81.83
3 30,578 64.79 47,191 100.00 19,244 40.78
4 36,398 31.47 29,698 25.68 38,595 33.37
5 5,862 14.63 5,377 13.41 6,872 17.14
400 1 10,038 92.62 35.74 9,250 85.35 38.17 9,483 87.50 31.76
2 572 93.62 610 99.84 588 96.24
3 32,876 69.66 47,140 99.89 22,360 47.38
4 28,219 24.40 20,378 17.62 29,387 25.41
5 4,909 12.25 4,454 11.11 6,277 15.66
500 1 9,370 86.46 31.93 10,838 100.00 35.11 9,682 89.33 29.46
2 554 90.67 593 97.05 593 97.05
3 28,198 59.75 37,193 78.81 18,116 38.39
4 24,845 21.48 21,363 18.47 29,104 25.17
5 5,483 13.68 5,269 13.15 5,653 14.10

Description: BAL: balance class weight, CW: class weight tuning, OV: oversampling, OE: overall error.

Figure 5a and b demonstrates that the balance class weight method can determine all lithology types. In both models, it can be observed that the pseudo gossan and inferred porphyry classes are identified in a reasonably large area. Conversely, the area of the two classes does not match the area of the two related classes on the existing lithological map. This condition occurs in all classes. Table 7 shows the percentage error in the classification results. Moreover, this visualization indicated that the effect of false positives for pseudo gossan and inferred porphyry classes was visible, which results in low precision scores and high recall scores in these models (Table 5). Contrarily, these two performance metric values should have a balanced value. Nevertheless, because the testing points around the southern region are few and unevenly distributed, the two values differ (Figure 4j).

The visualization of the classifications demonstrated the performance metric scores in Tables 5, 6 and 9, which are elucidated in Figures 5 and 6. The class weight tuning method offered poor visualization results following the performance metric scores of the minority classes (Tables 6 and 9). This procedure clearly indicates the effect of the weight of the lithology class on the classifier’s ability to identify minority classes. The lithology class becomes challenging to detect at low weight values, e.g., inferred porphyry in the 25, 100, or 500 TS models. If the lithology class gets a higher weight, it will be easy to observe (e.g., quaternary alluvium in the 25, 50, or 500 models). Nonetheless, not all high weights can make the lithology class easy to detect; it can be observed in the inferred porphyry lithology class on the 200, 300, or 400 TS models. In these three models, no significant difference is seen between the inferred porphyry class, which has a high weight, and the pseudo gossan and quaternary alluvium lithology classes, which possess a low weight. This condition suggests that the weight of the class must be proportional to the weight of other classes so that the classification results do not bias toward the class with a high weight. Therefore, it can be assumed that the weights in the minor or major classes must satisfy specific proportions based on the population [70] or be inversely proportional to the frequency of the class [71]. Hence, this requires additional research to acquire a correlation between weights and weight comparisons between lithology classes.

The balanced class weight method provided better visualization than the class weight tuning method. The classification results of the 25 (random_25_bal) and 50 (random_50_bal) TS models indicate that the classifier can determine all lithology classes. However, the coverage of each lithology class does not correspond to the coverage of the related lithology class on the existing map. All lithology classes were sufficiently identified in the 100, 300, 400, and 500 TS models. Nonetheless, this procedure cannot ultimately determine the minor classes compared with the existing lithological map. For instance, in the model with 100 TSs, although the number of TS was only 2, it has a fairly low IR value (32:1). In the models with 300, 400, and 500 TS, this class was identified, although only a small area of it. Alternatively, the classifier can determine the pseudo gossan class because the number of TS for this class in each model was slightly higher (3, 4, and 6, respectively). In addition, they made the impurity value slightly more significant than the model with 200 TS, even though the IR values were almost similar (69:1, 70:1, 59:1, respectively). Regarding this, Zhu et al. [69] elucidated that their classification performances can vary when the dimensionalities of datasets with the same IR are different. Hence, IR is not the best way to demonstrate the imbalance for classification because it does not consider dimensionality.

The oversampling method offered better visual classifications than the other two methods evaluated following the three criteria described earlier. Hence, according to the visualization of the classifications, it can be noted that the model with TS of 25, 50, 100, 300, and 500 identified all lithology classes. In the model oversampling with 100 TS, it seemed that the position and area of the lithology class were entirely sufficient. Undifferentiated porphyry and sedimentary rock classes appear to be less misclassified in many locations. In this model, the quaternary alluvium was lacking in the central and southeast regions. Nonetheless, the misclassification of this class decreases with the increase of TS. The best classifications regarding position accuracy and lithology class area were generally acquired by the model with 500 TS, although the pseudo gossan class was inappropriately identified. The best identification for the pseudo gossan class was in the model with 100 TS.

The visual analysis demonstrated that lithological boundaries were accurately identified, beginning with a model with 100 TS. The classification error for the oversampling method decreases consistently as the number of TS increases, as demonstrated in Table 8. The increase in the classifier’s ability to determine the lithology class is the effect of using geophysical data. Integrating geophysical data and remote sensing is crucial to form reliable lithological maps when limited TS are utilized [41]. The sedimentary rocks generally depicted good remarks because this class has high resistivity and low magnetite, as opposed to the other four classes. This result was consistent with Zhu et al.’s study [69], which states that providing more discriminatory features can improve classification performance. The model with 500 TS portrays the best result in identifying lithological boundaries, although the pseudo gossan class was not properly identified.

The oversampling method outcomes depicted a relationship between the IR and the classifier’s ability to determine minority classes. When the IR value in the minority class was not low enough (the number of TS for both classes had an insignificant difference), then with the oversampling method, the classifier treated the minority class as though it was in a balanced state [63]. Conversely, when the IR was high (the number of TS for the majority and minority classes had a significant difference), the minority class was incorrectly identified.

4 Discussion

The application of the oversampling method with limited and imbalanced TS in this study has yielded good results. This is understandable because the actual application of the oversampling method manipulates the classifier to perceive the imbalanced data as balanced data [63]. Nevertheless, the classification results demonstrated that the oversampling method could not identify all classes (pseudo gossan class was not identified in the 200 and 400 TS models, Table 6). This condition begs the question of how this can occur. In this case, three things require investigation: (i) the proportion of TS to the coverage area of each lithology class, (ii) the IR value and (iii) class overlapping or class complexity, which is part of the intrinsic characteristics [63,72].

Concerning the proportion of TS, Qian [70] reported that in stratified sampling (balanced distribution), sometimes a disproportionate number of TS exists. This proportion refers to the total population of each lithology class. Meanwhile, Noorhalim et al. [30] stated that the optimal classification performance should have a balanced distribution and an appropriate number of samples to reflect the TS that can offer more information that can be beneficial for learning processes. Hence, if we categorize areas with multiple classes, we must strive for these two things to be fulfilled. Examples include the stratified distribution classification using 25 and 50 TS (Figure 5) and the 25 and 50 TS in oversampling models (Figure 6c and f).

The need for a proportional number of TS reflects the IR value. Figure 6 and Table 7 demonstrate the influence of IR on the quality of the classifications. The number of TS utilized in the minor class should not be too limited, nor should the IR value be too high for appropriate identification. The minor class should have appropriate TS to enable the decision tree to develop for the class to be detected proportionally [70,71]. This condition occurred in the oversampling method with 100 TS (Figure 6i). The classifications of pseudo gossan and inferred porphyry exhibit the most similar coverage in comparison to the existing lithological map. Still, in other models, the coverage of these two classes became smaller, and even disappeared in some locations when the number of TS and the IR value increased (Figures 6r and u). Hence, for the classifier to accurately detect the lithology class, it is recommended to offer a substantial number of TS [70], which is proportional to the coverage area, with an appropriate (not too high) IR value. Additional research is required to verify the relationship between the number of TS, IR, and coverage area.

Intrinsic characteristics are crucial for applying existing and developing new techniques to deal with imbalanced data (23,63). One of the intrinsic characteristics that greatly influences the identification of lithology classes is class overlapping or class complexity. This can be seen in the area with the inferred porphyry class, between the quaternary alluvium and undifferentiated porphyry classes, located in the west-central region. With this overlapping class, the identification of minor classes in this region always experiences difficulties, which results in low classification accuracy (23). Thus, it is necessary to study further to parse this area by carrying out a partial classification so that the TS for the inferred porphyry class is not isolated in a narrow area and is in the middle of the TS of other classes.

5 Conclusion

The problem of imbalanced data is often encountered in using MLAs in lithology classification. The result demonstrated that the oversampling method, which is hybridized with the RF algorithm, generally outperformed the other two methods with the following performance metrics values: 1.00 (training accuracy), 0.70–0.82 (testing accuracy), 0.43–0.56 (F1 score), and 0.32–0.59 (Kappa score). Additionally, the comparative visual assessment also displayed that the oversampling method produced the classification result closest to the existing map regarding the class types, coverage areas, and boundaries.

The IR functions significantly in the classification. The addition of TS improved the accuracy of the classifications. Increased IR in the minor class affected the classifier’s ability. Still, the IR cannot be too large and must be proportional to the population. Hence, there is an association between the class identification ability and the number of TS proportional to each lithology class coverage area.

Considering the limitations of our study, we want to explore various aspects of our future work. As a starting point, it should be remarkable to investigate the influence of learning strategies on several classifiers. As stated previously, we selected the RF algorithm for our study since it has been shown to be a viable solution for unbalanced classification problems in several areas. Nevertheless, alternative options may also be explored to examine how various classifiers and learning strategies can handle the imbalanced data samples, especially with limited TS. Since the CSL method is unable to provide good results in this study, we aim to further investigate the determination of class weights as a penalty for misclassification costs, which functions as the basis for the effectiveness of the CSL method in identifying classes correctly. Likewise, the oversampling method yielded the best results in this research. Examining the classification’s findings indicates a poor performance metric value. Nonetheless, further studies must be carried out on this model if the capacity to recognize is to improve.

Acknowledgments

The authors wish to acknowledge PT. Eksplorasi Nusa Jaya and Mine Serve International for their support of this research. This research was partially funded by Institut Teknologi Nasional Bandung, Indonesia.

  1. Funding information: This research was partially funded by Institut Teknologi Nasional Bandung, Indonesia (Grant No.: 077/G.22.01/Rektorat/Itenas/VII/2022).

  2. Author Contributions: Conceptualization, H.N.; methodology, H.N.; software, S.B. and A.S.; validation, H.N., S.B. and A.S.; formal analysis, H.N. and K.W.; investigation, H.N., S.B. and A.S.; resources, K.W., S.B. and A.S.; data curation, H.N., K.W. and S.B.; writing – original draft preparation, H.N.; writing – review and editing, H.N., S.B., and A.S.; visualization, H.N.; supervision, K.W.; project administration, K.W. and H.N.; funding acquisition, H.N. All authors have read and agreed to the published version of the manuscript.

  3. Conflict of interest: The authors declare no conflict of interest.

  4. Data availability statement: DEMNAS data are available at the Geospatial Information Agency, Republic of Indonesia, and geophysical data are owned by PT. Eksplorasi Nusa Jaya. Requests for both types of data can be addressed to each organization.

References

[1] Merembayev T, Kurmangaliyev D, Bekbauov B, Amanbek Y. A comparison of machine learning algorithms in predicting lithofacies: Case studies from Norway and Kazakhstan. Energies. 2021;14:1–16.Search in Google Scholar

[2] Xi Y, Taha AMM, Hu A, Liu X. Accuracy comparison of various remote sensing data in lithological classification based on random forest algorithm. Geocarto Int. 2022;37(26):14451–79. 10.1080/10106049.2022.2088859.Search in Google Scholar

[3] Zhou K, Zhang J, Ren Y, Huang Z, Zhao L. A gradient boosting decision tree algorithm combining synthetic minority oversampling technique for lithology identification. Geophysics. 2020;85(4):WA147–58.Search in Google Scholar

[4] De Araújo Neto JF, Santos GL, De Albuquerque E, Souza IMB, De Brito Barreto S, De Lira Santos LCM, et al. Integration of remote sensing, airborne geophysics and structural analysis to geological mapping: A case study of the Vieirópolis region, Borborema Province, NE Brazil. Geol USP - Ser Cient. 2018;18(3):89–103.Search in Google Scholar

[5] Harvey AS, Fotopoulos G. Geological mapping using machine learning algorithms. Int Arch Photogramm Remote Sens Spat Inf Sci - ISPRS Arch. 2016;41(July):423–30. https://ui.adsabs.harvard.edu/abs/2016ISPAr41B8.423H.Search in Google Scholar

[6] Kuhn S, Cracknell MJ, Reading AM. Lithological mapping in the Central African Copper Belt using Random Forests and clustering: Strategies for optimised results. Ore Geol Rev. 2019;112:103015. 10.1016/j.oregeorev.2019.103015.Search in Google Scholar

[7] Kuhn S, Cracknell MJ, Reading AM, Sykora S. Case history identification of intrusive lithologies in volcanic terrains in British Columbia by machine learning using random forests: The value of using a soft classifier. Geophysics. 2020;85(6):235–44.Search in Google Scholar

[8] Halotel J, Demyanov V, Gardiner A. Value of geologically derived features in machine learning facies classification. Math Geosci. 2020;52(1):5–29. 10.1007/s11004-019-09838-0.Search in Google Scholar

[9] Li G, Zheng Y, Li Y, Wu W, Hong Y, Zhou X. Recognition of stratum lithology of seismic facies based on deep belief network. 2nd International Conference on Artificial Intelligence and Industrial Engineering (AIIE2016); 2016. p. 354–7.Search in Google Scholar

[10] Fuentes I, Padarian J, Iwanaga T, Vervoort RW. 3D lithological mapping of borehole descriptions using word embeddings. Comput Geosci. 2020;141:32. 10.1016/j.cageo.2020.104516.Search in Google Scholar

[11] Onan A. Hybrid supervised clustering based ensemble scheme for text classification Abstract. Kybernetes. 2017;46(2):330–48.Search in Google Scholar

[12] Onan A, Korukoğlu S, Bulut H. LDA-based topic modelling in text sentiment classification: An empirical analysis. Int J Comput Linguist Appl. 2016;7(1):101–19.Search in Google Scholar

[13] Onan A, Korukoǧlu S, Bulut H. Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst Appl. 2016;57:232–47.Search in Google Scholar

[14] Onan A, Korukoğlu S, Bulut H. A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification. Inf Process Manag. 2017;53(4):814–33.Search in Google Scholar

[15] Ao Y, Zhu L, Guo S, Yang Z. Probabilistic logging lithology characterization with random forest probability estimation. Comput Geosci. 2020;144:104556. 10.1016/j.cageo.2020.104556.Search in Google Scholar

[16] Kuhn S, Cracknell MJ, Reading AM. The utility of machine learning in identification of key geophysical and geochemical datasets: A case study in lithological mapping in the Central African Copper Belt. ASEG Ext Abstr. 2018;1:1–4.Search in Google Scholar

[17] Kuhn S, Cracknell MJ, Reading AM. Lithological mapping using Random Forests applied to geophysical and remote sensing data: A demonstration study from the Eastern Goldfields of Australia. Geophysics. 2018;84(4):1–37.Search in Google Scholar

[18] Wenhua W, Zhuwen W, Ruiyi H, Fanghui X, Xinghua Q, Yitong C. Lithology classification of volcanic rocks based on conventional logging data of machine learning: A case study of the eastern depression of Liaohe oil field. Open Geosci. 2021;13:1245–58.Search in Google Scholar

[19] Breiman L. Random forests. Mach Learn J Pap. 2001;45:1–33.Search in Google Scholar

[20] Cracknell MJ, Reading AM. The upside of uncertainty: Identification of lithology contact zones from airborne geophysics and satellite data using random forests and support vector machines. Geophysics. 2013;78(3):113–26.Search in Google Scholar

[21] Cracknell MJ, Reading AM. Geological mapping using remote sensing data: A comparison of five machine learning algorithms, their response to variations in the spatial distribution of training data and the use of explicit spatial information. Comput Geosci. 2014;63:22–33. 10.1016/j.cageo.2013.10.008.Search in Google Scholar

[22] Harris JR, Grunsky EC. Predictive lithological mapping of Canada’s North using Random Forest classification applied to geophysical and geochemical data. Comput Geosci. 2015;80(July):9–25. 10.1016/j.cageo.2015.03.013.Search in Google Scholar

[23] Ali A, Shamsuddin SM, Ralescu AL. Classification with class imbalance problem: A review. Int J Adv Soft Comput Appl. 2015;7(3):176–204.Search in Google Scholar

[24] Fernández A, García S, Galar M, Prati RC. Learning from imbalanced data sets. Springer Nature Switzerland; 2018. p. 377.Search in Google Scholar

[25] Krawczyk B. Learning from imbalanced data: Open challenges and future directions. Prog Artif Intell. 2016;5(4):221–32.Search in Google Scholar

[26] Thabtah F, Hammoud S, Kamalov F, Gonsalvesv AH. Data imbalance in classification: experimental evaluation. Inf Sci (NY). 2019;513:429–41. 10.1016/j.ins.2019.11.004.Search in Google Scholar

[27] Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G. Learning from class-imbalanced data: Review of methods and applications. Expert Syst Appl. 2017;73:220–39. 10.1016/j.eswa.2016.12.035.Search in Google Scholar

[28] Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, García-Borroto M. Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing. 2016;175:935–47. 10.1016/j.neucom.2015.04.120.Search in Google Scholar

[29] Weiss GM. Foundations of imbalanced learning. In: He H, Ma Y, editors. Imbalanced learning: Foundations, algorithms, and applications. Berlin, Germany: John Wiley & Sons; 2013. p. 216.Search in Google Scholar

[30] Noorhalim N, Ali A, Shamsuddin SM. Handling imbalanced ratio for class imbalance problem using SMOTE. Proceedings of the Third International Conference on Computing, Mathematics and Statistics (iCMS2017). Springer Nature Singapore; 2019. p. 19–30.Search in Google Scholar

[31] Ortigosa-Hernández J, Inza I, Lozano JA. Measuring the class-imbalance extent of multi-class problems. Pattern Recognit Lett. 2017;98:32–8.Search in Google Scholar

[32] Sun Y, Wong AKC, Kamel MS. Classification of imbalanced data: A review. Int J Pattern Recogn Artif Intell. 2009;23(4):687–719. 10.1142/S0218001409007326.Search in Google Scholar

[33] Menardi G, Torelli N. Training and assessing classification rules with imbalanced data. Data Min Knowl Discov. 2014;28:92–122.Search in Google Scholar

[34] López V, Fernández A, Moreno-Torres JG, Herrera F. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl. 2012;39(7):6585–608.Search in Google Scholar

[35] Karlhede A. Tackling imbalanced data in random forest to predict free-to-fee transitions of a subscription. Stockholm, Sweden: KTH Royal Institute of Technology; 2020.Search in Google Scholar

[36] Sinha S, Ohashi H. Class-wise difficulty-balanced loss for solving class-imbalance. Computer Vision – ACCV 2020; 2020. p. 1–17.Search in Google Scholar

[37] Makienko D, Seleznev I, Safonov I. The effect of the imbalanced training dataset on the quality of classification of lithotypes via whole core photos. In: Fursov V, Goshin Y, Kudryashov D, editors. The VI International Conference Information Technology and Nanotechnology. Samara, Russia: CEUR-WS; 2020. p. 132–6.Search in Google Scholar

[38] Weiss G, McCarthy K, Zabar B. Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs? Proceedings of the 2007 International Conference on Data Mining, DMIN 2007, June 25-28, 2007. Las Vegas, Nevada, USA; 2007. p. 1–7. http://storm.cis.fordham.edu/∼gweiss/papers/dmin07-weiss.pdf.Search in Google Scholar

[39] Kaewwichian P. Multiclass classification with imbalanced datasets for car ownership demand model – Cost-sensitive learning. Promet–Traffic Transp. 2021;33(3):361–71.Search in Google Scholar

[40] He J, Harris JR, Sawada M, Behnia P. A comparison of classification algorithms using Landsat-7 and Landsat-8 data for mapping lithology in Canada’s Arctic. Int J Remote Sens. 2015;36(8):2252–76.Search in Google Scholar

[41] Costa I, Tavares F, Oliveira J. Predictive lithological mapping through machine learning methods: a case study in the Cinzento Lineament, Carajás Province, Brazil. J Geol Surv Braz. 2019;2(1):26–36.Search in Google Scholar

[42] Harris JR, Juan HX, Rainbird RH, Behnia P. Remote predictive mapping 6: A comparison of different remotely sensed data for classifying bedrock types in Canada’s Arctic: Application of the robust classification method and Ra. Geosci Can. 2014;41(December):557–84.Search in Google Scholar

[43] Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and regression trees. 1st edn. New York: Taylor & Francis; 1984. p. 368.Search in Google Scholar

[44] Mine Serve International. Geological Map Scale of 1:25.000. 2nd edn. Komopa, Papua, Indonesia; 2000.Search in Google Scholar

[45] Skead MB. 1994-1996 Fieldwork in Komopa-Dawagu area, general synthesis. Jakarta, Indonesia: Nabire Bakti Mining; 1996. Search in Google Scholar

[46] Glover JK. The Structural and Lithological Setting, Controls of Mineralization and Potential in the Area of The Komopa-Dawagu Prospects, NBM BLOCK II. Jakarta, Indonesia: Mine Serve International; 1999.Search in Google Scholar

[47] Moore CB. Interpretation of The 1993 Irian jaya airborne geophysical surveys. Jakarta, Indonesia: Nabire Bakti Mining; 1994.Search in Google Scholar

[48] Google Map [Internet]; 2022 [cited 2022 Feb 22]. https://www.google.co.id/maps/@-3.7555498,136.5555741,46322m/data=!3m1!1e3?hl=en.Search in Google Scholar

[49] Satimagingcorp. Sentinel-2A (10m) Satellite Sensor [Internet]; 2022 [cited 2022 Aug 31]. p. 3. https://www.satimagingcorp.com/satellite-sensors/other-satellite-sensors/sentinel-2a/.Search in Google Scholar

[50] European Space Agency. Sentinel 2A [Internet]; 2019 [cited 2019 Dec 10]. https://earth.esa.int/web/sentinel/user-guides/sentinel-2-msi/product- types/level-2a.Search in Google Scholar

[51] L3Harris Geospatial Solution. Vegetation Suppression [Internet]. L3Harris Geospatial; 2020 [cited 2020 Aug 1]. p. 2–4. https://www.l3harrisgeospatial.com/docs/vegetationsuppression.htmlSearch in Google Scholar

[52] Geospatial Information Agency-Republic of Indonesia. DEMNAS Seamless Digital Elevation Model (DEM) dan Batimetri Nasional [Internet]; 2018 [cited 2019 Mar 20]. http://tides.big.go.id/DEMNAS/#Info.Search in Google Scholar

[53] Bannari A, El-Battay A, Saquaque A, Miri A. PALSAR-FBS L-HH mode and landsat-TM data fusion for geological mapping. Adv Remote Sens. 2016;5(4):246–68.Search in Google Scholar

[54] European Space Agency. SNAP [Internet]; 2022 [cited 2022 Sep 16]. https://earth.esa.int/eogateway/tools/snapSearch in Google Scholar

[55] European Space Agency. Level-1 radiometric calibration [Internet]; 2020 [cited 2020 Apr 10]. https://sentinel.esa.int/web/sentinel/radiometric-calibration-of-level-1-productsSearch in Google Scholar

[56] Ottinger M, Kuenzer C. Spaceborne L-band synthetic aperture radar data for geoscientific analyses in coastal land applications: A review. Remote Sens. 2020;12(14):1–36. 10.3390/rs12142228.Search in Google Scholar

[57] GeoSci. Electromagnetic Data Processing [Internet]; 2018 [cited 2022 Feb 6]. https://em.geosci.xyz/content/case_histories/bookpurnong/processing.html.Search in Google Scholar

[58] Scikitlearn. GridSearchCV [Internet]; 2020 [cited 2021 Jun 10]. p. 1–7. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.Search in Google Scholar

[59] Tyralis H, Papacharalampous G, Langousis A. A brief review of random forests for water scientists and practitioners and their recent history inwater resources. Water (Switzerland). 2019;11(5):910.Search in Google Scholar

[60] Probst P, Wright M, Boulesteix A-L. Hyperparameters and tuning strategies for random forest. WIREs Data Min Knowl Discov. 2019;9:1–19. 10.1002/widm.1301.Search in Google Scholar

[61] Scikitlearn. Sklearn.ensembleRandomForestClassifier [Internet]; 2020 [cited 2020 Jan 20]. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.Search in Google Scholar

[62] Breiman L. Bagging predictors. Mach Learn. 1996;140:123–40.Search in Google Scholar

[63] Brownlee J. Imbalanced classification with Python: Choose better metrics, balance skewed classes, and apply cost-sensitive learning [Internet]. v1.2. Machine Learning Mastery; 2020. https://machinelearningmastery.com/imbalanced-clas. https://machinelearningmastery.com/imbalanced-classification-with-python/.Search in Google Scholar

[64] Mohamed IM, Mohamed S, Mazher I, Chester P. Formation lithology classification: insights into machine learning methods. In SPE Annual Technical Conference and Exhibition. Calgary, Alberta, Canada: Society of Petroleum Engineers; 2019. 10.2118/196096-MS.Search in Google Scholar

[65] Zhang C, Wen H, Liao M, Lin Y, Wu Y, Zhang H. Study on machine learning models for building resilience evaluation in mountainous area: A Case Study of Banan District, Chongqing, China. Sensors. 2022;22(3):1163.Search in Google Scholar

[66] McHugh ML. Lessons in biostatistics Interrater reliability: The kappa statistic. Biochem Medica. 2012;22(3):276–82.Search in Google Scholar

[67] Shebl A, Kusky T, Csámer Á. Advanced land imager superiority in lithological classification utilizing machine learning algorithms. Arab J Geosci. 2022;15(923):1–13. 10.1007/s12517-022-09948-w.Search in Google Scholar

[68] Tischio RM, Weiss GM. Identifying classification algorithms most suitable for imbalanced data. Bronx, New York, USA: Dept. of Computer & Info. Science Fordham University; 2019.Search in Google Scholar

[69] Zhu R, Guo Y, Xue JH. Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognit Lett. 2020;133:217–23.Search in Google Scholar

[70] Qian J. Sampling. In: Peterson P, Baker E, McGaw B, editors. International Encyclopedia of Education. 3rd edn. Amsterdam: Elsevier; 2010. p. 390–5. https://doi.org/10.1016/B978-0-08-044894-7.01719-X.Search in Google Scholar

[71] Fernando KRM, Tsokos CP. Dynamically weighted balanced loss: Class imbalanced learning & confidence calibration of deep neural networks. IEEE Trans Neural Netw Learn Syst. 2022;33(7):2940–51. 10.1109/TNNLS.2020.3047335.Search in Google Scholar

[72] Ali A, Shamsuddin SM, Ralescu A. Classification with class imbalance problem: A review. Int J Adv Softw Comput Appl. 2013;5(3):31.Search in Google Scholar

Received: 2022-12-05
Revised: 2023-05-04
Accepted: 2023-05-05
Published Online: 2023-08-04

© 2023 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

  1. Regular Articles
  2. Diagenesis and evolution of deep tight reservoirs: A case study of the fourth member of Shahejie Formation (cg: 50.4-42 Ma) in Bozhong Sag
  3. Petrography and mineralogy of the Oligocene flysch in Ionian Zone, Albania: Implications for the evolution of sediment provenance and paleoenvironment
  4. Biostratigraphy of the Late Campanian–Maastrichtian of the Duwi Basin, Red Sea, Egypt
  5. Structural deformation and its implication for hydrocarbon accumulation in the Wuxia fault belt, northwestern Junggar basin, China
  6. Carbonate texture identification using multi-layer perceptron neural network
  7. Metallogenic model of the Hongqiling Cu–Ni sulfide intrusions, Central Asian Orogenic Belt: Insight from long-period magnetotellurics
  8. Assessments of recent Global Geopotential Models based on GPS/levelling and gravity data along coastal zones of Egypt
  9. Accuracy assessment and improvement of SRTM, ASTER, FABDEM, and MERIT DEMs by polynomial and optimization algorithm: A case study (Khuzestan Province, Iran)
  10. Uncertainty assessment of 3D geological models based on spatial diffusion and merging model
  11. Evaluation of dynamic behavior of varved clays from the Warsaw ice-dammed lake, Poland
  12. Impact of AMSU-A and MHS radiances assimilation on Typhoon Megi (2016) forecasting
  13. Contribution to the building of a weather information service for solar panel cleaning operations at Diass plant (Senegal, Western Sahel)
  14. Measuring spatiotemporal accessibility to healthcare with multimodal transport modes in the dynamic traffic environment
  15. Mathematical model for conversion of groundwater flow from confined to unconfined aquifers with power law processes
  16. NSP variation on SWAT with high-resolution data: A case study
  17. Reconstruction of paleoglacial equilibrium-line altitudes during the Last Glacial Maximum in the Diancang Massif, Northwest Yunnan Province, China
  18. A prediction model for Xiangyang Neolithic sites based on a random forest algorithm
  19. Determining the long-term impact area of coastal thermal discharge based on a harmonic model of sea surface temperature
  20. Origin of block accumulations based on the near-surface geophysics
  21. Investigating the limestone quarries as geoheritage sites: Case of Mardin ancient quarry
  22. Population genetics and pedigree geography of Trionychia japonica in the four mountains of Henan Province and the Taihang Mountains
  23. Performance audit evaluation of marine development projects based on SPA and BP neural network model
  24. Study on the Early Cretaceous fluvial-desert sedimentary paleogeography in the Northwest of Ordos Basin
  25. Detecting window line using an improved stacked hourglass network based on new real-world building façade dataset
  26. Automated identification and mapping of geological folds in cross sections
  27. Silicate and carbonate mixed shelf formation and its controlling factors, a case study from the Cambrian Canglangpu formation in Sichuan basin, China
  28. Ground penetrating radar and magnetic gradient distribution approach for subsurface investigation of solution pipes in post-glacial settings
  29. Research on pore structures of fine-grained carbonate reservoirs and their influence on waterflood development
  30. Risk assessment of rain-induced debris flow in the lower reaches of Yajiang River based on GIS and CF coupling models
  31. Multifractal analysis of temporal and spatial characteristics of earthquakes in Eurasian seismic belt
  32. Surface deformation and damage of 2022 (M 6.8) Luding earthquake in China and its tectonic implications
  33. Differential analysis of landscape patterns of land cover products in tropical marine climate zones – A case study in Malaysia
  34. DEM-based analysis of tectonic geomorphologic characteristics and tectonic activity intensity of the Dabanghe River Basin in South China Karst
  35. Distribution, pollution levels, and health risk assessment of heavy metals in groundwater in the main pepper production area of China
  36. Study on soil quality effect of reconstructing by Pisha sandstone and sand soil
  37. Understanding the characteristics of loess strata and quaternary climate changes in Luochuan, Shaanxi Province, China, through core analysis
  38. Dynamic variation of groundwater level and its influencing factors in typical oasis irrigated areas in Northwest China
  39. Creating digital maps for geotechnical characteristics of soil based on GIS technology and remote sensing
  40. Changes in the course of constant loading consolidation in soil with modeled granulometric composition contaminated with petroleum substances
  41. Correlation between the deformation of mineral crystal structures and fault activity: A case study of the Yingxiu-Beichuan fault and the Milin fault
  42. Cognitive characteristics of the Qiang religious culture and its influencing factors in Southwest China
  43. Spatiotemporal variation characteristics analysis of infrastructure iron stock in China based on nighttime light data
  44. Interpretation of aeromagnetic and remote sensing data of Auchi and Idah sheets of the Benin-arm Anambra basin: Implication of mineral resources
  45. Building element recognition with MTL-AINet considering view perspectives
  46. Characteristics of the present crustal deformation in the Tibetan Plateau and its relationship with strong earthquakes
  47. Influence of fractures in tight sandstone oil reservoir on hydrocarbon accumulation: A case study of Yanchang Formation in southeastern Ordos Basin
  48. Nutrient assessment and land reclamation in the Loess hills and Gulch region in the context of gully control
  49. Handling imbalanced data in supervised machine learning for lithological mapping using remote sensing and airborne geophysical data
  50. Spatial variation of soil nutrients and evaluation of cultivated land quality based on field scale
  51. Lignin analysis of sediments from around 2,000 to 1,000 years ago (Jiulong River estuary, southeast China)
  52. Assessing OpenStreetMap roads fitness-for-use for disaster risk assessment in developing countries: The case of Burundi
  53. Transforming text into knowledge graph: Extracting and structuring information from spatial development plans
  54. A symmetrical exponential model of soil temperature in temperate steppe regions of China
  55. A landslide susceptibility assessment method based on auto-encoder improved deep belief network
  56. Numerical simulation analysis of ecological monitoring of small reservoir dam based on maximum entropy algorithm
  57. Morphometry of the cold-climate Bory Stobrawskie Dune Field (SW Poland): Evidence for multi-phase Lateglacial aeolian activity within the European Sand Belt
  58. Adopting a new approach for finding missing people using GIS techniques: A case study in Saudi Arabia’s desert area
  59. Geological earthquake simulations generated by kinematic heterogeneous energy-based method: Self-arrested ruptures and asperity criterion
  60. Semi-automated classification of layered rock slopes using digital elevation model and geological map
  61. Geochemical characteristics of arc fractionated I-type granitoids of eastern Tak Batholith, Thailand
  62. Lithology classification of igneous rocks using C-band and L-band dual-polarization SAR data
  63. Analysis of artificial intelligence approaches to predict the wall deflection induced by deep excavation
  64. Evaluation of the current in situ stress in the middle Permian Maokou Formation in the Longnüsi area of the central Sichuan Basin, China
  65. Utilizing microresistivity image logs to recognize conglomeratic channel architectural elements of Baikouquan Formation in slope of Mahu Sag
  66. Resistivity cutoff of low-resistivity and low-contrast pays in sandstone reservoirs from conventional well logs: A case of Paleogene Enping Formation in A-Oilfield, Pearl River Mouth Basin, South China Sea
  67. Examining the evacuation routes of the sister village program by using the ant colony optimization algorithm
  68. Spatial objects classification using machine learning and spatial walk algorithm
  69. Study on the stabilization mechanism of aeolian sandy soil formation by adding a natural soft rock
  70. Bump feature detection of the road surface based on the Bi-LSTM
  71. The origin and evolution of the ore-forming fluids at the Manondo-Choma gold prospect, Kirk range, southern Malawi
  72. A retrieval model of surface geochemistry composition based on remotely sensed data
  73. Exploring the spatial dynamics of cultural facilities based on multi-source data: A case study of Nanjing’s art institutions
  74. Study of pore-throat structure characteristics and fluid mobility of Chang 7 tight sandstone reservoir in Jiyuan area, Ordos Basin
  75. Study of fracturing fluid re-discharge based on percolation experiments and sampling tests – An example of Fuling shale gas Jiangdong block, China
  76. Impacts of marine cloud brightening scheme on climatic extremes in the Tibetan Plateau
  77. Ecological protection on the West Coast of Taiwan Strait under economic zone construction: A case study of land use in Yueqing
  78. The time-dependent deformation and damage constitutive model of rock based on dynamic disturbance tests
  79. Evaluation of spatial form of rural ecological landscape and vulnerability of water ecological environment based on analytic hierarchy process
  80. Fingerprint of magma mixture in the leucogranites: Spectroscopic and petrochemical approach, Kalebalta-Central Anatolia, Türkiye
  81. Principles of self-calibration and visual effects for digital camera distortion
  82. UAV-based doline mapping in Brazilian karst: A cave heritage protection reconnaissance
  83. Evaluation and low carbon ecological urban–rural planning and construction based on energy planning mechanism
  84. Modified non-local means: A novel denoising approach to process gravity field data
  85. A novel travel route planning method based on an ant colony optimization algorithm
  86. Effect of time-variant NDVI on landside susceptibility: A case study in Quang Ngai province, Vietnam
  87. Regional tectonic uplift indicated by geomorphological parameters in the Bahe River Basin, central China
  88. Computer information technology-based green excavation of tunnels in complex strata and technical decision of deformation control
  89. Spatial evolution of coastal environmental enterprises: An exploration of driving factors in Jiangsu Province
  90. A comparative assessment and geospatial simulation of three hydrological models in urban basins
  91. Aquaculture industry under the blue transformation in Jiangsu, China: Structure evolution and spatial agglomeration
  92. Quantitative and qualitative interpretation of community partitions by map overlaying and calculating the distribution of related geographical features
  93. Numerical investigation of gravity-grouted soil-nail pullout capacity in sand
  94. Analysis of heavy pollution weather in Shenyang City and numerical simulation of main pollutants
  95. Road cut slope stability analysis for static and dynamic (pseudo-static analysis) loading conditions
  96. Forest biomass assessment combining field inventorying and remote sensing data
  97. Late Jurassic Haobugao granites from the southern Great Xing’an Range, NE China: Implications for postcollision extension of the Mongol–Okhotsk Ocean
  98. Petrogenesis of the Sukadana Basalt based on petrology and whole rock geochemistry, Lampung, Indonesia: Geodynamic significances
  99. Numerical study on the group wall effect of nodular diaphragm wall foundation in high-rise buildings
  100. Water resources utilization and tourism environment assessment based on water footprint
  101. Geochemical evaluation of the carbonaceous shale associated with the Permian Mikambeni Formation of the Tuli Basin for potential gas generation, South Africa
  102. Detection and characterization of lineaments using gravity data in the south-west Cameroon zone: Hydrogeological implications
  103. Study on spatial pattern of tourism landscape resources in county cities of Yangtze River Economic Belt
  104. The effect of weathering on drillability of dolomites
  105. Noise masking of near-surface scattering (heterogeneities) on subsurface seismic reflectivity
  106. Query optimization-oriented lateral expansion method of distributed geological borehole database
  107. Petrogenesis of the Morobe Granodiorite and their shoshonitic mafic microgranular enclaves in Maramuni arc, Papua New Guinea
  108. Environmental health risk assessment of urban water sources based on fuzzy set theory
  109. Spatial distribution of urban basic education resources in Shanghai: Accessibility and supply-demand matching evaluation
  110. Spatiotemporal changes in land use and residential satisfaction in the Huai River-Gaoyou Lake Rim area
  111. Walkaway vertical seismic profiling first-arrival traveltime tomography with velocity structure constraints
  112. Study on the evaluation system and risk factor traceability of receiving water body
  113. Predicting copper-polymetallic deposits in Kalatag using the weight of evidence model and novel data sources
  114. Temporal dynamics of green urban areas in Romania. A comparison between spatial and statistical data
  115. Passenger flow forecast of tourist attraction based on MACBL in LBS big data environment
  116. Varying particle size selectivity of soil erosion along a cultivated catena
  117. Relationship between annual soil erosion and surface runoff in Wadi Hanifa sub-basins
  118. Influence of nappe structure on the Carboniferous volcanic reservoir in the middle of the Hongche Fault Zone, Junggar Basin, China
  119. Dynamic analysis of MSE wall subjected to surface vibration loading
  120. Pre-collisional architecture of the European distal margin: Inferences from the high-pressure continental units of central Corsica (France)
  121. The interrelation of natural diversity with tourism in Kosovo
  122. Assessment of geosites as a basis for geotourism development: A case study of the Toplica District, Serbia
  123. IG-YOLOv5-based underwater biological recognition and detection for marine protection
  124. Monitoring drought dynamics using remote sensing-based combined drought index in Ergene Basin, Türkiye
  125. Review Articles
  126. The actual state of the geodetic and cartographic resources and legislation in Poland
  127. Evaluation studies of the new mining projects
  128. Comparison and significance of grain size parameters of the Menyuan loess calculated using different methods
  129. Scientometric analysis of flood forecasting for Asia region and discussion on machine learning methods
  130. Rainfall-induced transportation embankment failure: A review
  131. Rapid Communication
  132. Branch fault discovered in Tangshan fault zone on the Kaiping-Guye boundary, North China
  133. Technical Note
  134. Introducing an intelligent multi-level retrieval method for mineral resource potential evaluation result data
  135. Erratum
  136. Erratum to “Forest cover assessment using remote-sensing techniques in Crete Island, Greece”
  137. Addendum
  138. The relationship between heat flow and seismicity in global tectonically active zones
  139. Commentary
  140. Improved entropy weight methods and their comparisons in evaluating the high-quality development of Qinghai, China
  141. Special Issue: Geoethics 2022 - Part II
  142. Loess and geotourism potential of the Braničevo District (NE Serbia): From overexploitation to paleoclimate interpretation
Downloaded on 8.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/geo-2022-0487/html
Scroll to top button