A novel workflow for shale lithology identification – A case study in the Gulong Depression, Songliao Basin, China

Liying Xu; Ruiyi Han; Xuehong Yan; Xue Han; Zhenlin Li; Hui Wang; Linfu Xue; Yuhang Guo; Xiuwen Mo

doi:10.1515/geo-2022-0672

Article Open Access

A novel workflow for shale lithology identification – A case study in the Gulong Depression, Songliao Basin, China

Liying Xu , Ruiyi Han , Xuehong Yan , Xue Han , Zhenlin Li , Hui Wang , Linfu Xue , Yuhang Guo and Xiuwen Mo

Published/Copyright: August 21, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information

From the journal Open Geosciences Volume 16 Issue 1

Abstract

The identification of shale lithology is of great importance for the exploration and development of shale reservoirs. The lithology and mineralogical composition of shale are closely related, but a small number of laboratory core analysis samples are insufficient to evaluate the lithology of the entire formation. In this study, a lithology identification method using conventional logging curves is proposed for the shale stratigraphy of the Qingshankou Formation in the Gulong Depression of the Songliao Basin, northeastern China. First, a mineral pre-training model is constructed using discrete petrophysical experimental data with logging data, and features are generated for the logging data. Second, an adaptive multi-objective swarm crossover optimization method is employed to address the imbalance of logging data. Finally, the model is combined with a Bayesian gradient boosting algorithm for lithology identification. The proposed method demonstrates superior performance to eXtreme Gradient Boosting, Support Vector Machines, Multilayer Perceptron, and Random Forest in terms of accuracy, weight perspective, and macro perspective evaluation indexes. The method has been successfully applied in actual wells, with excellent results. The results indicate that the workflow is a reliable means of shale lithology identification.

Keywords: shale; shale oil; lithology identification; machine learning; Bayesian gradient boosting algorithm

1 Introduction

Terrestrial shale reservoirs are the focus of unconventional reservoir development. In China, terrestrial shales are concentrated in the Triassic, Cretaceous, and Paleocene stratigraphies. These are characterized by diverse lithology types, frequent vertical lithology changes, and strong non-homogeneity of reservoirs. The lithology of shale contains geological information such as stratigraphic mineral composition, sedimentary background, etc. Therefore, accurate identification of shale lithology is of great significance for shale reservoir evaluation and development [1,2,3]. Laboratory core analysis is the most accurate method for lithology identification, and scanning electron microscopy and X-ray diffraction analysis (XRD) are all methods for identifying lithology in the laboratory. However, their high cost and difficulty of access make it difficult to analyze the lithology of the entire stratigraphy [4]. Meanwhile, well logging has become an effective means of lithology identification due to the continuity and high resolution of its data [5].

In terms of using logging data to identify shale, Mulhern et al. used logging data in conjunction with petrophysical experiments to analyze the electrical phases of the stratigraphy and used the electrical phases to identify the lithology of shale in the Upper Monterey and Reef Ridge formations [6]. The rendezvous diagram method is the classical method used to recognize lithology, but the recognition results are not satisfactory.

Machine learning methods have an excellent ability to handle nonlinear problems and have been widely used in shale lithology identification scenarios. Wang and Carr proposed an artificial neural network-based identification model for identifying shale facies based on mineral composition in the Appalachian Basin [7]. Bhattacharya et al. used a combination of self-organizing maps, support vector machines, multi-resolution map-based clustering techniques, and artificial neural networks, for lithological identification of Devonian shale stratigraphies in North America [8]. Han et al. used a BP neural network model to predict the lithology of terrestrial shales in the Dongying and Jiyang depressions of the Bohai Bay Basin with the help of 23 logging curves [9]. Wang et al. proposed a shale lithology identification method using Hidden Markov Model combined with Random Forest [10]. Song et al. proposed an improved adversarial learning-based method for recognizing reservoir shale lithology [11]. Hou et al. used multilayer perceptron, support vector machine, random forest, and XGBoost to identify organ siliciclastic shales in the Gulong Depression [12]. Song et al. used the Bayesian neural network neural network, Fisher’s discriminant analysis, and classification regression decision tree to identify the shale of the Shahejie Formation in the Raoyang Depression, and effectively identified the strata and thin interlayers [13]. However, in the aforementioned studies, researchers were unaware of the imbalance in the data set, or the data set was an artificially balanced data set, which had the potential to negatively impact the practicality of the lithology identification method.

In the process of machine learning to identify lithologies, researchers have become aware of the class imbalance problem when using logging data for classification. This is due to differences in the sample sizes of each lithology. To address this issue, a number of classical data rebalancing techniques have been incorporated into the data preprocessing pipeline. He processed a logging dataset using a Mahakir resampling method combined with a deep neural network to identify dense sandstone reservoirs [14]. Zheng et al. balanced the dataset by using an NCR combined with a SMOTE method that combined with a multilayer perceptron, support vector machine, and extreme gradient boosting to identify lithofacies [15]. Ibrahim et al. proposed an objective hybrid approach based on a synthetic minority oversampling technique and extreme gradient boosting for lithological classification within the Tarkwaian paleo placer formation using assay data obtained through X-ray fluorescence analysis [16]. However, conventional oversampling techniques rely on randomly generating data according to a distance strategy, which may result in discrepancies between the generated samples and the actual samples, as well as the introduction of implausible instances within the data set.

In this study, an adaptive resampling Bayesian gradient boosting (ARBGB) method is proposed for shale lithology identification. The method considers the class imbalance of logging data and resamples the dataset using adaptive multi-objective swarm crossover optimization (AMSCO). Feature generation was performed with the help of a Bayesian gradient boosting regressor using XRD data and then shale lithology identification was performed using a Bayesian gradient boosting classifier. After discussing the performance of the model, the method was applied to the shale stratigraphy of the Gulong Depression in the Songliao Basin, northeastern China. The application results show that ARBGB outperforms other methods and provides a reliable solution for shale lithology identification in the study area. The feature engineering approach to the workflow represents an improvement on previous work. ARBGB establishes a nonlinear mapping between discrete mineral content information and continuous well logs. This feature engineering method, which incorporates prior physical data constraints, contains more physical information than the initial data set. At the same time, the use of AMSCO provides more options for logging dataset rebalancing.

2 Study area and data

2.1 Geological setting

In a previous study, it was postulated that the Songliao Basin is a superimposed basin, encompassing multiple successor basins resting atop Carboniferous–Permian folded strata, which constitute the basement of the Basin. During the syn-rift stage, spanning from 150 to 105 Ma [17], regional tensile stress led to extension and general crustal thinning, the fault blocks in the basement of the basin have separated and differential subsidence, resulting in the formation of numerous faulted basin groups within the Songliao Basin. Transitioning into the post-rift stage (105–79.1 Ma) [17], the Songliao Basin experienced rapid subsidence, giving rise to a large depression basin [18]. During the structural inversion stage (79–64 Ma), the significant alteration in the direction and speed of Pacific Plate motion during the Campanian period [18,19,20] triggered an abrupt transition from rapid subsidence to uplift. The sliding directions of both normal faults and reverse faults in the basin have been inverted to varying degrees, resulting in the formation of numerous NNE-trending inversion faults, broad anticlinal arches, and domes [21]. These structures play a crucial role in oil trapping within the Songliao Basin [22].

In the Songliao Basin, a large-scale lake flooding event occurred in the Cretaceous period [23]. The Central Depression, as the depositional and sedimentation center of the Qingshankou Formation, deposited dolomite, clay-bearing felsic shale, carbonate-bearing shale, interbedded thinly laminated coquina and clayey felsic shale in the Qingshankou Formation. Figure 1 shows a thin section of typical lithologies in the study area, Figure 1(a) shows a thin section of dolomite, which is mainly composed of mud and sand. The muds have recrystallization and the sands are enriched in bands. Figure 1(b) shows a thin section of Clay-bearing felsic shale, which is striated and consists mainly of mud and sand. The mud is recrystallized and the sand is enriched in thin layers. Figure 1(c) shows a thin section of Carbonate-bearing shale, which has a laminar structure and is mainly composed of mud, sand, and shell clastic. The mud is recrystallized, the sand is enriched in bands and lenses, and carbonate minerals are seen in the sand-enriched areas to account for the clastic grains. Figure 1(d) shows a thin section of Coquina, which has a granular structure with granules of mesquite and cavities mostly filled with dolomite. Figure 1(e) shows a thin section of Clay felsic shale, which has a grainy structure and consists mainly of mudstone. The mudstone has recrystallization.

Figure 1

These are photomicrographs in plane-polarized light of thin sections from selected samples to show the main lithologies in the study area (single and orthogonal polarized light): (a) Dolomite, (b) clay-bearing felsic shale, (c) carbonate-bearing shale, (d) Coquina, and (e) clay felsic shale.

The Gulong Depression is a negative secondary tectonic unit within the Central Depression Zone of the Songliao Basin (Figure 2(b)), a successional depression formed on the basis of basal tectonic morphology. It is adjacent to the Longhu Bubble-Daan terrace in the west, the Qijia depression in the north, and Daqing paleoanticline in the east. The Gulong Depression has high shale maturity in the Qingshankou Formation, which is currently the key area for the exploration and development of land-phase shale oil in China [24,25].

Figure 2

Geologic overview map of the study area. (a) The location of the Songliao Basin, (b) distribution of primary tectonic units in the Songliao Basin, (c) distribution of secondary tectonic units in the Central Depression, and (d) distribution of cores in the study area.

2.2 Data

In this study, the logging data and petrophysical experimental data of the Qingshankou Formation strata in the study area are used to be divided into a mineral pretraining model dataset and a lithology identification dataset, with a total of five lithologies. The mineral pre-training model dataset includes XRD data and logging data from 290 rock samples. Table 1 shows some samples of the mineral pre-training model dataset, and it can be seen that the dataset contains samples of quartz, potash feldspar, plagioclase, calcite, ankerite, dolomite, siderite, pyrite, and clay content. The lithology identification dataset consists of logging data from 594 rock samples, of which 475 samples were used for training and testing and 119 samples were used for callback. Table 2 shows the range of lithology logging responses in the lithology identification dataset, which includes acoustic (AC), compensated neutron-porosity logging (CNL), density (DEN), gamma ray log (GR), photoelectric absorption coefficient (PE), and deep lateral resistivity (RLLD).

Table 1

Mineralogical proportions of various lithological units of some samples

Lithology	Quartz (%)	)Potash feldspar (%)	Plagioclase (%)	Calcite (%)	Ankerite (%)	Dolomite (%)	Siderite (%)	Pyrite (%)	Clay (%)
Dolomite	8.4	0	3.1	33.8	0	46.1	0	0	8.7
Clay-bearing felsic shale	33.1	0	25.5	17	3.3	0	0.3	2.1	18.7
Carbonate-bearing shale	30.8	1.9	19.2	17.8	1.1	0	0	2.4	26.8
Coquina	1.0	0.7	0.7	44.9	48.9	0	0	3.8	0
Clay felsic shale	30.3	0.8	17.0	2.9	1.1	0	0.6	3.8	43.4

Table 2

Range of lithologic petrophysical response in the study area

Lithology	AC (μs/ft)	CNL (%)	DEN (g/cm³)	GR (API)	PE (b/e)	RLLD (Ωm)
Dolomite	92.43–329.35	18.66–30.23	2.35–2.53	103.98–139.14	3.69–14.03	5.15–8.31
Clay-bearing felsic shale	91.5–373.52	16.31–31.82	2.29–2.59	99.68–153.58	3.71–23.28	3.97–20.23
Carbonate-bearing shale	95–121.26	21.1–30.73	2.4–2.52	114.25–134.4	3.76–5.76	6.52–7.22
Coquina	98.86–351.41	24.64–29.7	2.39–2.56	102.13–143.63	4.28–15.72	4.31–6.25
Clay felsic shale	78.08–393.23	10.94–38.63	2.16–2.63	90.79–177.23	3.31–17.81	3.38–16.42

Notes: AC, acoustic; CNL, compensated neutron-porosity logging; DEN, density; GR, gamma ray log; PE, photoelectric absorption coefficient; RLLD, deep lateral resistivity.

In order to further observe the logging data response of each lithology in the lithology identification dataset, this study plotted the rendezvous plot matrix (Figure 3), in which the scatter plot is the logging cross plot, and the diagonal line demonstrates the lithology of the corresponding curve and the data distribution. Combined with Table 2, it can be seen that in a two-dimensional feature space such as the cross plot, the lithology is very difficult to distinguish. In addition, from the data distribution, it can be seen that Clay felsic shale has the largest sample size in the lithology identification dataset, and the dataset has serious class imbalance characteristics. The imbalance of the data will lead to the classification boundary being more favorable to the majority of class samples in the classification scenario. This affects the final prediction. Therefore, the imbalance problem of the dataset needs to be solved before lithology identification [26,27,28].

Figure 3

Logging data cross plot matrix.

3 Method

There are several challenges in the shale lithology identification workflow. The first is the effect of data imbalance, and it is very important to choose the appropriate dataset resampling method. Finally, the trade-off between the model optimization method, the model effectiveness, and the generalization ability should also be considered. This section details the methodology of our workflow based on the above considerations.

3.1 Methodology

Constraining datasets using a priori data has been shown to be of great help in geophysical exploration, so this study innovatively uses the method of constructing a mineral pre-training model to introduce mineral content data measured in the laboratory to constrain the logging dataset [29]. The mineral pre-training model is a Bayesian Gradient Boosted Regression model that takes the mineral content data of laboratory samples and the corresponding logging data of the samples to obtain the correspondence between the logging data and the mineral content, and generates a predictive model for the mineral content. The mineral pre-training model can give constraints on the logging dataset from the prior data and form new features to increase the dimensionality of the dataset.

The program to carry out shale lithology in this study is divided into three stages (Figure 4). First is the mineral pre-training model construction, where the mineral pre-training model dataset is trained using the Bayesian extreme gradient boosting regressor with logging data as features and mineral content as labels, and the pre-training model is saved. Then, the model training stage, the training set, is resampled using AMSCO method, followed by inputting the dataset into the pre-training model to generate the mineral features, logging data from the dataset and the generated mineral content as features, lithology as labels, and training using Bayesian Extreme Gradient Boosting classifier to generate the prediction model. Finally, there is a lithology prediction stage, where the data to be identified are fed into the pre-training model to generate mineral features, which are subsequently fed into the prediction model to obtain the final lithology prediction results. In this workflow, XRD data constrain the data set as prior physical information, and the introduction of a pre-trained model provides new information to the training set. This gives the data set features that are more relevant to the label.

Figure 4

Flowchart of shale lithology identification.

3.2 Adaptive multi-objective swarm crossover optimization – AMSCO

Conventional logging datasets usually belong to unbalanced datasets. Since the information of the majority class in an unbalanced dataset far exceeds that of the minority class, this results in a classifier that is prone to overfitting the majority class. In previous research, many methods have been proposed to deal with unbalanced datasets, aiming to change the sample distribution to rebalance the dataset [30,31,32]. Chawla proposed SMOTE method to improve the balance of datasets by oversampling minority datasets through distance strategy [33]. Saez proposed the SMOTE-IPF algorithm [34]. Abdi and Sattar proposed the Mahalanobis Distance-based Over-sampling technique (MDO) [35]. Douzas proposed a Heuristic Oversampling Method Based on K-Means and SMOTE to improve dataset balance [36]. They are both classical resampling methods. These methods mainly focus on improving the sensitivity of the learner to the minority class [37] or rebalancing the number of samples [38]. These preprocessing rebalancing methods usually include oversampling minority class data, undersampling majority samples, or a combination of both [39]. However, focusing only on the sample size between majority and minority classes does not optimize the classifier. The AMSCO method focuses on finding the optimal combination of the majority and minority class sample sets in the rebalancing process [40,41].

AMSCO proposes an adaptive rebalancing model, called population fusion, for dealing with unbalanced classification problems. The core idea is to decompose the original dataset and then only the set of eligible samples selected during the optimization of the two independent populations is used and recombined. During the rebalancing process, the parameters are automatically and adaptively tuned. The core optimizer of AMSCO is Particle Swarm Optimization (PSO). The PSO algorithm simulates the foraging behavior of birds and is a classical population intelligence algorithm. It has easy implementation, faster convergence, and fewer parameters than other intelligent algorithms. Since the objective function and constraints of the PSO algorithm are relatively simple, it is widely used in various fields.

In the optimization process of PSO, the initial dataset is used as the current dataset. Candidate classifiers are used to verify the quality of the current dataset. The current dataset will undergo two parallel population optimizations to best change the sample distribution until the performance of the candidate classifier reaches a threshold. Although the two candidate solutions are different in nature, a dataset instance close to the size of the initial dataset is cross-generated by selectively merging information from the best dataset instance into one instance. This dataset instance will be used as the current dataset in an iterative loop and then checked by an evaluation metric. If the evaluation metrics are less than a threshold, the dataset will again be divided into two parallel population optimization processes.

AMSCO couples these two optimization methods together as a unified iterative process. It gradually enhances the mix of data from the two population optimizations through iterations until a high-quality dataset is generated. The subgroups divide and search the space in parallel and share the best information so that the best solution can be found in the shortest possible time. AMSCO constructs two subgroups, one that optimizes the majority class through Swarm Instance Selection (SIS) and the other that optimizes the minority class through the Oversampling Technique for Synthesizing Minority Class Instances (OSMOTE). OSMOTE optimizes the minority class by inserting the generated samples into the data space with any or K nearest-neighbor line segments of the minority class and optimized by PSO.

3.3 Bayesian gradient boosting

Logging data are typically low-dimensional discrete data, and logging data possess typical class imbalance properties. Gradient boosting methods have been shown to be advantageous in logging lithology identification scenarios. Gradient boosting refers to a class of integrated learning methods based on decision trees, which are usually combined by multiple decision trees as weak learners. The basic idea of gradient boosting is to minimize the residual of the objective function through iterations. In each round of iterations, the model is set as a fitter of the current model residuals. The residuals of the model are then minimized by a gradient descent method. Typical gradient boosting methods are Adaboost [42], GBDT, and since they are strong learners obtained from decision trees as weak learners, their model prediction accuracy performs well while also possessing the excellent robustness of decision trees. But inevitably, decision trees may lead to overfitting, which in this study means that the actual prediction results are inferior to the model effect.

Extreme Gradient Boosting (XGBoost) introduces Lasso regularization and Ridge regularization terms in the objective function, aiming to reduce overfitting in terms of feature selection, handling noisy data, and controlling model complexity. For the objective function O ( t ) of gradient boosting, there are as follows:

O ( t ) = ∑ i = 1 n l ( y i , y ˆ i ( t − 1 ) + f t ( x i ) ) + Ω ( f t ) ,

where l is the loss function, y is the true value, i is the number of samples, t is the number of iterations, y ˆ i ( t − 1 ) represents the predicted value at the t − 1 round of iterations, x is the sample, f t ( x i ) is the predicted value at the first iteration, Ω ( f t ) and is the regularization term.

Unlike the gradient boosting method, XGBoost takes into account the second-order derivatives and performs a second-order Taylor expansion on f t ( x i ) . The weak learner of XGBoost is a decision tree, in order to determine the optimal weak learner parameters, XGB is parameterized for the f t ( x i ) and Ω ( f t ) functions, after substituting the decision tree parameters the objective function becomes as follows:

O ( t ) = ∑ j = 1 n ∑ i ∈ I j g i ω j + 1 2 ∑ i ∈ I j h i + λ ω j 2 + γ T ,

where g i is the first order derivative of f t ( x i ) and h i is the second order derivative of f t ( x i ) , ω j denotes the value of the j th node in the decision tree, I j is the set of samples at the leaf node j , T is the number of leaf nodes, and γ and λ are pruning parameters used to control the complexity of the tree.

In the process of finding the optimal learner parameters for XGBoost, there is a process of maximizing the objective evaluation metrics S as follows:

p = arg max p ∈ P S ( p ) ,

where p denotes the parameter vector and P denotes the set of parameter vectors. In determining the optimization direction, this study uses Expected Improvement as the acquisition function, which is represented as follows:

EI q ⁎ ( p ) = ∫ − ∞ + ∞ max ( q ⁎ − q , 0 ) p M ( q ∣ p ) d q ,

where q denotes the model metrics, q ⁎ denotes the model metrics threshold, and EI q ⁎ ( p ) denotes the extent to which the model metrics are improved under the p parameter vector. In this equation, the parameter vector for the next search p new is

p new = arg max p ∈ P EI q ⁎ ( p )

In this process, the indicator threshold q ⁎ is realized by Gaussian process regression. It is assumed that the target evaluation indicators and target parameters obey a Gaussian distribution with a mean value of 0. This distribution is used as the prior distribution. Further, the posterior distributions of the target evaluation indicators and target parameters are obtained based on the observed points. As the number of observed points increases, the target evaluation metrics are continuously updated, and the metric with the best current observation is used as the metric threshold q ⁎ . Subsequently, the search for better parameters continues until the iteration stops.

4 Validation of the methodology

In this section, with the aim of evaluating the impact of classification problems with unbalanced data, we introduce some evaluation metrics. For the hypothesis proposed in the previous section, we compare the effect of the AMSCO method and other data resampling methods on the two-dimensional feature plane projection of the classifier and evaluate the impact of the data resampling method by the classification effect. Generalization ability represents the performance of the model on data outside the training set. In this study, generalization ability means the performance of the model in identifying lithology in wells that were not trained. Often a model that does not perform well on new data is referred to as overfitting, and in this case, overfitting means that the predictions on the test set and the back judgment set differ significantly. In order to avoid overfitting, it is important to investigate the generalization ability of the lithology identification method in order to assess the practical value of the model. We evaluate the optimization method, the classifier selection, and the generalization ability of the model by comparing it with the classical machine learning methods on the test and back judgment datasets, respectively.

4.1 Evaluation indicators

In the class imbalance classification problem, the global evaluation index results will lose representativeness due to the inconsistency in the number of samples in each class. In this study, the correctness of each minority class lithology identification has little impact on the global evaluation index because the single minority class lithology sample represents a rather small proportion of all samples. However, due to the rarity of minority lithologies, the accuracy of a single minority lithology sample has a significant impact on the effect of the lithology evaluation. Therefore, the global evaluation index cannot comprehensively evaluate the lithology classification scenario with unbalanced classes.

In this study, in order to effectively evaluate the performance of the lithology identification model, in addition to the accuracy rate, we also use the indicators of Precision-macro , Recall-macro , F 1 -macro , Precision-weight , Recall-weight , and F 1-weight . True positive (TP), false positive (FP), false negative (FN), and true negative (TN) are the four scenarios of the classification results. The Precision-macro , Recall-macro , and F 1-macro show the performance of the model in the macro view, and Precision-weight , Recall-weight , and F 1-weight show the performance of the model in the weight view, which is set according to the proportion of the sample size in each category, and the evaluation metrics considering the category imbalance are calculated. The details are as follows:

Precision-macro = 1 k ∑ l = 1 k TP l TP l + FP l ,

where k is the number of categories.

Recall-macro = 1 k ∑ l = 1 k TP l TP l + FN l ,

F 1-macro = Precision-macro × Recall-macro Precision-macro + Recall-macro ,

Precision-weight = ∑ l = 1 k ω l × TP l TP l + FP l ,

where ω is the category weight

Recall-weight = ∑ l = 1 k ω l × TP l TP l + FN l ,

F 1-weight = Precision-weight × Recall-weight Precision-weight + Recall-weight .

4.2 Model validation

Figure 5 shows a comparison of the resampling effect of the dataset with the data distribution and XGBoost decision boundaries under the two-dimensional feature space of GR and DEN. Before resampling, dolomite, as a minority class sample, has a much smaller decision boundary than the clay-bearing felsic shale, and some of the dolomite is considered to be the clay-bearing felsic shale. The SMOTE (Figure 5(b)), SMOTE-IPF (Figure 5(c)), MOD (Figure 5(d)), and Kmeans_SMOTE (Figure 5(e)) methods are selected as comparison methods. After resampling the dataset, the number of minority class samples increases in all methods, and the minority class decision range expands. In terms of accuracy, the accuracy of SMOTE, SMOTE-IPF, MOD, and Kmeans_SMOTE decreases after resampling. Although these methods improve the imbalance of the data set, this resampling does not help to improve the classification effect. After AMSCO (Figure 5(f)) resampling, dolomite generates new samples, which results in the decision boundary moving towards the original decision range of the clay-bearing felsic shale, while the decision range of dolomite became larger, indicating that the dataset became more balanced. At the same time, the accuracy of the model increased from 88.889 to 91.429%, indicating that the model classification was improved after AMSCO resampling. In general, AMSCO resampling can effectively improve the class imbalance and is better than SMOTE method.

Figure 5

Comparison of rebalancing effect of logging data, red scatters are clay-bearing felsic shale samples, and blue scatters are dolomite samples. (a) Decision boundary of the original data with 88.889% accuracy. (b) Decision boundary of the model after SMOTE resampling with an accuracy of 88.710%. (c) Decision boundary of the model after SMOTE-IPF resampling with an accuracy of 88.710%. (d) Decision boundary of the model after MDO resampling with an accuracy of 88.710%. (e) Decision boundary of the model after Kmeans_SMOTE resampling with an accuracy of 88.710%. (f) Decision boundary of the model after AMSCO resampling with an accuracy of 91.429%.

In the pre-training model optimization, minimizing the root-mean-square error was used as the search direction, and the hyperparameter combinations for each mineral content were finally determined (Table 3). Further, for model evaluation, Extreme Gradient Boosting (XGBoost), Multilayer Perceptron (MLP), Random Forest (RF), and Support Vector Machine (SVM) methods were selected for comparison. Among them, RF is a classical tree integration method, MLP is a neural network method, and SVM is a typical kernel method. Figure 6 shows the histogram matrix of evaluation metrics for model test set effectiveness, the color of the histogram corresponds to the algorithm and the X-axis of the histogram is the evaluation metrics. The lower table records the specific values of the evaluation metrics, and the closer the color of the table is to blue, it means that the method has better values in the current evaluation metrics. The evaluation metrics of the ARBGB method are superior to other methods under both accuracy and weight perspectives. From the macro perspective, ARBGB’s Precision-macro outperforms XGboost’s Precision-macro , and ARBGB Recall-macro and F 1 -macro are slightly inferior to XGBoost’s Recall-macro and F 1 -macro .

Table 3

Mineral content pre-training model hyperparameters

Model	Hyperparameterization				RMSE
Model	Max depth	Learning rate	Min child weight	Gamma	RMSE
Quartz	5	0.0537	1.4514	1.6972 × 10⁻⁵	0.3541
Potash feldspar	3	0.0946	9.9296	0.0759	0.7925
Plagioclase	3	0.0959	9.9992	7.683 × 10⁻⁵	6.6444
Calcite	4	0.0834	5.7139	0.0428	10.5353
Ankerite	3	0.0704	5.1391	1.8684 × 10⁻⁷	15.7298
Dolomite	3	0.0522	1.8749	0.0022	0.4939
Siderite	4	0.0888	1.363	0.0002	0.5606
Pyrite	3	0.0749	6.6834	2.1689 × 10⁻⁵	1.7297
Clay	4	0.0928	6.8875	1.4587 × 10⁻⁸	9.4024

Notes: RMSE, root mean square error.

Figure 6

Histogram matrix of evaluation metrics for the model test set.

Considering the generalization ability of the model, the model validation is performed using the judgment back dataset which is independent from the training set. Figure 7 shows the histogram matrix of the evaluation metrics of the model’s judgment back dataset. Overall, the evaluation metrics of the judgment back model have decreased compared to the training set model. In both accuracy and weight perspectives, ARBGB is evaluated better than other methods. In macro view, ARBGB is slightly inferior to XGboost in terms of Precision-macro , and ARBGB outperforms XGBoost in terms of Recall-macro and F 1 -macro .

Figure 7

Histogram matrix of evaluation indicators for the model back judgment set.

Considering that lithology identification is an unbalanced classification scenario, the prediction results of each category should be equally important, so the evaluation indexes under the weight perspective better reflect the specific effect of the model. Considering the weight perspective, ARBGB is undoubtedly superior to other comparative methods. In the actual workflow, only the recognition effects of some specific lithologies may be emphasized, which leads to the evaluation indexes under the macro perspective being emphasized in specific cases.

The training model and back-judging results are hardly distinguishable between ARBGB and XGBoost effects in the macro view, so we discuss the model generalization ability of the two methods. Since tree-based learners are prone to overfitting, i.e., the evaluation metrics of the judging backset decrease, the ability of the model to apply to independent data, i.e., generalization ability, is also an important indicator of the model’s capability. We use the difference between the metrics of the training model and the back-judged model, as well as the average of the differences to determine the strength of the model’s generalization ability. Figure 8 shows the comparison between the generalization ability of ARBGB and XGBoost; it can be seen that ARBGB is smaller than XGBoost in all the four difference indicators, which is enough to show that ARBGB has a better generalization ability than XGBoost, especially in the macro difference, ARBGB is much smaller than XGBoost. To a certain extent, we can make a comparison between the model ability in the macro perspective, ARBGB is better than XGBoost in the accuracy, weight perspective, and macro perspective, and the performance is better than XGBoost.

Figure 8

Diagram of lithology identification results of X well, spanning depths from 2,300 to 2,350 m.

The results show that compared with XGBoost, MLP, RF, and SVM, the ARBGB model has better prediction performance and generalization ability, and it has better prediction ability for shale lithology.

4.3 Actual well validation

In order to verify the applicability of the model, the ARBGB model is applied to the stratigraphy of Qingshankou Formation in well X in the study area, and Figure 8 shows the graph of lithology identification results of well X 2,300–2,350 m. The discrete data lanes in the figure are the well-wall coring identification results, and the well-wall coring results show that this stratigraphic section is mainly dominated by Clay-bearing felsic shale and Clay felsic shale, with few thin layers of Dolomite, Coquina, and Carbonate-bearing shale in between. The ARBGB prediction obtained the Continuous stratigraphic lithology results, most of the layers are accurately predicted, and the lithological gaps in the uncored layers are filled. However, at 2,320 m, the well wall coring results show that the Coquina and Carbonate-bearing shale thin layers are intersected and close in-depth, and the ARBGB identified the result as Coquina, which may be due to the fact that the thickness of the thin layers is already lower than the resolution of the logging data, and the two samples, Coquina and Carbonate-bearing shale, are too close in depth to be identified in the logging data. Samples are too close in depth to make a significant difference in logging response, which is confirmed by the logging curves. Overall, ARBGB has good application in the actual stratigraphy.

5 Discussion

In this study, our proposed lithology identification method is a method based on logging data, which are typically category-imbalanced datasets. Therefore, part of our method focuses on dealing with these unbalanced data. This characteristic of the dataset has not been discussed by any researcher in previous studies in the study area. Extending to the international field, it has been studied by researchers using SMOTE and Tomek link method [43,44].

However, class imbalance is in fact a widespread problem that has been extensively discussed in the fields of Produced water reinjection [45] and hydrothermal alteration [46]. In terms of resolving the class imbalance itself, Chawla proposed the SMOTE method to improve the balance of datasets by oversampling minority datasets through distance strategy [33]. Saez proposed the SMOTE-IPF algorithm [34]. Abdi and Sattar proposed the Mahalanobis Distance-based Over-sampling technique [35]. Douzas proposed a Heuristic Oversampling Method Based on K-means and SMOTE to improve dataset balance [36]. The core of this still lies in the optimization problem of oversampling and undersampling methods.

Returning to this study, the AMSCO method we used was selected as a comparison with the classical methods SMOTE, SMOTE-IPF, MOD, and Kmeans_SMOTE. From the results, we can see that AMSCO is the best in terms of accuracy. Its successful applications in other fields such as Detecting malware [47] and frequency-hopping spread spectrum [48,49,50,51].

In addition to the rebalancing method, considering the generalization ability, the ARBGB method also has the best recognition effect compared to XGBoost, MLP, RF, and SVM methods, where the recognition effect considers two validation sets, and we believe that this discussion that includes the generalization ability is more persuasive than the ordinary evaluation metrics comparison (Figure 9).

Figure 9

Comparison of the generalization capabilities between ARBGB and XGboost algorithms

6 Conclusions

In this study, an innovative lithologic identification method is proposed for the shale stratigraphy of the Qingshankou Formation in the Gulong Depression, Songliao Basin, northeastern China. The method uses a priori data to constrain the training set, based on XRD data combined with Bayesian gradient boosting training to obtain a mineral pre-trained model for feature generation. It uses AMSCO to improve the imbalance of logging data and finally uses logging data combined with a Bayesian gradient boosting classifier for lithology identification. It was found that AMSCO can effectively improve the imbalance of logging data and improve the recognition results. The mineral pre-training model used limited petrophysical experimental data to generate features for the logging dataset, and combined with Bayesian gradient boosting, continuous stratigraphy lithology data were obtained. The ARBGB method is a reliable shale lithology identification method due to the evaluation indexes of the other methods and has good application results in real stratigraphies.

Funding information: This research was funded by the National Key Research and Development Program of China (2023YFC3707901), National Natural Science Foundation of China (Nos. 42204122), and National Natural Science Foundation of China (Nos. 42072323).
Author contributions: Conceptualization, L.X., X.M., and L.X.; methodology, L.X., R.H., and X.M.; software, L.X., and Y.G.; validation, X.Y., X.H., Z.L., and H.W.; writing – original draft preparation, L.X. All authors have read and agreed to the published version of the manuscript.
Conflict of interest: The authors declare no conflict of interest.
Data availability statement: The data are not publicly available due to Privacy of data. Other relevant materials during the current study are available from the corresponding author upon reasonable request.

References

[1] Ross DJK, Bustin RM. The importance of shale composition and pore structure upon gas storage potential of shale gas reservoirs. Mar Pet Geol. 2009;26(6):916–27. 10.1016/j.marpetgeo.2008.06.004.Search in Google Scholar

[2] Li H. Coordinated development of shale gas benefit exploitation and ecological environmental conservation in China: A mini review. Front Ecol Evol. 2023;11:1232395. 10.3389/fevo.2023.1232395.Search in Google Scholar

[3] Li H. Deciphering the formation period and geological implications of shale tectonic fractures: A mini review and forward-looking perspectives. Front Energy Res. 2023;11:1320366. 10.3389/fenrg.2023.1320366.Search in Google Scholar

[4] Shan SC, Wu YZ, Fu YK, Zhou PH. Shear mechanical properties of anchored rock mass under impact load. J Min Strata Control Eng. 2021;3(4):043034. 10.13532/j.jmsce.cn10-1638/td.20211014.001.Search in Google Scholar

[5] Wang J, Wang XL. Seepage characteristic and fracture development of protected seam caused by mining protecting strata. J Min Strata Control Eng. 2021;3(3):033511. 10.13532/j.jmsce.cn10-1638/td.20201215.001.Search in Google Scholar

[6] Mulhern ME, Laing JE, Senecal JE, Widdicombe RE, Isselhardt C, Bowersox JR. Electrofacies identification of lithology and stratigraphic trap, Southeast Lost Hills fractured shale pool, Kern County, California. AAPG Bull. 1985;69(4):671. 10.1306/AD4626C8-16F7-11D7-8645000102C1865D.Search in Google Scholar

[7] Wang G, Carr TR. Marcellus shale lithofacies prediction by multiclass neural network classification in the Appalachian Basin. Math Geosci. 2012;44(8):975–1004. 10.1007/s11004-012-9421-6.Search in Google Scholar

[8] Bhattacharya S, Carr TR, Pal M. Comparison of supervised and unsupervised approaches for mudstone lithofacies classification: Case studies from the Bakken and Mahantango-Marcellus shale, USA. J Nat Gas Sci Eng. 2016;33:1119–33. 10.1016/j.jngse.2016.04.055.Search in Google Scholar

[9] Han L, Fuqiang L, Zheng D, Weixu X. A lithology identification method for continental shale oil reservoir based on BP neural network. J Geophys Eng. 2018;15(3):895–908. 10.1088/1742-2140/aaa4db.Search in Google Scholar

[10] Wang P, Chen X, Wang B, Li J, Dai H. An improved method for lithology identification based on a hidden markov model and random forests. Geophysics. 2020;85(6):IM27–36. 10.1190/geo2020-0108.1.Search in Google Scholar

[11] Song L, Yin X, Yin L. Reservoir lithology identification based on improved adversarial learning. IEEE Geosci Remote Sens Lett. 2023;20:1–5. 10.1109/LGRS.2023.3281545.Search in Google Scholar

[12] Hou M, Xiao Y, Lei Z, Yang Z, Lou Y, Liu Y. Machine learning algorithms for lithofacies classification of the Gulong shale from the Songliao Basin, China. Energies. 2023;16(6):2581. 10.3390/en16062581.Search in Google Scholar

[13] Song Z, Xiao D, Wei Y, Zhao R, Wang X, Tang J. The research on complex lithology identification based on well logs: A case study of lower 1st member of the Shahejie formation in Raoyang Sag. Energies. 2023;16(4):1748. 10.3390/en16041748.Search in Google Scholar

[14] He M, Gu H, Wan H. Log interpretation for lithology and fluid identification using deep neural network combined with MAHAKIL in a tight sandstone reservoir. J Pet Sci Eng. 2020;194:107498. 10.1016/j.petrol.2020.107498.Search in Google Scholar

[15] Zheng D, Hou M, Chen A, Zhong H, Qi Z, Ren Q, et al. Application of machine learning in the identification of fluvial-lacustrine lithofacies from well logs: A case study from Sichuan Basin, China. J Pet Sci Eng. 2022;215:110610. 10.1016/j.petrol.2022.110610.Search in Google Scholar

[16] Ibrahim B, Isaac A, Anthony E, Fareed M. A novel XRF-based lithological classification in the Tarkwaian paleo placer formation using SMOTE-XGBoost. J Geochem Explor. 2023;245:107147. 10.1016/j.gexplo.2022.107147.Search in Google Scholar

[17] Wang P, Frank M, Didenko AN, Zhu DB, Singer B, Sun X. Tectonics and cycle system of the Cretaceous Songliao Basin: An inverted active Continental Margin Basin. Earth-Sci Rev. 2016;159:82–102. 10.1016/j.earscirev.2016.05.004.Search in Google Scholar

[18] Stepashko AA. The Cretaceous dynamics of the Pacific plate and stages of magmatic activity in Northeastern Asia. Geotectonics. 2006;40:225–35. 10.1134/S001685210603006X.Search in Google Scholar

[19] Maruyama S, Tetsuzo S. Orogeny and relative plate motions: Example of the Japanese islands. Tectonophysics. 1986;127:305–29. 10.1016/0040-1951(86)90067-3.Search in Google Scholar

[20] Didenko AN, Khanchuk AI, Tikhomirova AI, Voinova IP. Eastern Segment of the Kiselevka-Manoma terrane (Northern Sikhote Alin): Paleomagnetism and geodynamic implications. Russ J Pac Geol. 2014;8:18–37. 10.1134/S1819714014010023.Search in Google Scholar

[21] Song T. Inversion styles in the Songliao basin (Northeast China) and estimation of the degree of inversion. Tectonophysics. 1997;283:173–88. 10.1016/S0040-1951(97)00147-9.Search in Google Scholar

[22] Ren J, Kensaku T, Li S, Zhang J. Late Mesozoic and Cenozoic rifting and its dynamic setting in Eastern China and adjacent areas. Tectonophysics. 2002;344:175–205. 10.1016/S0040-1951(01)00271-2.Search in Google Scholar

[23] Zhao W, Bian C, Li Y, Zhang J, He K, Liu W, et al. Enrichment factors of movable hydrocarbons in lacustrine shale oil and exploration potential of shale oil in Gulong Sag, Songliao Basin, NE China. Pet Explor Dev. 2023;50(3):520–33. 10.1016/S1876-3804(23)60407-0.Search in Google Scholar

[24] Huo Z, Hao S, Liu B, Zhang J, Ding J, Tang X, et al. Geochemical characteristics and hydrocarbon expulsion of source rocks in the first member of the Qingshankou formation in the Qijia-Gulong Sag, Songliao Basin, Northeast China: Evaluation of shale oil resource potential. Energy Sci Eng. 2020;8(5):1450–67. 10.1002/ese3.603.Search in Google Scholar

[25] Liu B, Wang H, Fu X, Bai Y, Bai L, Jia M, et al. Lithofacies and depositional setting of a highly prospective lacustrine shale oil succession from the Upper Cretaceous Qingshankou formation in the Gulong Sag, Northern Songliao Basin, Northeast China. AAPG Bull. 2019;103(2):405–32. 10.1306/08031817416.Search in Google Scholar

[26] Han R, Wang Z, Guo Y, Wang X, Zhong G. Multi-label prediction method for lithology, lithofacies and fluid classes based on data augmentation by Cascade forest. Adv Geo-Energy Res. 2023;9(1):25–37. 10.46690/ager.2023.07.04.Search in Google Scholar

[27] Han R, Wang Z, Wang W, Xu F, Qi X, Cui Y, et al. Igneous rocks lithology identification with deep forest: Case Study from Eastern Sag, Liaohe Basin. J Appl Geophysics. 2023;208:104892. 10.1016/j.jappgeo.2022.104892.Search in Google Scholar

[28] Han R, Wang Z, Zhang Z, Wang X, Cui Y, Guo Y. Prediction of igneous lithology and lithofacies based on ensemble learning with data optimization. Geophysics. 2024;89(2):JM1–11. 10.1190/geo2022-0782.1.Search in Google Scholar

[29] Wu X, Ma J, Si X, Bi Z, Yang J, Gao H, et al. Sensing prior constraints in deep neural networks for solving exploration geophysical problems. Proc Natl Acad Sci. 2023;120(23):e2219573120. 10.1073/pnas.2219573120.Search in Google Scholar PubMed PubMed Central

[30] Estabrooks A, Taeho J, Nathalie J. A multiple resampling method for learning from imbalanced data sets. Comput Intell. 2004;20(1):18–36. 10.1111/j.0824-7935.2004.t01-1-00228.x.Search in Google Scholar

[31] Galar M, Alberto F, Edurne B, Humberto B, Francisco H. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Appl Rev). 2012;42(4):463–84. 10.1109/TSMCC.2011.2161285.Search in Google Scholar

[32] Fernández A, Victoria L, Mikel G, María J, Francisco H. Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowl Syst. 2013;42:97–110. 10.1016/j.knosys.2013.01.018.Search in Google Scholar

[33] Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57. 10.1613/jair.953.Search in Google Scholar

[34] Sáez JA, Luengo J, Stefanowski J, Herrera F. SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci. 2015;291:184–203. 10.1016/j.ins.2014.08.051.Search in Google Scholar

[35] Abdi L, Sattar H. To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng. 2016;28(1):238–51. 10.1109/TKDE.2015.2458858.Search in Google Scholar

[36] Douzas G, Bacao F, Last F. Improving Imbalanced Learning through a Heuristic oversampling method based on K-means and SMOTE. Inf Sci. 2018;465:1–20. 10.1016/j.ins.2018.06.056.Search in Google Scholar

[37] Cieslak DA, Chawla NV, Striegel A. Combating imbalance in network intrusion datasets. In Proceedings of the 2006 IEEE International Conference on Granular Computing; 2006. p. 732–737. 10.1109/GRC.2006.1635905.Search in Google Scholar

[38] Nekooeimehr I, Lai-Yuen SK. Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst Appl. 2016;46:405–16. 10.1016/j.eswa.2015.10.031.Search in Google Scholar

[39] Douzas G, Bacao F. Self-organizing map oversampling (SOMO) for imbalanced data set learning. Expert Syst Appl. 2017;82:40–52. 10.1016/j.eswa.2017.03.073.Search in Google Scholar

[40] Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F. Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets. Inf Sci. 2016;354:178–96. 10.1016/j.ins.2016.02.056.Search in Google Scholar

[41] Li J, Fong S, Wong RK, Chu VW. Adaptive multi-objective swarm fusion for imbalanced data classification. Inf Fusion. 2018;39:1–24. 10.1016/j.inffus.2017.03.007.Search in Google Scholar

[42] Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Computer Syst Sci. 1997;55(1):119–39. 10.1006/jcss.1997.1504.Search in Google Scholar

[43] Hasan ML, Tóth M. Localization of potential migration pathways inside a fractured metamorphic hydrocarbon reservoir using well log evaluation (Mezősas field, Pannonian Basin). Geoenergy Sci Eng. 2023;225:105710. 10.1016/j.geoenergy.2023.105710.Search in Google Scholar

[44] Gohari M, Niri M, Sadeghnejad S, Ghiasi-Freez J. Synthetic graphic well log generation using an enhanced deep learning workflow: Imbalanced multiclass data, sample size, and scalability challenges. SPE J. 2024;29(1):1–20. 10.2118/217466-PA.Search in Google Scholar

[45] Frota RA, Tanscheit R, Vellasco M. Fuzzy logic for control of injector wells flow rates under produced water reinjection. J Pet Sci Eng. 2022;215:110574. 10.1016/j.petrol.2022.110574.Search in Google Scholar

[46] Ishitsuka K, Ohta H, Murakami T, Kawai T, Sudo T, Aoyagi H. Characterization of hydrothermal alteration along geothermal wells using unsupervised machine-learning analysis of X-ray powder diffraction data. Earth Sci Inform. 2022;15(1):73–87. 10.1007/s12145-021-00694-3.Search in Google Scholar

[47] Li T, Liu Y, Wang X, Li Q, Liu Q, Wang R, et al. A Malware detection model based on imbalanced heterogeneous graph embeddings. Expert Syst Appl. 2024;246:123109. 10.1016/j.eswa.2023.123109.Search in Google Scholar

[48] Khan MT, Sheikh UU. A hybrid convolutional neural network with fusion of handcrafted and deep features for FHSS signals classification. Expert Syst Appl. 2023;225:120153. 10.1016/j.eswa.2023.120153.Search in Google Scholar

[49] Amuda YJ. Impact of COVID-19 on oil and gas sector in Nigeria: A condition for diversification of economic resources. Emerg Sci J. 2023;7(Special Issue):264–80. 10.28991/ESJ-2023-SPER-019.Search in Google Scholar

[50] Susilo A, Juwono AM, Aprilia F, Hisyam F, Rohmah S, Hasan MFR. Subsurface analysis using microtremor and resistivity to determine soil vulnerability and discovery of new local fault. Civ Eng J. 2023;9(9):2286–99. Article 9. 10.28991/CEJ-2023-09-09-014.Search in Google Scholar

[51] Edris WF, Al-Fhaid H, Al-Tamimi M. Evolution of durability and mechanical behaviour of mud mortar stabilized with oil shale ash, lime, and cement. Civ Eng J. 2023;9(9):2175–92. Article 9. 10.28991/CEJ-2023-09-09-06.Search in Google Scholar

Received: 2024-03-15

Revised: 2024-04-26

Accepted: 2024-05-10

Published Online: 2024-08-21

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/geo-2022-0672

Keywords for this article

shale; shale oil; lithology identification; machine learning; Bayesian gradient boosting algorithm

Creative Commons

BY 4.0