Incorporation of structural properties of the response surface into oblique model trees

Marvin Schöne; Martin Kohlhase; Oliver Nelles

doi:10.1515/auto-2025-0017

Article Open Access

Incorporation of structural properties of the response surface into oblique model trees

Marvin Schöne
Marvin Schöne is a Ph.D. student at the University of Siegen and researcher at the Institute for Data Science Solutions. He received his master’s degree in electrical engineering from the University of Applied Sciences and Arts Bielefeld in 2019. His research topics are design of experiments, local model networks and active learning.
, Martin Kohlhase
Martin Kohlhase is Professor of Control and Automation at the University of Applied Sciences and Arts Bielefeld and co-chair of the Institute for Data Science Solutions. He received his doctor’s degree in 2011 at the Technical University of Darmstadt. His research topics are nonlinear system identification, design of experiments, data-driven models and active learning.
and Oliver Nelles
Oliver Nelles is Professor at the University of Siegen in the Department of Mechanical Engineering and chair of Automatic Control – Mechatronics. He received his doctor’s degree in 1999 at the Technical University of Darmstadt. His research topics are nonlinear system identification, design of experiments, metamodeling and local model networks.

Published/Copyright: October 10, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal at - Automatisierungstechnik Volume 73 Issue 10

Abstract

Oblique model trees are nonparametric regression models that partition the input space using axis-oblique splits to define local regions for regression. The Hierarchical Local Model Tree algorithm (HILOMOT) efficiently constructs these trees and produces a continuous output from overlapping local models, making it well suited for industrial applications. We enhance HILOMOT with a novel semi-greedy splitting method to reduce the short-sightedness of its greedy splitting, eliminate an undesirable trade-off during tree construction that can cause underfitting, and improve split interpretability. Our approach first estimates the strongest direction of curvature in a partition that the existing model cannot approximate. Then, a non-greedy method determines the optimal split orthogonal to this direction. We compare our method in experiments on synthetic data against HILOMOT and other established regression techniques. Qualitative results confirm that our method produces more interpretable splits, while quantitative results show it generally constructs more accurate and compact trees. Regardless of the splitting method, HILOMOT consistently outperforms all reference methods in accuracy.

Zusammenfassung

Schräge Modellbäume sind nichtparametrische Regressionsmodelle, die den Eingangsraum mittels achsenschräger Splits in lokale Regionen zur Regression aufteilen. HILOMOT ist ein leistungsfähiger Algorithmus zur effizienten Erzeugung dieser Bäume. Durch überlappende lokale Modelle entsteht eine kontinuierliche Modellausgabe, was für industrielle Anwendungen vorteilhaft ist. In dieser Arbeit verbessern wir HILOMOT durch eine neuartige Semi-Greedy-Splitting-Methode, die Nachteile des ursprünglichen Greedy-Splittings reduziert, einen unerwünschten Zielkonflikt während der Baumerzeugung eliminiert, der zu einer Unteranpassung führen kann, und die Interpretierbarkeit der Splits verbessert. Unsere Methode schätzt zunächst die stärkste Krümmungsrichtung, die im aktuellen Modell nicht darstellbar ist. Anschließend bestimmt eine Non-Greedy-Methode die optimale Aufteilungsstelle orthogonal zur identifizierten Richtung. Wir analysieren unsere Methode experimentell mit synthetischen Daten im Vergleich zu HILOMOT und anderen Regressionsverfahren. Qualitative Ergebnisse bestätigen die verbesserte Interpretierbarkeit der Splits. Quantitative Ergebnisse zeigen zudem, dass unsere Methode meist genauere und kompaktere Bäume erzeugt als die ursprüngliche. Unabhängig vom Splitting übertrifft HILOMOT alle Referenzmethoden deutlich in der Vorhersagegenauigkeit.

Keywords: oblique trees; local linear networks; interpretability; multivariate regression

Schlagwörter: Schräge Bäume; lokal lineare Netzwerke; Interpretierbarkeit; multivariate Regression

1 Introduction

In industrial digitalization, Machine Learning (ML) offers significant potential to enhance manufacturing, e.g. by optimizing or controlling processes through data-driven models [1]. The function f of the process to be approximated often depends on multiple inputs with unknown nonlinear effects on the output. When f ∈ R , nonparametric multivariate regression models are promising [2]. Unlike parametric models such as Polynomial Regression, they adapt their structure during learning to capture complex nonlinear relationships and prevent under- or overfitting. Therefore, no prior knowledge about f is required. Well-established nonparametric models include k-Nearest Neighbors and Gaussian Processes. In contrast, Oblique Model Trees are less common, but similarly powerful, particularly for small datasets with high-dimensional inputs [3], [4].

An oblique model tree captures data patterns using multiple local models (e.g. linear affine models). Their regions are generated by a Divide-and-Conquer strategy that recursively partitions the input space with axis-oblique splits. This is typically done by a greedy algorithm that selects locally optimal splits without considering future ones, ensuring high computational efficiency [5]. Beyond their accuracy and efficiency, oblique model trees are interpretable, with local models providing insights into the approximated function [4]. Compared to the more common axis-orthogonal model trees, their axis-oblique splits allow greater structural flexibility, reducing both bias error and the number of required local models [5], [6].

However, oblique model trees also have certain drawbacks. Since axis-oblique splits are more complex than axis-orthogonal splits, these trees must address a tradeoff between increasing accuracy and decreasing interpretability and computational efficiency [5], [7]. The axis-oblique orientation (split direction) and location (splitting point) are crucial in this regard. As we show in Section 2, there are few methods that sufficiently address this tradeoff, which is the main reason why they are less common and the creation of axis-oblique splits is still noted as an open research question [5], [7], [8]. Moreover, many oblique model trees are discontinuous, which makes the application of the models (e.g. for process optimization or control) much more difficult [4]. Among the few existing methods, the Hierarchical Local Model Tree algorithm (HILOMOT) [9] provides the most notable advantages, as we outline in Section 2.

Due to overlapping local models, the trees generated by HILOMOT are continuous. In addition, the axis-oblique splits are determined by a very effective nonlinear optimization, resulting in accurate models. However, since a certain degree of overlap of the local models is required for successful optimization, HILOMOT struggles to adequately approximate less continuous functions. Moreover, the overlap has a regularization effect that is amplified in deeper layers of the tree, requiring HILOMOT to make a trade-off between larger overlaps for successful optimization and smaller overlaps to avoid underfitting [4]. Another drawback arises from the greedy construction of the tree. As with nearly all greedy algorithms, the splits are not always optimal for the entire global model [5]. When complex relationships are approximated, some previously executed splits must be corrected by additional splits, resulting in unnecessary local models and a decrease of accuracy. Furthermore, the axis-oblique split direction cannot be interpreted as it only reduces a loss function.

In our work, we improve HILOMOT through the following main contributions:

We develop a methodology that identifies the direction of strongest curvature in a partition that cannot be approximated by the so far generated tree, providing a more interpretable split direction.
We identify the necessary splitting points along that direction to best approximate the remaining nonlinearity in a partition by local affine models.
We employ the split direction and the splitting points in HILOMOT for a semi-greedy splitting that is unaffected by the degree of overlap, does not require major correction splits, and provides an additional pre-pruning criterion.

2 Oblique model trees

Figure 1a displays the structure of an oblique model tree, approximating f ( u ̲ ) = 10 ⁡ sin ( π u 1 u 2 ) (Figure 1c) with M = 2 inputs u₁ and u₂ gathered in the input vector u ̲ = [ u 1 u 2 ] T . The tree contains three splitting nodes t_k with k ∈ {1, 3, 4}, which divide the input space into four partitions by their respective axis-oblique splitting function Ψ_k and its complement Ψ ̃ k = 1 − Ψ k . Depending on the training algorithm, Ψ_k is either a smoothed (e.g. a sigmoidal) function

(1) Ψ k ( u ̲ ) = 1 1 + exp − κ ⋅ v ̲ T u ̲ − c ,

as for the tree in Figure 1, or a crisp indicator function

(2) Ψ k ( u ̲ ) = 1 if v ̲ T u ̲ ≥ c 0 otherwise ,

generating respectively a continuous or discontinuous model. The steepness κ defines the smoothness, the splitting coefficients v ̲ = [ v 1 … v M ] T the split direction, and the threshold c the splitting point of Ψ_k. Each of the four partitions is represented by a leaf t ̃ k ′ with k′ ∈ {2, 5, 6, 7}, containing a local model. Here, the local models are local affine functions y ̂ k ′ ( u ̲ ) = β 0 + β ̲ T u ̲ with the offset β₀ and the regression coefficients β ̲ = [ β 1 … β M ] T . Figure 1b shows the partitions of each local model. The global model output y ̂ ( u ̲ ) results from a weighted average of all local models and is shown in Figure 1d. The weightings are computed by multiplying Ψ_k and Ψ ̃ k along the path of each leaf [4].

$Figure 1: Oblique model tree approximating f ( u ̲ ) = 10 ⁡ sin ( π u 1 u 2 ) $f\left(\underline{u}\right)=10\mathrm{sin}\left(\pi {u}_{1}{u}_{2}\right)$ .$

Figure 1:

Oblique model tree approximating f ( u ̲ ) = 10 ⁡ sin ( π u 1 u 2 ) .

Only a few oblique model tree algorithms exist in the literature. Table 1 summarizes them chronologically, classifying them by construction strategy (greedy or non-greedy), global model output (continuous or discontinuous), and the methods used to generate v ̲ and c, while also outlining their major drawbacks.

Table 1:

Overview on existing algorithms to generate oblique model trees.

Ref.	Non-greedy	Continuous	Split direction	Splitting point	Major drawbacks (except greediness and discontinuity)
[10]	✗	✓	Smoothed Hinging Hyperplane (HH) [11] fitted by Conjugate Gradient Algorithm [12]		Splits and local models require same inputs; tradeoff between overlap and underfitting
[13]	✗	✗	First Principal Hessian Direction (PHD) [14]	Hinge-identification in residuals along first PHD	PHD requires multivariate normal distribution and only captures concave/convex curvatures
[15]	✗	✗	Linear Discriminant Analysis (LDA) on 2 Gaussian clusters	Quadratic Discriminant Analysis on LDA’s projection	Nonlinearity of approximated function not taken into account
[16]	✗	✗	Linear regression	Split which best reduces variance (brute force)	Nonlinearity of approximated function not taken into account
[9]	✗	✓	Separable Nonlinear Least Squares optimization [17]		Tradeoff between overlap and underfitting; split directions are not interpretable
[18]	✗	✗	Smoothed HH fitted by restricted Fuzzy c-Regression [19]		Same drawbacks as for [10]
[20]	✓	✗	Sparsity-regularized generalized logistic regression problem optimized by Orthogonal Matching Pursuit [21]		Both the local models and the split directions are not interpretable
[22]	✓	✓	Optimization of randomly chosen hyperplane (called long dipole) and splitting point by Evolutionary Algorithm		Optimization has many hyperparameters; split directions are not interpretable
[6]	✗	✗	First direction of refined Gradient Outer Product [23]	Residual-based averaging function of SUPPORT [24]	Refined Gradient Outer Product has high computational complexity on large scale data
[25]	✓	✓	Formulation of tree as nonlinear continuous optimization problem, solved by Sequential Least Squares Programming [26]		Optimization has many hyperparameters; parametric due to pre-selected symmetrical tree structure

The table shows that most algorithms (6 out of 10) are greedy and that non-greedy algorithms have only become relevant in recent years. Furthermore, 6 out of 10 algorithms determine v ̲ and c by a simultaneous optimization. This often results in axis-oblique splits that improve a loss function but lack interpretability in terms of split direction. Among the remaining 4 algorithms that determine v ̲ and c separately, PHDRT [13] and COMT [6] provide the most advanced methods for explicitly capturing and interpreting the nonlinear behavior of f (which is crucial for good splits [4]). Both methods apply supervised dimension reduction to identify the direction of strongest curvature in the response surface. Apart from the listed drawbacks, however, both algorithms generate discontinuous models, limiting their applicability. Only 4 listed algorithms ensure continuity, with HILOMOT [9] offering the greatest advantages. Unlike the non-greedy methods of Czajkowski and Kretowski [22], and Blanquero et al. [25], HILOMOT uses fewer hyperparameters and is more efficient. Furthermore, HILOMOT is more flexible in model selection than the Hinging Hyperplane Tree [10], as its local models aren’t limited to affine forms and may use inputs different from Ψ_k. HILOMOT thus provides a solid foundation to explore open research questions in oblique model trees, as done in this work.

3 Proposed method

The basic concept of our novel splitting method is illustrated in Figure 2 and partially motivated by PHDRT and COMT. Since v ̲ is generated greedy, but c is selected by a non-greedy method, our method is semi-greedy.

Figure 2:

Basic idea how we generate a splitting function in HILOMOT from a given set of samples and residuals in sequential steps.

First, we determine the average direction along which the local gradients ∇ ̲ of the residuals e vary most within a partition. This direction captures the strongest curvature and is used as the split direction. For this purpose, we developed a more efficient and accurate variant of the refined Gradient Outer Product (rGOP) by Schöne and Kohlhase [6], described in detail in Section 3.1. Then, we perform a non-greedy splitting point selection along v ̲ to reduce the short-sightedness of greedy algorithms (Section 3.2). Therefore, we project e onto the determined direction p = v ̲ T u ̲ and approximate them by a one-dimensional Additive Hinging Hyperplane Model (AHHM) [27], fitted non-greedily. The L hinges of the AHHM provide L optimal candidate splitting points C = { c 1 , … , c L } to decompose the curvature orthogonal to v ̲ into more linear segments. From these candidates, we select the one that splits the data into two sub-partitions with the most balanced number of samples, without considering any performance metrics. This ensures enough data in child partitions for further operations (e.g. splitting or local model estimation) while creating a shallower tree, reducing the regularization caused by the overlaps.

To generate Ψ_k, we use prior knowledge from the split direction and the AHHM of the parent node to generate additional candidate pairs of v ̲ and C . Among all pairs, we select the one whose AHHM identifies the most significant curvature. If none detects curvature, further partitioning is unnecessary (pre-pruning). We describe this splitting process more detailed in Section 3.3.

3.1 Curvature-oriented split direction estimation

Algorithm 1 shows how we determine the direction of strongest curvature in an iterative procedure. This procedure is primarily based on locally Weighted Least Squares (WLS) [23] to estimate the gradients, and Principal Component Analysis (PCA) to identify the directions with greatest variance.

Algorithm 1:

To determine curvature-oriented split direction in partition of t_k.

The data set D k = { X ̲ k , e ̲ k } contains N samples x ̲ T ( n ) = [ x 1 ( n ) … x M ( n ) ] with n = 1, …, N of the training data within the partition of node t_k, forming the N × M regression matrix

(3) X ̲ k = x ̲ T ( 1 ) ⋮ x ̲ T ( N ) = x 1 ( 1 ) … x M ( 1 ) ⋮ ⋱ ⋮ x 1 ( N ) … x M ( N ) ,

and the corresponding residuals

(4) e ̲ k = e ( 1 ) ⋮ e ( N ) = y ( 1 ) − y ̂ x ̲ ( 1 ) ⋮ y ( N ) − y ̂ x ̲ ( N )

from the so far trained tree y ̂ . We use the residual e(n) instead of the response y(n) for WLS to ensure that gradients reflect only residual curvature not captured by the current tree. Furthermore, the bandwidth b for a kernel that estimates the weights of WLS in (5) and a refinement matrix R ̲ are preselected. Unless a known split direction is incorporated as prior knowledge (see Section 3.3), R ̲ is the identity matrix I ̲ . Before the algorithm’s iterative procedure, the samples in X ̲ k are standardized by the mean values x ̄ ̲ = [ x ̄ 1 … x ̄ M ] T and the standard deviations σ ̲ = [ σ 1 … σ M ] T of the whole data set D to z ̲ T ( n ) = [ x 1 ( n ) − x ̄ 1 σ 1 … x M ( n ) − x ̄ M σ M ] , forming the normalized regression matrix Z ̲ .

Our algorithm iterates through four steps, which we explain in the following, until either v ̲ is converged or a maximum of j = J iterations is reached.

3.1.1 Incorporate knowledge from identified directions

At each iteration, the PCA provides d* most significant principal components (eigendirections) R ̲ = [ γ ̲ 1 … γ ̲ d * ] that contain directional information about curvatures. The choice of d* is detailed in Eq. (8). From iteration 2 onward, the directions of j − 1 are used to generate a lower dimensional subspace Z ̲ ′ for a refined estimation of the WLS weights [23]. Therefore, Z ̲ is projected to R ̲ by the transformation Z ̲ ′ = Z ̲ R ̲ , which makes the estimation of the weights more sensitive to important inputs and weakens the curse of dimensionality.

3.1.2 Estimate local gradients of residuals

To estimate the local gradients, we perform local linear regression by WLS around each sample z ̲ ( n ) in Z ̲ .

For each of the N local regressions, a vector w ̲ ( n ) = [ w 1 ( n ) … w N ( n ) ] of N weights w_i(n) with i = 1, …, N is determined by a discontinuous cosine kernel

(5) w j ( n ) = cos π D Mh 2 b if D Mh ≤ b 0 otherwise

with z ̲ ( n ) as the center of the corresponding kernel. The distance D Mh = ‖ z ̲ ′ ( i ) − z ̲ ′ ( n ) ‖ 1 1 d * is determined on the refined samples z ̲ ′ ( i ) and z ̲ ′ ( n ) using the Manhattan Distance, and normalized by d* to avoid its effect on the kernel width. This avoids singularities in high dimensional input spaces caused by too narrow kernels. The advantage of a discontinuous kernel over a continuous kernel (e.g., a Gaussian) is that samples farther away from z ̲ ( n ) are not weighted. Therefore, WLS can be performed on a subset of Z. Consequently, local gradients are estimated more efficiently than in rGOP [6], reducing the dominating complexity of our semi-greedy splitting from O(M²N²) to O ( M 2 N ̃ N ) . N ̃ is the average number of non-zero weights, which decreases with b. To reliably capture local structures, the kernels must be narrow (small b), so that N ̃ ≪ N holds.

After the weights are determined, our algorithm approximates the residuals in the local region of each sample by linear regression using WLS

(6) ∇ 0 ( n ) ∇ ̲ ( n ) = Z ̲ ̃ T W ̲ ( n ) Z ̲ ̃ − 1 Z ̲ ̃ T W ̲ ( n ) e ̲ k

with the augmented matrix Z ̲ ̃ = [ 1 ̲ Z ̲ ] and the weight matrix W ̲ ( n ) = diag w ̲ ( n ) . In this context, the slopes ∇ ̲ ( n ) = ∇ 1 ( n ) … ∇ M ( n ) T of the linear regression represent an estimate of the local gradient around z ̲ ( n ) .

3.1.3 Identify significant directions

From the local gradients, the most significant directions with the highest variation are identified using a PCA. Therefore, the gradients are merged into an N × M matrix, from which the covariance matrix

(7) Σ ̲ = Cov ∇ ̲ T ( 1 ) … ∇ ̲ T ( N ) T

is constructed. From this matrix, the eigendirections Γ ̲ = [ γ ̲ 1 … γ ̲ M ] and eigenvalues λ ̲ = [ λ 1 … λ M ] are extracted by an eigenvalue decomposition. The eigendirections are sorted in descending order according to their respective eigenvalues, and reduced to the

(8) d * = min d pre , d α

most significant eigendirections [ γ ̲ 1 … γ ̲ d * ] , with the number of selected directions from the previous iteration d_pre and the number of sorted eigendirections

(9) d α = arg min d d | ∑ i = 1 d λ i ≥ ξ ⌈ j / τ ⌉ ∑ i = 1 M λ i ,

that covering a proportion of ξ^⌈j/τ⌉ of the total variance with ξ ∈ (0, 1) and τ ∈ N . Potentiating ξ with ⌈j/τ⌉ ensures that as the iteration progresses, less significant eigendirections are selected and the refinement is concentrated on the most significant direction. Empirically, ξ = 0.85 and τ = 2 yielded good results over many synthetic functions, so no further tuning is needed. In conjunction with the limitation that d* cannot be increased, our algorithm converges significantly faster than rGOP [6].

3.1.4 Selection of splitting coefficients

As soon as the refinement is reduced to a single direction (d* = 1), we search for the eigendirection with the greatest eigenvalue λ* as the most significant split direction. Since λ* has an upper bound, convergence of v ̲ is guaranteed. To construct v ̲ from this direction, the coefficients of γ ̲ 1 must be rescaled to the original input space by v ̲ = γ ̲ 1 ◦ σ ̲ inv , using the inverse standard deviations σ ̲ inv = 1 σ 1 … 1 σ M T of the training data set.

3.2 Hinge-based splitting point identification

To identify a splitting point along v ̲ that is advantageous for future splits (non-greedy), the entire structure of the curvature must be considered. For complex nonlinear curvature, both the number and placement of local affine models must be determined while accounting for all local models. Estimating such a nonparametric local structure requires non-greedy algorithms that adapt the full model (local models and their transitions) to nonlinear characteristics. These transitions then define suitable splitting points to recursively approximate the nonlinear structure along the identified direction, minimizing the need for further correction splits. This non-greedy splitting can be characterized as a one-dimensional dynamic lookahead search [28].

Non-greedy algorithms are much more computationally expensive as greedy algorithms. However, since our method adapts in a one-dimensional search space, the computational impact remains limited. The Hinging Hyperplane (HH) algorithm of Breiman [29] and its modification [27] offers a good trade-off between model quality and efficiency. It generates an AHHM

(10) h ̂ ( u ̲ ) = ∑ l = 1 L h ̂ l ( θ ̲ l , u ̲ )

of L superimposed HHs h ̂ l by an iterative procedure. A HH is defined by an 2(M + 1) × 1 parameter vector θ ̲ l = θ ̲ l + θ ̲ l − T , with θ ̲ l + and θ ̲ l − forming two affine models. Their lines or (hyper)planes are joined together at 1 u ̲ T θ ̲ l + − θ ̲ l − = 0 , forming a hinge. This hinge serves as a transition boundary between θ ̲ l + and θ ̲ l − to either define a concave or convex regression function

(11) h ̂ l ( θ ̲ l , u ̲ ) = min [ 1 u ̲ T ] θ ̲ l + , [ 1 u ̲ T ] θ ̲ l − max [ 1 u ̲ T ] θ ̲ l + , [ 1 u ̲ T ] θ ̲ l − ,

whose surface looks like the shape of an open book [27].

Algorithm 2 shows how we estimate an optimal AHHM h ̂ * ( p ) from the projected data p ̲ = X ̲ k v ̲ and e ̲ k to generate C = { c l } l = 1 L . In order to incorporate prior knowledge, an initial AHHM h ̂ Init ( p ) can be selected.

Algorithm 2:

To determine candidate splitting points along identified split direction in partition t_k.

To determine the appropriate number of HHs for the underlying nonlinear structure, our algorithm 1) incrementally increases h ̂ by an additional HH and 2) measures the quality of each increased h ̂ by the corrected Akaike Information Criterion [4] (AIC)

(12) AIC C = N ⁡ ln R S S N + 2 o + 2 o o + 1 N − o − 1

with the residual sum of squares RSS and the model complexity o = 2L. The additional HH is initialized with its hinge placed at the largest gap between existing hinges. Once added to h ̂ , all HHs are sequentially adjusted using a backfitting algorithm [4] until convergence, ensuring each HH corrects the residual error

(13) e ̲ l ′ = e ̲ k − ∑ i = 1 , i ≠ l L h ̂ i θ ̲ i , p ̲

of the others with h ̂ i θ ̲ i , p ̲ = [ h ̂ i ( θ ̲ i , p 1 ) … h ̂ i ( θ ̲ i , p N ) ] T . Therefore, we apply the damped Newton Algorithm from [27] to p ̲ and e ̲ l ′ . Convergence was evidenced for both the backfitting algorithm and the damped Newton Algorithm by Ansley and Kohn [30], and Pucar and Sjoberg [27], respectively. After h ̂ is converged, we compare its quality with the quality AIC C * of the best AHHM so far and select a new one if necessary. The expansion process stops if no improvement is observed in two consecutive steps.

To verify that p contains a sufficient curvature for splitting (pre-pruning criterion), we estimate an affine model y ̂ am ( p ) on p ̲ and e ̲ k , and calculates its quality AIC C am by Eq. (12) as a reference of significance. For y ̂ am , the complexity is o = 1. If AIC C * ≤ AIC C am holds, h ̂ * has detected a significant curvature and candidate splitting points can be extracted from h ̂ * by

(14) c l = θ 0 , l − − θ 0 , l + θ 1 , l + − θ 1 , l − .

3.3 Integration into HILOMOT

Figure 3 illustrates our splitting method, enhancing HILOMOT to a Semi-Greedy Model Tree algorithm (SG-MOT). To identify the most significant curvature in D k , SG-MOT generates up to three candidate pairs v ̲ , C . The primary pair ( v ̲ 1 , C 1 ) is obtained by executing Algorithm 1 with R ̲ = I ̲ and Algorithm 2 without h ̂ Init ( p ) . For nodes with a parent (k > 1), we generate a second pair ( v ̲ 2 , C 2 ) by incorporating the parent’s split direction v ̲ P as prior knowledge. Unlike the first pair, we run Algorithm 1 with R ̲ = v ̲ P ◦ σ ̲ , transforming Z ̲ into a one-dimensional subspace (d* = 1) for refined WLS weight estimation. If hinges of the parent’s AHHM h ̂ P exist in t_k, we incorporate additional prior knowledge about this curvature to create a third pair ( v ̲ P , C 3 ) . Therefore, we retaining v ̲ P and initialize Algorithm 2 with the submodel h ̂ Sub , containing the L′ < L HHs within t_k. Since Algorithm 1 converges significantly faster with R ̲ = v ̲ P ◦ σ ̲ as with R ̲ = I ̲ , and only Algorithm 2 is needed for the last pair, ( v ̲ 2 , C 2 ) and ( v ̲ P , C 3 ) can be computed very efficiently.

$Figure 3: Process of generating a splitting function in SG-MOT from up to 3 candidate pairs of v and C $\mathcal{C}$ .$

Figure 3:

Process of generating a splitting function in SG-MOT from up to 3 candidate pairs of v and C .

If no candidate pair identifies significant curvature, no splitting is performed and the node becomes a leaf t ̃ k ′ . Otherwise, from all pairs that have identified a significant curvature, we select the pair ( v ̲ * , C * ) whose AHHM best fits the curvature in D k . To generate the final splitting function Ψ_k from ( v ̲ * , C * ) by Eq. (1), we select the candidate splitting point c l * out of C * that splits D k as balanced as possible.

4 Experimental analysis

To demonstrate the advantages of SG-MOT in split interpretability, model accuracy, and model complexity, we conduct an extensive experimental analysis on synthetic data. We first illustrate in Section 4.1 the interpretability of a tree generated by SG-MOT. We then compare SG-MOT with original HILOMOT and restrict our benchmarking to other related nonparametric methods for a fair comparison. Specifically, we generate axis-orthogonal model trees using the well-established algorithms GUIDE [3] and M5′ [31]. A boosting ensemble (BE) of trees, known for its strong performance, serves as an additional accuracy baseline [32]. Further details on our experimental setup and results are provided in Sections 4.2 and 4.3, respectively. Finally, we discuss the results in Section 4.4.

4.1 Illustrative example

Figure 4 shows the first four splits in SG-MOT to approximate f 1 ( u ̲ ) = 10 ⁡ sin ( π u 1 u 2 ) + 20 ( u 3 − 0,5 ) 2 + 10 u 4 + 5 u 5 from N_tr = 100 samples. The function consists of two nonlinear terms 10 sin(πu₁u₂) and 20 ( u 3 − 0,5 ) 2 , leading to a nonlinear structured response surface. This nonlinear structure is oriented along two directions: univariate along u₃ and bivariate along u₁ and u₂. More details about f₁ are listed in Table 2.

$Figure 4: First four splits of SG-MOT to approximate f 1 ( u ̲ ) = 10 ⁡ sin ( π u 1 u 2 ) + 20 ( u 3 − 0,5 ) 2 + 10 u 4 + 5 u 5 ${f}_{1}\left(\underline{u}\right)=10\mathrm{sin}\left(\pi {u}_{1}{u}_{2}\right)+20{\left({u}_{3}-0,5\right)}^{2}+10{u}_{4}+5{u}_{5}$ . For all four splits, our method identified a splitting direction p = v ̲ T u ̲ $p={\underline{v}}^{T}\underline{u}$ that targets the nonlinear effects of 10 sin(πu1u2) and 20 ( u 3 − 0,5 ) 2 $20{\left({u}_{3}-0,5\right)}^{2}$ .$

Figure 4:

First four splits of SG-MOT to approximate f 1 ( u ̲ ) = 10 ⁡ sin ( π u 1 u 2 ) + 20 ( u 3 − 0,5 ) 2 + 10 u 4 + 5 u 5 . For all four splits, our method identified a splitting direction p = v ̲ T u ̲ that targets the nonlinear effects of 10 sin(πu₁u₂) and 20 ( u 3 − 0,5 ) 2 .

Table 2:

Functions and data distributions to generate synthetic data for the experimental analysis.

Function f : R M → R	Ref.	M	Distribution of u
f 1 ( u ̲ ) = 10 ⁡ sin ( π u 1 u 2 ) + 20 ( u 3 − 0.5 ) 2 + 10 u 4 + 5 u 5	[33]	5	u ̲ ∼ U ( 0,1 )
f 2 ( u ̲ ) = 10 ⁡ sin ( π u 1 u 2 ) + 20 ( u 3 − 0.5 ) 2 + 10 u 4 + 5 u 5	[6]	5	{ u 1 , u 2 } ∼ U ( 0,2 ) , { u 3 , u 4 , u 5 } ∼ U ( 0,1 )
f 3 ( u ̲ ) = e ( 3 u 1 − u 2 + u 3 ) / 3 + 2 ⁡ sin ( π ( u 4 − u 5 ) ) + 2 ⁡ log ( 2 u 6 + u 7 ) / ( 3 − ( 2 u 6 + u 7 ) )	[34]	7	u ̲ ∼ U ( 0.01 , 1 )
f 4 ( u ̲ ) = 0.1 e 4 u 1 + 4 / ( 1 + e − 20 ( u 2 − 0.5 ) ) + 3 u 3 + 2 u 4 + u 5 + 0 ⋅ ∑ i = 6 10 u i	[33]	10	u ̲ ∼ U ( 0,1 )
f 5 ( u ̲ ) = − ζ ̲ 1 T u ̲ − 3 ζ ̲ 2 T u ̲ + 1 i f ζ ̲ 2 T u ̲ ≥ 0 & 3 ζ ̲ 1 T u ̲ + ζ ̲ 2 T u ̲ ≥ 0 − ζ ̲ 1 T u ̲ + 3 ζ ̲ 2 T u ̲ + 1 i f ζ ̲ 2 T u ̲ < 0 & 3 ζ ̲ 1 T u ̲ − ζ ̲ 2 T u ̲ ≥ 0 2 ζ ̲ 1 T u ̲ + 1 i f 3 ζ ̲ 1 T u ̲ + ζ ̲ 2 T u ̲ < 0 & 3 ζ ̲ 1 T u ̲ − ζ ̲ 2 T u ̲ < 0	[13]	10	u ̲ ∼ U ( − 1,1 )
w i t h ζ ̲ 1 = [ 1 1 1 1 1 0 0 0 0 0 ] T a n d ζ ̲ 2 = [ 0 0 0 0 0 1 1 1 1 1 ] T
f 6 ( u ̲ ) = e sin ( ( 0.9 ( u 1 + 0.48 ) ) 10 ) + u 2 u 3 + u 4 + 0 ⋅ ∑ i = 5 6 u i	[35]	6	u ̲ ∼ U ( 0,1 )
f 7 ( u ̲ ) = 4 ( u 1 − 2 + 8 u 2 − 8 u 2 2 ) 2 + ( 3 − 4 u 2 ) 2 + 16 u 3 + 1 ( 2 u 3 − 1 ) 2 + ∑ i = 4 8 i ⁡ ln ∑ j = 3 i u j	[36]	8	u ̲ ∼ U ( 0,1 )
f 8 ( u ̲ ) = 5 u 12 1 + u 1 + 5 ( u 4 − u 20 ) 2 + u 5 + 40 u 19 3 + 5 u 19 + 0.05 u 2 + 0.08 u 3 − 0.03 u 6 + 0.03 u 7 − 0.09 u 9 − 0.01 u 10 − 0.07 u 11 + 0.25 u 13 2 − 0.04 u 14 + 0.06 u 15 − 0.01 u 17 − 0.03 u 18	[37]	20	u ̲ ∼ U ( − 0.5 , 0.5 )
f 9 ( u ̲ ) = ∑ i = 1 4 u i ⁡ cos ∑ j = 5 i + 4 u j 2 + ∑ i = 1 4 u i ⁡ sin ∑ j = 5 i + 4 u j 2 0.5	[38]	8	u i ∼ U ( 0,1 ) , u j ∼ U ( 0,2 π )
f 10 ( u ̲ ) = 1 6 30 + 5 u 1 ⁡ sin ( 5 u 1 ) 4 + e − 5 u 2 − 100	[39]	2	u ̲ ∼ U ( 0,1 )
f 11 ( u ̲ ) = u 2 − 5.1 4 π 2 u 1 2 + 5 π u 1 − 6 2 + 10 − 10 8 π cos ( u 1 ) + 10	[40]	2	u 1 ∼ U ( − 5,10 ) , u 2 ∼ U ( 0,15 )

To effectively approximate f₁ by local affine models, the input space must be partitioned axis-orthogonal in u₃ and axis-oblique in the subspace spanned by u₁ and u₂. Moreover, the coefficients v 4 * and v 5 * in v ̲ * should be close to zero to allow interpretation and visualization of the nonlinear structure.

When generating the tree, SG-MOT sequentially selects the node with the largest error for splitting until a termination criterion occurs. It begins with the root t₁, where our method generates an almost axis-orthogonal split in u₃. Projecting the residuals onto this direction reveals the nonlinear quadratic effect of u₃. In this one-dimensional space, a single split point is identified, dividing the samples almost balanced into t₂ and t₃. The next split occurs in t₃, where SG-MOT detects the nonlinear sinusoidal structure along u₁ and u₂, mirrored at the origin of p due to the negative signs of v 1 * and v 2 * . The coefficients v 3 * , v 4 * and v 5 * are close to zero. Contrary to the splitting in t₁, the AHHM here has two hinges, from which the right one is chosen as c*. The splitting along u₁ and u₂ is continued in t₄, where our method uses v ̲ P and h ̂ * Sub (candidate pair 3) to generate the split. The fourth split is made in t₂, which contains data similar to t₃, leading SG-MOT to perform a comparable split.

Based on these splits, we can recognize that u₁, u₂ and u₃ have nonlinear effects on f₁ and that u₁ and u₂ interact with each other. With HILOMOT, for example, the splitting direction in t₁ would be p = 0.27u₁ − 0.07u₂ + 0.74u₃ + 0.01u₄ + 0.18u₅. With such a split, the effects of u₂ and u₅ and interactions with u₁ are misinterpreted. A projection onto this direction would not reveal any meaningful nonlinear structure.

4.2 Experimental setup

The data for our experimental analysis are generated by 11 different functions, which are listed in Table 2. Except for f₆, f₇ and f₁₁, all functions are common test functions for evaluating the performance of regression models. The functions f₆ and f₇ have been used respectively to evaluate active and passive learning strategies with regression models, and f₁₁ for global optimization methods. The function f₂ differs from f₁ only in terms of the definition range of u₁ and u₂, and f₄ and f₆ have inputs that do not affect the response.

For each function, we conduct three experiments with different training data sizes of N_tr = 200, 400, 800, resulting in 33 experiments. To ensure meaningful results, we average each experiment over 200 independently sampled training sets. In each run, all models are trained and evaluated on the same data, using a test set of N_test = 2,000 labeled samples, randomly drawn from the same distribution as the training data (see Table 2). White Gaussian noise with mean ϵ ̄ = 0 and variance σ_ϵ = 0.1σ_y is added to the training labels, where σ_y is the training response variance.

To evaluate the models, we calculate their average root mean square error

(15) E ̄ = 1 200 ∑ q = 1 200 R S S q N test

for each experiment over all independent runs. To clarify differences between the tested model trees and the baseline, we normalize E ̄ of SG-MOT, HILOMOT, GUIDE and M5′ by the averaged error of BE to E ̄ SG , E ̄ H , E ̄ G and E ̄ _M5′ respectively. The model complexity is evaluated only for SG-MOT and HILOMOT, as Gama [16] has shown that axis-oblique trees are generally smaller than axis-orthogonal ones. For this purpose, we calculate the average number of local models T ̄ SG and T ̄ H of the trees. To verify whether one model significantly performs better than other models, we apply a Wilcoxon test to the results of each experiment.

We perform our benchmarking in MATLAB, where we use the LMN-Tool toolbox [41] for SG-MOT and HILOMOT, the M5PrimeLab toolbox [42] for M5′, and the Statistics and Machine Learning toolbox [43] for BE. GUIDE is generated by an external toolbox [44]. Except for GUIDE, where a suitable model structure is already generated by the integrated post-pruning and a cross validation (CV), we carry out the following hyperparameter tuning.

SG-MOT: Independent line search for steepness κ ∈ [1, 8] and bandwidth b ∈ [0.4, 2] using CV.
HILOMOT: Line search for κ ∈ [1, 8] using CV.
M5′: Grid search for aggressive pruning {true, false} and κ ∈ [0.5, 30] using CV.
BE: Integrated Bayesian Optimization for number of trees, size of trees and learning rate for shrinkage.

In order to achieve comparable model structures, we use local affine models without any stepwise selection in GUIDE and M5′. Apart from this, the default settings are used for the respective toolboxes.

4.3 Experimental results

Table 3 presents the results of our experimental analysis in which the overall best results are obtained by SG-MOT. In 19 out of 33 experiments, E ̄ SG is the significantly lowest, with averages of 0.796, 0.760 and 0.701 across all experiments with N_tr = 200, 400, 800, respectively. Compared to the baseline, the accuracy is also in 25 experiments lower than that of the baseline, indicated by values of E ̄ SG < 1 . The second best results are achieved by HILOMOT with 9 experiments with the lowest significant error and averages of 0.829, 0.806 and 0.716 for experiments with N_tr = 200, 400, 800, respectively. The accuracy compared to the baseline is similar to SG-MOT (better for 26 experiments).

Table 3:

Results for 33 experiments. The test error E ̄ of each method is normalized by the error of BE as a baseline and increasingly colored green or red if the error is lower or higher than BE. For SG-MOT and HILOMOT, the number of leaves T ̄ SG and T ̄ H is also listed. The best results are marked in italics and, if significant, in bold. The last line summarizes the results.

On average, SG-MOT produces 5.9 % lower errors and 10.1 % smaller trees than HILOMOT. In 24 out of 33 experiments, T ̄ SG is significantly smaller than T ̄ H . For 13 of these 24 experiments, E ̄ SG is also lower. A more detailed comparison between SG-MOT and HILOMOT with respect to the different functions shows that SG-MOT achieves a significantly lower error for all experiments with f₂, f₃, f₅ and f₇. The differences are largest for f₂ and f₅ with an average of 23.4 % and 28.1 % respectively across the 3 experiments of each function. For f₁, f₄, f₁₀ and f₁₁, HILOMOT performs better for experiments with fewer samples. For the experiments with more samples, however, HILOMOT is outperformed by SG-MOT. Over these four functions, E ̄ SG is on average 7.3 % higher for N_tr = 200, but on average 13.2 % lower for N_tr = 800 than E ̄ H . This effect is particularly visible for f₄. For N_tr = 200, E ̄ SG is 26.3 % higher and for N_tr = 400 slightly lower than E ̄ H . Compared to the baseline, the performance of SG-MOT increases for f₄ by 45.4 % from N_tr = 200 to N_tr = 400. Only for f₈ and f₉ does SG-MOT perform consistently worse, with a respectively 3.9 % and 10.3 % higher average of E ̄ SG against E ̄ H over the 3 experiments of f₈ and f₉. The biggest difference can be seen in experiment 24, where E ̄ SG is 25.7 % higher than E ̄ H .

4.4 Discussion

As shown in Section 4.1, SG-MOT produces more interpretable splits, which can also become nearly axis-orthogonal if necessary. The split direction captures only inputs or interacting input groups with a nonlinear effect on the residuals, enabling identification and visualization of these effects in a one-dimensional plot. These advantages are not provided by HILOMOT.

Our findings in Section 4.3 further show that SG-MOT produces smaller trees than HILOMOT while maintaining comparable or even superior accuracy, enhancing the overall interpretability of an oblique tree [5]. SG-MOT usually outperforms its competitors in terms of accuracy across all sizes of N_tr. This is particularly relevant for N_tr = 200, as it demonstrates the efficacy of SG-MOT for applications on small data. Furthermore, it is remarkable that both SG-MOT and HILOMOT outperform BE with an average error reduction of respectively 24.8 % and 21.7 %.

A detailed analysis confirms that the semi-greedy properties of SG-MOT overcome the short-sightedness of greedy algorithms, producing splits that require no correction. For example, for f₂ our method identifies 3 optimal candidate splitting points along the direction in which the sinusoidal nonlinear structure of 10 sin(πu₁u₂) oscillates (Figure 5a). Additionally, our method’s independence from the overlap degree in Ψ yields two key advantages. The overlap can be adapted without any restrictions to the (dis)continuity of f and the required model complexity for a suitable fit. The first advantage becomes evident with f₅, where SG-MOT effectively handles non-differentiable transitions (Figure 5b) by an almost crisp model output. The second advantage is confirmed by f₁, f₄, f₁₀ and f₁₁, for which SG-MOT becomes better than HILOMOT as N increases. With more training data, SG-MOT applies lower smoothing, reducing overlap regularization at deeper levels. This enables a more complex model, whereas HILOMOT tends to underfit.

Figure 5:

Exemplary plots to explain benchmarking results.

The results in Section 4.3 also reveal limitations of SG-MOT. Some nonlinear effects from individual inputs remain partially undetected, e.g. u₁₉ in f₈ and u₂ in f₄ for N_tr = 200. A specific challenge arises with f₆, where increasing oscillations along u₁ (see Figure 5d) become inseparable in p = 0.99u₁ − 0.04u₂ − 0.04u₃ + 0.11u₄ + 0.09u₅ + 0.06u₆ once other coefficients in v ̲ * as v 1 * are nonzero (right side of Figure 5c). Additional disadvantages are the need to set the hyperparameter b and a 2–3 times longer computation time for SG-MOT compared to HILOMOT.

5 Conclusion and future work

In this work, we improved HILOMOT by a novel semi-greedy splitting method to SG-MOT. Our approach first estimates the direction of the strongest curvature in a partition, which cannot be approximated by the existing tree. Therefore, local residual gradients are analyzed in an iterative procedure. Then, the residuals are projected onto this direction and fitted with a one-dimensional AHHM, whose hinges serve as optimal candidate splitting points. SG-MOT selects the candidate that best balances the partitioned data.

Experiments on synthetic data show that SG-MOT produces meaningful, interpretable splits that reveal and visualize nonlinear properties of the training data. Extensive benchmarking confirms that SG-MOT and HILOMOT significantly outperform established regression methods in accuracy. Compared to HILOMOT, SG-MOT yields more precise, smaller trees by generating more optimal splits that require no significant corrective adjustments and better matching tree complexity to the data. This is achieved by decoupling the overlap from the splitting method and introducing an additional pre-pruning criterion.

For future work, we will address the limitations of our method. Since SG-MOT has a computational time 2–3 times longer than HILOMOT, we aim to improve the computational efficiency by an already developed but unvalidated subset selection method that limits residuals to significant ones. We also seek to reduce the overhead from tuning an additional hyperparameter by introducing a self-tuning approach. While our method currently supports only axis-oblique trees with local affine models, an extension to more complex local models is in development. Finally, we aim to enhance accuracy and interpretability by enforcing sparsity in axis-oblique splits, e.g., via a L1 regularization.

Corresponding authors: Marvin Schöne and Martin Kohlhase, Institute for Data Science Solutions, University of Applied Sciences and Arts Bielefeld, Interaktion 1, 33619 Bielefeld, Germany, E-mail: marvin.schoene@hsbi.de (M. Schöne), martin.kohlhase@hsbi.de (M. Kohlhase) (M. Schöne) (M. Kohlhase)

About the authors

Marvin Schöne

Marvin Schöne is a Ph.D. student at the University of Siegen and researcher at the Institute for Data Science Solutions. He received his master’s degree in electrical engineering from the University of Applied Sciences and Arts Bielefeld in 2019. His research topics are design of experiments, local model networks and active learning.

Martin Kohlhase

Martin Kohlhase is Professor of Control and Automation at the University of Applied Sciences and Arts Bielefeld and co-chair of the Institute for Data Science Solutions. He received his doctor’s degree in 2011 at the Technical University of Darmstadt. His research topics are nonlinear system identification, design of experiments, data-driven models and active learning.

Oliver Nelles

Oliver Nelles is Professor at the University of Siegen in the Department of Mechanical Engineering and chair of Automatic Control – Mechatronics. He received his doctor’s degree in 1999 at the Technical University of Darmstadt. His research topics are nonlinear system identification, design of experiments, metamodeling and local model networks.

Research ethics: Not applicable.
Informed consent: Not applicable.
Author contributions: All authors have accepted responsibility for the entire content of this manuscript and approved its submission.
Use of Large Language Models, AI and Machine Learning Tools: DeepL to improve the language.
Conflict of interest: The authors state no conflict of interest.
Research funding: This work was funded by the Ministry of Economic Affairs, Industry, Climate Action and Energy of the State of North Rhine-Westphalia within the project AI4ScaDa, grant number 005-2111-0015.
Data availability: Not applicable.

References

[1] A. Dogan and D. Birant, “Machine learning and data mining in manufacturing,” Expert Syst. Appl., vol. 166, p. 114060, 2021. https://doi.org/10.1016/j.eswa.2020.114060.Search in Google Scholar

[2] D. L. Banks, R. T. Olszewski, and R. A. Maxion, “Comparing methods for multivariate nonparametric regression,” Commun. Stat. Simulat. Comput., vol. 32, no. 2, pp. 541–571, 2003. https://doi.org/10.1081/sac-120017506.Search in Google Scholar

[3] W.-Y. Loh, “Regression trees with unbiased variable selection and interaction detection,” Stat. Sin., vol. 12, no. 2, pp. 361–386, 2002.Search in Google Scholar

[4] O. Nelles, Nonlinear System Identification: From Classical Approaches to Neural Networks, Fuzzy Models, and Gaussian Processes, Cham, Springer International Publishing, 2020.10.1007/978-3-030-47439-3Search in Google Scholar

[5] V. G. Costa and C. E. Pedreira, “Recent advances in decision trees: an updated survey,” Artif. Intell. Rev., vol. 56, no. 5, pp. 4765–4800, 2023. https://doi.org/10.1007/s10462-022-10275-5.Search in Google Scholar

[6] M. Schöne and M. Kohlhase, “Curvature-oriented splitting for multivariate model trees,” in 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, IEEE, 2021, pp. 1–9.10.1109/SSCI50451.2021.9659858Search in Google Scholar

[7] W.-Y. Loh, “Fifty years of classification and regression trees,” Int. Stat. Rev., vol. 82, no. 3, pp. 329–348, 2014, https://doi.org/10.1111/insr.12016.Search in Google Scholar

[8] M. Kretowski, Evolutionary Decision Trees in Large-Scale Data Mining, vol. 59 of Studies in Big Data, Cham, Springer International Publishing, 2019.10.1007/978-3-030-21851-5Search in Google Scholar

[9] O. Nelles, “Axes-oblique partitioning strategies for local model networks,” in 2006 IEEE Conference on Computer Aided Control System Design, 2006 IEEE International Conference on Control Applications, 2006 IEEE International Symposium on Intelligent Control, Munich, Germany, IEEE, 2006, pp. 2378–2383.10.1109/CACSD-CCA-ISIC.2006.4777012Search in Google Scholar

[10] S. Ernst, “Hinging hyperplane trees for approximation and identification,” in Proceedings of the 37th IEEE Conference on Decision and Control, vol. 2, Tampa, FL, USA, IEEE, 1998, pp. 1266–1271.10.1109/CDC.1998.758452Search in Google Scholar

[11] P. Pucar and M. Millnert, “Smooth hinging hyperplanes – an alternative to neural nets,” in European Control Conference (ECC), vol. 3, Rome, Italy, 1995, pp. 1173–1178.Search in Google Scholar

[12] L. E. Scales, Introduction to Non-Linear Optimization, London, Macmillan Education UK, 1985.10.1007/978-1-349-17741-7Search in Google Scholar

[13] K.-C. Li, H.-H. Lue, and C.-H. Chen, “Interactive tree-structured regression via principal hessian directions,” J. Am. Stat. Assoc., vol. 95, no. 450, pp. 547–560, 2000. https://doi.org/10.2307/2669398.Search in Google Scholar

[14] R. D. Cook, “Principal hessian directions revisited,” J. Am. Stat. Assoc., vol. 93, no. 441, pp. 84–94, 1998. https://doi.org/10.2307/2669605.Search in Google Scholar

[15] A. Dobra and J. Gehrke, “SECRET: a scalable linear regression tree algorithm,” in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Canada, ACM, 2002, pp. 481–487.10.1145/775047.775117Search in Google Scholar

[16] J. Gama, “Functional trees,” Mach. Learn., vol. 55, no. 3, pp. 219–250, 2004. Available at: https://doi.org/10.1023/b:mach.0000027782.67192.13.10.1023/B:MACH.0000027782.67192.13Search in Google Scholar

[17] G. H. Golub and V. Pereyra, “The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate,” SIAM J. Numer. Anal., vol. 10, no. 2, pp. 413–432, 1973, https://doi.org/10.1137/0710036.Search in Google Scholar

[18] T. Kenesei and J. Abonyi, “Hinging hyperplane based regression tree identified by fuzzy clustering and its application,” Appl. Soft Comput., vol. 13, no. 2, pp. 782–792, 2013. https://doi.org/10.1016/j.asoc.2012.09.027.Search in Google Scholar

[19] R. Hathaway and J. Bezdek, “Switching regression models and fuzzy clustering,” IEEE Trans. Fuzzy Syst., vol. 1, no. 3, pp. 195–204, 1993. https://doi.org/10.1109/91.236552.Search in Google Scholar

[20] J. Wang, R. Fujimaki, and Y. Motohashi, “Trading interpretability for accuracy: oblique treed sparse additive models,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney NSW Australia, ACM, 2015, pp. 1245–1254.10.1145/2783258.2783407Search in Google Scholar

[21] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Trans. Inf. Theor., vol. 53, no. 12, pp. 4655–4666, 2007. https://doi.org/10.1109/tit.2007.909108.Search in Google Scholar

[22] M. Czajkowski and M. Kretowski, “The role of decision tree representation in regression problems – an evolutionary perspective,” Appl. Soft Comput., vol. 48, no. 1, pp. 458–475, 2016. https://doi.org/10.1016/j.asoc.2016.07.007.Search in Google Scholar

[23] B. Li, Sufficient Dimension Reduction – Methods and Applications with R, New York, CRC Press, 2018.10.1201/9781315119427Search in Google Scholar

[24] P. Chaudhuri, M.-C. Huang, W.-Y. Loh, and R. Yao, “Piecewise-polynomial regression trees,” Stat. Sin., vol. 4, no. 1, pp. 143–167, 1994.Search in Google Scholar

[25] R. Blanquero, E. Carrizosa, C. Molero-Río, and D. R. Morales, “On sparse optimal regression trees,” Eur. J. Oper. Res., vol. 299, no. 3, pp. 1045–1054, 2022. https://doi.org/10.1016/j.ejor.2021.12.022.Search in Google Scholar

[26] J. Nocedal and S. J. Wright, Numerical Optimization. Springer Series in Operations Research and Financial Engineering, 2nd ed. New York, Springer, 2006.Search in Google Scholar

[27] P. Pucar and J. Sjoberg, “On the hinge-finding algorithm for hinging hyperplanes,” IEEE Trans. Inf. Theor., vol. 44, no. 3, pp. 1310–1319, 1998. https://doi.org/10.1109/18.669422.Search in Google Scholar

[28] S. Esmeir and S. Markovitch, “Anytime learning of decision trees,” J. Mach. Learn. Res., vol. 8, no. 33, pp. 891–933, 2007.Search in Google Scholar

[29] L. Breiman, “Hinging hyperplanes for regression, classification, and function approximation,” IEEE Trans. Inf. Theor., vol. 39, no. 3, pp. 999–1013, 1993. https://doi.org/10.1109/18.256506.Search in Google Scholar

[30] C. F. Ansley and R. Kohn, “Convergence of the backfitting algorithm for additive models,” J. Aust. Math. Soc. A Pure Math. Stat., vol. 57, no. 3, pp. 316–329, 1994. https://doi.org/10.1017/s1446788700037721.Search in Google Scholar

[31] J. R. Quinlan, “Learning with continuous classes,” in Proceedings of AI ’92, Hobart, Tasmania, World Scientific, 1992, pp. 343–348.Search in Google Scholar

[32] M. Fernández-Delgado, M. Sirsat, E. Cernadas, S. Alawadi, S. Barro, and M. Febrero-Bande, “An extensive experimental survey of regression methods,” Neural Netw., vol. 111, no. 1, pp. 11–34, 2019. https://doi.org/10.1016/j.neunet.2018.12.010.Search in Google Scholar PubMed

[33] J. H. Friedman, “Multivariate adaptive regression splines,” Ann. Stat., vol. 19, no. 1, pp. 1–66, 1991. https://doi.org/10.1214/aos/1176347963.Search in Google Scholar

[34] H. Zhang, C.-Y. Yu, H. Zhu, and J. Shi, “Identification of linear directions in multivariate adaptive spline models,” J. Am. Stat. Assoc., vol. 98, no. 462, pp. 369–376, 2003. https://doi.org/10.1198/016214503000152.Search in Google Scholar

[35] R. B. Gramacy and H. K. H. Lee, “Adaptive design and analysis of supercomputer experiments,” Technometrics, vol. 51, no. 2, pp. 130–145, 2009. https://doi.org/10.1198/tech.2009.0015.Search in Google Scholar

[36] H. Dette and A. Pepelyshev, “Generalized Latin hypercube design for computer experiments,” Technometrics, vol. 52, no. 4, pp. 421–429, 2010. https://doi.org/10.1198/tech.2010.09157.Search in Google Scholar

[37] E. N. Ben-Ari and D. M. Steinberg, “Modeling data from computer experiments: an empirical comparison of kriging with MARS and projection pursuit regression,” Qual. Eng., vol. 19, no. 4, pp. 327–338, 2007. https://doi.org/10.1080/08982110701580930.Search in Google Scholar

[38] J. An and A. Owen, “Quasi-regression,” J. Complex, vol. 17, no. 4, pp. 588–607, 2001. https://doi.org/10.1006/jcom.2001.0588.Search in Google Scholar

[39] Y. B. Lim, J. Sacks, W. J. Studden, and W. J. Welch, “Design and analysis of computer experiments when the output is highly correlated over the input space,” Can. J. Stat., vol. 30, no. 1, pp. 109–126, 2002. https://doi.org/10.2307/3315868.Search in Google Scholar

[40] L. C. W. Dixon and G. P. Szego, “The global optimization problem: an introduction,” Towards Glob. Optim., vol. 2, no. 1, pp. 1–15, 1978.Search in Google Scholar

[41] B. Hartmann, T. Ebert, T. Fischer, J. Belz, G. Kampmann, and O. Nelles, “LMNTOOL – Toolbox zum automatischen Trainieren lokaler Modellnetze,” in Proceedings 22. Workshop Computational Intelligence, Dortmund, KIT Scientific Publishing, 2012.Search in Google Scholar

[42] G. Jekabsons, M5PrimeLab, 2020 [Online]. http://www.cs.rtu.lv/jekabsons/Files/M5PrimeLab.pdf [Accessed: Feb. 17, 2025].Search in Google Scholar

[43] I. The MathWorks, “Fitrensemble: fit ensemble of learners for regression,” [Online]. https://de.mathworks.com/help/stats/fitrensemble.html [Accessed: Feb. 17, 2025].Search in Google Scholar

[44] W.-Y. Loh, User Manual for GUIDE Ver. 42.6∗, 2024 [Online]. https://pages.stat.wisc.edu/loh/treeprogs/guide/guideman.pdf [Accessed: Feb. 17, 2025].Search in Google Scholar

Received: 2025-02-17

Accepted: 2025-08-04

Published Online: 2025-10-10

Published in Print: 2025-10-27

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/auto-2025-0017

Keywords for this article

oblique trees; local linear networks; interpretability; multivariate regression

Creative Commons

BY 4.0