Home High dimensional threshold model with a time-varying threshold based on Fourier approximation
Article
Licensed
Unlicensed Requires Authentication

High dimensional threshold model with a time-varying threshold based on Fourier approximation

  • Lixiong Yang EMAIL logo
Published/Copyright: May 30, 2022

Abstract

This paper studies high-dimensional threshold models with a time-varying threshold approximated using a Fourier function. We develop a weighted LASSO estimator of regression coefficients as well as the threshold parameters. Our LASSO estimator can not only select covariates but also distinguish between linear and threshold models. We derive non-asymptotic oracle inequalities for the prediction risk, the l 1 and l bounds for regression coefficients, and provide an upper bound on the l 1 estimation error of the time-varying threshold estimator. The bounds can be translated easily into asymptotic consistency for prediction and estimation. We also establish the variable selection consistency and threshold detection consistency based on the l bounds. Through Monte Carlo simulations, we show that the thresholded LASSO works reasonably well in finite samples in terms of variable selection, and there is little harmness by the allowance for Fourier approximation in the estimation procedure even when there is no time-varying feature in the threshold. On the contrary, the estimation and variable selection are inconsistent when the threshold is time-varying but being misspecified as a constant. The model is illustrated with an empirical application to the famous debt-growth nexus.

JEL Classification: C13; C51; C52

Corresponding author: Lixiong Yang, School of Management, Lanzhou University, 222 South Tianshui Road, Lanzhou 730000, China, E-mail:

Award Identifier / Grant number: 71803072

Acknowledgements

The author thanks the editor and anonymous referees for very valuable comments and suggestions which improved the quality of the paper. Remaining errors and omissions are my own. The author acknowledges the financial support from the National Natural Science Foundation of China (Grant No. 71803072).

  1. Author contribution: The author has accepted responsibility for the entire content of this submitted manuscript and approved submission.

  2. Research funding: This study is supported by the National Natural Science Foundation of China (Grant No. 71803072).

  3. Conflict of interest statement: The author declares no conflicts of interest regarding this paper.

Appendix A: Mathematical proofs

This appendix provides the proofs of Theorems 1–6 in the paper. Define

(A1) V 1 j : = ( T σ X ( i ) T ) 1 t = 1 T e t x t ( j ) ,

(A2) V 2 j ( γ ) : = ( T σ X ( i ) ( γ ) T ) 1 t = 1 T e t x t ( j ) 1 { q t γ t ( γ ) } ,

(A3) R T ( β , γ ) = 2 1 T t = 1 T δ e t x t 1 { q t γ ̂ t ( γ ̂ ) } 1 { q t γ t ( γ ) } .

For a constant μ ∈ (0, 1), define the events

A : = j = 1 m { 2 V 1 j μ λ / σ } ,

B : = j = 1 m { 2 sup γ Γ V 2 j ( γ ) μ λ / σ } .

Lemma 1

(Basic Inequalities). On the events A and B , we have

(A4) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 + ( 1 μ ) λ D ̂ ( β ̂ β ) l 1 2 λ D ̂ ( β ̂ β 0 ) J 0 l 1 + λ | D ̂ β 0 l 1 D β 0 l 1 | + R T β 0 , γ 0 ,

and

(A5) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 + ( 1 μ ) λ D ̂ ( β ̂ β 0 ) l 1 1 T t = 1 T β 0 x t ( γ 0 ) β 0 x t ( γ ̂ ) 2 + 2 λ D ̂ ( β ̂ β 0 ) J 0 l 1 .

Proof of Lemma 1

Note that

(A6) SSR T ( β ̂ , γ ̂ ) SSR T ( β , γ ) = 1 T t = 1 T y t β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T y t β ( γ ) x t ( γ ) 2 = 1 T t = 1 T β 0 x t ( γ 0 ) + e t β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) + e t β ( γ ) x t ( γ ) 2 = 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β ( γ ) x t ( γ ) 2 2 1 T t = 1 T e t β ̂ ( γ ̂ ) x t ( γ ̂ ) β ( γ ) x t ( γ ) = 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β ( γ ) x t ( γ ) 2 2 1 T t = 1 T β ̂ 2 β 2 e t x t 2 1 T t = 1 T δ ̂ e t x t 1 { q t γ ̂ t } δ e t x t 1 { q t γ t } = 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β ( γ ) x t ( γ ) 2 2 1 T t = 1 T β ̂ 2 β 2 e t x t 2 1 T t = 1 T ( δ ̂ δ ) e t x t 1 { q t γ ̂ t } + δ e t x t 1 { q t γ ̂ t } δ e t x t 1 { q t γ t } = 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β ( γ ) x t ( γ ) 2 2 1 T t = 1 T β ̂ 2 β 2 e t x t 2 1 T t = 1 T ( δ ̂ δ ) e t x t 1 { q t γ ̂ t } 2 1 T t = 1 T δ e t x t 1 { q t γ ̂ t } 1 { q t γ t } .

Since the arg min property, we have

(A7) SSR T ( β ̂ , γ ̂ ) + λ D ̂ ( γ ̂ ) β ̂ l 1 SSR T ( β , γ ) + λ D ( γ ) β l 1 .

Thus,

(A8) SSR T ( β ̂ , γ ̂ ) SSR T ( β , γ ) λ D ( γ ) β l 1 λ D ̂ ( γ ̂ ) β ̂ l 1 .

Based on (A6) and (A8), for the prediction risk we have

(A9) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β ( γ ) x t ( γ ) 2 + 2 1 T t = 1 T β ̂ 2 β 2 e t x t + 2 1 T t = 1 T ( δ ̂ δ ) e t x t 1 { q t γ ̂ t } + 2 1 T t = 1 T δ e t x t 1 { q t γ ̂ t } 1 { q t γ t } + λ D ( γ ) β l 1 λ D ̂ ( γ ̂ ) β ̂ l 1 .

Noting that on the events A and B , we have

(A10) 2 1 T t = 1 T β ̂ 2 β 2 e t x t + 2 1 T t = 1 T ( δ ̂ δ ) e t x t 1 { q t γ ̂ t } μ λ D ̂ ( β ̂ β ) l 1 .

By the definition, R T ( β , γ ) = 2 1 T t = 1 T δ e t x t 1 { q t γ ̂ t } 1 { q t γ t } . Then, (A9) can be rewritten as (for any β , γ )

(A11) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β ( γ ) x t ( γ ) 2 + μ λ D ̂ ( β ̂ β ) l 1 + R T ( β , γ ) + λ D ( γ ) β l 1 λ D ̂ ( γ ̂ ) β ̂ l 1 .

Note that for jJ 0, we have | β ̂ ( j ) β 0 ( j ) | + | β 0 ( j ) | | β ̂ ( j ) | = 0 . Thus, on the events A and B , we have (evaluating at β 0, γ 0)

(A12) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 μ λ D ̂ ( β ̂ β 0 ) l 1 + R T β 0 , γ 0 + λ D β 0 l 1 λ D ̂ ( γ ̂ ) β ̂ l 1 .

Thus, we obtain

(A13) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 + ( 1 μ ) λ D ̂ ( β ̂ β ) l 1 λ D ̂ ( β ̂ β 0 ) l 1 + R T β 0 , γ 0 + λ D β 0 l 1 λ D ̂ ( γ ̂ ) β ̂ l 1 = λ D ̂ ( β ̂ β 0 ) l 1 + D ̂ β 0 l 1 D ̂ β 0 l 1 + R T β 0 , γ 0 + λ D β 0 l 1 λ D ̂ ( γ ̂ ) β ̂ l 1 = λ ( D ̂ ( β ̂ β 0 ) l 1 + D ̂ β 0 l 1 D ̂ ( γ ̂ ) β ̂ l 1 ) + R T β 0 , γ 0 + λ D β 0 l 1 λ D ̂ β 0 l 1 2 λ D ̂ ( β ̂ β 0 ) J 0 l 1 + λ | D ̂ β 0 l 1 D β 0 l 1 | + R T β 0 , γ 0 .

Similarly, on the events A and B , we have (evaluating at β 0 , γ ̂ )

(A14) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β 0 x t ( γ ̂ ) 2 + μ λ D ̂ ( β ̂ β 0 ) l 1 + R T ( β 0 , γ ̂ ) + λ D ( γ ̂ ) β 0 l 1 λ D ̂ ( γ ̂ ) β ̂ l 1 .

Thus,

(A15) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 + ( 1 μ ) λ D ̂ ( β ̂ β 0 ) l 1 1 T t = 1 T β 0 x t ( γ 0 ) β 0 x t ( γ ̂ ) 2 + λ D ̂ ( β ̂ β 0 ) l 1 + R T ( β 0 , γ ̂ ) + λ D ( γ ̂ ) β 0 l 1 λ D ̂ ( γ ̂ ) β ̂ l 1 = 1 T t = 1 T β 0 x t ( γ 0 ) β 0 x t ( γ ̂ ) 2 + 2 λ D ̂ ( β ̂ β 0 ) J 0 l 1 + R T ( β 0 , γ ̂ ) .

Noting that R T ( β 0 , γ ̂ ) = 0 , we have proved Lemma 1. □

Lemma 2

Under Assumption 1, P ( A B ) 1 as m → ∞.

Proof of Lemma 2

This proof is similar to Lemma 1 in Medeiros and Mendes (2016) and Lemma 6 in Lee, Seo, and Shin (2016). Set λ = A σ m 1 / d log ( 3 m ) T 1 / 2 . Then, using the Markov inequality, we have

P r ( A c ) j = 1 m P r T | V 1 j | > μ T λ / ( 2 σ ) = j = 1 m P r 2 T 1 / 2 t = 1 T e t x t ( j ) > X ( j ) T μ T λ / 2 2 d m 1 j = 1 m E | 1 T t = 1 T e t x t ( j ) | d m 1 / d X ( j ) T μ T λ / 2 d 2 d m 1 j = 1 m E | 1 T t = 1 T e t x t ( j ) | d m 1 / d X ( j ) T μ T λ / 2 d 2 d C d c d m 1 / d X min T μ T λ / 2 d 2 d C d c d 1 X min μ A σ / 2 log ( 3 m ) 1 / 2 d ,

in which the last inequality is because E | 1 T t = 1 T e t x t ( j ) | d C d E | 1 T t = 1 T e t x t ( j ) 2 | d / 2 C d 1 T | t = 1 T e t x t ( j ) | d C d c d by the Burkholder–Davis–Gundy inequality and the C r -inequality.

As we assume a fixed regressor design and independent normal errors, e t x t ( j ) is a sequence of symmetric random variables (i.e., e t x t ( j ) and e t x t ( j ) have the same distribution). Then, as in Lemma 6 in Lee, Seo, and Shin (2016), by Levy’s Symmetrization Inequalities (see., e.g., Cam and Yang 2015, pages 83–84), we have

P r sup γ Γ T V 2 j ( γ ) > μ T λ / ( 2 σ ) P r sup 1 s T 1 σ T t = 1 s e t x t ( j ) > X ( i ) ( γ ) T μ T λ / ( 2 σ ) 2 P r T | V 1 j | > X ( j ) ( γ ) T X ( j ) T μ T λ / ( 2 σ ) .

Thus,

P r ( B c ) 2 j = 1 m P r 2 T 1 / 2 t = 1 T e t x t ( j ) > X ( j ) ( γ ) T μ T λ / 2 2 d + 1 C d c d 1 X min μ A σ / 2 log ( 3 m ) 1 / 2 d .

Since P r ( A B ) 1 P r ( A c ) P r ( B c ) , we have

P r ( A B ) 1 P r ( A c ) P r ( B c ) 1 3 × 2 d C d c d 1 X min μ A σ / 2 log ( 3 m ) 1 / 2 d .

We obtain Lemma 2 as m → ∞. □

Lemma 3

(Consistency of prediction). Conditional on the events A and B ,

1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 K λ M ( β 0 ) .

for some constant K.

Proof of Lemma 3

Note that R T ( β , γ ) = 2 1 T t = 1 T δ e t x t 1 { q t γ ̂ t } 1 { q t γ t } . Then, on the event B , we have

(A16) R T ( β , γ ) 2 μ λ j = 1 m x ( j ) T δ 0 ( j ) 2 μ λ X max δ 0 l 1 ,

By Lemma 1, conditional on the events A and B , we have

(A17) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 + ( 1 μ ) λ D ̂ ( β ̂ β ) l 1 2 λ D ̂ ( β ̂ β 0 ) J 0 l 1 + λ | D ̂ β 0 l 1 D β 0 l 1 | + R T β 0 , γ 0 6 λ X max β max M ( β 0 ) + 2 μ λ X max δ 0 l 1 .

Note that δ 0 l 1 β max M ( β 0 ) . Thus, we have 6 λ X max β max M ( β 0 ) + 2 μ λ X max δ 0 l 1 λ M ( β 0 ) [ 6 X max β max + 2 μ X max β max ] K λ M ( β 0 ) . This implies that the LASSO prediction is consistent with probability converging to one. □

Lemma 4

(Sparsity of the LASSO). Conditional on the events A B ,

(A18) M ( β ̂ ) 4 maxeig ( X ( γ ̂ ) X ( γ ̂ ) / T ) K λ M ( β 0 ) / ( 1 μ ) 2 λ 2 X min 2 .

Proof of Lemma 4

We first rewrite the model y t = β x t ( γ ) + e t in matrix form:

(A19) y = X ( γ ) β + e .

For any given threshold parameter γ R = R 0 × R 1 × R 2 × R k , the LASSO solution β ̂ satisfies the KKT conditions given by

(A20) 2 T X ( j ) ( y X ( γ ) β ̂ ( γ ) ) = λ X ( j ) T sign β ̂ ( j ) ( γ ) ,  if  β ̂ 2 ( j ) ( γ ) 0

(A21) 2 T X ( j ) ( y X ( γ ) β ̂ ( γ ) ) λ X ( j ) T ,  if  β ̂ 2 ( j ) ( γ ) = 0

(A22) 2 T X ( j ) ( γ ) ( y X ( γ ) β ̂ ( γ ) ) = λ X ( j ) ( γ ) T sign δ ̂ ( j ) ( γ ) ,  if  δ ̂ ( j ) ( γ ) 0

(A23) 2 T X ( j ) ( γ ) ( y X ( τ ) β ̂ ( γ ) ) λ X ( j ) ( γ ) T ,  if  δ ̂ ( j ) ( γ ) = 0

in which j = 1, …, m.

Conditional on the events A and B , we have

(A24) 2 T t = 1 T e t x t ( j ) μ λ X ( j ) T ,

and

(A25) 2 T t = 1 T e t x t ( j ) 1 { q t γ t } μ λ X ( j ) ( γ ) T

for any γ , j = 1, …, m.

Since y = X( γ 0) β 0 + e, we have

(A26) 2 T X ( j ) ( X ( γ 0 ) β 0 X ( γ ) β ̂ ( γ ) ) ( 1 μ ) λ X ( j ) T ,  if  β ̂ 2 ( j ) ( γ ) 0 ,

(A27) 2 T X ( j ) ( γ ) ( X ( γ 0 ) β 0 X ( γ ) β ̂ ( γ ) ) ( 1 μ ) λ X ( j ) ( γ ) T ,  if  δ ̂ ( j ) ( γ ) 0 .

for any γ ∈ Γ.

Using inequalities above, we obtain

1 T 2 X ( γ 0 ) β 0 X ( γ ̂ ) β ̂ X ( γ ̂ ) X ( γ ̂ ) X ( γ 0 ) β 0 X ( γ ̂ ) β ̂ = 1 T 2 j = 1 m [ X ( j ) ] X ( γ 0 ) β 0 X ( γ ̂ ) β ̂ 2 + 1 T 2 j = 1 m [ X ( j ) ( γ ̂ ) ] X ( γ 0 ) β 0 X ( γ ̂ ) β ̂ 2 1 T 2 j : β ̂ ( j ) 0 [ X ( j ) ] X ( γ 0 ) β 0 X ( γ ̂ ) β ̂ 2 + 1 T 2 j : δ ̂ ( j ) 0 [ X ( j ) ( γ ̂ ) ] X ( γ 0 ) β 0 X ( γ ̂ ) β ̂ 2 ( 1 μ ) 2 λ 2 4 j : β ̂ ( j ) 0 X ( j ) T 2 + j : δ ̂ ( j ) 0 X ( j ) ( γ ̂ ) T 2 ( 1 μ ) 2 λ 2 4 X min 2 M ( β ̂ ) .

By Lemma 3, we have

(A28) 1 T 2 X ( γ 0 ) β 0 X ( γ ̂ ) β ̂ X ( γ ̂ ) X ( γ ̂ ) X ( γ 0 ) β 0 X ( γ ̂ ) β ̂ maxeig ( X ( γ ̂ ) X ( γ ̂ ) / T ) K λ M ( β 0 ) ,

in which maxeig ( X ( γ ̂ ) X ( γ ̂ ) / T ) denotes the largest eigenvalue of X ( γ ̂ ) X ( γ ̂ / T ) . We therefore obtain Lemma 3. □

Theorem 1

(Consistency of the LASSO estimator). Suppose δ 0 = 0. Suppose Assumption 1.3 holds with κ = κ ( s , 1 + μ 1 μ , Γ ) for μ ∈ (0, 1) and M ( β 0 ) s m , under Assumption 1, on the events A and B we have

(A29) 1 T ( X ( γ ̂ ) β ̂ X ( γ 0 ) β 0 ) T 2 λ κ s X max ,

(A30) β ̂ β 0 l 1 4 λ κ 2 ( 1 μ ) s X max 2 X min ,

(A31) M ( β ̂ ) 16 ϕ max s X max 2 ( 1 μ ) 2 κ 2 X min 2 .

Proof of Theorem 1

Since δ 0 = 0 implies 1 T t = 1 T β 0 x t ( γ 0 ) β 0 x t ( γ ̂ ) 2 = 0 . Thus, by Lemma 1 we have

(A32) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 + ( 1 μ ) λ D ̂ ( β ̂ β 0 ) l 1 2 λ D ̂ ( β ̂ β 0 ) J 0 l 1 .

Hence, we have

(A33) ( 1 μ ) λ D ̂ ( β ̂ β 0 ) J 0 J 0 C l 1 2 λ D ̂ ( β ̂ β 0 ) J 0 l 1 ,

which is equivalent to

(A34) ( 1 μ ) λ D ̂ ( β ̂ β 0 ) J 0 C l 1 2 λ D ̂ ( β ̂ β 0 ) J 0 l 1 ( 1 μ ) λ D ̂ ( β ̂ β 0 ) J 0 l 1 .

We rewrite (A34) as

(A35) ( 1 μ ) λ D ̂ ( β ̂ β 0 ) J 0 C l 1 ( 1 + μ ) λ D ̂ ( β ̂ β 0 ) J 0 l 1 ,

which implies that

(A36) D ̂ ( β ̂ β 0 ) J 0 C l 1 1 + μ 1 μ D ̂ ( β ̂ β 0 ) J 0 l 1 .

By Assumption 1.3 of uniform restricted eigenvalue, URE ( s , 1 + μ 1 μ , Γ ) κ ( s , 1 + μ 1 μ , Γ ) . Thus, we have (using r D ̂ ( β ̂ β 0 ) in Assumption 1.3)

(A37) κ 2 D ̂ ( β ̂ β 0 ) J 0 l 2 2 X ( γ ̂ ) D ̂ ( β ̂ β 0 ) l 2 2 T D ̂ ( β ̂ β 0 ) J 0 | | l 2 D ̂ ( β ̂ β 0 ) J 0 l 2 2 = X ( γ ̂ ) D ̂ ( β ̂ β 0 ) l 2 2 T = 1 T ( β ̂ β 0 ) D X ̂ ( γ ̂ ) X ( γ ̂ ) D ̂ ( β ̂ β 0 ) max ( D ̂ ) 2 T ( β ̂ β 0 ) X ( γ ̂ ) X ( γ ̂ ) ( β ̂ β 0 ) = max ( D ̂ ) 2 T ( X ( γ ̂ ) β ̂ X ( γ ) β 0 ) l 2 2 ,

where κ = κ ( s , 1 + μ 1 μ , Γ ) and the last equality is because of the assumption that δ 0 = 0.

By Lemma 1 and (A32), we have

(A38) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 2 λ D ̂ ( β ̂ β 0 ) J 0 l 1 2 λ s D ̂ ( β ̂ β 0 ) J 0 l 2 2 λ κ s max ( D ̂ ) 1 T ( X ( γ ̂ ) β ̂ X ( γ ) β 0 ) l 2 ,

where max ( D ̂ ) = X max by the definition. This inequality yields the first conclusion of the consistency of the LASSO prediction. Using a similar argument, we have

(A39) D ̂ ( β ̂ β 0 ) l 1 = D ̂ ( β ̂ β 0 ) J 0 l 1 + D ̂ ( β ̂ β 0 ) J 0 c l 1 2 1 μ D ̂ ( β ̂ β 0 ) J 0 l 1 2 1 μ s D ̂ ( β ̂ β 0 ) J 0 l 2 2 κ ( 1 μ ) s max ( D ̂ ) 1 T ( X ( γ ̂ ) β ̂ X ( γ ) β 0 ) | | l 2

Noting that D ̂ ( β ̂ β 0 ) l 1 min ( D ̂ ) β ̂ β 0 l 1 , by (A39) we have

β ̂ β 0 l 1 2 κ ( 1 μ ) s max ( D ̂ ) min ( D ̂ ) 1 T ( X ( γ ̂ ) β ̂ X ( γ ) β 0 ) | | l 2 4 λ κ 2 ( 1 μ ) s max ( D ̂ ) min ( D ̂ ) max ( D ̂ ) .

Noting max ( D ̂ ) = X max , we therefore have proved the consistency of the LASSO estimator for the case of δ 0 = 0.

By Lemma 3 and the bound for prediction risk, we have

(A40) M ( β ̂ ) 4 maxeig ( X ( γ ̂ ) X ( γ ̂ ) / T ) ( 2 λ / κ s X max ) 2 ( β 0 ) / ( 1 μ ) 2 λ 2 X min 2

Then the third conclusion of the lemma follows immediately. □

Lemma 5

(Probability of P r ( C ( η ) c ) ). P r ( A ) P r ( B ) P r ( C ( η ) ) 1 .

Proof of Lemma 5

For some constant η > 0, define the event

C C ( η ) : = sup γ t γ t 0 < η 2 T t = 1 T e t x t δ 0 1 q t < γ t 0 1 q i < γ t λ η .

By Levy’s Symmetrization Inequalities (as in Lemma 2), we obtain

(A41) P r ( C ( η j ) c ) P r sup γ t γ t 0 η j 2 T t = 1 T e t x t δ 0 1 q t < γ t 0 1 q i < γ t > λ η j 2 P r 2 T t = T γ t 0 η j T γ t 0 + η j e t x t δ 0 > λ η j = 2 P r 2 1 T η j t = T γ t 0 η j T γ t 0 + η j e t x t δ 0 > T λ E ( 2 T η ) 1 / 2 t = T γ t 0 η T γ t 0 + η e t x t δ 0 2 1 λ T 2 β max 2 1 m j = 1 m E ( 2 T η ) 1 / 2 t = T γ t 0 η T γ t 0 + η e t x t ( j ) 2 m 1 / 2 λ T 2 β max 2 C d 1 m j = 1 m E ( 2 T η ) 1 t = T γ t 0 η T γ t 0 + η e t x t ( j ) 2 m 1 / 2 λ T 2 β max 2 C d 1 m j = 1 m ( 2 T η ) 1 t = T γ t 0 η T γ t 0 + η E e t x t ( j ) 2 1 A σ log ( 3 m ) 1 / 2 2

in which ( 2 T η ) 1 t = T γ t 0 η T γ t 0 + η | e t x t ( j ) | 2 is bounded (by Assumption 1.6 for well-defined second moments). Hence, as in Lemma 2, we have P r ( C ( η j ) c ) 0 as m → ∞. □

Lemma 6

(Boundary of the threshold estimator). Suppose Assumption 1.4 holds. Let

η * = c 1 6 λ X max β max M ( β 0 ) + 2 μ λ X max δ 0 l 1 ,

in which c is defined as in Assumption 1.4. Then conditional on the events A and B ,

γ ̂ γ 0 l 1 η * .

Proof of Lemma 6

As in the Proof of Lemma 1, we have

(A42) SSR T ( β ̂ , γ ̂ ) SSR T ( β , γ ) = 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β ( γ ) x t ( γ ) 2 2 1 T t = 1 T β ̂ 2 β 2 e t x t 2 1 T t = 1 T ( δ ̂ δ ) e t x t 1 { q t γ ̂ t } 2 1 T t = 1 T δ e t x t 1 { q t γ ̂ t } 1 { q t γ t }

(A43) = 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β ( γ ) x t ( γ ) 2 2 1 T t = 1 T β ̂ 2 β 2 e t x t 2 1 T t = 1 T ( δ ̂ δ ) e t x t 1 { q t γ ̂ t } R T ( β , γ )

Thus, based on (A10) we obtain

(A44) SSR T ( β ̂ , γ ̂ ) SSR T β 0 , γ 0 = 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β 0 x t ( γ 0 ) 2 2 1 T t = 1 T β ̂ 2 β 02 e t x t 2 1 T t = 1 T δ ̂ δ 0 e t x t 1 { q t γ ̂ t } R T β 0 , γ 0 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 μ λ D ̂ ( β ̂ β 0 ) l 1 R T β 0 , γ 0

Note that for jJ 0, we have | β ̂ ( j ) β 0 ( j ) | + | β 0 ( j ) | | β ̂ ( j ) | = 0 . Thus, on the events A and B , we have

(A45) [ SSR T ( β ̂ , γ ̂ ) + λ D ̂ β ̂ l 1 ] SSR T β 0 , γ 0 λ D β 0 l 1 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 μ λ D ̂ ( β ̂ β 0 ) l 1 λ [ D β 0 l 1 D ̂ β ̂ l 1 ] R T β 0 , γ 0 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 2 λ D ̂ ( β ̂ β 0 ) J 0 l 1 λ [ D β 0 l 1 D ̂ β 0 l 1 ] R T β 0 , γ 0 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 [ 6 λ X max β max M ( β 0 ) + 2 μ λ X max δ 0 l 1 ]

where the last inequality comes from Eq. (A16).

Define η * = c 1 6 λ X max β max M ( β 0 ) + 2 μ λ X max δ 0 l 1 , in which c is defined as in Assumption 1.4. Suppose that γ ̂ γ 0 l 1 > η * . Then Assumption 1.3 and the above inequality together imply that | β ̂ ( j ) β 0 ( j ) | + | β 0 ( j ) | | β ̂ ( j ) | = 0 . Thus, on the events A and B , we have

(A46) [ SSR T ( β ̂ , γ ̂ ) + λ D ̂ β ̂ l 1 ] SSR T β 0 , γ 0 λ D β 0 l 1 > 0 ,

which leads to contradiction as γ ̂ is the minimizer of the criterion function as seen in Section 2. Thus, we obtain γ ̂ γ 0 l 1 η * . □

This lemma implies that, as T → ∞, m → ∞ and λ M ( β ) 0 , with probability approaching one, we have γ ̂ γ 0 .

The following Lemma states that the bounds for the prediction and the slope estimator may become smaller as the bound of the threshold estimator gets smaller.

Lemma 7

Suppose that γ ̂ γ 0 l 1 < c γ and β ̂ β 0 l 1 < c β for some (c γ , c β ). Suppose further that Assumptions 1.5 and 1.3 hold with Γ = { γ : γ γ 0 l 1 < c γ } , κ = κ ( s , 2 + μ 1 μ , Γ ) for μ ∈ (0, 1) and M ( β 0 ) s m . Then, conditional on the events A , B and C (c γ ), we have

f ̂ f 0 T 2 3 λ c γ + 2 X min 1 c γ C δ 0 1 6 X max 2 κ 2 λ s 2 X max κ c β c γ C δ 0 1 s 1 / 2 β ̂ β 0 l 1 3 ( 1 μ ) X min c γ + 2 X min 1 c γ C δ 0 1 6 X max 2 κ 2 λ s 2 X max κ c β c γ C δ 0 1 s 1 / 2

in which f ̂ f 0 T 2 = 1 T X ( γ ̂ ) β ̂ X ( γ 0 ) β 0 l 2 2 .

Proof of Lemma 7

Note that on the event C (c γ ), we have

(A47) R T β 0 , γ 0 = 2 1 T t = 1 T δ e t x t 1 q t γ ̂ t 1 q t γ t 0 λ c γ .

Using the mean value theorem ( f c = f b f a b a a b a b = 1 2 c ) and the triangular inequality |a| − |b| ≤ |a ± b| ≤ |a| + |b|, and Assumption 1.5 for smoothness of the design matrix, we obtain

(A48) D ̂ β 0 l 1 D β 0 l 1 = j = 1 m X ( j ) ( γ ̂ ) T X ( j ) γ 0 T | δ 0 ( j ) | j = 1 m 2 X ( j ) t 0 T 1 δ 0 ( j ) 1 T t = 1 T x t ( j ) 2 1 q t γ ̂ t 1 q t γ t 0 2 X min 1 c γ C δ 0 l 1 .

We consider two cases: (i) D ̂ β ̂ β 0 J 0 l 1 > c γ + 2 X min 1 c γ C δ 0 l 1 ; and (ii) D ̂ β ̂ β 0 J 0 l 1 c γ + 2 X min 1 c γ C δ 0 l 1 .

Case (i): D ̂ β ̂ β 0 J 0 l 1 > c γ + 2 X min 1 c γ C δ 0 l 1 . In this case, we have

(A49) λ D ̂ β 0 l 1 D β 0 l 1 + R T β 0 , γ 0 λ 2 X min 1 c γ C δ 0 l 1 + λ c γ = λ c γ + 2 X min 1 c γ C δ 0 l 1 λ D ̂ β ̂ β 0 J 0 l 1 .

By Lemma 1, we have

(A50) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 + ( 1 μ ) λ D ̂ ( β ̂ β ) l 1 2 λ D ̂ β ̂ β 0 J 0 l 1 + λ | D ̂ β 0 l 1 D β 0 l 1 | + R T β 0 , γ 0 3 λ D ̂ β ̂ β 0 J 0 l 1 ,

We rewrite (A50) as

D ̂ ( β ̂ β ) J 0 c l 1 2 + μ 1 μ D ̂ ( β ̂ β ) J 0 l 1 ,

which is because

(A51) ( 1 μ ) λ D ̂ ( β ̂ β ) l 1 = ( 1 μ ) λ D ̂ ( β ̂ β ) J 0 c l 1 + ( 1 μ ) λ D ̂ ( β ̂ β ) J 0 l 1 3 λ D ̂ ( β ̂ β 0 ) J 0 l 1 .

We assume that Assumption 1.3 holds with Γ = { γ : γ γ 0 l 1 < c γ } , κ = κ ( s , 2 + μ 1 μ , Γ ) for μ ∈ (0, 1) and M ( β 0 ) s m . By Assumption 1.3 of uniform restricted eigenvalue, URE ( s , 1 + μ 1 μ , Γ ) κ ( s , 1 + μ 1 μ , Γ ) , we have (using r D ̂ ( β ̂ β 0 ) in Assumption 1.3)

(A52) κ 2 D ̂ ( β ̂ β 0 ) J 0 l 2 2 X ( γ ̂ ) D ̂ ( β ̂ β 0 ) l 2 2 T D ̂ ( β ̂ β 0 ) J 0 | | l 2 D ̂ ( β ̂ β 0 ) J 0 l 2 2 = X ( γ ̂ ) D ̂ ( β ̂ β 0 ) l 2 2 T = 1 T ( β ̂ β 0 ) D ̂ X ( γ ̂ ) X ( γ ̂ ) D ̂ ( β ̂ β 0 ) max ( D ̂ ) 2 T ( β ̂ β 0 ) X ( γ ̂ ) X ( γ ̂ ) ( β ̂ β 0 ) X max 2 1 T X ( γ ̂ ) β ̂ X ( γ 0 ) β 0 l 2 2 + 2 c β + 1 X min δ 0 l 1 c γ C ,

where κ = κ ( s , 1 + μ 1 μ , Γ ) and the last inequality is because of the assumption of smoothness of the design matrix.

Thus,

(A53) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 3 λ D ̂ ( β ̂ β 0 ) J 0 l 1 3 λ s D ̂ ( β ̂ β 0 ) J 0 l 2 3 λ s κ 2 X max 2 1 T X ( γ ̂ ) β ̂ X ( γ 0 ) β 0 l 2 2 + 2 c β + 1 X min δ 0 l 1 c γ C 1 / 2 .

Noting that a + b ≤ 2a ∨ 2b, conditional on the events A and B we obtain the upper bound given by

(A54) 1 T X ( γ ̂ ) β ̂ X ( γ 0 ) β 0 l 2 2 = 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 18 X max 2 κ 2 λ 2 s 6 X max κ λ 2 c β + 1 X min δ 0 l 1 c γ C s 1 / 2 .

Using the similar argument as in Lemma 4, we have

(A55) D ̂ ( β ̂ β 0 ) l 1 = D ̂ ( β ̂ β 0 ) J 0 l 1 + D ̂ ( β ̂ β 0 ) J 0 c l 1 3 1 μ D ̂ ( β ̂ β 0 ) J 0 l 1 3 1 μ s D ̂ ( β ̂ β 0 ) J 0 l 2 3 X max κ ( 1 μ ) s 1 T X ( γ ̂ ) β ̂ X ( γ 0 ) β 0 l 2 2 + 2 c β + 1 X min δ 0 l 1 c γ C 1 / 2 .

Noting that D ̂ ( β ̂ β 0 ) l 1 min ( D ̂ ) β ̂ β 0 l 1 , using Eqs. (A54) and (A55) we have

(A56) β ̂ β 0 l 1 12 ( 1 μ ) κ 2 X max 2 X min λ s 6 ( 1 μ ) κ X max X min 2 c β + 1 X min δ 0 l 1 c γ C s 1 / 2 .

Case (ii): D ̂ ( β ̂ β 0 ) J 0 l 1 c γ + 2 X min 1 c γ C δ 0 l 1 . In this case, we have (by Lemma 1, as in (A49))

(A57) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 3 λ D ̂ ( β ̂ β 0 ) J 0 l 1 3 λ c γ + 2 X min 1 c γ C δ 0 l 1 ,

and

(A58) ( 1 μ ) λ D ̂ ( β ̂ β 0 ) l 1 3 λ D ̂ ( β ̂ β 0 ) J 0 l 1 .

By (A58), we have the following result

(A59) ( β ̂ β 0 ) l 1 3 ( 1 μ ) X min D ̂ ( β ̂ β 0 ) J 0 l 1

(A60) 3 ( 1 μ ) X min c γ + 2 X min 1 c γ C δ 0 l 1 .

Combining the above results establishes the Lemma. □

The following lemma shows that the bound for the threshold estimator can be further tightened if we combine results obtained in Lemmas 6 and 7

Lemma 8

(Tightened bound of the threshold estimator). Suppose that γ ̂ γ 0 l 1 < c γ and β ̂ β 0 l 1 < c β for some (c γ , c β ). Let η ̃ c 1 λ ( 1 + μ ) X max c β + c γ + 2 X min 1 c γ C δ 0 l 1 . If Assumption 1.4 holds, then conditional on the events A , B and C (c γ ), we have

γ ̂ γ 0 l 1 < η ̃ .

Proof of Lemma 8

Conditional on the events A , B and C (c γ ), we have

(A61) 2 T t = 1 T e t x t ( j ) ( β ̂ 2 β 02 ) + e t x t ( j ) 1 { q t γ ̂ t } ( δ ̂ δ 0 ) μ λ X max β ̂ β 0 l 1 μ λ X max c β ,

and

(A62) 2 T t = 1 T e t x t δ 0 1 q t < γ t 0 1 q t < γ t λ c γ .

Suppose η ̃ γ ̂ γ 0 l 1 < c γ . Then, as in Proof of Lemma 6, we have

(A63) SSR T ( β ̂ , γ ̂ ) SSR T ( β , γ ) = 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β ( γ ) x t ( γ ) 2 2 1 T t = 1 T β ̂ 2 β 2 e t x t 2 1 T t = 1 T ( δ ̂ δ ) e t x t 1 { q t γ ̂ t } 2 1 T t = 1 T δ e t x t 1 { q t γ ̂ t } 1 { q t γ t }

(A64) = 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β ( γ ) x t ( γ ) 2 2 1 T t = 1 T β ̂ 2 β 2 e t x t 2 1 T t = 1 T ( δ ̂ δ ) e t x t 1 { q t γ ̂ t } R T ( β , γ )

Thus,

(A65) SSR T ( β ̂ , γ ̂ ) SSR T β 0 , γ 0 = 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 1 T t = 1 T β 0 x t ( γ 0 ) β 0 x t ( γ 0 ) 2 2 1 T t = 1 T β ̂ 2 β 02 e t x t 2 1 T t = 1 T δ ̂ δ 0 e t x t 1 { q t γ ̂ t } R T ( β , γ ) 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 μ λ X max c β λ c γ

Note that for jJ 0, we have | β ̂ ( j ) β 0 ( j ) | + | β 0 ( j ) | | β ̂ ( j ) | = 0 . Thus, on the events A and B , we have

(A66) [ SSR T ( β ̂ , γ ̂ ) + λ D ̂ β ̂ l 1 ] SSR T β 0 , γ 0 λ D β 0 l 1 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 μ λ X max c β λ c γ λ D β 0 l 1 D ̂ β ̂ l 1 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 μ λ X max c β λ c γ λ D ̂ ( β ̂ β 0 ) l 1 + | D ̂ β 0 l 1 D β 0 l 1 | 1 T t = 1 T β 0 x t ( γ 0 ) β ̂ ( γ ̂ ) x t ( γ ̂ ) 2 λ ( 1 + μ ) X max c β + c γ + 2 X min 1 c γ C δ 0 l 1 > c η ̃ λ ( 1 + μ ) X max c β + c γ + 2 X min 1 c γ C δ 0 l 1

in which the last inequality is due to Assumption 1.3.

Noting that c η ̃ = λ ( 1 + μ ) X max c β + c γ + 2 X min 1 c γ C δ 0 l 1 . Thus, we obtain the Lemma using a contradiction argument as in Lemma 6. □

Lemma 7 provides us with three different bounds for | β ̂ β 0 | l 1 and the two of them are functions of c γ and c β . This leads us to apply Lemmas 7 and 8 iteratively to tighten up the bounds. As can be seen from Lemmas 7 and 8, when the sample size becomes large and thus λ is small enough, the bounds for | β ̂ β 0 | l 1 are dominated by the middle term in Lemma 7.

Theorem 2

Suppose δ 0. Suppose Assumptions 1.4, 1.5, and 1.3 hold with Γ * = γ : γ γ 0 l 1 < η * , κ = κ ( s , 2 + μ 1 μ , Γ * ) for μ ∈ (0, 1) and M ( β 0 ) s m . Let ( β ̂ , γ ̂ ) be the LASSO estimator defined in Section 2 with λ given in Section 2. Then, conditional on the events A , B and C (η*), we have

(A67) 1 T X ( γ ̂ ) β ̂ X ( γ 0 ) β 0 l 2 λ X max κ 18 s

(A68) β ̂ β 0 l 1 3 ( 1 μ ) X min 6 X max 2 κ 2 λ s

(A69) γ ̂ γ 0 l 1 3 ( 1 + μ ) X max ( 1 μ ) X min + 1 6 X max 2 c κ 2 λ 2 s

(A70) M ( β ̂ ) 72 ϕ max X max 2 ( 1 μ ) 2 κ 2 X min 2 s

Proof of Theorem 2

Let c β * and c γ * denote the bounds given in the Lemma for β ̂ β 0 l 1 and γ ̂ γ 0 l 1 , respectively. Suppose that

(A71) c γ + 2 X min 1 c γ C δ 0 1 6 X max 2 κ 2 λ s 2 X max κ c β c γ C δ 0 1 s 1 / 2 = 6 X max 2 κ 2 λ s .

That is, given the choice of λ, the bounds for β ̂ β 0 l 1 and γ ̂ γ 0 l 1 are small enough such that the bound is dominated by the middle term in Lemma 6. This argument applies for 1 T X ( γ ̂ ) β ̂ X ( γ 0 ) β 0 l 2 2 .

By Lemma 7 with c β = c β * , we have

(A72) γ ̂ γ 0 l 1 c 1 λ ( 1 + μ ) X max c β * + c γ + 2 X min 1 c γ C δ 0 l 1 3 ( 1 + μ ) X max ( 1 μ ) X min + 1 6 X max 2 c κ 2 λ 2 s ,

which is c γ * .

By Lemmas 3 and 6, we have the following three inequalities

(A73) f ̂ f 0 T 2 3 λ c γ + 2 X min 1 c γ C δ 0 1 6 X max 2 κ 2 λ s 2 X max κ c β c γ C δ 0 1 s 1 / 2 , β ̂ β 0 l 1 3 ( 1 μ ) X min c γ + 2 X min 1 c γ C δ 0 1 6 X max 2 κ 2 λ s 2 X max κ c β c γ C δ 0 1 s 1 / 2 , M ( β ̂ ) 4 maxeig ( X ( γ ̂ ) X ( γ ̂ ) / T ) 1 T X ( γ ̂ ) β ̂ X ( γ 0 ) β 0 l 2 2 / ( 1 μ ) 2 λ 2 X min 2 ,

in which f ̂ f 0 T 2 1 T X ( γ ̂ ) β ̂ X ( γ 0 ) β 0 l 2 2 . Plugging (A71) into the above inequalities we obtain the desired results. When (A71) does not hold, we can use a chaining argument by iteratively applying Lemmas 7 and 8 to tighten the bounds for the prediction risk and the estimator errors in β ̂ and γ ̂ as in Lee, Seo, and Shin (2016), and use the fixed point theorem to show that the bound can be reached within a finite number of iterative applications of Lemmas 7 and 8. □

Lemma 10

(Sup norm of the empirical process). Under Assumption 1, we have

1 T X ( γ ̂ ) e l = O p ( λ ) = O p m 1 / d log ( m ) T .

Proof of Lemma 10

The proof follows immediately from Lemma 2. □

Theorem 3

Suppose that Assumption 1 holds. When δ 0 = 0, we have

β ̂ β 0 l = O p ( λ ) = O p m 1 / d log ( m ) T .

Proof of Theorem 3

When δ 0 = 0, we can rewrite the model y t = β x t ( γ ) + e t in matrix form:

(A74) y = X β 02 + e ,

which holds for any given threshold parameter γ , the LASSO solution β ̂ satisfies the KKT conditions given by

(A75) 2 T X ( γ ̂ ) ( y X ( γ ̂ ) β ̂ ) + λ D ( γ ̂ ) z ( γ ̂ ) = 0 ,

where z ( γ ̂ ) l 1 and z ( j ) ( γ ̂ ) = sign ( β ̂ ( j ) ) if β ̂ ( j ) ( γ ̂ ) 0 . This can be reorganized as

(A76) 2 T X ( γ ̂ ) X ( γ ̂ ) ( β ̂ β 0 ) = 2 T X e λ D ( γ ̂ ) T z ( γ ̂ ) = 0 ,

which is equivalent to

(A77) 2 Σ ( γ ̂ ) ( β ̂ β 0 ) = 2 Σ ( γ ̂ ) 2 T X ( γ ̂ ) X ( γ ̂ ) ( β ̂ β 0 ) + 2 T X e λ D ( γ ̂ ) T z ( γ ̂ ) .

By Assumption, Θ( γ ) = Σ( γ )−1 exists for all γ ∈ Γ. Thus, we obtain

(A78) β ̂ β 0 = Θ ( γ ̂ ) Σ ( γ ̂ ) 1 T X ( γ ̂ ) X ( γ ̂ ) ( β ̂ β 0 ) + Θ ( γ ̂ ) 1 T X e λ 2 Θ ( γ ̂ ) D ( γ ̂ ) T z ( γ ̂ ) .

Thus, we have

(A79) β ̂ β 0 l = sup γ Γ Θ ( γ ) l Σ ( γ ̂ ) 1 T X ( γ ̂ ) X ( γ ̂ ) l ( β ̂ β 0 ) l 1 + sup γ Γ Θ ( γ ) l 1 T X e l + λ Θ ( γ ̂ ) l X max .

By Assumption, sup γ Γ Θ ( γ ) l is bounded. By Callot et al. (2017) and Theorem 2 above, sup γ Γ Σ ( γ ) 1 T X ( γ ) X ( γ ) = O p log ( m T ) T , ( β ̂ β 0 ) l 1 = O p ( λ s ) = O p s m 1 / d log ( m ) T . Thus, as s log ( m T ) T 0 we obtain

(A80) β ̂ β 0 l = O p m 1 / d log ( m ) T = O p λ .

Theorem 4

Suppose that Assumption 1 holds. When δ 00, we have

β ̂ β 0 l = O p ( λ ) = O p m 1 / d log ( m ) T .

Proof of Theorem 4

When δ 0 ≠ 0, we can rewrite the model y t = β x t ( γ ) + e t in matrix form:

(A81) y = X ( γ 0 ) β 0 + e ,

The LASSO solution β ̂ = ( β ̂ 2 , δ ̂ ) satisfies the KKT conditions given by

(A82) 2 T X ( γ ̂ ) ( y X ( γ ̂ ) β ̂ ) + λ D ( γ ̂ ) z ( γ ̂ ) = 0 ,

where z ( γ ̂ ) l 1 and z ( j ) ( γ ̂ ) = sign ( β ̂ ( j ) ) if β ̂ ( j ) ( γ ̂ ) 0 . This can be reorganized as

(A83) 1 T X ( γ ̂ ) X ( γ ̂ ) ( β ̂ β 0 ) 1 T X ( γ ̂ ) [ X ( γ 0 ) X ( γ ̂ ) ] β 0 = 1 T X ( γ ̂ ) e λ D ( γ ̂ ) z ( γ ̂ ) / 2 .

By (A85), we rewrite the above equation as

(A84) Σ ( γ 0 ) ( β ̂ β 0 ) 1 T X ( γ ̂ ) [ X ( γ 0 ) X ( γ ̂ ) ] β 0 = Σ ( γ 0 ) 1 T X ( γ ̂ ) X ( γ ̂ ) ( β ̂ β 0 ) + 1 T X ( γ ̂ ) e λ D ( γ ̂ ) z ( γ ̂ ) / 2 .

By Assumption, Θ ( γ 0 ) = Σ ( γ 0 ) 1 exists. Thus, we obtain

(A85) β ̂ β 0 = Θ ( γ 0 ) 1 T X ( γ ̂ ) [ X ( γ 0 ) X ( γ ̂ ) ] β 0 + Θ ( γ 0 ) Σ ( γ 0 ) 1 T X ( γ ̂ ) X ( γ ̂ ) ( β ̂ β 0 ) + Θ ( γ 0 ) 1 T X ( γ ̂ ) e λ Θ ( γ 0 ) D ( γ ̂ ) z ( γ ̂ ) / 2 ,

As in Theorem 3, we have

(A86) β ̂ β 0 l = Θ ( γ 0 ) l 1 T X ( γ ̂ ) [ X ( γ 0 ) X ( γ ̂ ) ] β 0 l + Θ ( γ 0 ) l Σ ( γ 0 ) 1 T X ( γ ̂ ) X ( γ ̂ ) l ( β ̂ β 0 ) l 1 + Θ ( γ 0 ) l 1 T X ( γ ̂ ) e l λ Θ ( γ 0 ) l X max .

By Theorem 3, we have γ ̂ γ 0 l 1 = O p ( s λ 2 ) with probability approaching one.

(A87) 1 T X ( γ ̂ ) [ X ( γ 0 ) X ( γ ̂ ) ] β 0 l sup j 1 T t = 1 T x t ( j ) 2 1 q t γ ̂ t 1 q t γ t 0 δ 0 l 1 C s M ( δ 0 ) m 2 / d log ( m ) T .

As we assume s M ( δ 0 ) m 1 / d log ( m T ) T 0 , thus

(A88) 1 T X ( γ ̂ ) [ X ( γ 0 ) X ( γ ̂ ) ] β 0 l = O p m 1 / d log ( m ) T = O p λ .

By Assumption, sup γ Γ Θ ( γ 0 ) l is bounded. By Callot et al. (2017) and Theorem 4 above, we have sup γ Γ Σ ( γ 0 ) 1 T X ( γ ̂ ) X ( γ ̂ ) = O p λ , ( β ̂ β 0 ) l 1 = O p ( λ s ) = O p s m 1 / d log ( m ) T . Thus, we obtain

(A89) β ̂ β 0 l = O p m 1 / d log ( m ) T = O p λ .

Theorem 5

(Consistency of variable selection). Assume min j M ( β 0 ) | β ( j ) | > 3 C λ , and Assumption 1 holds. Then P r ( J ( β ̃ ) = J ( β 0 ) ) 1 as T → ∞.

Proof of Theorem 5

We consider the zero and nonzero coefficients separately and show that both groups can be classified correctly. By Theorems 3 and 4, conditional on the events A , B and C , there exists a constant C > 0 such that β ̂ β 0 l C λ with probability approaching one for a sufficiently large T.

We first consider the truly zero coefficients. Let j M ( β 0 ) c . By the sup norm in Theorems 3 and 4, we have

(A90) max j M ( β 0 ) c | β ̂ ( j ) | C λ < 2 C λ = H ,

such that β ̃ ( j ) = 0 by the definition of the thresholded LASSO estimator.

On the other hand, consider the nonzero coefficients. Let j M ( β 0 ) and note that (by the triangular inequality)

(A91) | β ̂ ( j ) | min j M ( β 0 ) | β ( j ) | | β ̂ ( j ) β 0 ( j ) | 3 C λ C λ < 2 C λ = H ,

such that β ̃ ( j ) = β ̂ ( j ) 0 by the definition of the thresholded LASSO estimator and the beta-minassumption. □

Theorem 6

(Threshold effect). Assume min j M ( β 0 ) | β ( j ) | > 3 C λ , and Assumption 1 holds. Then P r ( J ( δ ̃ ) = J ( δ 0 ) ) 1 as T → ∞.

Proof of Theorem 6

This proof follows immediately from Theorem 5 as β ̃ = ( β ̃ 2 , δ ̃ ) . □

Appendix B: A block coordinate descent (BCD) algorithm

In this appendix, we modify the block coordinate descent (BCD) algorithm of Lee et al. (2021) to select covariates in the proposed model. For notational simplicity, we define f t = (q t , sin(2πkt/T), cos(2πkt/T), − 1)′. Then, the model defined in (1) and (2) can be rewritten as

(B1) y t = β 1 x t + δ * x t 1 f t γ * > 0 + e t ,

where δ * = β 2 β 1, γ * = (1, − γ 1, − γ 2, γ 0)′. Denote d t = 1 f t γ * > 0 , and l j , t = δ j * d t for j = 1, …, m T , t = 1, …, T, where δ j * is the jth element of δ *. Let d = (d 1, …, d T )′, l = ( l 11 , , l j , t , l m T , T ) , and e = ( e 1 , , e 2 m T ) .

Following Lee et al. (2021), we consider the variable selection problem in the l 0-penalization framework by solving the following mixed integer quadratic programming (MIQP):

(B2) min β 1 , δ * , γ * , d , l , e 1 T t = 1 T y t β 1 x t j = 1 m T x j , t l j , t 2 + λ m = 1 2 m T e m

subject to

(B3) β 1 , δ * A , γ * Γ * ,

(B4) L j δ j * U j ,

(B5) ( d t 1 ) ( M t + ϵ ) < f t γ * d t M t ,

(B6) d t { 0,1 } ,

(B7) d t L j l j , t d t U j ,

(B8) L j ( 1 d t ) δ j * l ( j , t ) U j ( 1 d t ) ,

(B9) τ 1 1 T t = 1 T d t τ 2 ,

(B10) e m β ̲ β 1 e m β ̄ ,

(B11) e m L δ * e m U ,

(B12) e m { 0,1 } ,

(B13) p ̲ m = 1 2 m T e m e m p ̄ ,

for j = 1, …, m T , t = 1, …, T, and m = 1, …, 2m T , where 0 < τ 1 < τ 2 < 1, L j and U j are the lower and upper bounds for δ j * , L and U are the lower and upper bounds for δ *, β ̲ and β ̄ are the lower and upper bounds for β 1, and p ̲ and p ̄ are the lower and upper bounds of the number of active elements of β 1 , δ * .

As discussed in Lee et al. (2021), the algorithm based on MIQP may run slowly when the dimension of x t is large. Therefore, we follow Lee et al. (2021) to consider a block coordinate descent (BCD) algorithm, which is an iterative algorithm based on mixed integer linear programming (MILP). The modified BCD algorithm is described as follows.

Step 1. Obtain an initial estimate β ̂ 1 0 , δ ̂ * 0 , γ ̂ * 0 using MIQP with a given time limit, say, MaxTime1. If a solution is obtained before reaching MaxTime1, then the initial estimate is set as the final estimate.

Step 2. If the final estimate is not obtained in Step 1, iterate the following steps (a)–(b) beginning with k = 1.

  1. For the given β ̂ 1 k 1 , δ ̂ * k 1 , update γ ̂ * using MILP:

    γ ̂ * k min γ * , d , e 1 T t = 1 T δ ̂ * k 1 x t 2 2 y t β ̂ 1 k 1 x t δ ̂ * k 1 x t d t

    subject to (B5), (B6), and (B9)(B12).

  2. For the given γ ̂ * k , update nonzero elements of β ̂ 1 k 1 , δ ̂ * k 1 by OLS:

    β ̂ 1 k , δ ̂ * k 1 T t = 1 T x t ( γ ̂ * k ) x t ( γ ̂ * k ) 1 1 T t = 1 T x t ( γ ̂ * k ) y t .

We next compare the BCD algorithm described above and the grid search (GS) method suggested in the paper through Monte Carlo simulations. To this end, the data are generated based on (17) and (18) with the sample size T = 200. We vary the dimension of x t , say, d.x = {2, 5, 10}, and set ( β 2 , δ ) = ( [ 1,0 ] , [ 1,0 ] ) , ( [ 1,1,1,0,0 ] , [ 1 , 1,1,0,0 ] ) , [ 1,1,1,1,1,0,0,0,0,0 ], [ 1 1,1 , 1,1,0,0,0,0,0 ] , respectively. The other parameter values are set as in Section 3. The parameter space for threshold parameters is set as [ q ( τ ) , q ( 1 τ ) ] 4 , where q (τ) is the τth order statistic of the threshold variable q t , and τ is specified as τ = 0.15. The parameter space for slope parameters is set as [−6,6]d.x . In the BCD algorithm, we set MaxTime1 as 1200 s, and k = 2, because k = 2 iterations would suffice in the BCD algorithm as shown in Lee et al. (2021). As the variable selection consistency requires λ → 0 and λT → ∞ (Lee et al. 2021), we therefore simply set λ = log(T)/T in the simulations.

In Table A1, we report the statistics as in Section 3 to compare the accuracy of different algorithms. As can be seen from Table A1, when d.x < 50, the BCD algorithm has better performance than the grid search (GS) method in terms of variable selection, according to the FNZ and PS statistics, while the performances of the two methods are similar in terms of threshold estimation and regime classification; when d.x = 50, the performance of the BCD algorithm is outperformed by the GS method, because we set MaxTime1 = 1200 s in the BCD algorithm, and BCD does not converge after spending the time budget as can be seen from Table A2.

Table A1:

The BCD algorithm and the grid search method.

MSE FNZ FZ PS l 1 error l error γ 0 = 0.5 γ 1 = 0.5 γ 2 = 0.5 AC Regime1
d.x = 2 WL 0.252 2.505 0.000 0.005 0.294 0.105 0.518 0.503 0.505 0.977 0.653
TH 0.252 1.797 0.000 0.108 0.288 0.105
BCD 0.255 0.126 0.000 0.927 0.097 0.061 0.500 0.494 0.504 0.982
d.x = 5 WL 0.255 4.520 0.000 0.000 0.800 0.161 0.506 0.505 0.507 0.991 0.656
TH 0.255 3.640 0.000 0.000 0.795 0.161
BCD 0.256 0.130 0.000 0.912 0.287 0.089 0.495 0.503 0.505 0.989
d.x = 10 WL 0.266 9.349 0.000 0.000 1.491 0.198 0.505 0.510 0.505 0.988 0.654
TH 0.267 6.771 0.000 0.000 1.460 0.198
BCD 0.271 0.373 0.000 0.812 0.499 0.106 0.501 0.497 0.507 0.991
WL 0.234 13.699 0.000 0.000 1.622 0.270 0.484 0.485 0.475 0.989 0.656
d.x = 20 TH 0.251 1.625 0.000 0.516 1.369 0.270
BCD 0.228 0.251 0.000 0.965 0.601 0.124 0.499 0.476 0.495 0.992
WL 0.301 15.396 0.000 0.000 2.248 0.413 0.496 0.496 0.510 0.943 0.652
d.x = 50 TH 0.342 0.281 0.000 0.716 1.912 0.414
BCD 0.982 3.123 0.000 0.221 4.730 0.793 0.526 0.404 0.423 0.990
Table A2:

Computation time for different algorithms.

Algorithm d.x = 2 d.x = 5 d.x = 10 d.x = 20 d.x = 50
Min BCD 5.36 20.27 69.58 308.90 1219.00
GS 14.92 18.25 23.34 33.82 93.78
Median BCD 10.54 36.08 153.67 715.60 1221.01
GS 31.11 37.48 43.05 68.00 175.80
Mean BCD 10.80 37.87 164.11 765.40 1221.00
GS 32.33 37.60 45.19 68.05 176.80
Max BCD 21.20 77.57 355.42 1205.60 1230.00
GS 69.81 62.04 84.01 96.27 310.19

In Table A2, we report summary statistics of computation time to compare the computational burden of the two methods. As can be seen from Table A2, the BCD algorithm is faster than the GS method when the dimension of x t is small (e.g., d.x = 2); however, the BCD algorithm would become slower than the GS method when the dimension of x t is very large (e.g., d.x = 50).

Overall, it seems that the GS method is useful when the dimension of x t is large and the dimension of γ * is small. In the proposed model, the dimension of x t can be larger than the sample size, while the dimension of γ * can be set at 4, because the Fourier function with a single frequency can be a reasonable approximation for different time-varying features; thus, the GS method is appropriate for the suggested high-dimensional threshold model. It is worth noting that the BCD algorithm works well when the threshold parameter γ * is of high dimension, in which case the GS method would not work well.

References

Becker, R., W. Enders, and J. Lee. 2006. “A Stationarity Test in the Presence of an Unknown Number of Smooth Breaks.” Journal of Time Series Analysis 27: 381–409. https://doi.org/10.1111/j.1467-9892.2006.00478.x.Search in Google Scholar

Callot, L., M. Caner, A. B. Kock, and J. A. Riquelme. 2017. “Sharp Threshold Detection Based on Sup-Norm Error Rates in High-Dimensional Models.” Journal of Business & Economic Statistics 35 (2): 250–64. https://doi.org/10.1080/07350015.2015.1052461.Search in Google Scholar

Cam, L. L., and G. L. Yang. 2015. Asymptotics in Statistics: Some Basic Concepts. New York: Springer.Search in Google Scholar

Cecchetti, S., M. Mohanty, and F. Zampolli. 2012. “The Real Effects of Debt.” In BIS Working Papers No. 352.Search in Google Scholar

Chan, K. S. 1993. “Consistency and Limiting Distribution of the Least Squares Estimator of a Threshold Autoregressive Model.” Annals of Statistics 21: 520–33. https://doi.org/10.1214/aos/1176349040.Search in Google Scholar

Chen, H. 2015. “Robust Estimation and Inference for Threshold Models with Integrated Regressors.” Econometric Theory 31 (4): 778–810. https://doi.org/10.1017/s0266466614000553.Search in Google Scholar

Dueker, M. J., Z. Psaradakis, and M. Sola. 2013. “State-Dependent Threshold Smooth Transition Autoregressive Models.” Oxford Bulletin of Economics & Statistics 75 (6): 835–54. https://doi.org/10.1111/j.1468-0084.2012.00719.x.Search in Google Scholar

Enders, W., and J. Lee. 2012. “A Unit Root Test Using a Fourier Series to Approximate Smooth Breaks.” Oxford Bulletin of Economics & Statistics 74 (4): 574–99. https://doi.org/10.1111/j.1468-0084.2011.00662.x.Search in Google Scholar

Hansen, B. E. 2000. “Sample Splitting and Threshold Estimation.” Econometrica 68 (3): 575–603. https://doi.org/10.1111/1468-0262.00124.Search in Google Scholar

Hansen, B. E. 2017. “Regression Kink with an Unknown Threshold.” Journal of Business & Economic Statistics 35 (2): 228–40. https://doi.org/10.1080/07350015.2015.1073595.Search in Google Scholar

Lee, S., M. H. Seo, and Y. Shin. 2016. “The Lasso for High Dimensional Regression with a Possible Change Point.” Journal of the Royal Statistical Society: Series B 78 (1): 193–210. https://doi.org/10.1111/rssb.12108.Search in Google Scholar PubMed PubMed Central

Lee, S., Y. Liao, M. H. Seo, and Y. Shin. 2021. “Factor-Driven Two-Regime Regression.” Annals of Statistics 49 (3): 1656–78. https://doi.org/10.1214/20-aos2017.Search in Google Scholar

Medeiros, M. C., and E. F. Mendes. 2016. “l1-Regularization of High-Dimensional Time-Series Models with Non-Gaussian and Heteroskedastic Errors.” Journal of Econometrics 191 (1): 255–71. https://doi.org/10.1016/j.jeconom.2015.10.011.Search in Google Scholar

Omay, T. 2015. “Fractional Frequency Flexible Fourier Form to Approximate Smooth Breaks in Unit Root Testing.” Economics Letters 134: 123–6. https://doi.org/10.1016/j.econlet.2015.07.010.Search in Google Scholar

Seo, M. H., and O. Linton. 2007. “A Smoothed Least Squares Estimator for Threshold Regression Models.” Journal of Econometrics 141 (2): 704–35. https://doi.org/10.1016/j.jeconom.2006.11.002.Search in Google Scholar

Wang, H., B. Li, and C. Leng. 2009. “Shrinkage Tuning Parameter Selection with a Diverging Number of Parameters.” Journal of the Royal Statistical Society: Series B 71 (3): 671–83. https://doi.org/10.1111/j.1467-9868.2008.00693.x.Search in Google Scholar

Yang, L., C. Lee, and J.-J. Su. 2017. “Behavior of the Standard Dickey-Fuller Test when There Is a Fourier-Form Break under the Null Hypothesis.” Economics Letters 159 (21): 128–33. https://doi.org/10.1016/j.econlet.2017.07.016.Search in Google Scholar

Yang, L., and J.-J. Su. 2018. “Debt and Growth: Is There a Constant Tipping Point?” Journal of International Money and Finance 87: 133–43. https://doi.org/10.1016/j.jimonfin.2018.06.002.Search in Google Scholar

Yang, L., C. Lee, and I.-P. Chen. 2021a. “Threshold Model with a Time-Varying Threshold Based on Fourier Approximation.” Journal of Time Series Analysis 42 (4): 406–30.10.1111/jtsa.12574Search in Google Scholar

Yang, L., C. Zhang, C. Lee, and I.-P. Chen. 2021b. “Panel Kink Threshold Regression Model with a Covariate-dependent Threshold.” The Econometrics Journal 24 (3): 462–81. https://doi.org/10.1093/ectj/utaa035.Search in Google Scholar

Yang, L. 2019. “Regression Discontinuity Designs with State-dependent Unknown Discontinuity Points: Estimation and Testing.” Studies in Nonlinear Dynamics & Econometrics 23 (2): 1–18. https://doi.org/10.1515/snde-2017-0059.Search in Google Scholar

Yang, L. 2022. “Time-Varying Threshold Cointegration with an Application to the Fisher Hypothesis.” Studies in Nonlinear Dynamics & Econometrics 26 (2): 257–74. https://doi.org/10.1515/snde-2018-0101.Search in Google Scholar

Yu, P., and X. Fan. 2021. “Threshold Regression with a Threshold Boundary.” Journal of Business & Economic Statistics 39 (4): 1–59. https://doi.org/10.1080/07350015.2020.1740712.Search in Google Scholar

Zhu, Y., H. Chen, and M. Lin. 2019. “Threshold Models with Time-Varying Threshold Values and Their Application in Estimating Regime-Sensitive Taylor Rules.” Studies in Nonlinear Dynamics & Econometrics 23 (5): 1–17.10.1515/snde-2017-0114Search in Google Scholar

Zou, H., T. Hastie, and R. Tibshirani. 2007. “On the Degrees of Freedom of the Lasso.” Annals of Statistics 35: 2173–92. https://doi.org/10.1214/009053607000000127.Search in Google Scholar


Supplementary Material

The online version of this article offers supplementary material (https://doi.org/10.1515/snde-2021-0047).


Received: 2021-05-18
Revised: 2022-02-14
Accepted: 2022-03-17
Published Online: 2022-05-30

© 2022 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 1.10.2025 from https://www.degruyterbrill.com/document/doi/10.1515/snde-2021-0047/html?lang=en
Scroll to top button