# multistep kernel regression smoothing by charles/ ¢ multistep kernel regression smoothing

Post on 11-Mar-2020

1 views

Embed Size (px)

TRANSCRIPT

Multistep kernel regression smoothing by boosting

Marco Di Marzio

DMQTE, G.d’Annunzio University, ITALY

Charles C. Taylor

Department of Statistics, University of Leeds, UK

Summary. In this paper we propose a simple multistep regression smoother which is con- structed in a boosting fashion, by learning the Nadaraya–Watson estimator with L2Boosting. Differently from the usual approach, we do not focus on L2Boosting for ever. Given a kernel smoother as a learner, we explore the boosting capability to build estimators using a finite number of boosting iterations. This approach appears fruitful since it simplifies the boosting in- terpretation and application. We find, in both theoretical analysis and simulation experiments, that higher order bias properties emerge. Relationships between our smoother and previous work are explored. Moreover, we suggest a way to successfully employ our method for estimat- ing probability density functions (pdf) and cumulative distribution functions (cdf) via binning procedures and the smoothing of the empirical cumulative distribution function, respectively. The practical performance of the method is illustrated by a large simulation study which shows an encouraging finite sample behaviour paricularly in comparison with other methods.

Keywords: Bias Reduction; Convolution; Local Polynomial Fitting; Machine Learning; Squared Loss; Twicing.

1. Introduction

1.1. Objectives and motivation Due to impressive performance, boosting (Shapire, 1990; Freund, 1995) has become one of the most studied machine learning ideas in the statistics community. Basically, a B-steps boosting algorithm (b.a.) iteratively computes B estimates by applying a given method, called a weak learner (w.l.), to B different re–weighted samples. The estimates are then combined into a single one which is the final output. This ensemble rule can be viewed as a ‘powerful committee’, which is expected to be significantly more accurate than every single estimate. A great deal of effort is being spent in developing theory to explain the practical behaviour of boosting, and at the moment a couple of crucial questions appear to have been successfully addressed. An important question is why boosting works. Now it seems that a satisfying response has been provided in that boosting is viewed as a greedy function approximation technique (Breiman, 1997; Friedman, 2001), where a loss function is optimized by B iterative adjustments of the estimate in the function space; at each iteration the weighting system indicates the direction of steepest descent. A second question concerns the Bayes risk consistency, and recent theoretical results show that boosting is not Bayes risk consistent and regularization techniques are needed. Recent results suggest that some regularization methods exist such that the prediction error should be nearly optimal for sufficiently large samples. Jiang (2004) analyzes the original Adaboost algorithm with ‘early stopping’, whereas alternative regularization methods are considered by Lugosi and

2 M. Di Marzio & C.C. Taylor

Vayatis (2004) and by Zhang (2004). In practical applications, a stopping rule is often determined by recourse to cross-validation based on subsamples.

To implement boosting we need to choose a loss function and a w.l.. Every loss function leads to a specifically ‘shaped’ b.a., consider for example, AdaBoost (Shapire, 1990; Freund and Shapire, 1996) for exponential losses, or L2boost (Friedman, 2001; Bühlmann and Yu, 2003) for L2 losses and so on. Clearly, if a specific w.l. is considered as well, a b.a. can be explicitly expressed as a multistep estimator, and some statistical properties can thus be derived. As a consequence, it could be possible to investigate boosting from a slightly new perspective. In particular, it could be worthwhile to explore the boosting ability to build new estimators in correspondence of a finite number of iterations. On the contrary, as mentioned, until now the predominant interest has been devoted to ensuring asymptotic behaviour such as resistance to overfitting or existence of bounds for stopping rules. We find three main advantages in a finite–steps approach: i) if we find that an estimator has optimal properties at a specific step, the need for regularization techniques vanishes; ii) when moving in the field of classical inference theory the interpretation becomes easier, also because underpinnings of related work could arise; iii) since classical statistical estimators result, the proposed methods are immediately and easily employable. This approach was recently pursued in density estimation and classification. Di Marzio and Taylor (2004) find that the second step of their b.a. for density estimation is a higher order bias estimator quite similar to the a well known higher order method due to Jones et al. (1995). Similar conclusions are also reached by Di Marzio and Taylor (2005a) that learn kernel density discrimination by boosting obtaining improved estimates of the frontier.

In the present paper we adopt the finite–steps perspective for exploring the potential of boosting in the regression setting. Specifically, we propose a new higher order biased nonparametric regression smoother that results from learning the Nadaraya–Watson (N-W) estimator by L2Boosting. Note that in polynomial regression bias becomes more serious the higher the curvature of the regression function. In this latter case the use of a higher polynomial fit – that is asymptotically less biased – is not always feasible since it is not sure that the regression function is sufficiently smooth, and larger samples are required. On this point it will emerge that one of the main features of the proposed method is that small bias properties are reached by requiring only first order differentiability. Relationships with previous work are outlined and utilized to explore the potential generalizations of our smoother. We also show that our estimator can be used in pdf and cdf estimation problems by fitting counting measures. Actually, fitting counting data has been used to transfer some useful properties of regression smoothers to density estimation. For example, Jones (1993) and Cheng et al. (1997) use local linear smoothers to fit respectively empirical density functions and bin counts to get less biased estimates at density boundaries. Similarly we try to utilize the higher order bias properties of our smoother in pdf and cdf estimation.

In Section 2 we briefly recall the N-W smoother which is used as the w.l. in our L2boostNW algorithm. In Section 3 we show the bias reduction properties of our b.a. and make clear links to previous work. Section 4 contains some possible generalizations of the method. In Section 5 it is shown how L2boostNW can be employed for pdf and cdf estimation and the results of some simulation experiments using regression curves, cdfs, and pdfs are summarized in Section 6. Overall, consistent gains almost always emerge with respect to both the N-W estimator and other boosting methods present in the literature. In agreement with our asymptotic theory, we have observed improving performance when sample sizes increase, and a certain robustness – not usual for kernel methods – to a wrong bandwidth selection was verified for our estimators. Finally, a few concluding remarks are

Kernel smoothing by boosting 3

contained in Section 7.

1.2. L2Boosting In what follows a description of L2Boosting suitable for our aims is given, more extensive treatments can be found in the references.

Given three real random variables, X , Y and ε assume the following regression model for their relationship

Y = m (X) + σ(X)ε, with Eε = 0, varε = 1, (1)

where X and ε are independent. Assuming that n i.i.d. observations S = {(Xi, Yi), i = 1, ..., n} drawn from (X, Y ) are available, the aim is to estimate the mean response curve m(x) = E(Y | X = x). This is the random design model, in the fixed design model as design observations we have a set of fixed, ordered points so the sample elements are s = (xi, Yi; i = 1, ..., n) that are often assumed equispaced.

L2Boosting stagewisely optimizes the squared loss function (m− m̂)2/2. Specifically, it is a procedure of iterative residual fitting where the final output is simply the sum of the fits. Formally, consider a w.l. m̂ (·; S, γ), that in the L2Boosting terminology is simply a crude smoother. An initial least squares fit is m̂0(·) = m̂ (·; S, γ0). For b = 1, 2, . . . , B, m̂b(·) is the sum of m̂b−1(·) and a least squares fit of the residuals Se = {Xi, ei = Yi−m̂b−1 (Xi)}, i.e. m̂ (·; Se, γb). The L2Boosting estimator is m̂B(·).

As mentioned, boosting is typically not Bayes consistent, i.e. if L∗ is the minimal loss obtained over all boosting iterations, then limB→∞ L(m, m̂B) > L

∗. Actually, the more B increases, the more m̂B becomes complex and tends to closely reproduce the sample (overfitting). Therefore, a stopping rule indicating when L∗ is achieved, i.e. the optimal number of iterations B∗, is needed.

2. L2Boosting and local polynomial fitting

2.1. Local Polynomial fitting from a L2Boosting perspective For an estimate of m(x) by L2Boosting we could use the polynomial

∑p j=0 βj (· − x)

j as the

w.l.. Recalling that L2Boosting optimizes squared losses, this would amount to optimizing this problem by fitting

∑p j=0 βj (· − x)

j to S

min β0,...,βp

n∑

i=1

Yi −

p∑

j=0

βj (Xi − x)j

2

K

( Xi − x

h

) (2)

The weight function K is a non-negative,