Censoring, Selection, and Nonparametric Methods

Theory Review

Censoring and the Tobit Model

Core Intuition

Suppose you want to model household expenditures on a luxury good, or remittances received from migrants abroad. Many households report exactly zero – not because their optimal demand is zero, but because some constraint prevents them from achieving it. The latent variable \(Y^*\) represents the unconstrained optimum; what we observe is \(Y = \max(Y^*, 0)\).

The key insight is that a pile-up of zeros at a boundary is different from a binary outcome. The zeros are not a separate decision – they are a constrained version of the same continuous decision. OLS ignores this constraint and treats the zeros as if they were true outcomes, producing bias.

The fraction of censored observations \(\pi = \Pr[Y = 0]\) governs how severe the problem is. Greene (1981) showed:

\[E[\hat{\beta}_{\text{OLS}}] \approx \beta(1-\pi)\]

So if 40% of observations are censored, OLS underestimates every coefficient by about 40%.

The Three Conditional Expectations

With a latent variable \(Y^*\) and observed \(Y = \max(Y^*, 0)\), there are three distinct objects we might want to estimate:

\[m^*(X) = E[Y^* \mid X] = X'\beta\]

\[m(X) = E[Y \mid X] = X'\beta \, \Phi\!\left(\frac{X'\beta}{\sigma}\right) + \sigma \phi\!\left(\frac{X'\beta}{\sigma}\right) \tag{27.2}\]

\[m^\#(X) = E[Y^\# \mid X] = X'\beta + \sigma \lambda\!\left(\frac{X'\beta}{\sigma}\right) \tag{27.3}\]

where \(\lambda(c) = \phi(c)/\Phi(c)\) is the inverse Mills ratio and \(Y^\#\) is \(Y\) with zeros dropped. These satisfy:

\[m^*(X) \leq m(X) \leq m^\#(X)\]

The ordering is intuitive: the latent mean includes negative values, the observed mean replaces negatives with zeros (raising the average), and the truncated mean drops all zeros entirely (raising it further).

OLS on the truncated sample \(Y^\#\) also fails, but in the opposite direction: for low \(X\), only observations with unusually high \(e\) end up positive, creating upward bias on the slope.

Tobit MLE

The distribution of \(Y\) given \(X\) mixes a discrete mass at zero with a continuous density for positive values:

\[ f(y \mid x) = \Phi\!\left(\frac{-x'\beta}{\sigma}\right)^{\mathbf{1}[y=0]} \left[\sigma^{-1} \phi\!\left(\frac{y - x'\beta}{\sigma}\right)\right]^{\mathbf{1}[y>0]} \]

The log-likelihood combines a probit piece (for the zeros) and a normal regression piece (for the positives):

\[ \ell_n(\beta, \sigma^2) = \sum_{Y_i=0} \log\Phi\!\left(\frac{-X_i'\beta}{\sigma}\right) - \frac{1}{2}\sum_{Y_i>0}\left[\log(2\pi\sigma^2) + \frac{(Y_i - X_i'\beta)^2}{\sigma^2}\right] \]

Global concavity (Olsen, 1978): Reparameterizing as \(\gamma = \beta/\sigma\) and \(\nu = 1/\sigma\) makes the likelihood globally concave in \((\gamma, \nu)\), so optimization converges to the unique global maximum.

Interpreting coefficients: The raw \(\beta_j\) is the effect on latent \(Y^*\), not on observed \(Y\). The marginal effect on observed \(Y\) is:

\[ \frac{\partial E[Y \mid X]}{\partial X_j} = \Phi\!\left(\frac{X'\beta}{\sigma}\right) \beta_j \]

The scaling factor \(\Phi(X'\beta/\sigma)\) is the probability of being uncensored at that value of \(X\).

CLAD

The Tobit MLE assumes normality and homoskedasticity. If either fails, the estimator is inconsistent. The Censored Least Absolute Deviations (CLAD) estimator (Powell, 1984) is distribution-free.

CLAD models the median rather than the mean. Under \(\text{Med}[e \mid X] = 0\):

\[\text{Med}[Y \mid X] = \max(X'\beta, 0)\]

CLAD minimizes:

\[M_n(\beta) = \frac{1}{n}\sum_{i=1}^n \left|Y_i - \max(X_i'\beta, 0)\right|\]

No distributional assumption on \(e\) is needed. The cost: \(M_n(\beta)\) is not globally convex because of the \(\max(\cdot, 0)\) kink, so the optimizer may converge to a local minimum. Running from multiple starting values is advisable.

When to use each:

  • Use Tobit when comfortable with normality and wanting to estimate the mean
  • Use CLAD when suspecting heavy tails, heteroskedasticity, or outliers
  • Always report both as a robustness check; large disagreements signal misspecification
  • Never use OLS on censored data; it is biased regardless of sample size
Check Your Understanding
  1. A researcher drops all households with zero remittances and runs OLS on the remaining sample. In what direction is the bias on the income coefficient? Why is this the opposite direction from OLS on the full sample?

The bias is upward. Among low-income households, only those with unusually large positive errors \(e\) end up with positive remittances and remain in the truncated sample. This selective inclusion of positive-error observations for low \(X\) steepens the estimated slope relative to the truth. OLS on the full sample runs in the opposite direction: the zeros pull the fitted line toward the horizontal, flattening the slope and producing downward attenuation bias. The two biases point in opposite directions because they make different errors; the full-sample OLS wrongly treats zeros as true outcomes, while the truncated OLS wrongly treats the truncated sample as random.

  1. Explain why \(\Phi(X'\beta/\sigma)\) appears as a scaling factor in \(\partial E[Y|X]/\partial X_j\). What does this equal when there is almost no censoring? When censoring is extreme?

\(\Phi(X'\beta/\sigma)\) is the probability that the latent variable \(Y^*\) is positive at that value of \(X\), i.e., the probability of being uncensored. A marginal increase in \(X_j\) shifts the latent demand by \(\beta_j\), but only uncensored observations translate this into a change in observed \(Y\). Censored units are already at zero and a small push does not move them. So the effect on the population average is \(\beta_j\) weighted by the fraction who are responsive, which is \(\Phi(X'\beta/\sigma)\).

When there is almost no censoring, \(\Phi(X'\beta/\sigma) \approx 1\), and the marginal effect on \(Y\) is approximately equal to \(\beta_j\) (censoring barely distorts the relationship). When censoring is extreme, \(\Phi(X'\beta/\sigma) \approx 0\), and even a large \(\beta_j\) produces almost no change in observed \(Y\) at that value of \(X\) because nearly everyone is stuck at zero.

  1. Tobit MLE is inconsistent under heteroskedasticity. Explain why CLAD remains consistent in that case.

Tobit MLE maximizes a likelihood derived under the assumption \(e \mid X \sim N(0, \sigma^2)\) with constant variance. If \(\sigma^2\) actually varies with \(X\), the likelihood is misspecified and the estimator converges to the wrong parameter value.

CLAD only requires \(\text{Med}[e \mid X] = 0\), which says the median of the error is zero regardless of how the variance behaves. Even under heteroskedasticity, if the median condition holds, the property \(\text{Med}[Y \mid X] = \max(X'\beta, 0)\) remains valid and CLAD consistently estimates \(\beta\). CLAD’s consistency does not depend on knowing or correctly specifying the shape of the error distribution.

  1. Olsen’s reparameterization makes the Tobit likelihood globally concave. CLAD’s objective is not globally convex. What practical consequence does each property have?

Global concavity of the Tobit likelihood means any numerical optimization algorithm that climbs the likelihood surface is guaranteed to converge to the unique global maximum. You can start from any initial values and reach the same answer. This makes Tobit computationally reliable – the estimator is well-defined and unique.

CLAD’s objective \(M_n(\beta) = \frac{1}{n}\sum |Y_i - \max(X_i'\beta, 0)|\) is not globally convex because the \(\max(\cdot, 0)\) kink creates flat regions and slope discontinuities. Optimization algorithms can converge to local minima that are not the global solution. In practice this means: (a) results may depend on starting values, (b) you should run CLAD from multiple starting points and compare, and (c) the estimator may be harder to compute reliably in small samples or with many parameters.

Exercise: Derive equations (27.2) and (27.3)

Starting from \(Y^* \sim N(X'\beta, \sigma^2)\) and \(Y = Y^* \mathbf{1}\{Y^* > 0\}\), derive:

  1. \(E[Y \mid X]\)
  2. \(E[Y^\# \mid X]\).

Hint: Use Theorem 5.8 from Hansen’s Probability and Statistics for Economists. Let \(c^* = -X'\beta/\sigma\).

We know \(Y^* \sim N(X'\beta, \sigma^2)\) and \(Y = Y^* \mathbf{1}\{Y^* > 0\}\). Let \(c^* = -X'\beta/\sigma\).

By Theorem 5.8.4 and the symmetry of the normal distribution,

\[ \begin{aligned} E[Y \mid X] &= E[Y^* \mathbf{1}\{Y^* > 0\} \mid X] \\ &= X'\beta\left(1 - \Phi\!\left(-\frac{X'\beta}{\sigma}\right)\right) + \sigma \phi\!\left(-\frac{X'\beta}{\sigma}\right) \\ &= X'\beta \, \Phi\!\left(\frac{X'\beta}{\sigma}\right) + \sigma \phi\!\left(\frac{X'\beta}{\sigma}\right) \end{aligned} \]

where the last step uses \(1 - \Phi(-c) = \Phi(c)\) and \(\phi(-c) = \phi(c)\) by symmetry of the normal.

By Theorem 5.8.6, the expectation of a truncated normal satisfies:

\[ \begin{aligned} E[Y^\# \mid X] &= E[Y^* \mid Y^* > 0, X] \\ &= X'\beta + \sigma \lambda\!\left(\frac{X'\beta}{\sigma}\right) \end{aligned} \]

where \(\lambda(c) = \phi(c)/\Phi(c)\) is the inverse Mills ratio. Conditioning on \(Y^* > 0\) raises the mean by \(\sigma\lambda(\cdot)\), which is the expected upward shift from truncating a normal distribution at zero.


Sample Selection

Censoring vs Selection

These two problems are easy to confuse but are conceptually distinct:

  • Censoring: You observe every unit in the population but the outcome is only partially observed (clipped at a boundary).
  • Selection: You only observe a non-random subset of units. The outcome is fully observed for those in the sample, but certain units are missing entirely.

The canonical selection example: estimating the wage equation for all potential workers using data only on the employed. If employment correlates with unobserved ability, and ability also affects wages, OLS on the employed subsample is biased.

The Selection Model and Heckman’s Correction

Model the data-generating process in two stages:

\[ Y^* = X'\beta + e \quad \text{(outcome equation)} \] \[ S = \mathbf{1}\{Z'\gamma + u > 0\} \quad \text{(selection equation)} \]

You observe \(Y = Y^*\) only when \(S = 1\). The conditional expectation in the selected sample is:

\[ E[Y \mid X, S=1] = X'\beta + E[e \mid u > -Z'\gamma] \]

If \(e\) and \(u\) are correlated, the second term is nonzero and OLS on the selected sample is biased.

Heckman (1979) showed that if \((e, u) \sim N(0, \Sigma)\) with \(\text{Var}(u) = 1\), then:

\[ E[Y \mid X, Z, S=1] = X'\beta + \sigma_{21}\,\lambda(Z'\gamma) \tag{27.7} \]

where \(\sigma_{21} = \text{Cov}(e,u)\) and \(\lambda(\cdot)\) is the inverse Mills ratio. The two-step Heckit estimator:

  1. Estimate \(\gamma\) by probit on the full sample (including \(S = 0\) units)
  2. Construct \(\hat{\lambda}_i = \lambda(Z_i'\hat{\gamma})\) and run OLS of \(Y_i\) on \(X_i\) and \(\hat{\lambda}_i\) for \(S_i = 1\) only

The coefficient on \(\hat{\lambda}_i\) estimates \(\sigma_{21}\), which is a test for selection bias.

Identification: At least one variable in \(Z\) should be excluded from \(X\). Without an exclusion restriction, identification relies entirely on the nonlinearity of \(\lambda(\cdot)\), which is fragile.

Check Your Understanding
  1. A researcher studies the effect of education on wages using a survey of the employed. Describe one scenario where selection bias is likely severe and one where it is likely negligible.

Severe: Estimating returns to education for women in a context where highly educated women have much higher employment rates. The employed women with low education are positively selected on unobservables (high ability or motivation overcomes the low-education barrier to employment), so their wages overstate what a typical low-education woman would earn if employed. The estimated education coefficient would be attenuated.

Negligible: Estimating returns to education for prime-age men in a strong economy where nearly all are employed. When \(\Pr[S=1]\) is close to 1 for all values of \(X\), conditioning on \(S=1\) barely changes the sample, and \(E[e \mid u > -Z'\gamma] \approx E[e] = 0\).

  1. In Heckman’s two-step estimator, the first step uses a probit on the full sample. Why do observations with \(S = 0\) matter for this step?

The first step estimates \(\gamma\) in the selection equation \(S = \mathbf{1}\{Z'\gamma + u > 0\}\). To estimate which values of \(Z\) make selection more or less likely, you need to observe both units that are selected (\(S=1\)) and units that are not (\(S=0\)). Without the \(S=0\) observations, you have no variation in the outcome of the selection process, it would be like trying to estimate a binary choice model when everyone chose the same option. The probit requires both zeros and ones to identify \(\gamma\).

  1. What does it mean if \(\hat{\sigma}_{21}\) in the Heckit is large and positive? Large and negative?

\(\sigma_{21} = \text{Cov}(e, u)\) where \(e\) is the error in the outcome equation and \(u\) is the error in the selection equation.

Large and positive: Unobservables that make someone more likely to be selected (\(u\) large) are also associated with higher outcomes (\(e\) large). In the wage example, unobserved ability raises both the probability of employment and the wage conditional on being employed. The selected sample has above-average unobserved ability, so OLS on the selected sample overstates wages for the full population.

Large and negative: Unobservables that make someone more likely to be selected are associated with lower outcomes. For example, workers who take jobs in a recession (high \(u\) because they are selected into employment) may be taking lower-quality jobs they would otherwise decline, so conditional wages are below average (\(e\) negative). OLS on the selected sample would understate wages for the full population.

  1. Why does the exclusion restriction matter for identification? What goes wrong if all variables in \(Z\) are also in \(X\)?

The Heckit includes both \(X'\beta\) and \(\sigma_{21}\lambda(Z'\gamma)\) in the second-step regression. If all variables in \(Z\) are also in \(X\), then \(\lambda(Z'\gamma) = \lambda(X'\gamma)\), and you are trying to separately identify a linear function \(X'\beta\) from a nonlinear transformation \(\lambda(X'\gamma)\) of the same variables. In principle these are distinct functions, so there is formal identification through nonlinearity. In practice, \(\lambda(\cdot)\) is nearly linear over most of its range, so \(X'\beta\) and \(\lambda(X'\gamma)\) are nearly collinear. The coefficients \(\beta\) and \(\sigma_{21}\) become very imprecisely estimated (standard errors explode). An exclusion restriction provides a variable that shifts selection without directly affecting the outcome, giving genuine variation to separately identify the two terms.


Nonparametric Density Estimation

Why Nonparametric?

Everything so far has imposed functional form assumptions: the CEF is linear in \(X\), errors are normal, and so on. Nonparametric methods ask: what can we learn without imposing any functional form?

In censoring models, we worried about whether errors are normal. But we should also ask: is the linear index \(X'\beta\) itself the right model for \(m(x) = E[Y \mid X=x]\)? Kernel methods let the data answer this question.

The key tradeoff: Parametric models converge at rate \(\sqrt{n}\) but are inconsistent when misspecified. Nonparametric estimators are consistent under weak assumptions but converge more slowly: at rate \(n^{2/5}\). The slower rate reflects the genuine cost of not imposing a functional form.

From Histograms to Kernel Density

A histogram estimates \(f(x)\) by counting observations in bins of width \(w\):

\[ \hat{f}(x) = \frac{n_j}{nw} \]

for all \(x\) in bin \(j\). Problems: flat within bins, discontinuous at boundaries, arbitrary bin width.

The kernel density estimator replaces the hard bin with a smooth weighting function centered at each evaluation point:

\[ \hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{X_i - x}{h}\right) \tag{17.2} \]

where \(K(\cdot)\) is a kernel (symmetric, non-negative, integrates to 1) and \(h > 0\) is the bandwidth. Common kernels: Gaussian, Epanechnikov, biweight. The kernel choice has little practical effect; the bandwidth \(h\) is what matters.

The Bias-Variance Tradeoff and Optimal Bandwidth

For any fixed \(x\):

\[ E[\hat{f}(x)] \approx f(x) + \frac{1}{2}f''(x)h^2 \quad \text{(bias increases with } h \text{)} \] \[ \text{Var}[\hat{f}(x)] \approx \frac{f(x)R_K}{nh} \quad \text{(variance decreases with } h \text{)} \]

where \(R_K = \int K(u)^2\,du\). The asymptotic integrated mean squared error (AIMSE) combines both:

\[ \text{AIMSE} = \frac{1}{4}R(f'')h^4 + \frac{R_K}{nh} \tag{17.10} \]

where \(R(f'') = \int (f''(x))^2\,dx\) measures the curvature of \(f\). Minimizing over \(h\) gives \(h_0 \propto n^{-1/5}\).

Silverman’s rule of thumb: If \(f\) is approximately \(N(0, \sigma^2)\):

\[ h_r = \sigma C_K n^{-1/5} \tag{17.12} \]

where \(C_K \approx 1.059\) for the Gaussian kernel. Works well for unimodal symmetric distributions; oversmooths for bimodal data. For those cases, cross-validation (minimizing leave-one-out prediction error) is preferred.

Check Your Understanding
  1. Explain why the AIMSE formula has one term increasing in \(h\) and one decreasing in \(h\). What does the optimal \(h_0\) balance?

The AIMSE is \(\frac{1}{4}R(f'')h^4 + \frac{R_K}{nh}\).

The first term, \(\frac{1}{4}R(f'')h^4\), is the integrated squared bias. Larger \(h\) means each evaluation point averages over a wider neighborhood, so curvature in \(f\) gets smoothed away. The estimate lags behind turns in the true density. Bias grows as \(h^4\).

The second term, \(\frac{R_K}{nh}\), is the integrated variance. Smaller \(h\) means each estimate uses fewer effective observations (the kernel weight falls off quickly), so the estimate is noisier. Variance shrinks as \(nh\) grows.

The optimal \(h_0\) is where the derivative of AIMSE with respect to \(h\) equals zero, balancing these two forces. At the optimum, the marginal benefit from reducing bias exactly equals the marginal cost of increased variance.

  1. You estimate a kernel density with bandwidth \(h = 500\) and it looks like a smooth single hill; with \(h = 50\) it shows two humps. Which do you trust more, and why?

The smaller bandwidth (\(h = 50\)) is more trustworthy if the two-hump shape persists across a range of reasonable bandwidths near 50. A large bandwidth like \(h = 500\) is almost certainly oversmoothing; it averages over such a wide neighborhood that genuine multimodality is hidden. The fact that reducing \(h\) reveals structure is evidence that the structure is real, not just noise.

However, you should not trust \(h = 50\) blindly either. If you reduce \(h\) further to \(h = 10\) and get ten noisy spikes, that suggests \(h = 50\) may itself be slightly undersmoothing. The right approach is to try a range of bandwidths and look for features (like two modes) that are stable across a sensible range. Cross-validation provides an objective criterion for choosing among them.


Nonparametric Regression

From Density Estimation to Regression

The conditional expectation \(m(x) = E[Y \mid X=x]\) can be estimated without imposing a functional form. The model is:

\[ Y = m(X) + e, \quad E[e \mid X] = 0, \quad E[e^2 \mid X] = \sigma^2(X) \]

where \(m(x)\) can be any smooth function and heteroskedasticity is allowed. The approach parallels kernel density: average \(Y\) values near \(x\), weighted by a kernel. The key distinction is whether you average with a constant weight (Nadaraya-Watson) or fit a local line (local linear).

Nadaraya-Watson and Local Linear Estimators

The Nadaraya-Watson (local constant) estimator:

\[ \hat{m}^{nw}(x) = \frac{\sum_{i=1}^n K\!\left(\frac{X_i-x}{h}\right) Y_i}{\sum_{i=1}^n K\!\left(\frac{X_i-x}{h}\right)} \tag{19.2} \]

Its asymptotic bias has two components:

\[ B^{nw}(x) = \frac{1}{2}m''(x) + \frac{f'(x)}{f(x)}m'(x) \]

The first is smoothing bias (curvature averaged away). The second is density gradient bias: if observations are denser to the right of \(x\) and \(m\) is increasing, the average pulls upward. This bias is largest at the boundaries of the data.

The local linear estimator fits a line rather than a constant near \(x\):

\[ \{\hat{m}^{ll}(x), \hat{m}'^{ll}(x)\} = \arg\min_{\alpha,\beta} \sum_{i=1}^n K\!\left(\frac{X_i-x}{h}\right)(Y_i - \alpha - \beta(X_i-x))^2 \]

Its bias is only \(B^{ll}(x) = \frac{1}{2}m''(x)\) – the density gradient term disappears because the local slope compensates for the asymmetric neighborhood. Both estimators share the same asymptotic variance:

\[ \text{Var}(\hat{m}(x)) \approx \frac{R_K \sigma^2(x)}{f(x) \cdot nh} \]

Local linear has less bias and equal variance; it strictly dominates Nadaraya-Watson. Use local linear as the default.

Bandwidth Selection in Regression

The optimal bandwidth is \(h_0 \propto n^{-1/5}\), but the constants depend on unknown \(m''(x)\) and \(\sigma^2(x)\). Two practical approaches:

Rule of Thumb (Fan and Gijbels, 1996): Fit a global 4th-order polynomial to approximate \(m''(x)\), then plug into the optimal bandwidth formula. Fast but relies on the polynomial approximation.

Cross-Validation: Choose \(h\) to minimize leave-one-out prediction error:

\[ CV(h) = \frac{1}{n}\sum_{i=1}^n \left(Y_i - \hat{m}_{-i}(X_i, h)\right)^2 \]

The leave-one-out approach is essential: in-sample residuals \(Y_i - \hat{m}(X_i)\) cannot be used because as \(h \to 0\) the estimator overfits to each point and residuals approach zero even when the model is wrong.

Check Your Understanding
  1. Explain the density gradient bias term \((f'(x)/f(x))m'(x)\) in the Nadaraya-Watson bias formula. In what region of the data would you expect this to be largest?

The Nadaraya-Watson estimator computes a weighted average of \(Y\) values near \(x\), where the weights are \(K((X_i - x)/h)\). If observations are not symmetric around \(x\), because the density \(f(x)\) is sloping, the weighted average is pulled toward the denser side. When \(f'(x) > 0\) (more observations to the right) and \(m'(x) > 0\) (the function is increasing), the extra right-side observations pull the weighted average above the true \(m(x)\), creating upward bias.

This bias is largest where \(f'(x)/f(x)\) is large, which occurs at the boundaries of the data’s support. Near the left boundary there are almost no observations to the left and many to the right (\(f'(x) > 0\)), and near the right boundary the opposite holds. Interior regions with roughly symmetric neighborhoods have \(f'(x) \approx 0\) and little density gradient bias. (This is the core reason local linear is preferred, it eliminates this boundary problem entirely.)

  1. Local linear has the same asymptotic variance as Nadaraya-Watson but less bias. Does this mean you should always use a smaller bandwidth with local linear? Why or why not?

Not necessarily. The bias reduction from local linear comes from eliminating the density gradient term, not from changing the bandwidth. Both estimators face the same variance formula \(R_K\sigma^2(x)/(f(x) \cdot nh)\), and both have smoothing bias proportional to \(h^2\). The optimal bandwidth for local linear may actually be larger than for Nadaraya-Watson because with one source of bias eliminated, you can afford more smoothing (accepting slightly more smoothing bias) to gain variance reduction.

Choosing a smaller bandwidth because “local linear is better” would increase variance without a compensating reduction in the remaining bias term. The right approach is to select the bandwidth optimally for local linear rather than inheriting the bandwidth choice from a Nadaraya-Watson analysis.

  1. Cross-validation uses leave-one-out residuals. Why do in-sample residuals understate prediction error, especially for small \(h\)?

The in-sample residual is \(Y_i - \hat{m}(X_i)\), where \(\hat{m}(X_i)\) is computed using observation \(i\) itself. As \(h \to 0\), the kernel assigns almost all weight to the single nearest neighbor – which for observation \(i\) is often \(i\) itself. The estimator nearly interpolates: \(\hat{m}(X_i) \approx Y_i\), so the residual approaches zero even if the true \(m(x)\) has nothing to do with \(Y_i\). You are measuring how well the estimator memorizes the data, not how well it predicts new observations.

The leave-one-out residual \(Y_i - \hat{m}_{-i}(X_i)\) removes observation \(i\) before computing the estimate at \(X_i\). This forces the estimator to actually predict \(Y_i\) from nearby observations. When \(h\) is too small, \(\hat{m}_{-i}(X_i)\) becomes noisy (there are few other observations nearby) and the leave-one-out error is large, correctly signaling that the bandwidth is too small.

  1. In a partially linear model \(Y = m(X) + \beta_1 D + e\) where \(D\) is a binary treatment, why does \(\hat{\beta}_1\) achieve the parametric convergence rate even though \(m(X)\) is estimated nonparametrically?

Robinson’s estimator recovers \(\beta_1\) by projecting out the nonparametric component. After subtracting \(\hat{E}[Y \mid X]\) and \(\hat{E}[D \mid X]\), the estimator runs OLS of \(\tilde{Y} = Y - \hat{E}[Y \mid X]\) on \(\tilde{D} = D - \hat{E}[D \mid X]\). The regression is:

\[\tilde{Y}_i = \beta_1 \tilde{D}_i + \text{small nonparametric errors}\]

The key point is that the nonparametric estimation errors in \(\hat{E}[Y \mid X]\) and \(\hat{E}[D \mid X]\) are small enough, of order \(n^{-2/5}\), that when they are multiplied together and summed over \(n\) observations, the total contribution to the OLS estimator is \(o(n^{-1/2})\). In other words, the nonparametric errors average out fast enough that they do not affect the asymptotic distribution of \(\hat{\beta}_1\). The parametric \(\sqrt{n}\) rate is driven by the \(n\) independent observations of \(\tilde{D}_i\), not by the slower nonparametric estimation of \(m(X)\).