Final Exam Review Session

Overview

Cross-Sectional and Limited Dependent Variable Models

Method	When to Use	Estimator	Key Assumptions	Main Caveat / Failure Mode
MLE (general principle)	When you can specify a parametric distribution $f(X \mid \theta)$ for the data	Maximize $\ell_n(\theta) = \sum_i \log f(X_i \mid \theta)$	Model is correctly specified; $i.i.d.$ observations	If misspecified, converges to pseudo-true value $\theta_0$; use sandwich variance $H^{-1}\mathcal{I}H^{-1}$ instead of $H^{-1}$
Probit / Logit	Binary outcome $Y \in \{0,1\}$; want to model $P(Y=1 \mid X)$ respecting $[0,1]$ bounds	MLE with $\Phi(X\beta)$ (probit) or $\Lambda(X\beta)$ (logit)	Latent variable errors are normal (probit) or logistic (logit); correct index specification	Coefficients are not marginal effects; must compute AME $= \hat{\beta} \cdot \frac{1}{n}\sum_i g(X_i\hat{\beta})$; inconsistent if distribution misspecified
LPM	Binary outcome; want a quick linear approximation to $P(Y=1 \mid X)$	OLS on $Y \in \{0,1\}$	Linear probability model; homoscedasticity (often violated)	Predicted probabilities outside $[0,1]$; heteroscedastic errors by construction; use robust SEs
Multinomial Logit	Unordered discrete outcome with $J \geq 3$ alternatives; one decision maker chooses among alternatives	MLE; one $\beta_j$ normalized to zero (base category)	IIA (Independence of Irrelevant Alternatives): odds ratio $P_j/P_k$ depends only on $j$ and $k$, not other alternatives	IIA fails when alternatives are close substitutes; use nested logit or mixed logit if IIA is implausible
Ordered Probit / Logit	Ordered discrete outcome (e.g., disagree / neutral / agree); order is meaningful	MLE; single index $X\beta$ with thresholds $\alpha_1 < \cdots < \alpha_{J-1}$	Proportional odds: single coefficient governs transition between all categories	If the proportional odds assumption fails, a more flexible model is needed
Tobit	Continuous latent outcome $Y^$ but observed $Y = \max(Y^, 0)$ (or similar censoring); censoring mechanism is known	MLE; joint likelihood over censored and uncensored observations	Normally distributed errors; exogeneity of $X$; censoring threshold known	OLS on $Y$ or on uncensored subsample both biased toward zero; marginal effect on $E[Y\mid X]$ is $\beta_j \cdot \Phi(X\beta/\sigma)$, not $\beta_j$
Heckman Selection	Outcome $Y$ observed only for a selected subsample ($S=1$); selection may be endogenous	2-step: probit for selection, then OLS with inverse Mills ratio $\hat{\lambda}$ as control	Joint normality of $(e_i, u_i)$; at least one exclusion restriction in $Z$ not in $X$	Without exclusion restriction, identification relies on nonlinearity of $\lambda(\cdot)$, very fragile in practice; test for selection bias via $t$-test on $\hat{\sigma}_{21}$
GMM / IV	Endogenous regressors; moment conditions $E[Z_i e_i] = 0$ available; possibly overidentified ($l > k$)	Minimize $J(\beta) = n\bar{g}_n(\beta)'W\bar{g}_n(\beta)$; efficient GMM sets $W = \hat{\Omega}^{-1}$	Valid instruments: relevance ($\text{Cov}(Z,X) \neq 0$) and exogeneity ($E[Ze]=0$)	Weak instruments inflate variance and bias; test overidentifying restrictions with Hansen $J \sim \chi^2_{l-k}$; efficiency gain of GMM over 2SLS disappears under homoscedasticity

Nonparametric, Time Series, Panel, and Machine Learning Methods

Method	When to Use	Estimator	Key Assumptions	Main Caveat / Failure Mode
Kernel Density Estimation	Estimate the density $f(x)$ without imposing a parametric form	$\hat{f}(x) = \frac{1}{nh}\sum_i K\!\left(\frac{X_i - x}{h}\right)$	i.i.d. data; $f(x)$ is twice differentiable; bandwidth $h \to 0$, $nh \to \infty$	Bias $\propto h^2 f''(x)$; variance $\propto (nh)^{-1}$; optimal $h \propto n^{-1/5}$; convergence rate $n^{-2/5}$ is slower than parametric
Kernel / Local Linear Regression	Estimate $m(x) = E[Y\mid X=x]$ nonparametrically; no functional form assumed	Nadaraya-Watson (weighted average) or local linear (weighted OLS at each $x$)	Smoothness of $m(x)$; i.i.d. data; bandwidth chosen by cross-validation	NW has extra bias $[f'(x)/f(x)]m'(x)$ at slope regions and boundary bias; local linear corrects both; still converges at $n^{-2/5}$ in one dimension
AR($p$) / ARMA	Univariate time series; model dynamic dependence and forecast future values	OLS (AR) or MLE (ARMA); model order selected by AIC on a fixed estimation sample	Weak stationarity; ergodicity; MDS errors; AR roots outside unit circle	Unit roots ($\phi_1 = 1$) break stationarity: random walk, no mean reversion; test with Dickey-Fuller; AIC selects order but always use same sample across candidate models
Panel Data FE-IV	Panel data with fixed effects $u_i$ and endogenous $X_{it}$; need to handle both simultaneously	Within transformation ($\tilde{M}_D$) to remove $u_i$, then 2SLS on demeaned variables	Strict exogeneity of instruments after demeaning; instruments vary within individual over time	Instruments that are time-invariant get differenced away; need within-individual variation in $Z_{it}$
Dynamic Panel (AB / BB)	Panel data with lagged dependent variable $Y_{i,t-1}$; standard FE is inconsistent	Arellano-Bond: GMM on first-differenced equation using lagged levels as instruments; Blundell-Bond adds level moment conditions	No serial correlation in $\varepsilon_{it}$; stationarity of initial conditions (BB only)	Nickell bias in FE: $\text{plim}(\hat{\rho}_{FE} - \rho) \approx -(1+\rho)/(T-1)$; AB has weak instruments when $\rho \approx 1$- use BB instead
Ridge Regression	$p$ large or $p > n$; goal is prediction; multicollinearity present	$\hat{\beta}_{ridge} = (X'X + \lambda I)^{-1}X'Y$; $\lambda$ chosen by leave-one-out CV	Linear model; no sparsity required	Shrinks all coefficients toward zero but never exactly to zero, cannot do variable selection; choose Lasso instead if sparsity is desired
Lasso	$p$ large or $p > n$; goal is prediction and variable selection; true model believed sparse	Minimize $\\|Y - X\beta\\|_2^2 + \lambda\\|\beta\\|_1$; $\lambda$ by $K$-fold CV	Sparsity (few truly nonzero coefficients); incoherence / restricted eigenvalue condition	Shrinkage biases retained coefficients toward zero, use post-Lasso (refit OLS on selected variables) to remove bias; for causal inference use double-selection
Double-Selection Lasso	Causal effect of treatment $D$ on $Y$ with high-dimensional controls $X$	Lasso of $Y$ on $X$, Lasso of $D$ on $X$; OLS of $Y$ on $D$ plus union of selected controls	Approximate sparsity in both the outcome and treatment equations	Standard Lasso may drop relevant confounders if they have weak effects on $Y$ alone; double selection protects against this omitted variable bias

Topic Review

Maximum Likelihood Estimation

MLE finds the parameter values $\theta$ that make the observed data most likely. For a sample of $n$ i.i.d. observations, the likelihood function is the joint density evaluated at the data: \[L_n(\theta) = \prod_{i=1}^n f(X_i \mid \theta)\]

Because products are inconvenient, we maximize the log-likelihood instead (equivalent, since $\log$ is monotone increasing): \[\ell_n(\theta) = \sum_{i=1}^n \log f(X_i \mid \theta)\]

The MLE is $\hat{\theta} = \arg\max_{\theta \in \Theta} \ell_n(\theta)$, found by setting the score $S_n(\theta) = \partial \ell_n(\theta)/\partial\theta = 0$ and verifying the second-order condition.

Key objects:

Score: $S_n(\theta) = \sum_i \partial \log f(X_i \mid \theta)/\partial\theta$. At the true $\theta_0$, $E[S] = 0$.
Hessian: $H_n(\theta) = -\partial^2 \ell_n(\theta)/\partial\theta\partial\theta'$ (negative second derivative; measures curvature).
Fisher information: $_= E[SS’] = $ variance of the score.
Information matrix equality (correctly specified models only): $\mathcal{I}_\theta = H_\theta$. This equality fails under misspecification.

Asymptotic properties (correctly specified): \[\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0,\ \mathcal{I}_\theta^{-1})\] MLE is asymptotically Cramér-Rao efficient, no consistent unbiased estimator has smaller asymptotic variance.

Three variance estimators (all asymptotically equivalent):

Estimator	Formula
Expected Hessian	$\hat{V}_0 = \hat{\kappa}_\theta^{-1}$, where $\hat{\kappa}_\theta = H_\theta(\hat{\theta})$
Sample Hessian	$\hat{V}_1 = \left[-\frac{1}{n}\frac{\partial^2 \ell_n(\hat{\theta})}{\partial\theta\partial\theta'}\right]^{-1}$ (most common in practice)
Outer product	$\hat{V}_2 = \hat{\mathcal{I}}_\theta^{-1}$, where $\hat{\mathcal{I}}_\theta = \frac{1}{n}\sum_i \hat{S}_i \hat{S}_i'$

Misspecification: If the model is misspecified, MLE converges to the pseudo-true value $\theta_0 = \arg\min_\theta \text{KLIC}(f, f_\theta)$, where KLIC is the Kullback-Leibler divergence. The information matrix equality fails, so the robust sandwich variance $H_\theta^{-1} \mathcal{I}_\theta H_\theta^{-1}$ must be used instead of $\mathcal{I}_\theta^{-1}$.

Likelihood Ratio (LR) test: To test $q$ constraints on $\theta$, compare the unconstrained log-likelihood $\ell_n(\hat{\theta})$ to the constrained $\ell_n(\tilde{\theta})$: \[LR = 2[\ell_n(\hat{\theta}) - \ell_n(\tilde{\theta})] \xrightarrow{d} \chi^2_q\] Note $LR \geq 0$ always, since the unconstrained maximum cannot be smaller than the constrained one.

Binary and Multinomial Choice Models

We are interested in $P(x) = \text{Prob}[Y = 1 \mid X = x]$. Because $Y \in \{0,1\}$, this fully describes the distribution of $Y$ given $X$.

Three main models:

LPM: $P(x) = x\beta$. Easy to estimate with OLS, but predicted probabilities can fall outside $[0,1]$.
Probit: $P(x) = \Phi(x\beta)$, where $\Phi$ is the standard normal CDF.
Logit: $P(x) = \Lambda(x\beta)$, where $\Lambda(u) = (1 + e^{-u})^{-1}$.

Both probit and logit are index models: $P(x) = G(x\beta)$ for some CDF $G$.

Latent variable interpretation: \[Y^* = X\beta + e, \quad e \sim G(\cdot), \quad Y = \mathbf{1}[Y^* > 0]\] The scale of $\beta$ is not identified, only $\beta^* = \beta/\sigma$ is, where $\sigma = \text{SD}(e)$. By convention, $\sigma = 1$ for probit and $\sigma = \pi/\sqrt{3}$ for logit.

Estimation: Both are estimated by MLE. The log-likelihood is globally concave, so convergence is fast and unique.

Marginal effects: Coefficients are not marginal effects. The average marginal effect (AME) is: \[\widehat{\text{AME}} = \hat{\beta} \cdot \frac{1}{n}\sum_{i=1}^n g(X_i\hat{\beta})\] where $g(\cdot)$ is the density of $G(\cdot)$.

Multinomial extensions:

Multinomial logit: $P_j(x) = e^{x\beta_j}/\sum_l e^{x\beta_l}$. One $\beta_j$ is normalized to zero (base category). Suffers from the IIA (Independence of Irrelevant Alternatives) property.
Nested logit: Relaxes IIA by grouping alternatives. Within-group errors are correlated via a dissimilarity parameter $\lambda_j$.
Ordered probit/logit: For ordered outcomes. Uses thresholds $\alpha_1 < \alpha_2 < \cdots < \alpha_{J-1}$ applied to a single latent variable $U^* = X\beta + e$.

Generalized Method of Moments (GMM)

GMM generalizes IV estimation to overidentified models ($l > k$, more instruments than regressors). The moment condition is: \[E[g_i(\beta)] = E[Z_i e_i] = E[Z_i(Y_i - X_i\beta)] = 0\]

When $l > k$, there is no $\beta$ that sets the sample analog exactly to zero. Instead, GMM minimizes the weighted criterion: \[J(\beta) = n \cdot \bar{g}_n(\beta)' W \bar{g}_n(\beta)\]

The GMM estimator (IV case): \[\hat{\beta}_{GMM} = (X'ZWZ'X)^{-1}X'ZWZ'Y\]

Efficient GMM: Set $W = \hat{\Omega}^{-1}$, where $\hat{\Omega} = \frac{1}{n}\sum_i Z_i Z_i' \hat{e}_i^2$. This minimizes the asymptotic variance. Under conditional homoscedasticity, 2SLS is efficient and equals GMM.

Two-step procedure: First estimate with 2SLS (consistent but not efficient). Use residuals to estimate $\Omega$. Re-estimate using $\hat{\Omega}^{-1}$ as the weight matrix.

Key tests:

Hansen’s J-test (OverID): $J(\hat{\beta}_{GMM}) \xrightarrow{d} \chi^2_{l-k}$ under $H_0: E[Ze] = 0$. Always report when $l > k$.
Endogeneity test (C-test): Compare $J$ from restricted moment conditions (fewer instruments) to $J$ from the full set. $C = J_r - J_u \xrightarrow{d} \chi^2_{k_2}$.
GMM Distance test: Analog of the likelihood ratio test. $D = J_c - J_u \xrightarrow{d} \chi^2_q$ under $H_0$.

Censoring and Selection Models

Censoring (Tobit): The latent outcome $Y^*$ exists for everyone but is only observed above (or below) a threshold. \[Y^* = X\beta + e, \quad e \mid X \sim N(0, \sigma^2), \quad Y = \max(Y^*, 0)\]

Three distinct expectations:

Object	Formula
$E[Y^* \mid X]$	$X\beta$
$E[Y \mid X]$	$X\beta \cdot \Phi(X\beta/\sigma) + \sigma\phi(X\beta/\sigma)$
$E[Y \mid X, Y>0]$	$X\beta + \sigma\lambda(X\beta/\sigma)$

where $\lambda(c) = \phi(c)/\Phi(c)$ is the inverse Mills ratio. Note $E[Y^*\mid X] \leq E[Y\mid X] \leq E[Y\mid X, Y>0]$.

OLS on either $Y$ or the uncensored subsample is biased toward zero. Estimate by MLE.

Key marginal effect: \[\frac{\partial E[Y \mid X]}{\partial X_j} = \beta_j \cdot \Phi(X\beta/\sigma)\]

Selection (Heckman/Heckit): Selection into the observed sample is endogenous. \[Y_i^* = X_i\beta + e_i, \quad S_i^* = Z_i\gamma + u_i, \quad S_i = \mathbf{1}[S_i^* > 0], \quad Y_i = Y_i^* \text{ if } S_i = 1\]

The key equation is: \[E[Y_i \mid X_i, S_i = 1] = X_i\beta + \sigma_{21}\lambda(Z_i\gamma)\]

Selection bias arises when $\sigma_{21} \neq 0$ (unobservables in the selection equation are correlated with unobservables in the outcome equation). Heckman’s 2-step corrects this.

Nonparametric Methods

Kernel density estimation: \[\hat{f}(x) = \frac{1}{nh}\sum_{i=1}^n K\!\left(\frac{X_i - x}{h}\right)\]

Bandwidth $h$ controls the bias-variance tradeoff:

Larger $h$: smoother, more bias, less variance
Smaller $h$: rougher, less bias, more variance

Asymptotic bias $\approx \frac{1}{2}f''(x)h^2$. Bias is positive where $f''(x) > 0$ (density is convex) and negative where $f''(x) < 0$ (density is concave).

Optimal bandwidth: $h_0 \propto n^{-1/5}$. Density estimator converges at rate $n^{-2/5}$, slower than parametric $n^{-1/2}$.

Kernel regression (Nadaraya-Watson): \[\hat{m}_{NW}(x) = \frac{\sum_i K\!\left(\frac{X_i-x}{h}\right)Y_i}{\sum_i K\!\left(\frac{X_i-x}{h}\right)}\]

Local linear (LL) estimator is preferred: it corrects for boundary bias and the additional bias term $f'(x)m'(x)/f(x)$ that affects NW. Both have the same asymptotic variance.

As $h \to \infty$, kernel regression converges to the OLS fit. As $h \to 0$, it interpolates the data.

Partially linear regression: $Y = m(X) + Z\delta + e$. Robinson (1988) shows you can estimate $\delta$ by differencing out $E[Y\mid X]$ and $E[Z\mid X]$ nonparametrically, then running OLS on the residuals.

Time Series

Stationarity: $\{Y_t\}$ is weakly stationary if $E[Y_t] = \mu$, $\text{Var}(Y_t) = \sigma^2$, and $\text{Cov}(Y_t, Y_{t-k})$ depend only on $k$, not $t$.

White noise vs. MDS: An MDS satisfies $E[e_t \mid \mathcal{F}_{t-1}] = 0$. Every MDS is white noise (serially uncorrelated), but not vice versa.

AR(1): $Y_t = \phi_0 + \phi_1 Y_{t-1} + e_t$

Stationary iff $|\phi_1| < 1$
$E[Y_t] = \phi_0/(1-\phi_1)$, $\text{Var}(Y_t) = \sigma^2/(1-\phi_1^2)$, $\text{Corr}(Y_t, Y_{t-k}) = \phi_1^k$
Impulse response: $b_j = \phi_1^j$
Unit root ($\phi_1 = 1$): random walk, not stationary, not mean-reverting

MA(q): $Y_t = \mu + \sum_{j=0}^q \theta_j e_{t-j}$. Autocorrelations are zero beyond lag $q$.

Wold decomposition: Any weakly stationary non-deterministic process can be written as $Y_t = \mu + b(L)e_t$ for some square-summable coefficients $b_j$.

Impulse response function (IRF): $b_j = \partial Y_{t+j}/\partial e_t$. For AR($p$), computed recursively: \[b_j = \phi_1 b_{j-1} + \phi_2 b_{j-2} + \cdots + \phi_p b_{j-p}, \quad b_0 = 1\]

Model selection: Use AIC $= n\log(\hat{\sigma}^2(p)) + 2p$. Choose the model with the lowest AIC, always using the same sample size across models.

Panel Data: IV and Dynamic Models

Fixed effects IV (within 2SLS): When some $X_{it}$ are correlated with $\varepsilon_{it}$ in: \[Y_{it} = X_{it}\beta + u_i + \varepsilon_{it}\] demean all variables (within transformation $\tilde{M}_D$) to remove $u_i$, then apply 2SLS using demeaned instruments $\tilde{Z}$.

Dynamic panel data: \[Y_{it} = \rho Y_{i,t-1} + X_{it}\beta + u_i + \varepsilon_{it}\]

Standard fixed effects is inconsistent even as $N \to \infty$ when $T$ is small (Nickell bias). The intuition: after within-demeaning, $\tilde{Y}_{i,t-1}$ is correlated with $\tilde{\varepsilon}_{it}$ because the individual mean $\bar{Y}_i$ depends on all $Y_{it}$, including $Y_{i,t-1}$.

For $T = 3$: $\text{plim}(\hat{\rho}_{FE} - \rho) = -(1+\rho)/2$.

Anderson-Hsiao: First-difference to remove $u_i$, then use $Y_{i,t-2}$ as an IV for $\Delta Y_{i,t-1}$. Consistent but not efficient.

Arellano-Bond: GMM using all valid lagged levels as instruments for the first-differenced equation. More efficient, but susceptible to weak instruments when $\rho \approx 1$.

Blundell-Bond: Adds moment conditions in levels (using lagged differences as instruments). Better when $\rho$ is close to 1 or $\sigma_u^2$ is large relative to $\sigma_\varepsilon^2$.

Machine Learning

When $p$ (number of regressors) is very large, OLS breaks down ($X'X$ may be near-singular or $p > n$). Machine learning methods address this via shrinkage and selection.

Ridge regression: \[\hat{\beta}_{ridge} = \underset{\beta}{\arg\min} \; \|Y - X\beta\|_2^2 + \lambda\|\beta\|_2^2 = (X'X + \lambda I_p)^{-1}X'Y\] Shrinks all coefficients toward zero. Never sets any coefficient exactly to zero. Well-defined even when $p > n$.

Lasso: \[\hat{\beta}_{Lasso} = \underset{\beta}{\arg\min} \; \|Y - X\beta\|_2^2 + \lambda\|\beta\|_1\] Shrinks coefficients and sets some to exactly zero (variable selection). The $\ell_1$ penalty creates corners in the constraint set where solutions land on coordinate axes.

Selecting $\lambda$: Cross-validation (CV). For Lasso, $K$-fold CV is standard. For Ridge, leave-one-out CV has a convenient closed form.

Post-Lasso: Use Lasso to select variables, then refit OLS on selected variables only. Removes shrinkage bias for retained variables.

Double-selection Lasso: For causal inference with a treatment $D$ and controls $X$. Run Lasso of $Y$ on $X$ and Lasso of $D$ on $X$. Take the union of selected variables and estimate the model by OLS. Avoids omitted variable bias from imperfect selection.

Regression trees / Random forests: Partition the covariate space into regions and predict $Y$ by the mean within each region. Random forests average over many trees grown on bootstrap samples, with random subsets of variables at each split, to reduce variance.

Practice Exercises

Exercise 1: Heckman Selection Model

You are interested in estimating the returns to education on wages. However, wages are only observed for employed workers. Let:

\[Y_i^* = X_i\beta + e_i \qquad \text{(wage equation)}\] \[S_i^* = Z_i\gamma + u_i, \quad S_i = \mathbf{1}[S_i^* > 0] \qquad \text{(employment equation)}\]

where $Y_i = Y_i^*$ is observed only when $S_i = 1$, and the errors are jointly normal: \[\begin{pmatrix} e_i \\ u_i \end{pmatrix} \sim N\!\left(\mathbf{0},\ \begin{pmatrix} \sigma^2 & \sigma_{21} \\ \sigma_{21} & 1 \end{pmatrix}\right)\]

Note that $u_i$ has variance normalized to 1, consistent with a probit for the selection equation.

(a) Show that: \[E[Y_i \mid X_i, S_i = 1] = X_i\beta + \sigma_{21}\lambda(Z_i\gamma)\] where $\lambda(c) = \phi(c)/\Phi(c)$ is the inverse Mills ratio.

Hint: Use the linear projection $e_i = \sigma_{21} u_i + \varepsilon_i$ where $\varepsilon_i \perp u_i$. You will need the result that for $u \sim N(0,1)$, $E[u \mid u > -c] = \phi(c)/\Phi(c)$.

Answer

Start by writing the conditional expectation of wages for observed workers: \[E[Y_i \mid X_i, S_i = 1] = E[X_i\beta + e_i \mid X_i, Z_i\gamma + u_i > 0]\] \[= X_i\beta + E[e_i \mid u_i > -Z_i\gamma]\] where the last step uses the fact that $e_i$ is independent of $X_i$ and $Z_i$ given $u_i$.

Now use the linear projection of $e_i$ on $u_i$: \[e_i = \sigma_{21} u_i + \varepsilon_i\] where $\varepsilon_i \perp u_i$ by construction of the linear projection. Therefore: \[E[e_i \mid u_i > -Z_i\gamma] = \sigma_{21} E[u_i \mid u_i > -Z_i\gamma] + E[\varepsilon_i \mid u_i > -Z_i\gamma]\] \[= \sigma_{21} E[u_i \mid u_i > -Z_i\gamma] + 0\] since $\varepsilon_i \perp u_i$ implies $E[\varepsilon_i \mid u_i > -Z_i\gamma] = E[\varepsilon_i] = 0$.

Using the provided result for truncated normals with $u_i \sim N(0,1)$: \[E[u_i \mid u_i > -Z_i\gamma] = \frac{\phi(Z_i\gamma)}{\Phi(Z_i\gamma)} = \lambda(Z_i\gamma)\]

Combining: \[E[Y_i \mid X_i, S_i = 1] = X_i\beta + \sigma_{21}\lambda(Z_i\gamma)\]

(b) Under what condition does OLS on the selected sample ($S_i = 1$) give an unbiased estimate of $\beta$? Provide both the mathematical condition and the economic intuition.

Answer

From part (a): \[E[Y_i \mid X_i, S_i = 1] = X_i\beta + \sigma_{21}\lambda(Z_i\gamma)\]

OLS on the selected sample regresses $Y_i$ on $X_i$ only, omitting $\lambda(Z_i\gamma)$. If $\sigma_{21}\lambda(Z_i\gamma)$ is correlated with $X_i$, OLS is biased (omitted variable bias).

Mathematical condition: OLS is unbiased if $\sigma_{21} = 0$.

Economic intuition: $\sigma_{21}$ is the covariance between the unobservable wage component $e_i$ (e.g., ability) and the unobservable employment component $u_i$ (e.g., motivation). If these are uncorrelated, the workers who select into employment are not systematically different from the rest of the population in terms of unobservable productivity, so the selected sample is effectively random and OLS is unbiased. Selection bias arises precisely when unobservables that determine employment are also related to wages.

(c) Describe Heckman’s 2-step estimator. What does each step estimate, and why is the first step necessary before the second?

Answer

Step 1: Estimate the selection equation $S_i = \mathbf{1}[Z_i\gamma + u_i > 0]$ by probit using the full sample (both employed and unemployed). This yields $\hat{\gamma}$. For each observation, compute the estimated inverse Mills ratio: \[\hat{\lambda}_i = \lambda(Z_i\hat{\gamma}) = \frac{\phi(Z_i\hat{\gamma})}{\Phi(Z_i\hat{\gamma})}\]

Step 2: On the selected sample ($S_i = 1$ only), regress $Y_i$ on $X_i$ and $\hat{\lambda}_i$ by OLS. The coefficient on $X_i$ is the consistent estimate of $\beta$, and the coefficient on $\hat{\lambda}_i$ estimates $\sigma_{21}$.

Why is the first step necessary? From part (a), the correct model for wages in the selected sample is $Y_i = X_i\beta + \sigma_{21}\lambda(Z_i\gamma) + \text{error}$. We do not know $\gamma$, so we cannot construct $\lambda(Z_i\gamma)$ directly. The first step estimates $\gamma$ using the probit model, allowing us to construct the control function $\hat{\lambda}_i$ that, when included in the second step, removes the selection bias.

(d) Why is it important in practice to have at least one variable in $Z$ that is excluded from $X$? What goes wrong if all variables in the selection equation are also in the outcome equation?

Answer

If $Z = X$ (all selection variables are also outcome variables), then $\lambda(Z_i\gamma) = \lambda(X_i\gamma)$ is a nonlinear function of $X_i$. In principle, this still identifies $\beta$ through the nonlinearity of $\lambda(\cdot)$. However, over a typical range of data, $\lambda(\cdot)$ is nearly linear, it can be closely approximated by a linear function of its argument.

This means $X_i\beta$ and $\sigma_{21}\lambda(X_i\gamma)$ become nearly collinear in the second-step regression, making it very difficult to separately identify $\beta$ and $\sigma_{21}$. Standard errors become enormous and estimates are highly sensitive to functional form assumptions.

Having an exclusion restriction: a variable in $Z$ not in $X$ (e.g., non-labor income, which affects employment but not the wage rate directly), provides independent variation in $\hat{\lambda}_i$ that is not already explained by $X_i$. This makes the second-step regression well-identified and more robust to the near-linearity of $\lambda(\cdot)$.

(e) The coefficient on $\hat{\lambda}_i$ from step 2 is an estimate of $\sigma_{21}$. How do you use this to test for selection bias? State the null hypothesis clearly.

Answer

Null hypothesis: $H_0: \sigma_{21} = 0$ (no selection bias; unobservables in the wage and employment equations are uncorrelated).

Test: From the second-step OLS regression of $Y_i$ on $X_i$ and $\hat{\lambda}_i$, let $\hat{\sigma}_{21}$ be the estimated coefficient on $\hat{\lambda}_i$ and $\text{se}(\hat{\sigma}_{21})$ its standard error. Compute the $t$-statistic: \[t = \frac{\hat{\sigma}_{21}}{\text{se}(\hat{\sigma}_{21})}\] Under $H_0$, $t \xrightarrow{d} N(0,1)$. Reject $H_0$ at the 5% level if $|t| > 1.96$.

Note on standard errors: The second-step OLS standard errors are not directly valid because $\hat{\lambda}_i$ is a generated regressor (estimated from step 1). Correct standard errors should account for the estimation error in $\hat{\gamma}$. In practice, use the standard errors provided by Stata’s heckman command, or use the bootstrap.

Exercise 2: AR(2) Process

Consider the AR(2) process: \[Y_t = \phi_0 + \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + e_t\] where $e_t$ is a strictly stationary ergodic white noise process with $E[e_t] = 0$ and $\text{Var}(e_t) = \sigma^2$.

(a) State the three conditions on $\phi_1$ and $\phi_2$ for this process to be weakly stationary. Describe what the stationarity region looks like and give an intuitive interpretation of each boundary condition.

Answer

The AR(2) process is stationary if and only if all three of the following hold:

\[\phi_1 + \phi_2 < 1 \tag{i}\] \[\phi_2 - \phi_1 < 1 \tag{ii}\] \[\phi_2 > -1 \tag{iii}\]

Stationarity region: These three inequalities define a triangle in the $(\phi_1, \phi_2)$ plane. The vertices are at $(-2, -1)$, $(2, -1)$, and $(0, 1)$.

Intuition for each boundary:

Boundary (i): $\phi_2 = 1 - \phi_1$ (upper-right edge). When this holds with equality, the AR polynomial has a unit root at $z = 1$. The process has a positive real unit root and is not mean-reverting.
Boundary (ii): $\phi_2 = 1 + \phi_1$ (upper-left edge). When this holds with equality, the AR polynomial has a unit root at $z = -1$. The process exhibits persistent alternating oscillations.
Boundary (iii): $\phi_2 = -1$ (bottom edge). This corresponds to a pair of complex unit roots on the unit circle. The process oscillates persistently without damping.

Inside the triangle, all roots of the AR polynomial lie outside the unit circle, and the process is stationary and mean-reverting.

(b) Assuming $Y_t$ is stationary so that $E[Y_t] = E[Y_{t-1}] = E[Y_{t-2}] = \mu$, derive $E[Y_t]$ in terms of $\phi_0$, $\phi_1$, and $\phi_2$.

Answer

Take unconditional expectations of both sides of the AR(2) equation: \[E[Y_t] = \phi_0 + \phi_1 E[Y_{t-1}] + \phi_2 E[Y_{t-2}] + E[e_t]\]

Using stationarity ($E[Y_t] = E[Y_{t-1}] = E[Y_{t-2}] = \mu$) and $E[e_t] = 0$: \[\mu = \phi_0 + \phi_1\mu + \phi_2\mu\] \[\mu(1 - \phi_1 - \phi_2) = \phi_0\] \[\boxed{\mu = \frac{\phi_0}{1 - \phi_1 - \phi_2}}\]

Note that this requires $\phi_1 + \phi_2 \neq 1$, which is guaranteed by stationarity condition (i) from part (a) (the denominator is strictly positive under stationarity).

(c) Derive $\text{Cov}(Y_t, Y_{t-1})$ directly using the definition of covariance. The following are provided: \[\text{Var}(Y_t) = \frac{(1-\phi_2)\sigma^2}{(1+\phi_2)\left[(1-\phi_2)^2 - \phi_1^2\right]}\] \[\text{Cov}(Y_t, Y_{t-2}) = \phi_1 \text{Cov}(Y_t, Y_{t-1}) + \phi_2 \text{Var}(Y_t)\]

Hint: Apply the definition $\text{Cov}(Y_t, Y_{t-1}) = \text{Cov}(\phi_0 + \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + e_t,\ Y_{t-1})$ and use the linearity of covariance.

Answer

Apply the definition of covariance and linearity: \[\text{Cov}(Y_t, Y_{t-1}) = \text{Cov}(\phi_0 + \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + e_t,\ Y_{t-1})\] \[= \text{Cov}(\phi_0, Y_{t-1}) + \phi_1\text{Cov}(Y_{t-1}, Y_{t-1}) + \phi_2\text{Cov}(Y_{t-2}, Y_{t-1}) + \text{Cov}(e_t, Y_{t-1})\]

Now evaluate each term:

$\text{Cov}(\phi_0, Y_{t-1}) = 0$ (constant has zero covariance)
$\text{Cov}(Y_{t-1}, Y_{t-1}) = \text{Var}(Y_{t-1}) = \text{Var}(Y_t)$ (by stationarity)
$\text{Cov}(Y_{t-2}, Y_{t-1}) = \text{Cov}(Y_t, Y_{t-1})$ (by stationarity, same lag-1 autocovariance)
$\text{Cov}(e_t, Y_{t-1}) = 0$ (white noise $e_t$ is uncorrelated with past $Y$)

Therefore: \[\text{Cov}(Y_t, Y_{t-1}) = \phi_1\text{Var}(Y_t) + \phi_2\text{Cov}(Y_t, Y_{t-1})\]

Collecting terms: \[(1 - \phi_2)\text{Cov}(Y_t, Y_{t-1}) = \phi_1\text{Var}(Y_t)\]

\[\boxed{\text{Cov}(Y_t, Y_{t-1}) = \frac{\phi_1\,\text{Var}(Y_t)}{1-\phi_2} = \frac{\phi_1(1-\phi_2)\sigma^2}{(1-\phi_2)(1+\phi_2)\left[(1-\phi_2)^2-\phi_1^2\right]} = \frac{\phi_1\sigma^2}{(1+\phi_2)\left[(1-\phi_2)^2-\phi_1^2\right]}}\]

Note: The provided formula for $\text{Cov}(Y_t, Y_{t-2})$ is not needed for this part, it would be used to derive higher-order autocovariances.

(d) Using the recursive method (“Method 1” in Hansen), compute $b_0$, $b_1$, $b_2$, $b_3$ for the impulse response function $b_j = \partial Y_{t+j}/\partial e_t$. Then write the general recursion formula for $b_j$ for all $j \geq 1$.

Hint: Write out the equation for $Y_{t+j}$ and substitute backwards. For $j \geq 2$, the recursion formula $b_j = \phi_1 b_{j-1} + \phi_2 b_{j-2}$ will be useful.

Answer

$b_0$: Directly from $Y_t = \phi_0 + \phi_1 Y_{t-1} + \phi_2 Y_{t-2} + e_t$: \[b_0 = \frac{\partial Y_t}{\partial e_t} = 1\]

$b_1$: Write $Y_{t+1} = \phi_0 + \phi_1 Y_t + \phi_2 Y_{t-1} + e_{t+1}$: \[b_1 = \frac{\partial Y_{t+1}}{\partial e_t} = \phi_1\frac{\partial Y_t}{\partial e_t} + \phi_2\frac{\partial Y_{t-1}}{\partial e_t} = \phi_1 \cdot 1 + \phi_2 \cdot 0 = \phi_1\]

since $e_t$ does not appear in $Y_{t-1}$.

$b_2$: Write $Y_{t+2} = \phi_0 + \phi_1 Y_{t+1} + \phi_2 Y_t + e_{t+2}$: \[b_2 = \frac{\partial Y_{t+2}}{\partial e_t} = \phi_1\frac{\partial Y_{t+1}}{\partial e_t} + \phi_2\frac{\partial Y_t}{\partial e_t} = \phi_1 b_1 + \phi_2 b_0 = \phi_1^2 + \phi_2\]

$b_3$: Write $Y_{t+3} = \phi_0 + \phi_1 Y_{t+2} + \phi_2 Y_{t+1} + e_{t+3}$: \[b_3 = \phi_1 b_2 + \phi_2 b_1 = \phi_1(\phi_1^2 + \phi_2) + \phi_2\phi_1 = \phi_1^3 + 2\phi_1\phi_2\]

General recursion: For all $j \geq 2$: \[\boxed{b_j = \phi_1 b_{j-1} + \phi_2 b_{j-2}}\] with initial conditions $b_0 = 1$ and $b_1 = \phi_1$. Note this differs from the AR(1) case where $b_j = \phi_1^j$: the second lag introduces a richer pattern that can include oscillations when the characteristic roots of the AR polynomial are complex.

Method	When to Use	Estimator	Key Assumptions	Main Caveat / Failure Mode
MLE (general principle)	When you can specify a parametric distribution \(f(X \mid \theta)\) for the data	Maximize \(\ell_n(\theta) = \sum_i \log f(X_i \mid \theta)\)	Model is correctly specified; \(i.i.d.\) observations	If misspecified, converges to pseudo-true value \(\theta_0\); use sandwich variance \(H^{-1}\mathcal{I}H^{-1}\) instead of \(H^{-1}\)
Probit / Logit	Binary outcome \(Y \in \{0,1\}\); want to model \(P(Y=1 \mid X)\) respecting \([0,1]\) bounds	MLE with \(\Phi(X\beta)\) (probit) or \(\Lambda(X\beta)\) (logit)	Latent variable errors are normal (probit) or logistic (logit); correct index specification	Coefficients are not marginal effects; must compute AME \(= \hat{\beta} \cdot \frac{1}{n}\sum_i g(X_i\hat{\beta})\); inconsistent if distribution misspecified
LPM	Binary outcome; want a quick linear approximation to \(P(Y=1 \mid X)\)	OLS on \(Y \in \{0,1\}\)	Linear probability model; homoscedasticity (often violated)	Predicted probabilities outside \([0,1]\); heteroscedastic errors by construction; use robust SEs
Multinomial Logit	Unordered discrete outcome with \(J \geq 3\) alternatives; one decision maker chooses among alternatives	MLE; one \(\beta_j\) normalized to zero (base category)	IIA (Independence of Irrelevant Alternatives): odds ratio \(P_j/P_k\) depends only on \(j\) and \(k\), not other alternatives	IIA fails when alternatives are close substitutes; use nested logit or mixed logit if IIA is implausible
Ordered Probit / Logit	Ordered discrete outcome (e.g., disagree / neutral / agree); order is meaningful	MLE; single index \(X\beta\) with thresholds \(\alpha_1 < \cdots < \alpha_{J-1}\)	Proportional odds: single coefficient governs transition between all categories	If the proportional odds assumption fails, a more flexible model is needed
Tobit	Continuous latent outcome \(Y^\) but observed \(Y = \max(Y^, 0)\) (or similar censoring); censoring mechanism is known	MLE; joint likelihood over censored and uncensored observations	Normally distributed errors; exogeneity of \(X\); censoring threshold known	OLS on \(Y\) or on uncensored subsample both biased toward zero; marginal effect on \(E[Y\mid X]\) is \(\beta_j \cdot \Phi(X\beta/\sigma)\), not \(\beta_j\)
Heckman Selection	Outcome \(Y\) observed only for a selected subsample (\(S=1\)); selection may be endogenous	2-step: probit for selection, then OLS with inverse Mills ratio \(\hat{\lambda}\) as control	Joint normality of \((e_i, u_i)\); at least one exclusion restriction in \(Z\) not in \(X\)	Without exclusion restriction, identification relies on nonlinearity of \(\lambda(\cdot)\), very fragile in practice; test for selection bias via \(t\)-test on \(\hat{\sigma}_{21}\)
GMM / IV	Endogenous regressors; moment conditions \(E[Z_i e_i] = 0\) available; possibly overidentified (\(l > k\))	Minimize \(J(\beta) = n\bar{g}_n(\beta)'W\bar{g}_n(\beta)\); efficient GMM sets \(W = \hat{\Omega}^{-1}\)	Valid instruments: relevance (\(\text{Cov}(Z,X) \neq 0\)) and exogeneity (\(E[Ze]=0\))	Weak instruments inflate variance and bias; test overidentifying restrictions with Hansen \(J \sim \chi^2_{l-k}\); efficiency gain of GMM over 2SLS disappears under homoscedasticity

Method	When to Use	Estimator	Key Assumptions	Main Caveat / Failure Mode
Kernel Density Estimation	Estimate the density \(f(x)\) without imposing a parametric form	\(\hat{f}(x) = \frac{1}{nh}\sum_i K\!\left(\frac{X_i - x}{h}\right)\)	i.i.d. data; \(f(x)\) is twice differentiable; bandwidth \(h \to 0\), \(nh \to \infty\)	Bias \(\propto h^2 f''(x)\); variance \(\propto (nh)^{-1}\); optimal \(h \propto n^{-1/5}\); convergence rate \(n^{-2/5}\) is slower than parametric
Kernel / Local Linear Regression	Estimate \(m(x) = E[Y\mid X=x]\) nonparametrically; no functional form assumed	Nadaraya-Watson (weighted average) or local linear (weighted OLS at each \(x\))	Smoothness of \(m(x)\); i.i.d. data; bandwidth chosen by cross-validation	NW has extra bias \([f'(x)/f(x)]m'(x)\) at slope regions and boundary bias; local linear corrects both; still converges at \(n^{-2/5}\) in one dimension
AR(\(p\)) / ARMA	Univariate time series; model dynamic dependence and forecast future values	OLS (AR) or MLE (ARMA); model order selected by AIC on a fixed estimation sample	Weak stationarity; ergodicity; MDS errors; AR roots outside unit circle	Unit roots (\(\phi_1 = 1\)) break stationarity: random walk, no mean reversion; test with Dickey-Fuller; AIC selects order but always use same sample across candidate models
Panel Data FE-IV	Panel data with fixed effects \(u_i\) and endogenous \(X_{it}\); need to handle both simultaneously	Within transformation (\(\tilde{M}_D\)) to remove \(u_i\), then 2SLS on demeaned variables	Strict exogeneity of instruments after demeaning; instruments vary within individual over time	Instruments that are time-invariant get differenced away; need within-individual variation in \(Z_{it}\)
Dynamic Panel (AB / BB)	Panel data with lagged dependent variable \(Y_{i,t-1}\); standard FE is inconsistent	Arellano-Bond: GMM on first-differenced equation using lagged levels as instruments; Blundell-Bond adds level moment conditions	No serial correlation in \(\varepsilon_{it}\); stationarity of initial conditions (BB only)	Nickell bias in FE: \(\text{plim}(\hat{\rho}_{FE} - \rho) \approx -(1+\rho)/(T-1)\); AB has weak instruments when \(\rho \approx 1\)- use BB instead
Ridge Regression	\(p\) large or \(p > n\); goal is prediction; multicollinearity present	\(\hat{\beta}_{ridge} = (X'X + \lambda I)^{-1}X'Y\); \(\lambda\) chosen by leave-one-out CV	Linear model; no sparsity required	Shrinks all coefficients toward zero but never exactly to zero, cannot do variable selection; choose Lasso instead if sparsity is desired
Lasso	\(p\) large or \(p > n\); goal is prediction and variable selection; true model believed sparse	Minimize \(\\|Y - X\beta\\|_2^2 + \lambda\\|\beta\\|_1\); \(\lambda\) by \(K\)-fold CV	Sparsity (few truly nonzero coefficients); incoherence / restricted eigenvalue condition	Shrinkage biases retained coefficients toward zero, use post-Lasso (refit OLS on selected variables) to remove bias; for causal inference use double-selection
Double-Selection Lasso	Causal effect of treatment \(D\) on \(Y\) with high-dimensional controls \(X\)	Lasso of \(Y\) on \(X\), Lasso of \(D\) on \(X\); OLS of \(Y\) on \(D\) plus union of selected controls	Approximate sparsity in both the outcome and treatment equations	Standard Lasso may drop relevant confounders if they have weak effects on \(Y\) alone; double selection protects against this omitted variable bias

Object	Formula
\(E[Y^* \mid X]\)	\(X\beta\)
\(E[Y \mid X]\)	\(X\beta \cdot \Phi(X\beta/\sigma) + \sigma\phi(X\beta/\sigma)\)
\(E[Y \mid X, Y>0]\)	\(X\beta + \sigma\lambda(X\beta/\sigma)\)