IV and Dynamic Panel Data Models
Theory Review
IV Estimation for Panel Data
Core Intuition
Recall the fixed effects panel model from ApEc 8212:
\[ Y_{it} = X_{it}'\beta + u_i + \varepsilon_{it} \]
Fixed effects (within transformation) removes bias from unobserved individual factors (\(u_i\)), but not from endogeneity in the error (\(\varepsilon_{it}\)). If any \(X_{it}\) is correlated with \(\varepsilon_{it}\), fixed effects alone is still biased.
The solution is to use IV/2SLS on the within-transformed data: within handles \(u_i\), IV handles endogeneity in \(\varepsilon_{it}\).
Example: To estimate the effect of union membership on wages, fixed effects removes bias from ability (\(u_i\)), but not from transitory wage shocks (\(\varepsilon_{it}\)) that may influence union membership. You need an instrument for union membership that is unrelated to these shocks.The Within-Transformed 2SLS Estimator
After demeaning (within transformation), apply IV/2SLS to the transformed data:
\[ \dot{Y} = M_D Y, \quad \dot{X} = M_D X, \quad \dot{Z} = M_D Z \]
The fixed effects 2SLS estimator is just the usual 2SLS formula, but using the demeaned variables:
\[ \hat{\beta}_{2\text{sls}} = (\dot{X}'\dot{Z}(\dot{Z}'\dot{Z})^{-1}\dot{Z}'\dot{X})^{-1}(\dot{X}'\dot{Z}(\dot{Z}'\dot{Z})^{-1}\dot{Z}'\dot{Y}) \]
Key points:
- Instruments must have variation after demeaning and not be collinear.
- Instruments must predict all endogenous \(X\) after demeaning.
- Strict exogeneity: \(E[Z_{is}\varepsilon_{it}] = 0\) for all \(s, t\).
- Always use cluster-robust standard errors (cluster by individual).
Hausman-Taylor: Estimating Time-Invariant Coefficients
Standard fixed effects cannot estimate the coefficient on any variable that does not vary over time, it gets swept out by the within transformation along with \(u_i\). The Hausman-Taylor model solves this by using IV to separately identify time-invariant effects.
The model allows four types of variables:
| Variable | Varies over time? | Correlated with \(u_i\)? |
|---|---|---|
| \(X_{1it}\) | Yes | No (exogenous) |
| \(X_{2it}\) | Yes | Yes (endogenous) |
| \(Z_{1i}\) | No | No (exogenous) |
| \(Z_{2i}\) | No | Yes (endogenous) |
To estimate \(\gamma_2\) (the coefficient on \(Z_{2i}\), time-invariant and endogenous), use the within-demeaned exogenous time-varying variables (\(\dot{X}_1\)) as instruments. After fixed effects removes \(u_i\), \(\dot{X}_1\) is uncorrelated with \(u_i\) and valid for \(Z_{2i}\).
The instrument set is \((\dot{X}_1,\, \dot{X}_2,\, \bar{X}_1,\, Z_1)\). Only the between variation in \(X_2\) (\(\bar{X}_2\)) is correlated with \(u_i\), so \(\dot{X}_2\) is valid.
Requirement: \(k_1 \geq l_2\) (at least as many exogenous time-varying variables as endogenous time-invariant ones).
Check Your Understanding
- A researcher wants to estimate the return to schooling using a panel of workers observed over 10 years. Schooling is time-invariant (most workers complete education before the panel begins) and is likely correlated with unobserved ability \(u_i\). She proposes the Hausman-Taylor model. What observable variables would she need to serve as instruments for schooling, and what assumption must hold for this to work?
She needs at least one time-varying, exogenous regressor (an \(X_{1it}\)) to serve as an instrument for schooling. The key assumption: this variable is uncorrelated with \(u_i\) (unobserved ability). If this holds, the instrument is valid.
For example, if she has data on annual hours worked (which varies over time and is plausibly exogenous), she could use the within variation in hours worked as an instrument for schooling. The assumption is that changes in hours worked are not systematically related to unobserved ability after controlling for schooling.
- In the within-transformed 2SLS setup, strict exogeneity requires \(E[Z_{is}\varepsilon_{it}] = 0\) for all \(s\) and \(t\), not just \(s = t\). Give an example of an instrument that satisfies contemporaneous exogeneity (\(s = t\)) but fails strict exogeneity, and explain why this would invalidate the estimator.
If local program funding responds to past earnings shocks (e.g., more funding after a bad year), then funding is correlated with past errors. This violates strict exogeneity, making the instrument invalid for panel IV, even if it’s uncorrelated with current errors.
The Bias in Dynamic Panels
Why Fixed Effects Fails with Lagged Dependent Variables
Adding a lagged dependent variable to a panel model is natural when outcomes persist over time, wages today depend on wages yesterday, health today depends on health yesterday. The model is:
\[Y_{it} = \alpha Y_{i,t-1} + X_{it}'\beta + u_i + \varepsilon_{it}\]
The tempting approach is to apply standard fixed effects (within estimation): demean everything and run OLS. After demeaning:
\[\dot{Y}_{it} = \alpha \dot{Y}_{i,t-1} + \dot{\varepsilon}_{it}\]
The regressor \(\dot{Y}_{i,t-1} = Y_{i,t-1} - \bar{Y}_i\) contains \(\bar{Y}_i\), the individual mean, which is computed using all time periods including \(t\). But \(\bar{Y}_i\) contains \(Y_{it}\), which depends on \(\varepsilon_{it}\). So \(\dot{Y}_{i,t-1}\) is mechanically correlated with \(\dot{\varepsilon}_{it}\). OLS on the within-transformed model is biased.
Nickell (1981) quantified this bias. For the simple AR(1) panel model with \(T\) time periods:
\[\underset{N \to \infty}{\text{plim}}(\hat{\alpha}_{fe} - \alpha) = \frac{1 + \alpha}{\frac{2\alpha}{1-\alpha} - \frac{T-1}{1-\alpha^{T-1}}}\]
For the special case \(T = 3\), this simplifies to \(-(1+\alpha)/2\).
This is not just a small-sample problem, it is a fixed-\(T\) bias that does not disappear as \(N \to \infty\). No matter how large the panel, if \(T\) is small the fixed effects estimator of \(\alpha\) is inconsistent. The bias only vanishes as \(T \to \infty\).
How large is the bias in practice? For \(\alpha = 0.5\) and \(T = 3\), the plim of \(\hat{\alpha}_{fe}\) is \(0.5 - (1+0.5)/2 = -0.25\). The estimator not only understates the true persistence, it gives the wrong sign entirely.
Check Your Understanding
- A development economist estimates the effect of cash transfers on household consumption using a panel of 5,000 households observed for \(T = 4\) years. She includes lagged consumption as a regressor and uses standard fixed effects (within estimation). Should she be worried about the bias? Does the fact that \(N = 5{,}000\) help?
Yes. The bias is a fixed-\(T\) problem and does not go away as \(N\) increases. She should use an IV-based estimator for dynamic panels.
- Confirm the Nickell bias formula for \(T = 3\) and \(\alpha = 0\). What does the result say about a researcher who estimates a dynamic panel model, finds a negative \(\hat{\alpha}_{fe}\), and concludes that consumption growth mean-reverts?
For \(\alpha = 0\) and \(T = 3\): \(\text{plim}(\hat{\alpha}_{fe} - 0) = -(1+0)/2 = -0.5\). So \(\text{plim}(\hat{\alpha}_{fe}) = -0.5\).
The true model has \(\alpha = 0\), there is no persistence or mean reversion at all. But the fixed effects estimator converges to \(-0.5\), which looks like substantial negative autocorrelation. The researcher would conclude that high consumption growth this period predicts low consumption growth next period, when in fact there is no such pattern.
Equation Details
This result comes from the derivation of the Nickell bias for the fixed effects estimator in dynamic panel models. For an AR(1) panel with \(T = 3\) and true \(\alpha = 0\), the bias formula is:
\[ \underset{N\to\infty}{\text{plim}}(\hat{\alpha}_{fe} - \alpha) = -\frac{1+\alpha}{2} \]
This is derived by writing the first-differenced OLS estimator for \(\alpha\), calculating the expected value of the numerator (\(E[\Delta Y_{i2} \Delta \varepsilon_{i3}] = -\sigma_\varepsilon^2\)), and the denominator (\(E[(\Delta Y_{i2})^2] = 2\sigma_\varepsilon^2/(1+\alpha)\) under stationarity), and then taking the ratio. Plugging in \(\alpha = 0\) gives \(-\frac{1+0}{2} = -0.5\).
So, even when the true process has no persistence, the fixed effects estimator is biased downward and can suggest spurious mean reversion when \(T\) is small.Estimators for Dynamic Panel Data
Anderson-Hsiao: The Core IV Idea
The strategy for all valid dynamic panel estimators follows the same two-step logic: (1) eliminate \(u_i\) by first-differencing, then (2) use lagged levels as instruments for the endogenous differenced lags.
After first differencing:
\[\Delta Y_{it} = \alpha \Delta Y_{i,t-1} + \Delta X_{it}'\beta + \Delta \varepsilon_{it}\]
The problem is that \(\Delta Y_{i,t-1} = Y_{i,t-1} - Y_{i,t-2}\) is correlated with \(\Delta \varepsilon_{it} = \varepsilon_{it} - \varepsilon_{i,t-1}\) because \(Y_{i,t-1}\) was generated by \(\varepsilon_{i,t-1}\), which appears in \(\Delta \varepsilon_{it}\).
Anderson and Hsiao (1982) proposed \(Y_{i,t-2}\) as an instrument for \(\Delta Y_{i,t-1}\). Why is this valid?
- Relevance: \(Y_{i,t-2}\) is correlated with \(\Delta Y_{i,t-1} = Y_{i,t-1} - Y_{i,t-2}\)
- Exogeneity: Under the assumption that \(\varepsilon_{it}\) is serially uncorrelated, \(Y_{i,t-2}\) is determined before \(\varepsilon_{i,t-1}\) and \(\varepsilon_{it}\), so \(E[Y_{i,t-2} \Delta \varepsilon_{it}] = 0\).
This requires \(T \geq p + 2\) (you need at least two periods before the current one to form the instrument). For an AR(1) with no \(X\) variables, you need \(T \geq 3\).
Serial correlation: If \(\varepsilon_{it}\) is itself serially correlated, then \(Y_{i,t-2}\) may not be a valid instrument because \(Y_{i,t-2}\) contains \(\varepsilon_{i,t-2}\), which could be correlated with \(\varepsilon_{it}\) via \(\varepsilon_{i,t-1}\). Always test for serial correlation in the residuals before relying on the Anderson-Hsiao estimator.
Arellano-Bond: More Instruments, More Efficiency
Anderson and Hsiao use only \(Y_{i,t-2}\) as an instrument for \(\Delta Y_{i,t-1}\). But for observations at time \(t\), any level \(Y_{i,t-2}, Y_{i,t-3}, \ldots, Y_{i1}\) is a valid instrument, because all of them are predetermined relative to \(\Delta \varepsilon_{it}\).
Arellano and Bond (1991) exploit all available lags. The instrument set grows with \(t\):
- At \(t = 3\): use \(Y_{i1}\) as instrument
- At \(t = 4\): use \(Y_{i1}, Y_{i2}\)
- At \(t = T\): use \(Y_{i1}, Y_{i2}, \ldots, Y_{i,T-2}\)
The total number of instruments is \(\sum_{t=3}^{T}(t-2) = T(T-1)/2 - 1\), which grows quadratically with \(T\).
This is naturally cast as a GMM problem: the moment conditions are \(E[Y_{i,t-j}\Delta\varepsilon_{it}] = 0\) for \(j \geq 2\). The GMM estimator optimally weights these conditions, yielding lower variance than Anderson-Hsiao in theory.
The many-weak-instruments problem: In practice, using all available lags can hurt more than it helps. When \(T\) is large, the instrument matrix becomes very large and the GMM weighting matrix becomes difficult to estimate precisely, the estimator becomes unreliable. Standard advice is to limit the lag depth, using only \(Y_{i,t-2}\) and \(Y_{i,t-3}\) as instruments even when longer lags are available.
Weak Instruments and the Blundell-Bond Estimator
Both Anderson-Hsiao and Arellano-Bond can suffer from weak instruments in two situations:
- When \(\alpha\) is close to 1 (the process is nearly a unit root). Blundell and Bond (1998) showed that the first-stage coefficient for the Anderson-Hsiao instrument is:
\[ \gamma = (\alpha - 1) \cdot \frac{k}{k + \sigma_u^2/\sigma_\varepsilon^2}, \quad k = \frac{1-\alpha}{1+\alpha} \]
When \(\alpha \to 1\), the factor \((\alpha - 1) \to 0\) and \(k \to 0\), so \(\gamma \to 0\). Lagged levels contain almost no information about differenced values when the series is highly persistent.
- When \(\sigma_\varepsilon^2\) is small relative to \(\sigma_u^2\), most variation in \(Y_{it}\) comes from the fixed effect \(u_i\) rather than the idiosyncratic shock. First differencing eliminates this variation, leaving almost nothing for the instrument to work with.
Blundell-Bond: adds a second set of moment conditions by using differenced lags as instruments for the level equation. The idea: instead of only estimating the differenced equation with lagged levels as instruments, also estimate the level equation with lagged differences as instruments. \(\Delta Y_{i,t-1}\) is a valid instrument for \(Y_{i,t-1}\) in the level equation because (under stationarity) the correlation between \(\Delta Y_{i,t-1}\) and \(u_i\) is zero.
This “system GMM” approach, combining the differenced equation and the level equation, is more efficient than Arellano-Bond when instruments are weak, especially for persistent processes. The tradeoff: it requires the additional assumption that \(Y_{it}\) is stationary (the initial conditions \(Y_{i1}\) are not systematically related to \(u_i\)). It can also still suffer from the many-instruments problem.
Predetermined vs. Strictly Exogenous Regressors
Dynamic panel models often include time-varying regressors \(X_{it}\) alongside the lagged dependent variable. The key question is how \(X_{it}\) relates to the error over time.
Strictly exogenous: \(E[X_{is}\varepsilon_{it}] = 0\) for all \(s\) and \(t\). Past, present, and future \(X\) are all uncorrelated with today’s shock. This allows \(\Delta X_{it}\) to serve as its own instrument in the differenced equation.
Predetermined: \(E[X_{i,t-s}\varepsilon_{it}] = 0\) for \(s \geq 0\) only. Current and past \(X\) are uncorrelated with today’s shock, but future \(X\) may not be. In other words, \(X\) does not anticipate future shocks, but today’s shock can influence future \(X\). This is the natural assumption when \(X\) is itself an outcome that responds dynamically to \(Y\).
Under predeterminedness, \(\Delta X_{it}\) is correlated with \(\Delta\varepsilon_{it}\) because \(\Delta X_{it} = X_{it} - X_{i,t-1}\) includes \(X_{i,t-1}\), which may have responded to \(\varepsilon_{i,t-1}\), which appears in \(\Delta\varepsilon_{it}\). The fix is to instrument \(\Delta X_{it}\) with \(X_{i,t-1}\) (or earlier lags), which are predetermined and therefore uncorrelated with \(\Delta\varepsilon_{it}\).
A summary of valid instruments across all settings:
| Setting | Endogenous variable | Valid instruments |
|---|---|---|
| Anderson-Hsiao (AR(1)) | \(\Delta Y_{i,t-1}\) | \(Y_{i,t-2}\) (levels) |
| Arellano-Bond | \(\Delta Y_{i,t-1}\) | \(Y_{i1}, \ldots, Y_{i,t-2}\) (all lags) |
| Blundell-Bond (level eq.) | \(Y_{i,t-1}\) | \(\Delta Y_{i,t-1}\) (differences) |
| Predetermined \(X\) | \(\Delta X_{it}\) | \(X_{i,t-1}, X_{i,t-2}, \ldots\) |
Check Your Understanding
- You estimate a dynamic investment model with \(T = 6\) using Arellano-Bond and all available lags as instruments. A referee warns of a “many weak instruments” problem. What does this mean, and what should you do?
With \(T = 6\), you have \(\sum_{t=3}^{6}(t-2) = 10\) instruments for one AR coefficient. Too many instruments can make the GMM weighting matrix \(\hat{W}\) unstable and individual lags weak (low first-stage \(R^2\)), especially if the process is persistent.
Remedies: (1) Collapse the instrument matrix; (2) Limit lag depth (e.g., use only \(Y_{i,t-2}\) and \(Y_{i,t-3}\)); (3) Report weak instrument statistics; (4) Compare to Blundell-Bond.
- A researcher estimates a dynamic health model with \(Y_{it}\) (health) and \(X_{it}\) (income, predetermined). After first differencing, why can’t \(\Delta X_{it}\) be its own instrument, and what should be used?
With predetermined \(X_{it}\), \(\Delta X_{it}\) is endogenous because \(X_{i,t-1}\) may be correlated with \(\varepsilon_{i,t-1}\) in \(\Delta\varepsilon_{it}\). Use \(X_{i,t-1}\) (or earlier lags) as instruments: they are predetermined and satisfy both relevance and exogeneity.
- You estimate an AR(1) panel and get \(\hat{\alpha}_{fe} = -0.4\) with \(T = 5\). Should you trust this? When could it make sense?
Nickell bias is always negative: \(\hat{\alpha}_{fe}\) is a lower bound for the true \(\alpha\). With \(T = 5\), the bias is about \(-(1+\alpha)/4\). So \(\hat{\alpha}_{fe} = -0.4\) could mean the true \(\alpha\) is higher (e.g., \(-0.2\) or \(0\)). Always check with Anderson-Hsiao or Arellano-Bond: if those are less negative, the fixed effects result is likely just bias.
Exercise 1: Choosing an Estimator
Exercise
A labor economist uses a panel of workers observed for \(T = 4\) years to estimate the effect of wages on hours worked, allowing for individual fixed effects and dynamic persistence. The model is:
\[ \text{Hours}_{it} = \alpha \, \text{Hours}_{i,t-1} + \beta \, \text{Wage}_{it} + u_i + \varepsilon_{it} \]
She believes wages are predetermined (hours worked this period may affect next period’s offered wage through learning-by-doing, but current-period wage shocks do not contemporaneously affect this period’s hours decision).
(a) She first estimates the model using standard fixed effects (within estimation). Explain why this estimate of \(\alpha\) will be biased. Compute the direction and approximate magnitude of the bias for \(T = 4\) and \(\alpha = 0.6\).
The bias arises because the within transformation creates a mechanical correlation between the demeaned lagged dependent variable and the demeaned error. Specifically, \(\dot{Y}_{i,t-1} = Y_{i,t-1} - \bar{Y}_i\) contains the individual mean \(\bar{Y}_i\), which is computed using all time periods including \(t\). Since \(\bar{Y}_i\) includes \(Y_{it}\), which was generated partly by \(\varepsilon_{it}\), we have \(\text{Cov}(\dot{Y}_{i,t-1}, \dot{\varepsilon}_{it}) \neq 0\). The bias is always negative, fixed effects estimation always understates \(\alpha\).
For \(T = 4\) and \(\alpha = 0.6\), the Nickell (1981) formula gives:
\[\text{plim}(\hat{\alpha}_{fe} - \alpha) = \frac{1 + \alpha}{\frac{2\alpha}{1-\alpha} - \frac{T-1}{1-\alpha^{T-1}}}\]
Plugging in \(\alpha = 0.6\) and \(T = 4\):
- \(\frac{2(0.6)}{1-0.6} = 3\)
- \(\alpha^{T-1} = 0.6^3 = 0.216\), so \(\frac{T-1}{1-\alpha^{T-1}} = \frac{3}{0.784} \approx 3.827\)
- Denominator: \(3 - 3.827 = -0.827\)
- Numerator: \(1 + 0.6 = 1.6\)
\[\text{bias} = \frac{1.6}{-0.827} \approx -1.93\]
So \(\text{plim}(\hat{\alpha}_{fe}) \approx 0.6 - 1.93 = -1.33\). The fixed effects estimator is not just biased, it would likely produce an estimate far below zero, suggesting strong mean reversion when the true process is quite persistent. Do not use standard fixed effects here.
(b) She decides to use the Anderson-Hsiao approach. Write down the first-differenced equation and identify which variables are endogenous in that equation. For each endogenous variable, state a valid instrument and verify the two IV conditions (relevance and exogeneity).
First-differencing eliminates \(u_i\):
\[\Delta \text{Hours}_{it} = \alpha \, \Delta \text{Hours}_{i,t-1} + \beta \, \Delta \text{Wage}_{it} + \Delta \varepsilon_{it}\]
Endogenous variables:
\(\Delta \text{Hours}_{i,t-1} = \text{Hours}_{i,t-1} - \text{Hours}_{i,t-2}\). This is endogenous because \(\text{Hours}_{i,t-1}\) was generated by \(\varepsilon_{i,t-1}\), which appears (negatively) in \(\Delta \varepsilon_{it} = \varepsilon_{it} - \varepsilon_{i,t-1}\).
\(\Delta \text{Wage}_{it} = \text{Wage}_{it} - \text{Wage}_{i,t-1}\). Since wages are only predetermined (not strictly exogenous), \(\text{Wage}_{i,t-1}\) may have been affected by \(\varepsilon_{i,t-1}\), which again appears in \(\Delta \varepsilon_{it}\). So \(\Delta \text{Wage}_{it}\) is also endogenous.
Valid instruments:
- For \(\Delta \text{Hours}_{i,t-1}\): use \(\text{Hours}_{i,t-2}\) (a lagged level).
- Relevance: \(\text{Hours}_{i,t-2}\) is literally part of \(\Delta \text{Hours}_{i,t-1} = \text{Hours}_{i,t-1} - \text{Hours}_{i,t-2}\) and predicts it directly.
- Exogeneity: Under serially uncorrelated \(\varepsilon_{it}\), \(\text{Hours}_{i,t-2}\) was determined before \(\varepsilon_{i,t-1}\) and \(\varepsilon_{it}\), so \(E[\text{Hours}_{i,t-2} \cdot \Delta \varepsilon_{it}] = 0\).
- For \(\Delta \text{Wage}_{it}\): use \(\text{Wage}_{i,t-1}\) (the lagged level).
- Relevance: \(\text{Wage}_{i,t-1}\) predicts \(\Delta \text{Wage}_{it}\) directly (it is part of the expression).
- Exogeneity: Under predeterminedness, \(E[\text{Wage}_{i,t-1} \cdot \varepsilon_{it}] = 0\) and \(E[\text{Wage}_{i,t-1} \cdot \varepsilon_{i,t-1}] = 0\) (wages at \(t-1\) are set before the shock at \(t-1\) realizes, by the predeterminedness assumption). So \(E[\text{Wage}_{i,t-1} \cdot \Delta \varepsilon_{it}] = 0\).
This requires \(T \geq 3\) to have at least one valid observation per person.
(c) A colleague suggests using the Arellano-Bond estimator instead. What is the advantage over Anderson-Hsiao? What is the potential downside with \(T = 4\)?
Advantage: Arellano-Bond uses all available lagged levels as instruments, not just the single lag used by Anderson-Hsiao. At time \(t\), the valid instruments include \(\text{Hours}_{i,t-2}, \text{Hours}_{i,t-3}, \ldots, \text{Hours}_{i1}\). Using more moment conditions allows GMM to weight them optimally, yielding a lower-variance estimator than Anderson-Hsiao in theory.
Downside with \(T = 4\): With only four time periods, the differenced equation is observed at \(t = 3\) and \(t = 4\). The instrument counts are:
- At \(t = 3\): only \(\text{Hours}_{i1}\) is available (one instrument).
- At \(t = 4\): \(\text{Hours}_{i1}\) and \(\text{Hours}_{i2}\) are available (two instruments).
That is a total of just 3 instruments across both periods, essentially the same as Anderson-Hsiao, so there is little efficiency gain. The advantage of Arellano-Bond grows with \(T\); for \(T = 4\) it is modest, and the many-instruments problem is not a concern here because the instrument count is small.
(d) Why might the Blundell-Bond estimator be preferred in this application, and what additional assumption does it require?
With \(\alpha = 0.6\), the process is persistent. The Anderson-Hsiao first-stage coefficient is
\[\gamma = (\alpha - 1) \cdot \frac{k}{k + \sigma_u^2/\sigma_\varepsilon^2}, \quad k = \frac{1-\alpha}{1+\alpha}\]
If \(\sigma_u^2/\sigma_\varepsilon^2\) is large, \(\gamma\) is near zero: lagged levels are weak instruments after differencing.
Blundell-Bond improves efficiency by also using the level equation with lagged differences as instruments, which are more informative when persistence is high.
Extra assumption: The initial value \(\text{Hours}_{i1}\) must be uncorrelated with \(u_i\) (stationarity: \(E[\Delta Y_{i,t-1} u_i] = 0\)). If the panel starts in an unusual period, this may not hold, and Blundell-Bond is invalid.
Exercise 2: Deriving the Nickell Bias
Exercise
Consider the simplest possible dynamic panel model: an AR(1) with no covariates, \(N\) individuals, and exactly \(T = 3\) time periods:
\[ Y_{it} = \alpha Y_{i,t-1} + u_i + \varepsilon_{it}, \quad i = 1,\ldots,N,\quad t = 1,2,3 \]
Assume \(\varepsilon_{it} \overset{i.i.d.}{\sim} (0, \sigma_\varepsilon^2)\), independent of \(u_i\) and of \(Y_{i0}\).
(a) Show that with \(T = 3\), the within (fixed effects) estimator is numerically equivalent to the first-differenced OLS estimator. That is, show that estimating \(\dot{Y}_{it} = \alpha \dot{Y}_{i,t-1} + \dot{\varepsilon}_{it}\) on the demeaned data gives the same \(\hat{\alpha}\) as estimating \(\Delta Y_{i3} = \alpha \Delta Y_{i2} + \Delta \varepsilon_{i3}\) by OLS.
First-Difference Estimator
Differencing eliminates \(u_i\). For \(t=2,3\): \[ \begin{align} \Delta Y_{i2} &= \alpha \Delta Y_{i1} + \Delta\varepsilon_{i2}\\ \Delta Y_{i3} &= \alpha \Delta Y_{i2} + \Delta\varepsilon_{i3} \end{align} \] where \(\Delta Y_{it} = Y_{it} - Y_{i,t-1}\).
Stacking these two equations per individual and applying OLS: \[ \hat{\alpha}_{FD} = \frac{\sum_i (\Delta Y_{i2} \cdot \Delta Y_{i1} + \Delta Y_{i3} \cdot \Delta Y_{i2})}{\sum_i (\Delta Y_{i1}^2 + \Delta Y_{i2}^2)} \]
Within Estimator
Demeaning over \(t=1,2,3\) eliminates \(u_i\): \[ \dot{Y}_{it} = Y_{it} - \bar{Y}_i, \quad \text{where } \bar{Y}_i = \frac{1}{3}(Y_{i1} + Y_{i2} + Y_{i3}) \] The within regression is: \[ \dot{Y}_{it} = \alpha \dot{Y}_{i,t-1} + \dot{\varepsilon}_{it} \] The estimator is: \[ \hat{\alpha}_{FE} = \frac{\sum_i \sum_{t=1}^3 \dot{Y}_{it} \dot{Y}_{i,t-1}}{\sum_i \sum_{t=1}^3 \dot{Y}_{i,t-1}^2} \]
Step 1: Express demeaned variables in terms of differences.
Note that: \[ \begin{align} \dot{Y}_{i1} &= Y_{i1} - \frac{Y_{i1} + Y_{i2} + Y_{i3}}{3} = \frac{2Y_{i1} - Y_{i2} - Y_{i3}}{3}\\ \dot{Y}_{i2} &= \frac{2Y_{i2} - Y_{i1} - Y_{i3}}{3}\\ \dot{Y}_{i3} &= \frac{2Y_{i3} - Y_{i1} - Y_{i2}}{3} \end{align} \]
We can rewrite these as: \[ \begin{align} \dot{Y}_{i1} &= \frac{1}{3}(Y_{i1} - Y_{i2}) + \frac{1}{3}(Y_{i1} - Y_{i3}) = -\frac{1}{3}\Delta Y_{i2} - \frac{1}{3}(\Delta Y_{i2} + \Delta Y_{i3})\\ \dot{Y}_{i2} &= \frac{1}{3}(Y_{i2} - Y_{i1}) + \frac{1}{3}(Y_{i2} - Y_{i3}) = \frac{1}{3}\Delta Y_{i2} - \frac{1}{3}\Delta Y_{i3}\\ \dot{Y}_{i3} &= \frac{1}{3}(Y_{i3} - Y_{i1}) + \frac{1}{3}(Y_{i3} - Y_{i2}) = \frac{1}{3}(\Delta Y_{i2} + \Delta Y_{i3}) + \frac{1}{3}\Delta Y_{i3} \end{align} \]
Simplifying: \[ \begin{align} \dot{Y}_{i1} &= -\frac{2}{3}\Delta Y_{i2} - \frac{1}{3}\Delta Y_{i3}\\ \dot{Y}_{i2} &= \frac{1}{3}\Delta Y_{i2} - \frac{1}{3}\Delta Y_{i3}\\ \dot{Y}_{i3} &= \frac{1}{3}\Delta Y_{i2} + \frac{2}{3}\Delta Y_{i3} \end{align} \]
Similarly for lagged values: \[ \begin{align} \dot{Y}_{i0} &= -\frac{2}{3}\Delta Y_{i1} - \frac{1}{3}\Delta Y_{i2}\\ \dot{Y}_{i1} &= \frac{1}{3}\Delta Y_{i1} - \frac{1}{3}\Delta Y_{i2}\\ \dot{Y}_{i2} &= \frac{1}{3}\Delta Y_{i1} + \frac{2}{3}\Delta Y_{i2} \end{align} \]
Step 2: Substitute into the within estimator formula.
The numerator becomes (after substituting and expanding products of the form \(\frac{1}{9}\) times sums of \(\Delta Y\) products): \[ \text{Num}_{FE} = \sum_i \frac{1}{9}\left[(-2\Delta Y_{i2} - \Delta Y_{i3})(-2\Delta Y_{i1} - \Delta Y_{i2}) + (\Delta Y_{i2} - \Delta Y_{i3})(\Delta Y_{i1} - \Delta Y_{i2}) + (\Delta Y_{i2} + 2\Delta Y_{i3})(\Delta Y_{i1} + 2\Delta Y_{i2})\right] \]
Expanding and simplifying (all cross-terms between \(\Delta Y_{i1}\) and \(\Delta Y_{i3}\) cancel): \[ \text{Num}_{FE} = \sum_i (\Delta Y_{i2} \cdot \Delta Y_{i1} + \Delta Y_{i3} \cdot \Delta Y_{i2}) \]
Similarly, the denominator: \[ \text{Den}_{FE} = \sum_i (\Delta Y_{i1}^2 + \Delta Y_{i2}^2) \]
Therefore: \(\hat{\alpha}_{FE} = \hat{\alpha}_{FD}\).
(b) The OLS estimator for \(\alpha\) in the first-differenced regression can be written as:
\[ \hat{\alpha}_{fe} = \alpha + \frac{\sum_{i=1}^N \Delta Y_{i2} \cdot \Delta \varepsilon_{i3}}{\sum_{i=1}^N (\Delta Y_{i2})^2} \]
Show that the expected value of the numerator, \(E[\Delta Y_{i2} \cdot \Delta \varepsilon_{i3}]\), equals \(-\sigma_\varepsilon^2\). This shows the bias term in \(\hat{\alpha}_{fe}\) is negative.
Expand:
\[E[\Delta Y_{i2} \cdot \Delta \varepsilon_{i3}] = E[(Y_{i2} - Y_{i1})(\varepsilon_{i3} - \varepsilon_{i2})]\]
\[= E[Y_{i2}\varepsilon_{i3}] - E[Y_{i2}\varepsilon_{i2}] - E[Y_{i1}\varepsilon_{i3}] + E[Y_{i1}\varepsilon_{i2}]\]
By the model and exogeneity: - \(E[Y_{i2}\varepsilon_{i3}] = 0\) - \(E[Y_{i1}\varepsilon_{i3}] = 0\) - \(E[Y_{i1}\varepsilon_{i2}] = 0\) - \(E[Y_{i2}\varepsilon_{i2}] = \sigma_\varepsilon^2\)
So \(E[\Delta Y_{i2} \cdot \Delta \varepsilon_{i3}] = 0 - \sigma_\varepsilon^2 - 0 + 0 = -\sigma_\varepsilon^2\).
(c) Now show that \(E[(\Delta Y_{i2})^2] = \frac{2\sigma_\varepsilon^2}{1+\alpha}\) when the panel is stationary (i.e., assume \(Y_{it}\) has reached its stationary distribution so that \(\text{Var}(Y_{it}) = \sigma_\varepsilon^2/(1-\alpha^2)\)).
\[E[(\Delta Y_{i2})^2] = E[(Y_{i2} - Y_{i1})^2] = \text{Var}(Y_{i2}) + \text{Var}(Y_{i1}) - 2\text{Cov}(Y_{i2}, Y_{i1})\]
Under stationarity, \(\text{Var}(Y_{it}) = \sigma_\varepsilon^2/(1-\alpha^2)\) for all \(t\), and the lag-1 autocovariance is \(\text{Cov}(Y_{i2}, Y_{i1}) = \alpha \cdot \text{Var}(Y_{it}) = \alpha\sigma_\varepsilon^2/(1-\alpha^2)\).
(Note: this is the variance of the idiosyncratic component conditional on \(u_i\). The fixed effect \(u_i\) contributes equally to \(Y_{i1}\) and \(Y_{i2}\) and cancels in the difference.)
Therefore:
\[E[(\Delta Y_{i2})^2] = \frac{\sigma_\varepsilon^2}{1-\alpha^2} + \frac{\sigma_\varepsilon^2}{1-\alpha^2} - \frac{2\alpha\sigma_\varepsilon^2}{1-\alpha^2}\]
\[= \frac{2\sigma_\varepsilon^2(1-\alpha)}{1-\alpha^2} = \frac{2\sigma_\varepsilon^2(1-\alpha)}{(1-\alpha)(1+\alpha)} = \frac{2\sigma_\varepsilon^2}{1+\alpha}\]
(d) Combine parts (b) and (c) to show that:
\[\underset{N\to\infty}{\text{plim}}(\hat{\alpha}_{fe} - \alpha) = -\frac{1+\alpha}{2}\]
and evaluate this for \(\alpha = 0\) and \(\alpha = 0.8\). What happens to \(\text{plim}(\hat{\alpha}_{fe})\) as \(\alpha \to 1\)?
By the law of large numbers:
\[\underset{N\to\infty}{\text{plim}}(\hat{\alpha}_{fe} - \alpha) = \frac{E[\Delta Y_{i2} \cdot \Delta \varepsilon_{i3}]}{E[(\Delta Y_{i2})^2]} = \frac{-\sigma_\varepsilon^2}{2\sigma_\varepsilon^2/(1+\alpha)} = -\frac{1+\alpha}{2}\]
Evaluating:
- \(\alpha = 0\): bias \(= -(1+0)/2 = -1/2\), so \(\text{plim}(\hat{\alpha}_{fe}) = 0 - 1/2 = -1/2\).
- \(\alpha = 0.8\): bias \(= -(1+0.8)/2 = -0.9\), so \(\text{plim}(\hat{\alpha}_{fe}) = 0.8 - 0.9 = -0.1\).
As \(\alpha \to 1\): the bias approaches \(-(1+1)/2 = -1\), so \(\text{plim}(\hat{\alpha}_{fe}) \to 1 - 1 = 0\). For a near-unit-root process (very persistent), the fixed effects estimator converges to zero. It completely misses the persistence and makes the series look like white noise. This is exactly the setting where Blundell-Bond has its largest advantage over Anderson-Hsiao, since lagged levels become weak instruments precisely when \(\alpha\) is close to 1.
The formula also shows that no matter what \(\alpha \in (-1, 1)\) is, \(\text{plim}(\hat{\alpha}_{fe}) = \alpha/2 - 1/2 < \alpha\): the fixed effects estimator always gives the wrong sign for any \(\alpha < 1\), since \(\alpha/2 - 1/2 < 0\) whenever \(\alpha < 1\).