Time Series Econometrics
Theory Review
Stationarity and Ergodicity
Core Intuition
In cross-sectional analysis, you observe many different units from the same population, which makes averaging across units meaningful. In time series, you observe one unit repeatedly. For averaging over time to be meaningful, the distribution of \(Y_t\) must be “stable”. The same process generating \(Y_t\) today should also have generated \(Y_{t-100}\).
Covariance (weak) stationarity requires three things to be constant over time:
- the mean \(\mu = E[Y_t]\),
- the variance \(\sigma^2 = \text{Var}(Y_t)\), and
- the autocovariances \(\gamma(k) = \text{Cov}(Y_t, Y_{t-k})\) for all lags \(k\).
Notice that \(\gamma(k)\) depends only on the gap \(k\), not on when you are in time.
Strict stationarity requires the entire joint distribution of \((Y_t, Y_{t+1}, \ldots, Y_{t+\ell})\) to be the same for all \(t\).
A practical way to think about it: GDP levels are not stationary (they trend upward and the variance grows), but GDP growth rates often are. Taking differences or logs is a standard transformation to induce stationarity.
Ergodicity: The Missing Piece for Consistency
Stationarity alone is not enough for a sample mean to consistently estimate \(\mu\). To see why, consider the process \(Y_t = Z\) for all \(t\), where \(Z \sim N(0,1)\) is drawn once and held fixed. This process is strictly stationary; the distribution of \(Y_t\) is the same for all \(t\). But the sample mean is:
\[\bar{Y} = \frac{1}{n}\sum_{t=1}^n Y_t = \frac{1}{n}\sum_{t=1}^n Z = Z\]
No matter how large \(n\) is, \(\bar{Y}\) stays stuck at the single draw of \(Z\) and never converges to \(E[Z] = 0\). The problem: consecutive observations carry zero new information because they are all the same.
Ergodicity rules this out. Intuitively, an ergodic process “visits” every part of its sample space over time. Variables that are far apart in time eventually become approximately independent, so successive observations do add genuine new information.
The Ergodic Theorem (Theorem 14.9) is the time series analog of the Law of Large Numbers:
\[\bar{Y} \xrightarrow{p} \mu \quad \text{if } Y_t \text{ is strictly stationary and ergodic with } E[||Y||] < \infty\]
The hierarchy: i.i.d. \(\Rightarrow\) strictly stationary + ergodic. Any i.i.d. sequence is ergodic (Theorem 14.4), and ergodicity is preserved under transformations (Theorem 14.5). Most practically useful time series models (AR, MA, ARMA with white noise errors) are ergodic.
Check Your Understanding
- Nominal GDP in the U.S. grows over time, so it is clearly not stationary. A researcher argues that they should use \(\log(\text{GDP}_t)\) instead. Does this fix the stationarity problem? What transformation would you recommend and why?
Taking logs does not fix the problem if the series still trends upward, it just converts exponential growth into linear growth. The mean of \(\log(\text{GDP}_t)\) still increases over time, violating the constant-mean requirement.
The standard fix is to take first differences of the log: \(\Delta \log(\text{GDP}_t) = \log(\text{GDP}_t) - \log(\text{GDP}_{t-1}) \approx g_t\), the growth rate. Growth rates are typically stationary: they fluctuate around a roughly stable mean (e.g., around 2–3% annually) rather than trending indefinitely.
- Suppose you observe a process for 500 periods and compute \(\bar{Y} = 3.2\). Under what conditions can you interpret this as a consistent estimate of \(E[Y_t]\)? Name one condition under which it would not be consistent, even with many observations.
For \(\bar{Y} \xrightarrow{p} E[Y_t]\), you need the process to be strictly stationary and ergodic with a finite mean (Ergodic Theorem 14.9). Stationarity ensures the mean is actually constant over time (so there is a single population mean to estimate), and ergodicity ensures the time average converges to that mean rather than getting stuck.
A simple case where it fails: \(Y_t = Z\) for all \(t\). The process is stationary but not ergodic. The sample mean equals the single draw of \(Z\), never converging to \(E[Z]\). This can happen in macroeconomic models with permanent unobserved heterogeneity or random trends that do not mean-revert.
Martingale Difference Sequences and White Noise
From i.i.d. to MDS
Much of time series econometrics involves errors that are not i.i.d. but still have the property that their conditional mean given the past is zero. This is precisely the definition of a martingale difference sequence (MDS).
A process \((e_t, \mathcal{F}_t)\) is an MDS if:
\[E[e_t \mid \mathcal{F}_{t-1}] = 0\]
where \(\mathcal{F}_{t-1}\) is the information set, everything known up to time \(t-1\). This says that the past gives no information about the mean of \(e_t\). The process is unforecastable in mean.
The relationship between the key concepts is:
\[\underbrace{\text{i.i.d., } E[e]=0}_{\text{fully unforecastable}} \;\subset\; \underbrace{\text{homoscedastic MDS}}_{\text{mean \& variance unforecastable}} \;\subset\; \underbrace{\text{MDS}}_{\text{mean unforecastable}} \;\subset\; \underbrace{\text{white noise}}_{\text{serially uncorrelated}}\]
The key difference from i.i.d.: an MDS can have forecastable higher moments. For example, \(e_t = u_t u_{t-1}\) (where \(u_t \sim \text{i.i.d. } N(0,1)\)) is an MDS because \(E[e_t \mid \mathcal{F}_{t-1}] = u_{t-1} E[u_t] = 0\). But \(e_t^2 = u_t^2 u_{t-1}^2\) has \(\text{Cov}(e_t^2, e_{t-1}^2) = 2 \neq 0\), the variance is forecastable.
White Noise and the Wold Decomposition
A white noise process has zero mean, finite variance, and zero serial correlation for all lags \(k \neq 0\). It is the weakest useful error structure.
The Wold Decomposition Theorem (Theorem 14.17) is a foundational result: every weakly stationary, non-deterministic process can be written as:
\[Y_t = \mu + \sum_{j=0}^{\infty} b_j e_{t-j}, \quad b_0 = 1, \quad \sum_{j=0}^{\infty} b_j^2 < \infty\]
where \(e_t\) is a white noise process. This justifies the use of linear MA models as approximations, not as arbitrary functional form assumptions, but as the natural representation of any stationary process. Every AR, MA, and ARMA model is an attempt to approximate this infinite moving average.
The deterministic component \(\mu_t = \lim_{m \to \infty} \mathcal{P}_{t-m}[Y_t]\) captures the part of \(Y_t\) that can be perfectly forecast from the infinite past. In most applications, \(\mu_t = \mu\) (a constant), giving the non-deterministic case.
Central Limit Theorems for Time Series
Two CLTs are used, depending on the error structure:
MDS CLT (Theorem 14.11): If \(u_t\) is strictly stationary, ergodic MDS with \(E[u_t u_t'] = \Sigma < \infty\), then: \[\frac{1}{\sqrt{n}}\sum_{t=1}^n u_t \xrightarrow{d} N(0, \Sigma)\]
This is the CLT you use when errors are serially uncorrelated (as in AR models, after controlling for lags).
Mixing CLT (Theorem 14.15): When observations are serially correlated, the variance of the sample mean is larger. Define the long-run variance: \[\Omega = \sum_{\ell=-\infty}^{\infty} \Gamma(\ell)\]
where \(\Gamma(\ell) = E[u_t u_{t-\ell}']\). Under mixing conditions, \(\frac{1}{\sqrt{n}}\sum u_t \xrightarrow{d} N(0, \Omega)\). The long-run variance \(\Omega\) exceeds the contemporaneous variance \(\Sigma\) whenever there is positive serial correlation. Positive autocorrelation “reinforces” shocks rather than canceling them, inflating the variance of the sum.
Check Your Understanding
- You are modeling daily stock returns \(r_t\). An efficient market hypothesis suggests returns should be unforecastable. A colleague says “returns are i.i.d.” but you notice that large absolute returns tend to cluster together (volatility clustering). Are returns consistent with being an MDS? Are they i.i.d.?
Returns can be consistent with MDS but not i.i.d. The MDS condition only requires \(E[r_t \mid \mathcal{F}_{t-1}] = 0\) (the mean return is unpredictable). Efficient markets do imply this.
However, i.i.d. requires all moments to be independent over time, including the variance. Volatility clustering means \(E[r_t^2 \mid \mathcal{F}_{t-1}]\) varies with past information (large squared returns predict large future squared returns). This violates i.i.d. but does not violate the MDS condition.
Moving Average Processes
MA(1) and MA(q)
A moving average process expresses \(Y_t\) as a weighted sum of current and past shocks. The MA(1) is:
\[Y_t = \mu + e_t + \theta e_{t-1}\]
Think of \(e_t\) as an unexpected shock (e.g., a surprise policy change or weather event). The MA(1) says that today’s \(Y_t\) is influenced by today’s shock and last period’s shock, but not shocks from further back. The parameter \(\theta\) governs how persistent the shock is.
Key moments: \[E[Y_t] = \mu, \quad \text{Var}(Y_t) = (1 + \theta^2)\sigma^2, \quad \rho(1) = \frac{\theta}{1 + \theta^2}, \quad \rho(k) = 0 \text{ for } k \geq 2\]
MA(q) processes have autocorrelations that cut off at lag \(q\). This is the signature pattern in the autocorrelation function (ACF) that tells you a process might be MA. You see significant autocorrelations at lags 1 through \(q\) and then essentially nothing.
The MA(\(\infty\)) (also called a linear process) is: \[Y_t = \mu + \sum_{j=0}^{\infty} \theta_j e_{t-j}, \quad \sum_{j=0}^{\infty} |\theta_j| < \infty\]
This is exactly the Wold Decomposition: every stationary process is a (possibly infinite-order) MA.
Invertibility and the AR(∞) Representation
Just as AR models can be “inverted” into MA representations (under stationarity conditions), MA models can be inverted into AR representations, but only under an invertibility condition.
For MA(1) with \(|\theta| < 1\), the polynomial \(b(z) = 1 + \theta z\) is invertible (roots lie outside the unit circle), and you can write:
\[Y_t = \frac{\mu}{1+\theta} + \sum_{j=1}^{\infty} (-\theta)^j Y_{t-j} + e_t\]
If \(|\theta| \geq 1\), this infinite sum diverges and the inversion fails. The invertibility condition \(|\theta| < 1\) is the MA analog of the AR stationarity condition \(|\alpha_1| < 1\).
Check Your Understanding
- You estimate the ACF of a quarterly economic variable and find that \(\hat{\rho}(1)\) and \(\hat{\rho}(2)\) are significantly different from zero, but \(\hat{\rho}(k) \approx 0\) for all \(k \geq 3\). What type of model does this suggest? What would the ACF look like if the process were instead AR(1)?
The cutoff after lag 2 is the signature of an MA(2) process: MA(\(q\)) has \(\rho(k) = 0\) for all \(k > q\). An MA(2) allows correlation at lags 1 and 2 only, which matches the pattern.
An AR(1) process has \(\rho(k) = \alpha_1^k\), which decays gradually toward zero rather than cutting off sharply. For example, with \(\alpha_1 = 0.7\): \(\rho(1) = 0.7\), \(\rho(2) = 0.49\), \(\rho(3) = 0.34\), \(\rho(4) = 0.24\), and so on. There is no sharp cutoff; the autocorrelations taper off smoothly. If you saw this pattern, you’d suspect an AR model rather than MA.
Autoregressive Processes
AR(1): The Core Model
The AR(1) model is arguably the most important model in time series econometrics:
\[Y_t = \alpha_0 + \alpha_1 Y_{t-1} + e_t\]
The stationarity condition is \(|\alpha_1| < 1\). When this holds, \(Y_t\) can be written as an MA(\(\infty\)) by repeated substitution:
\[Y_t = \underbrace{\frac{\alpha_0}{1-\alpha_1}}_{\mu} + \sum_{j=0}^{\infty} \alpha_1^j e_{t-j}\]
Key moments (using stationarity, i.e., \(E[Y_t] = E[Y_{t-1}]\)):
\[E[Y_t] = \frac{\alpha_0}{1-\alpha_1}, \quad \text{Var}(Y_t) = \frac{\sigma^2}{1-\alpha_1^2}, \quad \rho(k) = \alpha_1^k\]
Autocorrelation pattern: Unlike MA processes, AR(1) autocorrelations decay gradually to zero. The speed of decay is governed by \(|\alpha_1|\): values close to 1 mean very slow decay (high persistence), while values close to 0 mean fast decay.
Economic intuition: In a job market, if each period a fraction \(1-\alpha_1\) of workers lose their jobs and new workers \(u_t\) enter, then employment follows AR(1) dynamics. A negative shock (recession) reduces employment, but each subsequent period some recovery occurs. The AR(1) is mean-reverting when \(|\alpha_1| < 1\), after any shock, \(Y_t\) gradually returns to \(\mu\).
Unit Roots and Non-Stationarity
When \(\alpha_1 = 1\) (with \(\alpha_0 = 0\)), the AR(1) becomes a random walk:
\[Y_t = Y_{t-1} + e_t = Y_0 + \sum_{j=1}^t e_j\]
This is fundamentally different from a stationary AR(1):
- No mean reversion: After a shock, \(Y_t\) does not drift back toward any long-run level. The effect of a shock is permanent.
- Variance grows with time: \(\text{Var}(Y_t) = t\sigma^2 \to \infty\), so the process is not covariance stationary.
- The AR polynomial is not invertible: \(\alpha(z) = 1 - z\) has a root at \(z = 1\) (on the unit circle), violating the \(|z| > 1\) condition.
Random walks are standard models for asset prices (the efficient market hypothesis implies that price changes should be unforecastable), exchange rates, and other “near-unit-root” macro variables like GDP levels. First-differencing transforms a random walk into stationary white noise: \(\Delta Y_t = e_t\).
AR(2) and the Stationarity Region
The AR(2) model:
\[Y_t = \alpha_0 + \alpha_1 Y_{t-1} + \alpha_2 Y_{t-2} + e_t\]
is stationary if and only if all three conditions hold simultaneously:
\[\alpha_1 + \alpha_2 < 1, \quad \alpha_2 - \alpha_1 < 1, \quad \alpha_2 > -1\]
These define the triangular region in \((\alpha_1, \alpha_2)\) space. The characteristic roots \(\lambda_{1,2} = (\alpha_1 \pm \sqrt{\alpha_1^2 + 4\alpha_2})/2\) determine the dynamics:
- Real roots (above the parabola \(\alpha_2 = -\alpha_1^2/4\)): monotone decay, like two AR(1) processes in succession.
- Complex roots (below the parabola): oscillating dynamics. The process cycles around \(\mu\) with a period determined by the imaginary part of the roots. This can generate business-cycle-like patterns in the data.
The AR(2) is considerably more flexible than AR(1), able to capture both persistence and oscillation.
Impulse Response Functions
The impulse response function (IRF) answers: if there is a shock \(e_t = 1\) at time \(t\), what is the effect on \(Y_{t+j}\) for \(j = 0, 1, 2, \ldots\)?
Formally, \(b_j = \partial Y_{t+j}/\partial e_t\), the coefficients of the MA(\(\infty\)) representation. For an AR(p), these can be computed recursively:
\[b_0 = 1, \quad b_j = \alpha_1 b_{j-1} + \alpha_2 b_{j-2} + \cdots + \alpha_p b_{j-p}\]
The IRF tells you about the persistence of shocks:
- For a stationary AR(1): \(b_j = \alpha_1^j \to 0\). Shocks decay geometrically.
- For a random walk: \(b_j = 1\) for all \(j\). Shocks are permanent.
- For AR(2) with complex roots: the IRF oscillates, showing damped cycles.
Check Your Understanding
- You estimate an AR(1) model for quarterly U.S. inflation and obtain \(\hat{\alpha}_1 = 0.85\). A colleague estimates the same model for Germany and gets \(\hat{\alpha}_1 = 0.30\). Compare the persistence of inflation shocks in the two countries. If inflation is 2 percentage points above target today, how long until it is within 0.5 pp of target in each country?
For U.S. with \(\alpha_1 = 0.85\): the autocorrelation at lag \(k\) is \(\rho(k) = 0.85^k\). A shock decays slowly. Starting 2 pp above target, the deviation after \(k\) quarters is \(2 \times 0.85^k\). Setting \(2 \times 0.85^k = 0.5\) gives \(k = \log(0.25)/\log(0.85) \approx 8.5\) quarters, about two years.
For Germany with \(\alpha_1 = 0.30\): \(2 \times 0.30^k = 0.5\) gives \(k = \log(0.25)/\log(0.30) \approx 1.1\) quarters, the deviation dissipates in about one quarter.
This illustrates why \(\alpha_1\) close to 1 is described as “high persistence”, it takes many periods for the effect of a shock to fade. In the extreme case \(\alpha_1 = 1\), the shock never dissipates.
ARMA Models and Identification
ARMA(p, q): Combining AR and MA
An ARMA(\(p\),\(q\)) model combines \(p\) autoregressive terms and \(q\) moving average terms:
\[Y_t = \alpha_0 + \sum_{i=1}^p \alpha_i Y_{t-i} + e_t + \sum_{j=1}^q \theta_j e_{t-j}\]
or compactly as \(\alpha(L)Y_t = \alpha_0 + \theta(L)e_t\).
MA(\(\infty\)) can represent any stationary process, but fitting an MA(10) requires estimating 10 parameters. An ARMA(1,1) has only 2 parameters but can still generate rich autocorrelation patterns that would require many more pure AR or MA parameters to match.
Stationarity and invertibility: An ARMA(\(p\),\(q\)) is stationary if the AR roots lie outside the unit circle (same condition as pure AR), and invertible if the MA roots lie outside the unit circle (same condition as pure MA).
If \(\Delta^d Y_t\) is ARMA(\(p\),\(q\)) but \(Y_t\) is not, then \(Y_t\) is ARIMA(\(p\), \(d\), \(q\)). The “\(I\)” stands for integrated. \(Y_t\) must be differenced \(d\) times to achieve stationarity. A random walk is ARIMA(0,1,0).
Exercise
In class, we saw how to “convert” an AR(1) process into an MA(\(\infty\)) process. Now, show how to transform an MA(1)
\[Y_t = \mu + e_t + \theta e_{t-1}\]
into an AR(\(\infty\)) process, assuming that \(t \rightarrow \infty\) and \(|\theta| < 1\).
a) Show the transformation by substituting the terms.
\[ \begin{aligned} Y_t &= \mu + e_t + \theta e_{t-1} \\ Y_t &= \mu + e_t + \theta (Y_{t-1} - \mu - \theta e_{t-2}) \\ Y_t &= \mu (1- \theta) + e_t + \theta Y_{t-1} - \theta^2 e_{t-2} \\ Y_t &= \mu (1- \theta) + e_t + \theta Y_{t-1} - \theta^2 (Y_{t-1} - \mu - \theta e_{t-3}) \\ Y_t &= \mu (1- \theta - \theta^2) + e_t + \theta Y_{t-1} - \theta^2 Y_{t-2} - \theta^3 e_{t-3} \\ Y_t &= \mu (1- \theta - \theta^2 - \theta^3 - \dots) + e_t + \theta Y_{t-1} - \theta^2 Y_{t-2} + \theta^3 Y_{t-3} \dots \\ Y_t - \theta Y_{t-1} + \theta^2 Y_{t-2} - \theta^3 Y_{t-3} + \cdots &= \mu \left(1 - \theta - \theta^2 - \theta^3 - \cdots\right) + e_t \\ \sum_{j=0}^\infty (-\theta)^j Y_{t-j} &= \mu \sum_{j=0}^\infty (-\theta)^j + e_t \\ \end{aligned} \]
The sum \(\sum_{j=0}^\infty (-\theta)^j = \frac{1}{1+\theta}\) for \(|\theta| < 1\), so we have: \[ \sum_{j=0}^\infty (-\theta)^j Y_{t-j} = \frac{\mu}{1+\theta} + e_t \]
This is the AR(\(\infty\)) representation of the original MA(1) process.
b) Show the transformation using the lag operator.
By the lag operator: \[ \begin{aligned} Y_t &= \mu + e_t + \theta e_{t-1} \\ Y_t &= \mu + e_t + \theta Le_t \\ Y_t &= \mu + (1 + \theta L) e_t \\ (1 + \theta L)^{-1} Y_t &= (1 + \theta L)^{-1} \mu + e_t \\ (1 - (- \theta L))^{-1} Y_t &= (1 + \theta L)^{-1} \mu + e_t \\ \sum_{j=0}^\infty (-\theta)^j L^j Y_{t} &= \frac{\mu}{1 + \theta} + e_t \\ \sum_{j=0}^\infty (-\theta)^j Y_{t-j} &= \frac{\mu}{1 + \theta} + e_t \end{aligned} \]