Maximum Likelihood & Binary Choice Models

Theory Review

Maximum Likelihood Estimation

Core Concepts

Interactive visualization: rpsychologist

Likelihood of the data is the joint density as assumed by the model: \[f(x_1, x_2, ..., x_n | \theta) = \prod_{i=1}^n f(x_i|\theta)\]

Likelihood function is the joint density evaluated at the observed data: \[L_n (\theta) = f(X_1, X_2, ..., X_n | \theta) = \prod_{i=1}^n f(X_i|\theta)\]

MLE: \(\hat{\theta}\) is the value of \(\theta\) that maximizes \(L_n (\theta)\): \[\hat{\theta} = \underset{\theta \in \Theta}{\mathrm{argmax}} L_n (\theta)\]

Log-likelihood function also maximizes \(L_n (\theta)\) (monotonic transformation): \[l_n (\theta) \equiv \log(L_n (\theta)) = \sum_{i=1}^n \log(f(X_i|\theta))\]

Why use log-likelihood? Maximizing the log-likelihood is mathematically equivalent to maximizing the likelihood, but products become sums, making differentiation much easier.

Related Concepts:

  • Fisher Information Matrix: Expected value of the Hessian
  • Cramér-Rao Lower Bound: Lower bound on the variance of unbiased estimators
MLE Estimation Steps
  1. Construct \(f(x|\theta)\) as a function of \(x\) and \(\theta\)
  2. Take the logarithm: \(\log(f(x|\theta))\)
  3. Evaluate at \(x = X_i\) and sum: \(l_n(\theta) = \sum_{i=1}^n \log(f(X_i|\theta))\)
  4. If possible, solve the F.O.C. to find the maximum
  5. Check the S.O.C. to verify that it is a maximum
  6. If solving the F.O.C. is not possible, use numerical methods to maximize \(l_n(\theta)\)
Score and Hessian

Likelihood Score is the derivative of the likelihood function: \[S_n (\theta) = \frac{\partial l_n(\theta)}{\partial \theta} = \sum_{i=1}^n \frac{\partial \log(f(X_i|\theta))}{\partial \theta}\]

  • \(S_n (\theta)\) measures the “sensitivity” of \(l_n(\theta)\) to \(\theta\)
  • When \(\hat{\theta}\) is an interior solution, \(S_n (\hat{\theta}) = 0\) (first-order condition)

Likelihood Hessian is the negative of the second derivative: \[H_n (\theta) = - \frac{\partial^2 l_n(\theta)}{\partial \theta \partial \theta '} = - \sum_{i=1}^n \frac{\partial^2 \log(f(X_i|\theta))}{\partial \theta \partial \theta '}\]

  • Shows the degree of curvature in the log-likelihood
  • Larger values indicate greater curvature → more precise estimation (smaller standard errors)
Check Your Understanding

Why do we maximize the log-likelihood instead of the likelihood function?

The log transformation converts products into sums, making differentiation easier. Since log is a monotonic transformation, the argmax is preserved - whatever maximizes the likelihood also maximizes the log-likelihood.

What does it mean when the likelihood score \(S_n(\hat{\theta}) = 0\)?

This is the first-order condition for a maximum. It means the derivative of the log-likelihood with respect to \(\theta\) equals zero at \(\hat{\theta}\), indicating we’ve found a critical point (potentially a maximum).

If the Hessian has larger absolute values, what does this tell us about the likelihood function?

Larger absolute values in the Hessian indicate greater curvature - meaning the likelihood is more “peaked” around the maximum. This generally means more precise estimation (smaller standard errors).

Exercise: Logit MLE

Consider a dependent variable \(Y\) that takes values 1 and 0 with probabilities \(G(X'\beta)\) and \(1 - G(X'\beta)\).

Assume a logit model is appropriate: \(G(X'\beta) = \frac{e^{X'\beta}}{1 + e^{X'\beta}}\)

In this exercise, \(X'\beta\) is a scalar (it’s the inner product of vectors), even though we write it in matrix notation. Here’s why the derivatives work out simply:

  • \(X_i\) is a \(k \times 1\) vector of covariates for observation \(i\)
  • \(\beta\) is a \(k \times 1\) vector of parameters
  • \(X_i'\beta\) is a scalar (the linear prediction for observation \(i\))

When we take \(\frac{\partial}{\partial \beta}\) of a scalar function that depends on \(\beta\) through \(X'\beta\), we use the chain rule: \[\frac{\partial f(X'\beta)}{\partial \beta} = \frac{\partial f(X'\beta)}{\partial (X'\beta)} \cdot \frac{\partial (X'\beta)}{\partial \beta} = f'(X'\beta) \cdot X\]

The key: \(\frac{\partial (X'\beta)}{\partial \beta} = X\) (this is a standard vector derivative result).

For the second derivative (Hessian), we’re taking \(\frac{\partial}{\partial \beta'}\) of the \(k \times 1\) score vector, giving us a \(k \times k\) matrix. Since each element of the score is \(f'(X_i'\beta) X_i\), the Hessian involves \(f''(X_i'\beta) X_i X_i'\).

Part (a): Write out the conditional probability mass function.

\[\pi(Y|X) = G(X'\beta)^Y (1 - G(X'\beta))^{1-Y}\]

This is a Bernoulli distribution where the probability parameter depends on \(X\).

Part (b): Show that for the two outcomes, the probabilities can be written as \(G((2Y - 1)X'\beta)\).

For \(Y=1\): \((2(1)-1)X'\beta = X'\beta\), so \(G(X'\beta) = P(Y=1|X)\)

For \(Y=0\): \((2(0)-1)X'\beta = -X'\beta\)

Since \(G(-X'\beta) = \frac{e^{-X'\beta}}{1+e^{-X'\beta}} = \frac{1}{1+e^{X'\beta}} = 1 - G(X'\beta) = P(Y=0|X)\)

This compact notation works for both outcomes.

Part (c): Given a sample of size \(N\), write out the likelihood function and log-likelihood function.

Likelihood (standard form): \[L_N(\beta) = \prod_{i=1}^N G(X_i'\beta)^{Y_i}(1-G(X_i'\beta))^{1-Y_i}\]

Likelihood (compact form): \[L_N(\beta) = \prod_{i=1}^N G(Z_i'\beta) \quad \text{where } Z_i = (2Y_i-1)X_i\]

The compact form uses the fact that \(G(-u) = 1 - G(u)\) for the logistic function:

  • When \(Y_i = 1\): \(Z_i = X_i\), so \(G(Z_i'\beta) = G(X_i'\beta) = P(Y_i=1|X_i)\)
  • When \(Y_i = 0\): \(Z_i = -X_i\), so \(G(Z_i'\beta) = G(-X_i'\beta) = 1 - G(X_i'\beta) = P(Y_i=0|X_i)\)

Log-likelihood (standard form): \[l_N(\beta) = \sum_{i=1}^N \left[Y_i \log G(X_i'\beta) + (1-Y_i)\log(1-G(X_i'\beta))\right]\]

Log-likelihood (compact form): \[l_N(\beta) = \sum_{i=1}^N \log G(Z_i'\beta) = \sum_{i=1}^N \log G((2Y_i-1)X_i'\beta)\]

Part (d): Derive the likelihood score \(S_n(\beta)\) necessary for maximizing \(l_n(\beta)\).

Using the compact form and chain rule:

\[S_N(\beta) = \frac{\partial l_N(\beta)}{\partial \beta} = \sum_{i=1}^N \frac{G'((2Y_i-1)X_i'\beta)}{G((2Y_i-1)X_i'\beta)} (2Y_i-1)X_i\]

Since \(G'(z) = G(z)(1-G(z))\) for the logistic function:

\[S_N(\beta) = \sum_{i=1}^N (2Y_i-1)(1-G((2Y_i-1)X_i'\beta))X_i\]

We can also derive this directly from the non-compact log-likelihood. Taking the derivative:

\[S_N(\beta) = \sum_{i=1}^N \left[Y_i \frac{G'(X_i'\beta)}{G(X_i'\beta)} - (1-Y_i)\frac{G'(X_i'\beta)}{1-G(X_i'\beta)}\right]X_i\]

Using \(G'(z) = G(z)(1-G(z))\) and simplifying:

\[S_N(\beta) = \sum_{i=1}^N \left[Y_i(1-G(X_i'\beta)) - (1-Y_i)G(X_i'\beta)\right]X_i = \sum_{i=1}^N (Y_i - G(X_i'\beta))X_i\]

This form is cleaner and shows that the score is the sum of residuals \((Y_i - \hat{p}_i)\) weighted by \(X_i\).

Part (e): Derive the likelihood Hessian \(H_n(\beta)\). Does this point to a maximum or minimum likelihood?

\[H_N(\beta) = -\frac{\partial^2 l_N(\beta)}{\partial \beta \partial \beta'} = -\sum_{i=1}^N G(X_i'\beta)(1-G(X_i'\beta))X_iX_i'\]

This can be written as: \[H_N(\beta) = -\sum_{i=1}^N G_i(1-G_i)X_iX_i'\]

where \(G_i = G(X_i'\beta)\).

Since \(0 < G_i < 1\), we have \(G_i(1-G_i) > 0\), and \(X_iX_i'\) is positive semi-definite. Therefore \(H_N(\beta)\) is positive definite, confirming this is a maximum*of the likelihood function (recall the Hessian is defined as the negative of the second derivative).


Binary Choice Models

Core Concepts

In many situations, your dependent variable is binary (takes only values 0 or 1). Examples include:

  • Employment status (employed/unemployed)
  • Firm exit decisions
  • Purchase decisions (buy/don’t buy)
  • Arrest outcomes

The key relationship of interest is the response probability: \[P(x) = \text{Prob}[Y = 1| X = x] = E[Y | X = x]\]

We also care about marginal effects: \[\frac{\partial P(x)}{\partial x} = \frac{\partial \text{Prob}[Y = 1| X = x]}{\partial x}\]

The Regression Model:

We can write this as: \[Y = P(X) + e, \quad \text{where } E[e| X ] = 0\]

The error term is binary with conditional variance: \[\text{Var}[e| X ] = P(X)(1 - P(X))\]

This means the error is heteroskedastic - its variance depends on \(X\).
Three Models for Binary Choice

1. Linear Probability Model (LPM): \[P(x) = x'\beta\]

2. Probit Model: \[P(x) = \Phi(x'\beta)\] where \(\Phi(\cdot)\) is the standard normal CDF.

3. Logit Model: \[P(x) = \Lambda(x'\beta) = \frac{1}{1 + e^{-x'\beta}}\] where \(\Lambda(\cdot)\) is the logistic CDF.

Both probit and logit are index models: \[P(x) = G(x'\beta)\] where \(G(\cdot)\) is a CDF that is symmetric around 0: \(G(-u) = 1 - G(u)\).

Key Properties:

  • Both constrain probabilities to [0,1]
  • Both estimated via MLE (globally concave)
  • Logit coefficients ≈ 1.8 × probit coefficients
  • Give very similar predictions in practice

When to use each:

  • LPM: Quick analysis, panel data, IV estimation, when probabilities stay in [0,1]
  • Probit/Logit: Need probabilities in [0,1], testing distributional assumptions
Marginal Effects

Why They Matter:

Unlike LPM, coefficients in probit/logit are not marginal effects. The marginal effect of \(x_j\) is: \[\frac{\partial P(x)}{\partial x_j} = \beta_j \cdot g(x'\beta)\]

where \(g(u) = \frac{\partial G(u)}{\partial u}\) is the density function.

Key insight: The marginal effect depends on both \(\beta\) AND the value of \(x\) (through \(g(x'\beta)\)).

Average Marginal Effects (AME):

Averages over the distribution of \(X\): \[\text{AME}_j = E\left[\frac{\partial P(X)}{\partial x_j}\right] = \beta_j \cdot E[g(X'\beta)]\]

Estimation: \[\widehat{\text{AME}}_j = \hat{\beta}_j \cdot \frac{1}{n}\sum_{i=1}^n g(X_i'\hat{\beta})\]

Check Your Understanding

When can we use OLS for a binary dependent variable?

We can use OLS (this gives the Linear Probability Model), but it has limitations, primarily that the predicted probabilities can be < 0 or > 1

Why are logit coefficients about 1.8 times larger than probit coefficients?

This comes from the normalization of the latent variable variance:

  • Probit assumes \(\sigma = 1\) (standard normal)
  • Logit assumes \(\sigma = \pi/\sqrt{3} \approx 1.814\)

Since we estimate \(\beta^* = \beta/\sigma\), and logit has a larger \(\sigma\), its coefficients must be proportionally larger to give the same predicted probabilities. This is just a scaling difference - both models give similar predictions and marginal effects.

Exercise: Binary Choice Applications

Consider a probit model where the probability of employment is: \[P(\text{employed}|X) = \Phi(\beta_0 + \beta_1 \cdot \text{education} + \beta_2 \cdot \text{experience})\]

Suppose you estimate: \(\hat{\beta}_0 = -2\), \(\hat{\beta}_1 = 0.3\), \(\hat{\beta}_2 = 0.05\)

Part (a): What is the predicted probability of employment for someone with 12 years of education and 10 years of experience? Use \(\Phi(2.1) \approx 0.98\).

Calculate the index: \[z = -2 + 0.3(12) + 0.05(10) = -2 + 3.6 + 0.5 = 2.1\]

Since \(\Phi(2.1) \approx 0.98\), the predicted probability is approximately 98%.

Part (b): Calculate the marginal effect of education at this point, given that \(\phi(2.1) \approx 0.044\) (where \(\phi\) is the standard normal PDF).

The marginal effect of education for probit is: \[\frac{\partial P}{\partial \text{education}} = \beta_1 \cdot \phi(X'\beta) = 0.3 \times 0.044 = 0.0132\]

Interpretation: At this combination of education and experience, an additional year of education increases the probability of employment by approximately 1.3 percentage points.

Part (c): Why is the marginal effect in part (b) smaller than the coefficient \(\beta_1 = 0.3\)?

The marginal effect is always scaled by the density \(\phi(X'\beta)\), which is at most 0.399 (at \(z=0\)) and decreases as we move into the tails.

At \(z = 2.1\), we’re in the right tail of the distribution where \(\phi(2.1) \approx 0.044\) is quite small. This means: - The probability is already very high (98%) - There’s not much “room” for it to increase further - The marginal effect is correspondingly small

This illustrates a key property of nonlinear models: marginal effects vary with the levels of the explanatory variables.

Part (d): If you instead estimated a logit model and obtained coefficients \(\tilde{\beta}_0 = -3.6\), \(\tilde{\beta}_1 = 0.54\), \(\tilde{\beta}_2 = 0.09\), would you be concerned about model misspecification? Why or why not?

No, this is not concerning. The logit coefficients are approximately 1.8 times the probit coefficients:

  • \(-3.6 / -2 = 1.8\)
  • \(0.54 / 0.3 = 1.8\)
  • \(0.09 / 0.05 = 1.8\)

This ratio arises from the different variance normalizations:

  • Probit: assumes \(\text{Var}(\epsilon) = 1\)
  • Logit: assumes \(\text{Var}(\epsilon) = \pi^2/3\), so uses scaling factor \(\sigma = \pi/\sqrt{3} \approx 1.814\)

Calculate predicted probabilities and marginal effects from both models. If these are similar, both models are giving consistent answers, just with different parameter scales.

Part (e): Suppose the sample has 1,000 observations with mean education = 13 years and mean experience = 8 years. Describe how you would calculate the Average Marginal Effect (AME) of education.

For a probit model, the AME of education is: \[\widehat{\text{AME}}_{\text{education}} = \hat{\beta}_1 \cdot \frac{1}{n}\sum_{i=1}^n \phi(X_i'\hat{\beta})\]

Steps:

  1. For each individual \(i\) in the sample, calculate their index: \(z_i = \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{education}_i + \hat{\beta}_2 \cdot \text{experience}_i\)
  2. Evaluate the standard normal density at each index: \(\phi(z_i)\)
  3. Average these densities: \(\bar{\phi} = \frac{1}{1000}\sum_{i=1}^{1000} \phi(z_i)\)
  4. Multiply by the coefficient: \(\widehat{\text{AME}}_{\text{education}} = 0.3 \times \bar{\phi}\)

This gives the average effect across all individuals in the sample, accounting for the fact that marginal effects vary with covariate levels.