Generalized Method of Moments & Multinomial Choice Models

Theory Review

Generalized Method of Moments (GMM)

Core Intuition

Instead of specifying the entire distribution of the data (as in MLE), GMM only requires that we specify moment conditions that the parameters should satisfy. These are relationships that we expect to hold in the population.

Example Moment Conditions:

For a regression model \(Y = X'\beta + \epsilon\) with \(E[\epsilon|X] = 0\):

\[E[X\epsilon] = E[X(Y - X'\beta)] = 0\]

This gives us as many moment conditions as we have variables in \(X\).

The GMM Principle:

Find parameter values that make the sample analogs of these moment conditions “as close to zero as possible.”

How GMM Works

Step 1: Specify Moment Conditions

Start with population moment conditions: \[E[g_i(\beta)] = 0\]

where \(g_i(\beta)\) is an \(l \times 1\) vector of moment functions and \(\beta\) is a \(k \times 1\) vector of parameters.

Step 2: Form Sample Moments

Replace population expectations with sample averages: \[\bar{g}_n(\beta) = \frac{1}{n}\sum_{i=1}^n g_i(\beta)\]

Step 3: Choose Parameters

Find \(\hat{\beta}\) that minimizes a weighted distance from zero: \[\hat{\beta}_{GMM} = \underset{\beta}{\mathrm{argmin}} \; n \bar{g}_n(\beta)' W_n \bar{g}_n(\beta)\]

where \(W_n\) is a weighting matrix and we multiply by \(n\) for the criterion function \(J(\beta)\).

When we have exactly as many moment conditions as parameters (\(l = k\), just-identified), we can set the moments exactly to zero. When we have more moment conditions than parameters (\(l > k\), over-identified), we minimize a weighted sum of squared moments.

GMM vs Other Estimators

GMM encompasses many familiar estimators:

OLS as GMM:

Moment condition: \(E[X(Y - X'\beta)] = 0\)

This is just-identified (same number of moments as parameters), so GMM reduces to OLS.

IV as GMM:

Moment condition: \(E[Z(Y - X'\beta)] = 0\)

where \(Z\) are instruments. The GMM estimator is: \[\hat{\beta}_{GMM} = (X'ZWZ'X)^{-1}X'ZWZ'Y\]

When we choose \(W = (Z'Z)^{-1}\), GMM equals 2SLS.

MLE as GMM:

Under correct specification, the score function gives moment conditions: \(E[S(\theta)] = 0\)

GMM is more general because it does not require full distributional assumptions.

Advantages of GMM:

Flexible: only requires moment conditions, not full distribution
Robust: valid even if some aspects of the model are misspecified
Natural framework for models with more moment conditions than parameters
Allows for heteroskedasticity and correlation in errors

Efficient GMM:

The optimal weighting matrix is \(W = \Omega^{-1}\) where \(\Omega = E[Z_iZ_i'e_i^2]\). This gives the efficient GMM estimator with asymptotic variance: \[V_{\beta} = (Q'\Omega^{-1}Q)^{-1}\]

where \(Q = E[Z_iX_i']\).

When to Use GMM:

You have valid moment conditions but do not want to specify the full distribution
You have more moment conditions than parameters (over-identification)
You want robust inference in the presence of heteroskedasticity or correlation
You are working with panel data or time series where standard assumptions may fail

Check Your Understanding

What is the key difference between MLE and GMM?

Answer

MLE requires specifying the entire probability distribution of the data. GMM only requires specifying moment conditions (expected values of certain functions). This makes GMM more flexible and robust, though potentially less efficient when the distribution is correctly specified.

Why do we need a weighting matrix in GMM?

Answer

When we have more moment conditions than parameters (over-identified case), we cannot set all moments exactly to zero. The weighting matrix determines how much weight to give each moment condition when forming the objective function. The optimal weighting matrix \(W = \Omega^{-1}\) accounts for the variance and covariance of the moments and yields the efficient GMM estimator.

How does OLS fit into the GMM framework?

Answer

OLS uses the moment condition \(E[X(Y - X'\beta)] = 0\), which comes from the assumption that errors are uncorrelated with regressors. This gives exactly as many moments as parameters (\(l = k\)), so the GMM estimator sets these sample moments to zero, which yields the OLS formula \(\hat{\beta} = (X'X)^{-1}X'Y\).

Multinomial Choice Models

Core Concepts

Many economic decisions involve choosing among multiple discrete alternatives:

Which mode of transportation to take (car, bus, train, bike)
Which product to purchase from a set of competing brands
Which location to invest in or where to locate a business
Which occupation or major to choose

The dependent variable is categorical with more than two outcomes, and these outcomes have no natural ordering (or we ignore the ordering).

Utility Framework:

The general setup assumes that choices reflect underlying utility. Let the latent utility from option \(j\) be: \[U_j^* = X'\beta_j + \epsilon_j\]

While we do not observe \(U_j^*\), we observe \(Y\) (the option chosen): \[Y = j \text{ if } U_j^* \geq U_l^* \text{ for all } l\]

Identification:

Two normalizations are needed:

Only differences in \(\beta_j\) are identified, so we set one \(\beta_j = 0\) (base alternative)
We fix the variance of the error terms (typically to 1 or \(\pi^2/3\))

Multinomial Logit Model

Basic Model:

For individual \(i\) choosing among \(J\) alternatives, the probability of choosing alternative \(j\) is:

\[P_j(x) = \frac{\exp(x'\beta_j)}{\sum_{k=1}^J \exp(x'\beta_k)}\]

This assumes the error terms \(\epsilon_j\) follow a Type I Extreme Value distribution and are independent across alternatives.

Maximum Likelihood Estimation:

The log-likelihood function is: \[\ell_n(\beta) = \sum_{i=1}^n \sum_{j=1}^J \mathbb{1}\{Y_i = j\} \log(P_j(X_i|\beta))\]

The MLE maximizes this function. Since the likelihood is globally concave, numerical optimization quickly finds the maximum.

Marginal Effects:

Unlike linear models, marginal effects are functions of the data: \[\delta_j(x) = \frac{\partial P_j(x)}{\partial x} = P_j(x)(\beta_j - \sum_{l=1}^J P_l(x)\beta_l)\]

Average Marginal Effects (AME):

\[\hat{\delta}_j = \frac{1}{n}\sum_{i=1}^n \hat{\delta}_j(X_i)\]

Conditional Logit Model

When to Use:

When choice characteristics vary across alternatives (e.g., price varies by product, distance varies by location).

Basic Specification:

\[U_j^* = X_j'\gamma + \epsilon_j\]

Here \(X_j\) varies across choices but \(\gamma\) does not. For example, price has the same effect across all products (marginal utility of money).

Mixed Specification:

Combining individual and choice characteristics: \[U_j^* = W'\beta_j + X_j'\gamma + \epsilon_j\]

The probability is: \[P_j(w, x) = \frac{\exp(w'\beta_j + x_j'\gamma)}{\sum_{k=1}^J \exp(w'\beta_k + x_k'\gamma)}\]

Marginal Effects:

For choice-specific variables:

Own effect: \(\frac{\partial P_j(w,x)}{\partial x_j} = \gamma P_j(w,x)(1 - P_j(w,x))\)

Cross effect: \(\frac{\partial P_j(w,x)}{\partial x_l} = -\gamma P_j(w,x)P_l(w,x)\)

Independence of Irrelevant Alternatives (IIA)

The IIA Assumption:

The ratio of probabilities for any two alternatives depends only on the characteristics of those two alternatives:

\[\frac{P_j(x)}{P_k(x)} = \frac{\exp(x'\beta_j)}{\exp(x'\beta_k)} = \exp(x'(\beta_j - \beta_k))\]

Implications:

Adding or removing an alternative does not change relative probabilities of other alternatives
The odds ratio between any two alternatives is independent of other alternatives

When IIA May Fail:

The classic example is the red bus/blue bus problem. Suppose people choose between:

Car (probability 0.5)
Red bus (probability 0.5)

If we add a blue bus option (identical to red bus except color), IIA predicts:

Car (probability 0.33)
Red bus (probability 0.33)
Blue bus (probability 0.33)

But intuitively, we expect:

Car (probability 0.5)
Red bus (probability 0.25)
Blue bus (probability 0.25)

Alternatives When IIA Fails:

Nested logit (allows correlation within groups using dissimilarity parameter \(\tau_j\))
Mixed logit (allows random coefficients)
Multinomial probit (allows flexible error correlation, but computationally demanding)

Check Your Understanding

What is the key assumption that distinguishes multinomial logit from other multinomial models?

Answer

The Independence of Irrelevant Alternatives (IIA) assumption. This assumes that the ratio of choice probabilities between any two alternatives is independent of the presence or characteristics of other alternatives. It arises from assuming the errors follow independent Type I Extreme Value distributions across alternatives.

Why do individual-specific variables need alternative-specific coefficients?

Answer

Individual-specific variables like age or income do not vary across the alternatives for a given person. To affect choice probabilities, they must have different effects on the utility of different alternatives (different \(\beta_j\) for each alternative). Otherwise, they would appear in all utilities equally and cancel out when computing choice probabilities.

If a model includes both firm characteristics and region characteristics, which type gets alternative-specific coefficients?

Answer

Firm characteristics (individual-specific variables) get alternative-specific coefficients (\(\beta_j\)) because they do not vary across regions for a given firm. Region characteristics (alternative-specific variables) get common coefficients (\(\gamma\)) because they vary across alternatives and their effect is assumed to be the same across all choosers.

Exercise: Multinomial Logit Applications

Consider a model of location choice for foreign investment by Japanese firms that depends only on three firm characteristics: industry, size, and age. Suppose there are only 5 regions available for foreign direct investment: three in the UK (UK1, UK2, UK3) and two in France (FR1, FR2).

Part (a): Given the model specification, write out the random utilities of choosing UK1 and choosing FR1 for firm i. Carefully use subscripts i and j as indices for observations and choices, respectively.

Answer

\[ \begin{aligned} U_{i,UK1} &= \alpha_{UK1,0} + \alpha_{UK1,1} \text{industry}_i + \alpha_{UK1,2} \text{size}_i + \alpha_{UK1,3} \text{age}_i + \epsilon_{i,UK1} \\ U_{i,FR1} &= \alpha_{FR1,0} + \alpha_{FR1,1} \text{industry}_i + \alpha_{FR1,2} \text{size}_i + \alpha_{FR1,3} \text{age}_i + \epsilon_{i,FR1} \end{aligned} \]

Part (b): Write out the probability that firm i would prefer UK1 over FR1.

Answer

\[ \begin{aligned} \text{Prob}( U_{i,UK1} > U_{i,FR1}) = \text{Prob}(&\alpha_{UK1,0} + \alpha_{UK1,1} \text{industry}_i + \alpha_{UK1,2} \text{size}_i + \alpha_{UK1,3} \text{age}_i + \epsilon_{i,UK1} \\ >& \alpha_{FR1,0} + \alpha_{FR1,1} \text{industry}_i + \alpha_{FR1,2} \text{size}_i + \alpha_{FR1,3} \text{age}_i + \epsilon_{i,FR1} ) \end{aligned} \]

Part (c): Write out the probability for firm i to choose UK1 under the multinomial logit model, and denote it as \(P_{UK1}(X_i|\alpha)\).

Answer

\[ P_{UK1}(X_i |\alpha) = \text{Prob}( U_{i,UK1} > U_{i, \sim UK1}) = \frac{\exp(\alpha_{UK1,0} + \alpha_{UK1,1} \text{industry}_i + \alpha_{UK1,2} \text{size}_i + \alpha_{UK1,3} \text{age}_i)}{ \sum_{j=1}^5 \exp(\alpha_{j0} + \alpha_{j1} \text{industry}_i + \alpha_{j2} \text{size}_i + \alpha_{j3} \text{age}_i)} \]

Part (d): Write out the log-likelihood function for the sample of \(n\) firms.

Answer

\[ \ell_n (\alpha) = \sum_{i=1}^n \sum_{j=1}^5 \mathbb{1} \{Y_i =j\} \log(P_j(X_i|\alpha)) \]

Part (e): Write out the expression for the marginal effect of industry on choosing UK1.

Answer

Denote \(A = \sum_{j=1}^5 \exp(\alpha_{j0} + \alpha_{j1} \text{industry}_i + \alpha_{j2} \text{size}_i + \alpha_{j3} \text{age}_i)\) and \(B = \exp(\alpha_{UK1,0} + \alpha_{UK1,1} \text{industry}_i + \alpha_{UK1,2} \text{size}_i + \alpha_{UK1,3} \text{age}_i)\) so that \(P_{UK1} (X_i|\alpha) = \frac{B}{A}\).

\[ \begin{aligned} \frac{\partial P_{UK1} (X_i|\alpha)}{\partial \text{industry}} &= \frac{B \alpha_{UK1,1} A - B \sum_{j=1}^5 \exp(\alpha_{j0} + \alpha_{j1} \text{industry}_i + \alpha_{j2} \text{size}_i + \alpha_{j3} \text{age}_i) \alpha_{j1}}{A^2} \\ &= \alpha_{UK1,1} P_{UK1} (X_i|\alpha) - P_{UK1} (X_i|\alpha) \sum_{j=1}^5 P_j (X_i|\alpha) \alpha_{j1} \\ &= P_{UK1} (X_i|\alpha) \left( \alpha_{UK1,1} - \sum_{j=1}^5 P_j (X_i|\alpha) \alpha_{j1} \right) \end{aligned} \]

Now suppose that you also have information for each region, including the unemployment rate, the domestic industry count and the total area.

Part (f): Write out the random utilities of UK1 and FR1 for firm i. Use subscripts i and j as indices for observations and choices, respectively.

Answer

\[ \begin{aligned} U_{i,UK1} &= \alpha_{UK1,0} + \alpha_{UK1,1} \text{industry}_i + \alpha_{UK1,2} \text{size}_i + \alpha_{UK1,3} \text{age}_i \\ &\quad + \gamma_1 \text{unemp}_{UK1} + \gamma_2 \text{domind}_{UK1} + \gamma_3 \text{area}_{UK1} + \epsilon_{i,UK1} \\ U_{i,FR1} &= \alpha_{FR1,0} + \alpha_{FR1,1} \text{industry}_i + \alpha_{FR1,2} \text{size}_i + \alpha_{FR1,3} \text{age}_i \\ &\quad + \gamma_1 \text{unemp}_{FR1} + \gamma_2 \text{domind}_{FR1} + \gamma_3 \text{area}_{FR1} + \epsilon_{i,FR1} \end{aligned} \]

Part (g): Write out the probability that firm i would prefer UK1 over FR1.

Answer

\[ \begin{aligned} \text{Prob}( U_{i,UK1} > U_{i,FR1}) = \text{Prob}(&\alpha_{UK1,0} + \alpha_{UK1,1} \text{industry}_i + \alpha_{UK1,2} \text{size}_i + \alpha_{UK1,3} \text{age}_i \\ &+ \gamma_1 \text{unemp}_{UK1} + \gamma_2 \text{domind}_{UK1} + \gamma_3 \text{area}_{UK1} + \epsilon_{i,UK1} \\ >& \alpha_{FR1,0} + \alpha_{FR1,1} \text{industry}_i + \alpha_{FR1,2} \text{size}_i + \alpha_{FR1,3} \text{age}_i \\ &+ \gamma_1 \text{unemp}_{FR1} + \gamma_2 \text{domind}_{FR1} + \gamma_3 \text{area}_{FR1} + \epsilon_{i,FR1} ) \end{aligned} \]

Part (h): Write out the probability for firm i to choose UK1 under the multinomial logit model, and denote it as \(P_{UK1}(X_i, W_i|\alpha)\).

Answer

\[ \begin{aligned} P_{UK1}(X_i, W_i |\alpha) &= \text{Prob}( U_{i,UK1} > U_{i, \sim UK1}) \\ &= \frac{\exp(\alpha_{UK1,0} + \alpha_{UK1,1} \text{industry}_i + \alpha_{UK1,2} \text{size}_i + \alpha_{UK1,3} \text{age}_i + \gamma_1 \text{unemp}_{UK1} + \gamma_2 \text{domind}_{UK1} + \gamma_3 \text{area}_{UK1})}{ \sum_{j=1}^5 \exp(\alpha_{j0} + \alpha_{j1} \text{industry}_i + \alpha_{j2} \text{size}_i + \alpha_{j3} \text{age}_i + \gamma_1 \text{unemp}_j + \gamma_2 \text{domind}_j + \gamma_3 \text{area}_j)} \end{aligned} \]

Part (i): Write out the log-likelihood function for the sample of n firms.

Answer

\[ \ell_n (\alpha, \gamma) = \sum_{i=1}^n \sum_{j=1}^5 \mathbb{1} \{Y_i =j\} \log(P_j(X_i,W_i|\alpha, \gamma)) \]

Part (j): Write out the expression for the marginal effect of a change in the unemployment rate of UK1 on the probability of FR1. What marginal effect should this be equal to?

Answer

\[ \begin{aligned} \frac{\partial P_{FR1} (w,x)}{\partial \text{unemp}_{UK1}} = - \gamma_1 P_{FR1} (w,x) P_{UK1} (w,x) \end{aligned} \]

This should be equal to the expression for the marginal effect of a change in the unemployment rate of FR1 on the probability of choosing UK1.

Part (k): Suppose that you are only interested in researching investment within the UK. If you dropped the firms that selected France from the data, would this yield a consistent estimate of \(\alpha\)?

Answer

If the Independence of Irrelevant Alternatives (IIA) assumption holds, then dropping the French regions will yield a consistent estimate. However, if IIA does not hold, then dropping the French regions will change the underlying probabilities of the UK regions and yield an inconsistent estimate.

Ordered Choice Models

When to Use Ordered Models

Ordered choice models are used when the dependent variable has multiple outcomes with a natural ordering, but the distances between outcomes are not meaningful.

Common Examples:

Survey responses (strongly disagree, disagree, neutral, agree, strongly agree)
Educational attainment (high school, some college, bachelor’s, graduate degree)
Credit ratings (AAA, AA, A, BBB, BB, B)
Health status (poor, fair, good, excellent)

Key Difference from Multinomial Models:

Ordered models respect the ordering of outcomes, while multinomial models treat all outcomes symmetrically. If your outcomes have a clear ordering, ordered models are more efficient and provide more interpretable results.

Ordered Logit and Probit Models

Latent Variable Framework:

Assume there is an underlying continuous latent variable \(Y^*\) that determines the observed ordered outcome \(Y\):

\[Y^* = X'\beta + \epsilon\]

We observe:

\[Y = j \text{ if } \kappa_{j-1} < Y^* \leq \kappa_j\]

where \(\kappa_0 = -\infty\), \(\kappa_J = +\infty\), and \(\kappa_1 < \kappa_2 < \ldots < \kappa_{J-1}\) are cutpoint parameters to be estimated.

Unlike multinomial models, ordered models use a single coefficient vector \(\beta\) for all outcomes. The ordering comes from the cutpoints, not from different coefficients.

Ordered Logit:

Assumes \(\epsilon\) follows a logistic distribution: \[P(Y = j | X) = \Lambda(\kappa_j - X'\beta) - \Lambda(\kappa_{j-1} - X'\beta)\]

where \(\Lambda(z) = \frac{1}{1 + e^{-z}}\) is the logistic CDF.

Ordered Probit:

Assumes \(\epsilon\) follows a standard normal distribution: \[P(Y = j | X) = \Phi(\kappa_j - X'\beta) - \Phi(\kappa_{j-1} - X'\beta)\]

where \(\Phi(\cdot)\) is the standard normal CDF.

Interpretation and Marginal Effects

Sign Interpretation:

The sign of \(\beta_k\) tells you:

If \(\beta_k > 0\): increases in \(x_k\) increase \(Y^*\), making higher ordered outcomes more likely
If \(\beta_k < 0\): increases in \(x_k\) decrease \(Y^*\), making lower ordered outcomes more likely

Marginal Effects:

The effect on the probability of outcome \(j\) is:

\[\frac{\partial P(Y = j | X)}{\partial x_k} = [g(\kappa_{j-1} - X'\beta) - g(\kappa_j - X'\beta)] \beta_k\]

where \(g(\cdot)\) is the density function (logistic or normal).

The marginal effect on extreme categories is always the same sign as \(\beta_k\)
The marginal effect on middle categories can be either positive or negative
This is because increasing \(x_k\) shifts probability away from one side and toward the other

Example:

If \(\beta_{income} > 0\) in a model of educational attainment:

Higher income increases probability of graduate degree (highest category)
Higher income decreases probability of high school only (lowest category)
Effect on middle categories (bachelor’s degree) depends on where most of the probability mass is

Parallel Regression Assumption

Ordered logit and probit models assume that the effect of covariates is the same across all cutpoints. This is called the parallel regression (or proportional odds) assumption.

The model assumes that increasing \(x_k\) by one unit has the same effect on the log-odds of being in category \(j\) or higher versus lower than \(j\), regardless of which cutpoint we consider.

Testing the Assumption:

You can test this by:

Estimating separate binary models for each cutpoint
Comparing coefficients across models
Using a Brant test or similar diagnostic

What to Do If Violated:

Use a generalized ordered logit/probit that allows coefficients to vary by cutpoint
Use a multinomial logit/probit that ignores the ordering
Try different specifications or transformations of variables

Check Your Understanding

When should you use an ordered model instead of a multinomial model?

Answer

Use an ordered model when your dependent variable has a natural ordering and you want to respect that ordering. For example, use ordered models for survey responses (strongly disagree to strongly agree) or educational levels (high school to graduate degree). Ordered models are more efficient when the ordering is meaningful because they use a single coefficient vector rather than separate coefficients for each outcome.

What do the cutpoint parameters represent?

Answer

The cutpoints \(\kappa_j\) represent thresholds on the latent variable \(Y^*\) that separate the different observed outcomes. They tell us where the boundaries are between adjacent categories. For instance, if \(Y^* > \kappa_2\), we observe the third ordered outcome. The cutpoints are estimated along with the \(\beta\) coefficients.

Why can the marginal effect on middle categories have either sign?

Answer

When a covariate increases, it shifts the distribution of \(Y^*\) to the right (if \(\beta > 0\)). This always increases probability in the highest categories and decreases probability in the lowest categories. But for middle categories, probability can flow both in (from lower categories) and out (to higher categories). The net effect depends on which flow is larger, which depends on where the distribution is concentrated relative to the cutpoints.