GMM & Multinomial Choice Models

Coding Review

Setup: R Packages

Key Packages

For GMM and IV:

AER::ivreg(): Instrumental variables regression (2SLS)
gmm::gmm(): Generalized method of moments estimation
sandwich::vcovHC(): Robust covariance matrices

For Multinomial Choice:

mlogit::mlogit(): Multinomial logit estimation
mlogit::mlogit.data(): Format data for multinomial models

For Censored Regression:

AER::tobit(): Tobit estimation (censored regression)
censReg::censReg(): CLAD and other censored regression methods

For Standard Errors:

lmtest::coeftest(): Coefficient tests with robust SEs
sandwich::vcovCL(): Clustered standard errors

GMM and 2SLS Comparison

GMM generalizes 2SLS. When we use the optimal weighting matrix, GMM becomes more efficient than 2SLS, especially with heteroskedasticity.

Exercise 1: 2SLS Estimation

We’ll use the CigarettesSW dataset from the AER package, which contains state-level data on cigarette consumption and prices. We want to estimate the demand for cigarettes, but price is endogenous (correlated with demand shocks). We’ll use taxes as instruments.

First, load the data and create the key variables.

Now estimate a 2SLS regression of log quantity on log price and log income, using real tax (rtax) as an instrument for log price. Use the ivreg() function.

# 2SLS: Instrument log_price with rtax
model_2sls <- ivreg(log_quantity ~ log_price + log_income | rtax + log_income, 
                    data = CigarettesSW)

summary(model_2sls, vcov = sandwich, diagnostics = TRUE)

The coefficient on log_price is the price elasticity of demand. A negative coefficient indicates that higher prices reduce quantity demanded. The diagnostics = TRUE option provides weak instrument tests (you want high F-statistics) and overidentification tests.

Exercise 2: GMM Estimation

Now let’s compare 2SLS with GMM. We’ll use the same specification but estimate it with GMM, which allows for optimal weighting in the presence of heteroskedasticity.

Note: In this just-identified case (3 instruments, 3 parameters including intercept), GMM and 2SLS will give identical coefficient estimates. The difference only appears in standard errors and in over-identified cases.

library(gmm)

# Define moment conditions
g <- function(theta, data) {
  y <- data$log_quantity
  X <- cbind(1, data$log_price, data$log_income)
  Z <- cbind(1, data$rtax, data$log_income)
  
  residuals <- y - X %*% theta
  moments <- Z * as.vector(residuals)
  return(moments)
}

# GMM with optimal weighting
model_gmm <- gmm(g, x = CigarettesSW, 
                 t0 = coef(model_2sls),
                 type = "iterative",
                 wmatrix = "optimal")

summary(model_gmm)

# Compare coefficients
cat("\n=== Coefficient Comparison ===\n")
cat("2SLS:\n")
print(round(coef(model_2sls), 4))
cat("\nGMM:\n")
print(round(coef(model_gmm), 4))
cat("\nDifference:\n")
print(round(coef(model_gmm) - coef(model_2sls), 6))

In the just-identified case (same number of moment conditions as parameters), GMM and 2SLS produce exactly the same coefficient estimates, regardless of the weighting matrix. This is because we have exactly enough equations to solve for the parameters uniquely.

What differs?

Standard Errors: GMM with wmatrix = "optimal" uses a heteroskedasticity-robust covariance matrix, while standard 2SLS assumes homoskedasticity. Here we used vcov = sandwich for 2SLS to make them comparable.
Over-identified Cases: When you have more instruments than endogenous variables (e.g., using both rtax and rtaxs as instruments for log_price), GMM becomes more efficient than 2SLS by optimally weighting the moment conditions.

What’s the weight matrix? The wmatrix = "optimal" option tells GMM to:

Start with a preliminary weighting (often identity or \((Z'Z)^{-1}\))
Estimate the variance of moment conditions: \(\hat{\Omega} = \frac{1}{n}\sum_{i=1}^n g_i(\hat{\theta})g_i(\hat{\theta})'\)
Use optimal weighting: \(W = \hat{\Omega}^{-1}\)
Iterate until convergence

In 2SLS, the weight matrix is fixed at \(W = (Z'Z)^{-1}\). Efficient GMM optimizes this choice.

Multinomial Logit

Multinomial logit is used when the dependent variable has multiple unordered categories. The key challenge is understanding three types of variables:

Alternative-specific variables (vary by choice): e.g., costs of each technology
Individual-specific variables (vary by firm/plant): e.g., age, regulatory environment
Interactions (alternative-specific variables that vary by individual characteristics)

We’ll use the NOx dataset, which contains pollution control technology choices by power plants.

Exercise 3: Data Preparation

The mlogit package requires data in a specific “long” format. Load the data and check its structure.

library(mlogit)
data("NOx")

# View structure
head(NOx, 12)

# Check alternatives
cat("Available technologies:\n")
unique(NOx$alt)

# Convert to mlogit format
N <- mlogit.data(NOx,
                 shape = "long",
                 choice = "choice",
                 alt.var = "alt")

head(N, 12)

The NOx data contains choices of pollution control technologies by power plants. There are 4 technology alternatives:

post: Post-combustion pollution control
cm: Combustion modification
lnb: Low NOx burners
(and a baseline/no control option)

The data is already in long format with one row per plant-technology pair. The choice variable indicates which technology was chosen (TRUE/FALSE). Variables like vcost (variable cost) and kcost (capital cost) vary by technology, while age and env (regulatory environment) vary by plant.

Exercise 4: Four Types of Multinomial Logit Specifications

Now we’ll estimate four different models to understand how different types of variables work in multinomial logit. The formula syntax is:

choice ~ alternative-specific | individual-specific | interactions

Fill in the formulas for each model:

# Model 1: Alternative-specific variables only
model1 <- mlogit(choice ~ vcost + kcost | 0 | 0, data = N)

cat("=== Model 1: Alternative-Specific Only ===\n")
summary(model1)

# Model 2: Individual-specific variables only
model2 <- mlogit(choice ~ 0 | age + env | 0, data = N)

cat("\n=== Model 2: Individual-Specific Only ===\n")
summary(model2)

# Model 3: Both types
model3 <- mlogit(choice ~ vcost + kcost | age + env | 0, data = N)

cat("\n=== Model 3: Both Variable Types ===\n")
summary(model3)

# Model 4: With interactions
model4 <- mlogit(choice ~ vcost + kcost | age + env | vcost + kcost, data = N)

cat("\n=== Model 4: With Interactions ===\n")
summary(model4)

Model 1 (Alternative-Specific Variables):

vcost and kcost vary by pollution control technology but are the same across plants
Get single coefficients - the cost effect is assumed the same for all plants
Negative coefficients indicate higher costs reduce choice probability
These variables directly enter the utility comparison between technologies

Model 2 (Individual-Specific Variables):

age varies by plant but not by technology
env is the regulatory environment (regulated, deregulated, public)
Get alternative-specific coefficients (e.g., age:cm, age:lnb, etc.)
One technology serves as the base (coefficients set to 0)
Shows how plant characteristics affect preference for each technology relative to the base
Example: if age:lnb is positive, older plants are more likely to choose low NOx burners

Model 3 (Combined):

Most common specification using both types of variables
Costs affect all plants the same way (alternative-specific)
Age and regulatory environment effects vary by technology
Captures both universal preferences (costs) and heterogeneous preferences

Model 4 (With Interactions):

Most flexible specification: allows cost sensitivity to vary by plant characteristics
The third part of the formula (| vcost + kcost) creates interactions between:
- Alternative-specific variables (costs)
- Individual-specific variables (age, env)
Example: coefficients like age:vcost show how plant age affects cost sensitivity
Allows older plants to be more/less sensitive to capital costs than newer plants
Relaxes the assumption that all plants respond to costs identically

Why individual-specific variables need alternative-specific coefficients: They don’t vary across technologies for a given plant, so without different coefficients per alternative, they’d cancel out in utility comparisons.
Why interactions matter: In Model 3, all plants have the same cost sensitivity. Model 4 allows regulated plants to be less price-sensitive (they pass costs to ratepayers), or older plants to weight capital costs differently.

Model Comparison:

# Compare fit
cat("\nLog-Likelihoods:\n")
cat("Model 1:", logLik(model1), "\n")
cat("Model 2:", logLik(model2), "\n")
cat("Model 3:", logLik(model3), "\n")
cat("Model 4:", logLik(model4), "\n")

# Test if interactions are needed
lrtest(model3, model4)

The likelihood ratio test shows whether allowing cost sensitivity to vary by plant characteristics significantly improves fit.

Practical Example: If regulated plants are less sensitive to operating costs (because they can pass costs to customers), Model 4 would capture this through the env:vcost interaction term, while Model 3 would miss this heterogeneity.

Exercise 5: Predicted Probabilities

Calculate predicted probabilities for the first plant using Model 4 (the full model with interactions). What is the probability they choose each pollution control technology?

# Predicted probabilities
fitted_probs <- fitted(model4, outcome = FALSE)

cat("First plant's choice probabilities:\n")
print(fitted_probs[1:4, ])

cat("\nSum of probabilities:", sum(fitted_probs[1:4, ]))

# Compare across models
cat("\n\nCompare first plant across all four models:\n")
cat("\nModel 1 (costs only):\n")
print(fitted(model1, outcome = FALSE)[1:4, ])

cat("\nModel 2 (age and regulation only):\n")
print(fitted(model2, outcome = FALSE)[1:4, ])

cat("\nModel 3 (both, no interactions):\n")
print(fitted(model3, outcome = FALSE)[1:4, ])

cat("\nModel 4 (with interactions):\n")
print(fitted_probs[1:4, ])

The predicted probabilities show the model’s estimate of how likely each plant is to choose each pollution control technology, given the plant’s characteristics and the costs of each technology. The probabilities sum to 1 for each plant, representing a complete probability distribution over the alternatives.

Comparing Models:

Model 1 uses only costs, so predictions vary only based on which technologies are cheaper
- Plants with similar cost structures will have similar predicted probabilities
Model 2 uses only plant characteristics (age and regulatory environment), so predictions vary only based on plant-level factors
- Can’t distinguish between technologies with different costs
Model 3 combines both, accounting for:
- Universal cost sensitivity (all plants prefer cheaper options)
- Heterogeneous base preferences based on plant characteristics
Model 4 adds interaction effects, allowing:
- Cost sensitivity to vary by plant type
- Example: Older plants in regulated environments may be less sensitive to capital costs
- Most realistic representation of heterogeneous decision-making

Policy Relevance: Understanding these choice probabilities and how they vary by plant characteristics helps regulators:

Predict technology adoption patterns under different cost scenarios
Design cost-effective environmental policies
Target subsidies or regulations to plants most likely to respond

Ordered Logit and Probit

Ordered models are used when the dependent variable has multiple ordered categories, like survey responses (strongly disagree to strongly agree) or educational levels (high school, some college, bachelor’s, graduate degree).

Exercise 6: Ordered Logit

We’ll use the Wage1 dataset from the wooldridge package, which includes educational attainment as an ordered variable.

First, load and prepare the data by creating an ordered education factor.

Now estimate an ordered logit model with education as the dependent variable and age, female, experience, and urban as predictors.

# Ordered logit model
model_ologit <- polr(education ~ age + female + experience + urban,
                     data = educ_data,
                     method = "logistic",
                     Hess = TRUE)

summary(model_ologit)

Coefficients: Positive coefficients increase the latent variable, making higher education categories more likely
Cutpoints (Intercepts): These are the thresholds between adjacent categories on the latent scale
For example, if age has a positive coefficient, older individuals are more likely to have higher education levels
The coefficients apply to all cutpoints (parallel regression assumption)

Example: If the coefficient on age is 0.05, each additional year of age increases the log-odds of being in a higher education category by 0.05.

Exercise 7: Ordered Probit

Now estimate the same model using ordered probit and compare the results.

# Ordered probit model
model_oprobit <- polr(education ~ age + female + experience + urban,
                      data = educ_data,
                      method = "probit",
                      Hess = TRUE)

summary(model_oprobit)

# Compare coefficients
cat("\n=== Coefficient Comparison ===\n")
cat("Ratio of logit to probit coefficients:\n")
print(coef(model_ologit) / coef(model_oprobit))

Interpretation:

Just like binary logit vs probit, ordered logit coefficients are approximately 1.8 times larger than ordered probit coefficients. This comes from the different variance assumptions:

Ordered logit: variance = π²/3 ≈ 3.29
Ordered probit: variance = 1

The ratio √(π²/3) ≈ 1.814 explains the scaling difference. Both models give similar predicted probabilities and substantive conclusions.

Exercise 8: Predicted Probabilities

Calculate predicted probabilities for different profiles using the ordered logit model.

# Create profiles
profile1 <- data.frame(age = 30, female = 1, experience = 5, urban = 1)
profile2 <- data.frame(age = 30, female = 0, experience = 5, urban = 0)

# Predicted probabilities
probs1 <- predict(model_ologit, newdata = profile1, type = "probs")
probs2 <- predict(model_ologit, newdata = profile2, type = "probs")

cat("Urban Female:\n")
print(round(probs1, 3))
cat("\nRural Male:\n")
print(round(probs2, 3))

# Visualize
library(ggplot2)
prob_comparison <- data.frame(
  Education = rep(levels(educ_data$education), 2),
  Probability = c(probs1, probs2),
  Profile = rep(c("Urban Female", "Rural Male"), each = 5)
)

ggplot(prob_comparison, aes(x = Education, y = Probability, fill = Profile)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Predicted Education Probabilities",
       y = "Probability") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation:

The predicted probabilities show how the likelihood of each education level changes based on individual characteristics:

Urban Female: Higher probabilities in [observe which categories]
Rural Male: Higher probabilities in [observe which categories]

The differences reflect the estimated effects of gender and urban location on educational attainment. If coefficients on these variables were positive, we’d see probability mass shifting toward higher education categories for urban females.

Key Insight: Unlike multinomial logit, ordered models assume the effects are monotonic - variables that increase higher education don’t just shift probability between random categories, they systematically shift it up or down the ordered scale.