Suppose we have the linear equation \(Y = X' \beta+e\) with two sets of instruments \(Z_1\) and \(Z_2\). Then consider the following estimators of \(\beta\):
\[
\begin{aligned}
\hat{\beta} &: \text{2SLS using the instruments } Z_1 \\
\tilde{\beta} &: \text{2SLS using the instruments } Z_2 \\
\bar{\beta} &: \text{GMM using the instruments } Z=(Z_1,Z_2) \\
&\text{ and the weight matrix } \pmb{W} =
\begin{pmatrix} (\pmb{Z_1' Z_1})^{-1} \lambda & 0 \\ 0 & (\pmb{Z_2' Z_2})^{-1} (1-\lambda) \end{pmatrix}
\end{aligned}
\]
for \(\lambda \in (0,1)\).
Find an expression for \(\bar{\beta}\) which shows that it is a specific weighted average of \(\hat{\beta}\) and \(\tilde{\beta}\). (28 points)
TipSolution (a)
First, we can describe \(\hat{\beta}, \tilde{\beta}, \bar{\beta}\): \[
\begin{aligned}
\hat{\beta} &= (X' P_{Z_1} X)^{-1} X' P_{Z_1} Y \\
\tilde{\beta} &= (X' P_{Z_2} X)^{-1} X' P_{Z_2} Y \\
\bar{\beta} &= (X' ZWZ' X)^{-1} X' ZWZ' Y \\
\end{aligned}
\]
This shows that \(\bar{\beta}\) is a matrix-weighted average of \(\hat{\beta}\) and \(\tilde{\beta}\), where the weights depend on \(\lambda\) and the projection matrices:
Is this an efficient weight matrix to use for GMM? (Short answer, 1-2 sentences) (2 points)
TipSolution (b)
No, this is not efficient. The optimal GMM weight matrix should be \(W = (E[ZZ'e^2])^{-1}\), which requires estimating the error variance. This weight matrix uses only \((Z'Z)^{-1}\) scaled by \(\lambda\), ignoring heteroskedasticity in the errors.
2. Multinomial Logits and Probits
(50 points total)
Load the heating data, which contains a sample of 900 Californian households and their choice of heating system.
idcase: id
depvar: heating system
gc (gas central)
gr (gas room)
ec (electric central)
er (electric room)
hp (heat pump)
ic.z: installation cost for heating system z (defined for the 5 heating systems)
oc.z: annual operating cost for heating system z (defined for the 5 heating systems)
income: annual income of the household
agehed: age of the household head
rooms: numbers of rooms in the house
region
Estimate a multinomial logit model for heating system choice using only house characteristics. Report and interpret the coefficients. (6 points)
TipSolution (a)
Code
library(mlogit)data("Heating", package ="mlogit")Heating_mlogit <-dfidx(Heating, choice ="depvar", shape ="wide", varying =3:12)fit_mnl <-mlogit(depvar ~0| rooms + region + income + agehed, data = Heating_mlogit)summary(fit_mnl)
The coefficient on agehed:er is -0.03, indicating that each additional year of age decreases the log-odds of choosing electric room heating versus gas central by 0.03 (p = 0.03). This suggests older household heads are significantly less likely to choose electric room systems. Similarly, the coefficient on income:gr is -0.12, suggesting that higher-income households are less likely to choose gas room heating relative to gas central, though this effect is not statistically significant (p = 0.19).
Estimate a conditional logit model using only the system-specific variables. Report and interpret the coefficients. (6 points)
TipSolution (b)
Code
fit_cl <-mlogit(depvar ~ ic + oc, data = Heating_mlogit)summary(fit_cl)
Call:
mlogit(formula = depvar ~ ic + oc, data = Heating_mlogit, method = "nr")
Frequencies of alternatives:choice
ec er gc gr hp
0.071111 0.093333 0.636667 0.143333 0.055556
nr method
6 iterations, 0h:0m:0s
g'(-H)^-1g = 9.58E-06
successive function values within tolerance limits
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):er 0.19459102 0.20424212 0.9527 0.3407184
(Intercept):gc 0.05213336 0.46598878 0.1119 0.9109210
(Intercept):gr -1.35058266 0.50715442 -2.6631 0.0077434 **
(Intercept):hp -1.65884594 0.44841936 -3.6993 0.0002162 ***
ic -0.00153315 0.00062086 -2.4694 0.0135333 *
oc -0.00699637 0.00155408 -4.5019 6.734e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -1008.2
McFadden R^2: 0.013691
Likelihood ratio test : chisq = 27.99 (p.value = 8.3572e-07)
The coefficients for installation and operating costs are generic across all alternatives. The coefficient on ic is -0.0015 (p = 0.01), meaning that a $1 increase in installation cost decreases the log-odds of choosing that system by 0.0015. The coefficient on oc is -0.007 (p < 0.001), indicating that a $1 increase in annual operating cost decreases the log-odds by 0.007. Both cost variables significantly reduce the probability of choosing more expensive systems.
Estimate a mixed logit model including both house and system-specific variables. Compare the results to (a) and (b). (6 points)
TipSolution (c)
Code
fit_mixed <-mlogit(depvar ~ ic + oc | rooms + region + income + agehed, data = Heating_mlogit)summary(fit_mixed)
# Formal model comparisonslr_mnl_mixed <-lrtest(fit_mnl, fit_mixed)lr_cl_mixed <-lrtest(fit_cl, fit_mixed)
Comparison to (a): Adding system-specific costs (ic, oc) significantly improves model fit (χ² = 27.23, p < 0.001). The household characteristic coefficients remain stable: the age coefficient for electric room heating is -0.0263 in model (a) versus -0.0258 in model (c), suggesting robust demographic effects independent of costs.
Comparison to (b): Adding household characteristics does not significantly improve fit over the conditional logit (χ² = 26.57, p = 0.33). The cost coefficients remain nearly identical: ic is -0.0015 in model (b) versus -0.0015 in model (c), and oc is -0.007 versus -0.007. This suggests that system costs are the primary drivers of heating choice, with household demographics adding minimal explanatory power.
Estimate a mixed logit model that allows installation and operating costs to have alternative-specific effects. Test whether this more flexible specification is warranted compared to the mixed model in (c). (7 points)
TipSolution (d)
Code
# Flexible model with alternative-specific cost coefficientsfit_mixed_flex <-mlogit(depvar ~0| rooms + region + income + agehed | ic + oc, data = Heating_mlogit)summary(fit_mixed_flex)
# Compare to the restricted model from (c)lr_test_result <-lrtest(fit_mixed, fit_mixed_flex)
The LR test (χ² = 5.65, p = 0.69) fails to reject the null hypothesis that cost effects are equal across alternatives. The generic specification from (c) is preferred because there’s insufficient evidence that an increase in installation cost has different effects on choosing a heat pump versus electric central heating. The simpler model with common cost coefficients is adequate for these data.
Using the model from part (c), calculate the marginal effects of household income on the probability of choosing each heating system. Evaluate these at the mean values of all other covariates. Interpret your results. (5 points)
TipSolution (e)
Code
# Create data at mean valuesz <-with(Heating, data.frame(ic =c(mean(ic.gc), mean(ic.gr), mean(ic.ec), mean(ic.er), mean(ic.hp)),oc =c(mean(oc.gc), mean(oc.gr), mean(oc.ec), mean(oc.er), mean(oc.hp)),rooms =mean(rooms),region =names(sort(table(region), decreasing=TRUE))[1],income =mean(income),agehed =mean(agehed)))z$region <-factor(z$region, levels=levels(Heating$region))# Calculate marginal effects of incomeme_income <-effects(fit_mixed, covariate ="income", data = z)# Visualizeme_df <-data.frame(Alternative =names(me_income),MarginalEffect =as.numeric(me_income))ggplot(me_df, aes(x = Alternative, y = MarginalEffect)) +geom_bar(stat ="identity", fill ="steelblue") +geom_text(aes(label =round(MarginalEffect, 3)), vjust =ifelse(me_df$MarginalEffect >=0, -0.3, 1.3), size =3.5) +geom_hline(yintercept =0, linetype ="dashed") +labs(title ="Marginal Effect of Income on Heating System Choice",subtitle ="Evaluated at mean of other covariates",y ="Change in Probability (per $1000 income increase)",x ="Heating System") +theme_minimal()
The marginal effects show how a $1 increase in household income changes the probability of choosing each heating system, holding all other variables at their means. The effects are small but must sum to zero across all alternatives (since probabilities sum to 1).
Heat pumps have a positive marginal effect of 0.0029, meaning a $1 income increase raises the probability of choosing a heat pump by 0.29 percentage points, suggesting heat pumps are a “normal good” preferred by higher-income households. Electric room heating shows the largest negative effect (-0.0096), with electric central (0.0076) also positive. Gas central shows minimal change (6^{-4}). The pattern suggests higher-income households shift away from room-based systems toward central heating, particularly heat pumps.
Using the same model, calculate how the probability of choosing a heat pump changes if its installation cost decreases by 20%. Compare this to the marginal effect of installation cost. (5 points)
Simulation approach shows actual change of: 0.0131
Difference (due to nonlinearity): 0.0018
The policy simulation shows the finite change in probability from a substantial (20%) cost reduction. The marginal effect (0.0113) provides a local linear approximation that closely matches the simulation result (0.0131), with a small difference (0.0018) due to the nonlinearity of the logit model. For this moderately large change, both approaches give similar results, though the simulation is more accurate. For smaller policy changes, marginal effects provide excellent approximations.
Estimate two nested logit models for heating system choice. For each model: (i) report and interpret the inclusive value (dissimilarity) parameters for each nest, and (ii) test whether each inclusive value parameter is statistically different from 1 and explain how to determine if the nested structure is justified (show your code and results). (7 points)
Model 1: Nests by energy source: (1) gas (gc, gr), (2) electric (ec, er, hp)
Model 2: Nests by system type: (1) room systems (gr, er), (2) central systems (gc, ec, hp)
TipSolution (g)
Code
# Model 1: Nests by energy sourcenests1 <-list(gas =c("gc", "gr"),electric =c("ec", "er", "hp"))fit_nested1 <-mlogit(depvar ~ ic + oc | rooms + region + income + agehed, data = Heating_mlogit, nests = nests1)summary(fit_nested1)
# Model 2: Nests by system typenests2 <-list(room =c("gr", "er"),central =c("gc", "ec", "hp"))fit_nested2 <-mlogit(depvar ~ ic + oc | rooms + region + income + agehed, data = Heating_mlogit, nests = nests2)summary(fit_nested2)
# Test inclusive value parameters for Model 1est_iv1 <-coef(fit_nested1)[grep("^iv:", names(coef(fit_nested1)))]se_iv1 <-sqrt(diag(vcov(fit_nested1)))[grep("^iv:", names(coef(fit_nested1)))]z_iv1 <- (est_iv1 -1) / se_iv1p_iv1 <-2*pnorm(-abs(z_iv1))iv1_results <-data.frame(Nest =names(est_iv1), Estimate = est_iv1, Std.Error = se_iv1, z.value = z_iv1, p.value = p_iv1,row.names =NULL)knitr::kable(iv1_results, digits =2, caption ="Model 1 - Energy Source Nests")
Model 1 - Energy Source Nests
Nest
Estimate
Std.Error
z.value
p.value
iv:gas
8.06
11.81
0.6
0.55
iv:electric
1.69
0.77
0.9
0.37
Code
# Test inclusive value parameters for Model 2est_iv2 <-coef(fit_nested2)[grep("^iv:", names(coef(fit_nested2)))]se_iv2 <-sqrt(diag(vcov(fit_nested2)))[grep("^iv:", names(coef(fit_nested2)))]z_iv2 <- (est_iv2 -1) / se_iv2p_iv2 <-2*pnorm(-abs(z_iv2))iv2_results <-data.frame(Nest =names(est_iv2), Estimate = est_iv2, Std.Error = se_iv2, z.value = z_iv2, p.value = p_iv2,row.names =NULL)knitr::kable(iv2_results, digits =2, caption ="Model 2 - System Type Nests")
Model 2 - System Type Nests
Nest
Estimate
Std.Error
z.value
p.value
iv:room
0.04
0.02
-38.69
0
iv:central
0.05
0.04
-26.02
0
Code
# Compare nested models to standard multinomial logitlr_nested1 <-lrtest(fit_mixed, fit_nested1)lr_nested2 <-lrtest(fit_mixed, fit_nested2)
The inclusive value (IV) parameter λ for each nest measures the degree of independence among alternatives within that nest:
λ = 1: Alternatives in the nest are completely independent (IIA holds, reduces to MNL)
0 < λ < 1: Alternatives in the nest are correlated in unobserved utility
λ closer to 0: Higher correlation within nest
Model 1 shows evidence of misspecification: The gas nest has λ = 8.06 (SE = 11.81), well above 1, and the electric nest has λ = 1.69 (p = 0.37). Values of λ > 1 violate the random utility maximization framework and indicate the nesting structure is fundamentally incorrect. Neither IV parameter is significantly different from 1 (p > 0.05), suggesting the energy source grouping does not capture meaningful correlation patterns.
Model 2 shows a valid nested structure: Both IV parameters (0.04 and 0.05) fall within the valid range of (0,1) and are significantly different from 1 (both p < 0.001), providing strong evidence that the nesting structure is appropriate. The λ values close to 0 indicate very high correlation within nests, meaning room systems (gr, er) share substantial unobserved attributes, as do central systems (gc, ec, hp). This does not violate theoretical foundations—rather, it reveals important substitution patterns that the standard multinomial logit cannot capture.
Determining if Nested Structure is Justified:
Check validity: 0 < λ < 1
Model 1: ✗ (λ > 1 for both nests)
Model 2: ✓ (both λ values in valid range)
Test H₀: λ = 1 (if rejected, nesting is justified):
Model 1: Cannot reject (p > 0.05) → nesting not justified
Model 2: Strongly reject (p < 0.001) → nesting IS justified
Likelihood ratio test vs. MNL:
Model 1 vs. MNL: χ² = 7.45, p = 0.02
Model 2 vs. MNL: χ² = 9.34, p = 0.01
Conclusion: Model 1 (energy source nesting) is invalid due to λ > 1, while Model 2 (system type nesting) is well-specified with valid IV parameters that are significantly different from 1. The system type structure reveals that households view room-based heating systems as close substitutes, and similarly for central systems. Model 2 should be preferred over the standard multinomial logit, as it better captures the actual substitution patterns in heating system choice.
Combine the five heating systems into three categories: gas (gc, gr), electric (ec, er), and heat pump (hp). Estimate both a mixed multinomial probit and a mixed multinomial logit model for this grouped outcome. Compare the results to the nested logit from (g). (8 points)
TipSolution (h)
Code
# Create grouped dataset with data.tableDT <-data.table(Heating)[ , grouped :=fifelse(depvar %in%c("gc", "gr"), "gas",fifelse(depvar %in%c("ec", "er"), "electric", "hp"))][, `:=`(ic_gas = (ic.gc + ic.gr) /2,ic_elec = (ic.ec + ic.er) /2,oc_gas = (oc.gc + oc.gr) /2,oc_elec = (oc.ec + oc.er) /2)]# Create long formatfinal_data <-rbindlist(list( DT[, .(idcase = .I, alt ="gas", chosen = grouped =="gas", ic = ic_gas, oc = oc_gas, rooms, region, income, agehed)], DT[, .(idcase = .I, alt ="electric", chosen = grouped =="electric", ic = ic_elec, oc = oc_elec, rooms, region, income, agehed)], DT[, .(idcase = .I, alt ="hp", chosen = grouped =="hp", ic = ic.hp, oc = oc.hp, rooms, region, income, agehed)]))# Convert to data.frame, then to dfidxHeating_grouped <-dfidx(as.data.frame(final_data), idx =c("idcase", "alt"),choice ="chosen",idnames =c("chid", "alt"))# Fit multinomial logitfit_grouped_mnl <-mlogit(chosen ~ ic + oc | rooms + region + income + agehed,data = Heating_grouped)summary(fit_grouped_mnl)
# Fit multinomial probitfit_grouped_mnp <-mlogit(chosen ~ ic + oc | rooms + region + income + agehed,data = Heating_grouped,probit =TRUE)summary(fit_grouped_mnp)
Grouped MNL vs. Grouped MNP: The probit model allows for more flexible correlation patterns in errors but is computationally more intensive. Both models yield similar coefficient estimates for costs: ic is -0.0023 (MNL) vs. -0.0018 (MNP), and oc is -0.0087 vs. -0.0051. Log-likelihoods are nearly identical (-567.3 vs. -566.01), suggesting that the additional flexibility of probit adds little value for this grouped outcome.
Grouped Models vs. Nested Logit:
The grouped models explicitly combine alternatives before estimation
The nested logit (LL = -991.22) maintains separate alternatives but models correlation
These log-likelihoods are not directly comparable because they use different outcome variables (3 categories vs. 5 categories). The nested logit appears to have a “worse” (more negative) log-likelihood, but this is misleading—it uses more detailed choice information.
The grouped models discard within-group variation by pre-aggregating choices (e.g., treating gc and gr as identical), while the nested logit preserves the original 5-category structure
Summary:
The grouped and nested models give similar substantive conclusions about cost effects (both show strong negative effects of ic and oc), supporting the validity of grouping if within-group distinctions are unimportant
The nested logit provides a more flexible approach that doesn’t require pre-grouping alternatives, but suffered from misspecification issues (λ values outside valid range)
For this application, if the distinction between gc/gr or ec/er is theoretically unimportant, the grouped MNL offers a simpler, well-specified alternative to the problematic nested logit
3. Ordered Logits and Probits
(20 points total)
Load the Persistence_preferences_rural_Guatemala data from the replication files for the paper “Persistence of Individual and Social Preferences in Rural Settings.” This dataset contains a panel of 1,262 agricultural households in rural Guatemala, surveyed annually from 2019 to 2022. It includes self-reported preference indicators, household socioeconomic characteristics, and contextual variables.
trust_people_0: trust in people, scale 1–4 (1 = almost always can trust, 4 = almost always must be very careful)
head_compprim_above: household head completed elementary or above
head_male: household head is male
head_age: household head age
head_spanish: household head’s main language spoken is Spanish
t_TV_radio: household head owns TV or radio
HH_poor_PL2011: household is poor (below $1.90/day, 2011 PPP)
Tot_AgriLand_ha: household head agricultural land size (hectares)
Estimate an ordered logit model for trust in people (trust_people_0) as a function of household head education (head_compprim_above), sex (head_male), age (head_age), main language spoken (head_spanish), TV/radio ownership (t_TV_radio), poverty status (HH_poor_PL2011), and agricultural land size (Tot_AgriLand_ha). Report and interpret the coefficients. (5 points)
Call:
polr(formula = trust_people_0 ~ head_compprim_above + head_male +
head_age + head_spanish + t_TV_radio + HH_poor_PL2011 + Tot_AgriLand_ha,
data = guat, method = "logistic")
Coefficients:
Value Std. Error t value
head_compprim_above 0.184386 0.133410 1.3821
head_male -0.128022 0.165553 -0.7733
head_age 0.002724 0.004856 0.5610
head_spanish 0.171032 0.132548 1.2903
t_TV_radio -0.073130 0.132798 -0.5507
HH_poor_PL2011 0.088101 0.170908 0.5155
Tot_AgriLand_ha 0.104222 0.055393 1.8815
Intercepts:
Value Std. Error t value
1|2 -1.8900 0.3320 -5.6932
2|3 -1.3156 0.3280 -4.0114
3|4 1.5015 0.3289 4.5656
Residual Deviance: 2520.828
AIC: 2540.828
(71 observations deleted due to missingness)
Positive coefficients indicate higher values of the trust scale (i.e., less trust). The coefficient on head_compprim_above is 0.18, suggesting that household heads with completed elementary education or above have slightly lower trust (higher on the mistrust scale), though this effect is not significant (p ≈ 0.17).
The coefficient on Tot_AgriLand_ha is 0.1 (p ≈ 0.06), indicating that households with larger landholdings tend to have lower trust levels, with this effect approaching significance at the 10% level. Each additional hectare of agricultural land is associated with a 0.1 unit increase in the log-odds of being in a higher (less trusting) category. Most other covariates show minimal and non-significant effects on trust.
Estimate an ordered probit model for the same outcome and predictors. Compare the results to the ordered logit model. (5 points)
Call:
polr(formula = trust_people_0 ~ head_compprim_above + head_male +
head_age + head_spanish + t_TV_radio + HH_poor_PL2011 + Tot_AgriLand_ha,
data = guat, method = "probit")
Coefficients:
Value Std. Error t value
head_compprim_above 0.104388 0.076061 1.3724
head_male -0.056075 0.092379 -0.6070
head_age 0.001652 0.002744 0.6020
head_spanish 0.103172 0.075759 1.3619
t_TV_radio -0.039368 0.075521 -0.5213
HH_poor_PL2011 0.044139 0.095954 0.4600
Tot_AgriLand_ha 0.044384 0.028013 1.5844
Intercepts:
Value Std. Error t value
1|2 -1.1041 0.1852 -5.9604
2|3 -0.7946 0.1841 -4.3153
3|4 0.9093 0.1846 4.9271
Residual Deviance: 2521.934
AIC: 2541.934
(71 observations deleted due to missingness)
The ordered probit yields similar conclusions to the ordered logit, with coefficients differing in scale. To compare, we can divide logit coefficients by approximately 1.6 (the ratio of logistic to normal standard deviations). After scaling, the coefficients are nearly identical:
The statistical significance and substantive interpretations are identical across both models, confirming that the choice between ordered logit and ordered probit is largely a matter of preference for this application.
For a female household head, age 40, with completed elementary education, not poor, owns a TV or radio, main language spoken is Spanish, and owns 0.8 ha of agricultural land, calculate the predicted probability of each trust category using the ordered logit model. (5 points)
Predicted Trust Probabilities for Female Household Head
Trust_Category
Probability
1 (High trust)
0.09
2
0.06
3
0.59
4 (Low trust)
0.26
For this profile, the predicted probability of being in the lowest trust category (4 = “almost always must be very careful”) is 0.26, while the probability of the highest trust category (1 = “almost always can trust”) is 0.09. The modal category is category 3 with probability 0.59.
Discuss how the predicted probabilities change if the household head is male, holding other characteristics the same. (5 points)
TipSolution (d)
Code
newdata_male <- newdata_femalenewdata_male$head_male <-1probs_male <-predict(fit_logit, newdata_male, type ="probs")
Code
knitr::kable(prob_comparison, digits =2,caption ="Comparison of Trust Probabilities by Gender")
Comparison of Trust Probabilities by Gender
Trust_Category
Female
Male
Difference
1 (High trust)
0.09
0.10
0.01
2
0.06
0.06
0.01
3
0.59
0.60
0.01
4 (Low trust)
0.26
0.24
-0.02
Being male is associated with slightly higher trust (lower values on the mistrust scale). Male household heads have a predicted probability of 23.96% for the lowest trust category versus 26.37% for females (a difference of -2.41 percentage points). The probability of high trust (category 1) increases from 8.59% to 9.65% (1.06 percentage points).
However, the coefficient on head_male is not statistically significant (t = -0.77), suggesting this difference could be due to sampling variability rather than a true gender effect on trust. The small magnitude of the differences (all under 2.4 percentage points) further indicates that gender has minimal impact on trust in this population.