Problem Set 2

Handed out: February 2, 2026 | Due: February 11, 2026

Submit on Gradescope

Download heating.dta from Canvas

Download Persistence_preferences_rural_Guatemala.dta from Canvas

1. GMM and 2SLS

(30 points total)

Suppose we have the linear equation \(Y = X' \beta+e\) with two sets of instruments \(Z_1\) and \(Z_2\). Then consider the following estimators of \(\beta\):

\[ \begin{aligned} \hat{\beta} &: \text{2SLS using the instruments } Z_1 \\ \tilde{\beta} &: \text{2SLS using the instruments } Z_2 \\ \bar{\beta} &: \text{GMM using the instruments } Z=(Z_1,Z_2) \\ &\text{ and the weight matrix } \pmb{W} = \begin{pmatrix} (\pmb{Z_1' Z_1})^{-1} \lambda & 0 \\ 0 & (\pmb{Z_2' Z_2})^{-1} (1-\lambda) \end{pmatrix} \end{aligned} \]

for \(\lambda \in (0,1)\).

  1. Find an expression for \(\bar{\beta}\) which shows that it is a specific weighted average of \(\hat{\beta}\) and \(\tilde{\beta}\). (28 points)

First, we can describe \(\hat{\beta}, \tilde{\beta}, \bar{\beta}\): \[ \begin{aligned} \hat{\beta} &= (X' P_{Z_1} X)^{-1} X' P_{Z_1} Y \\ \tilde{\beta} &= (X' P_{Z_2} X)^{-1} X' P_{Z_2} Y \\ \bar{\beta} &= (X' ZWZ' X)^{-1} X' ZWZ' Y \\ \end{aligned} \]

Next, expand \((X' ZWZ' X)\): \[ \begin{aligned} (X' ZWZ' X) &= X' (Z_1 Z_2) \begin{pmatrix} (Z_1' Z_1)^{-1} \lambda & 0 \\ 0 & (Z_2' Z_2)^{-1} (1-\lambda) \end{pmatrix} \begin{pmatrix} Z_1' \\ Z_2' \end{pmatrix} X \\ &= \begin{pmatrix} X'Z_1 & X'Z_2 \end{pmatrix} \begin{pmatrix} (Z_1' Z_1)^{-1} \lambda & 0 \\ 0 & (Z_2' Z_2)^{-1} (1-\lambda) \end{pmatrix} \begin{pmatrix} Z_1' X \\ Z_2' X \end{pmatrix} \\ &= \lambda (X' Z_1 (Z_1' Z_1)^{-1} Z_1' X) + (1-\lambda) (X' Z_2 (Z_2' Z_2)^{-1} Z_2' X) \\ &= \lambda (X' P_{Z_1} X) + (1-\lambda) (X' P_{Z_2} X) \end{aligned} \]

Similarly, \((X' ZWZ' Y) = \lambda (X' P_{Z_1} Y) + (1-\lambda) (X' P_{Z_2} Y)\).

Substitute this into \(\bar{\beta}\), \[ \begin{aligned} \bar{\beta} =& (\lambda (X' P_{Z_1} X) + (1-\lambda) (X' P_{Z_2} X))^{-1} (\lambda (X' P_{Z_1} Y) + (1-\lambda) (X' P_{Z_2} Y)) \\ =& \lambda (\lambda (X' P_{Z_1} X) + (1-\lambda) (X' P_{Z_2} X))^{-1} X' P_{Z_1} Y \\ &+ (1-\lambda) (\lambda (X' P_{Z_1} X) + (1-\lambda) (X' P_{Z_2} X))^{-1} X' P_{Z_2} Y \\ =& \lambda (\lambda (X' P_{Z_1} X) + (1-\lambda) (X' P_{Z_2} X))^{-1} (X' P_{Z_1} X) (X' P_{Z_1} X)^{-1} X' P_{Z_1} Y \\ &+ (1-\lambda) (\lambda (X' P_{Z_1} X) + (1-\lambda) (X' P_{Z_2} X))^{-1} (X' P_{Z_2} X) (X' P_{Z_2} X)^{-1} X' P_{Z_2} Y \\ =& \lambda (\lambda (X' P_{Z_1} X) + (1-\lambda) (X' P_{Z_2} X))^{-1} (X' P_{Z_1} X) \hat{\beta} \\ &+ (1-\lambda) (\lambda (X' P_{Z_1} X) + (1-\lambda) (X' P_{Z_2} X))^{-1} (X' P_{Z_2} X) \tilde{\beta} \\ \end{aligned} \]

This shows that \(\bar{\beta}\) is a matrix-weighted average of \(\hat{\beta}\) and \(\tilde{\beta}\), where the weights depend on \(\lambda\) and the projection matrices:

\[ \begin{aligned} w_1 &= \lambda(\lambda(X'P_{Z_1}X) + (1-\lambda)(X'P_{Z_2}X))^{-1}(X'P_{Z_1}X) \\ w_2 &= (1-\lambda)(\lambda(X'P_{Z_1}X) + (1-\lambda)(X'P_{Z_2}X))^{-1}(X'P_{Z_2}X) \end{aligned} \]

  1. Is this an efficient weight matrix to use for GMM? (Short answer, 1-2 sentences) (2 points)

No, this is not efficient. The optimal GMM weight matrix should be \(W = (E[ZZ'e^2])^{-1}\), which requires estimating the error variance. This weight matrix uses only \((Z'Z)^{-1}\) scaled by \(\lambda\), ignoring heteroskedasticity in the errors.

2. Multinomial Logits and Probits

(50 points total)

Load the heating data, which contains a sample of 900 Californian households and their choice of heating system.

  • idcase: id
  • depvar: heating system
    • gc (gas central)
    • gr (gas room)
    • ec (electric central)
    • er (electric room)
    • hp (heat pump)
  • ic.z: installation cost for heating system z (defined for the 5 heating systems)
  • oc.z: annual operating cost for heating system z (defined for the 5 heating systems)
  • income: annual income of the household
  • agehed: age of the household head
  • rooms: numbers of rooms in the house
  • region
  1. Estimate a multinomial logit model for heating system choice using only house characteristics. Report and interpret the coefficients. (6 points)
Code
library(mlogit)
data("Heating", package = "mlogit")
Heating_mlogit <- dfidx(Heating, choice = "depvar", shape = "wide", varying = 3:12)
fit_mnl <- mlogit(depvar ~ 0 | rooms + region + income + agehed, data = Heating_mlogit)
summary(fit_mnl)

Call:
mlogit(formula = depvar ~ 0 | rooms + region + income + agehed, 
    data = Heating_mlogit, method = "nr")

Frequencies of alternatives:choice
      ec       er       gc       gr       hp 
0.071111 0.093333 0.636667 0.143333 0.055556 

nr method
6 iterations, 0h:0m:0s 
g'(-H)^-1g = 0.000229 
successive function values within tolerance limits 

Coefficients :
                  Estimate Std. Error z-value  Pr(>|z|)    
(Intercept):er   1.7280834  0.9034935  1.9127 0.0557905 .  
(Intercept):gc   2.6574570  0.7314051  3.6334 0.0002798 ***
(Intercept):gr   1.7397017  0.8403717  2.0702 0.0384376 *  
(Intercept):hp   0.6962075  1.0272072  0.6778 0.4979193    
rooms:er        -0.0280525  0.0957705 -0.2929 0.7695878    
rooms:gc        -0.0651531  0.0758723 -0.8587 0.3904947    
rooms:gr        -0.0735756  0.0883564 -0.8327 0.4050063    
rooms:hp        -0.0484459  0.1086953 -0.4457 0.6558115    
regionscostl:er  0.0643634  0.4481357  0.1436 0.8857967    
regionscostl:gc  0.0750504  0.3619851  0.2073 0.8357520    
regionscostl:gr  0.1055048  0.4145433  0.2545 0.7991027    
regionscostl:hp -0.1414429  0.5011087 -0.2823 0.7777442    
regionmountn:er -0.0299505  0.5923048 -0.0506 0.9596713    
regionmountn:gc -0.1250734  0.4784022 -0.2614 0.7937533    
regionmountn:gr -0.0121556  0.5470234 -0.0222 0.9822714    
regionmountn:hp -0.0720965  0.6557692 -0.1099 0.9124554    
regionncostl:er -0.3606209  0.4953449 -0.7280 0.4666014    
regionncostl:gc  0.2193895  0.3841963  0.5710 0.5679759    
regionncostl:gr -0.3505868  0.4554936 -0.7697 0.4414865    
regionncostl:hp -0.4345249  0.5551394 -0.7827 0.4337850    
income:er       -0.0346964  0.0993531 -0.3492 0.7269223    
income:gc       -0.0072686  0.0787568 -0.0923 0.9264663    
income:gr       -0.1205758  0.0914321 -1.3187 0.1872537    
income:hp        0.0602717  0.1138138  0.5296 0.5964141    
agehed:er       -0.0263460  0.0120109 -2.1935 0.0282714 *  
agehed:gc       -0.0050147  0.0094277 -0.5319 0.5947854    
agehed:gr       -0.0027106  0.0109661 -0.2472 0.8047674    
agehed:hp       -0.0197832  0.0136043 -1.4542 0.1458948    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -1008.6
McFadden R^2:  0.013365 
Likelihood ratio test : chisq = 27.325 (p.value = 0.28955)

The coefficient on agehed:er is -0.03, indicating that each additional year of age decreases the log-odds of choosing electric room heating versus gas central by 0.03 (p = 0.03). This suggests older household heads are significantly less likely to choose electric room systems. Similarly, the coefficient on income:gr is -0.12, suggesting that higher-income households are less likely to choose gas room heating relative to gas central, though this effect is not statistically significant (p = 0.19).

  1. Estimate a conditional logit model using only the system-specific variables. Report and interpret the coefficients. (6 points)
Code
fit_cl <- mlogit(depvar ~ ic + oc, data = Heating_mlogit)
summary(fit_cl)

Call:
mlogit(formula = depvar ~ ic + oc, data = Heating_mlogit, method = "nr")

Frequencies of alternatives:choice
      ec       er       gc       gr       hp 
0.071111 0.093333 0.636667 0.143333 0.055556 

nr method
6 iterations, 0h:0m:0s 
g'(-H)^-1g = 9.58E-06 
successive function values within tolerance limits 

Coefficients :
                  Estimate  Std. Error z-value  Pr(>|z|)    
(Intercept):er  0.19459102  0.20424212  0.9527 0.3407184    
(Intercept):gc  0.05213336  0.46598878  0.1119 0.9109210    
(Intercept):gr -1.35058266  0.50715442 -2.6631 0.0077434 ** 
(Intercept):hp -1.65884594  0.44841936 -3.6993 0.0002162 ***
ic             -0.00153315  0.00062086 -2.4694 0.0135333 *  
oc             -0.00699637  0.00155408 -4.5019 6.734e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -1008.2
McFadden R^2:  0.013691 
Likelihood ratio test : chisq = 27.99 (p.value = 8.3572e-07)

The coefficients for installation and operating costs are generic across all alternatives. The coefficient on ic is -0.0015 (p = 0.01), meaning that a $1 increase in installation cost decreases the log-odds of choosing that system by 0.0015. The coefficient on oc is -0.007 (p < 0.001), indicating that a $1 increase in annual operating cost decreases the log-odds by 0.007. Both cost variables significantly reduce the probability of choosing more expensive systems.

  1. Estimate a mixed logit model including both house and system-specific variables. Compare the results to (a) and (b). (6 points)
Code
fit_mixed <- mlogit(depvar ~ ic + oc | rooms + region + income + agehed, data = Heating_mlogit)
summary(fit_mixed)

Call:
mlogit(formula = depvar ~ ic + oc | rooms + region + income + 
    agehed, data = Heating_mlogit, method = "nr")

Frequencies of alternatives:choice
      ec       er       gc       gr       hp 
0.071111 0.093333 0.636667 0.143333 0.055556 

nr method
6 iterations, 0h:0m:0s 
g'(-H)^-1g = 0.000205 
successive function values within tolerance limits 

Coefficients :
                   Estimate  Std. Error z-value  Pr(>|z|)    
(Intercept):er   1.65712020  0.91742490  1.8063   0.07088 .  
(Intercept):gc   0.44487214  0.87015831  0.5113   0.60917    
(Intercept):gr  -0.39561437  0.98337519 -0.4023   0.68746    
(Intercept):hp  -0.74157651  1.11159204 -0.6671   0.50469    
ic              -0.00151382  0.00062611 -2.4178   0.01561 *  
oc              -0.00695406  0.00156315 -4.4488 8.637e-06 ***
rooms:er        -0.02576373  0.09593158 -0.2686   0.78827    
rooms:gc        -0.05220447  0.07600116 -0.6869   0.49215    
rooms:gr        -0.06841328  0.08843305 -0.7736   0.43916    
rooms:hp        -0.04383822  0.10872170 -0.4032   0.68679    
regionscostl:er  0.01082503  0.45032066  0.0240   0.98082    
regionscostl:gc  0.04505269  0.36430136  0.1237   0.90158    
regionscostl:gr  0.06940405  0.41659167  0.1666   0.86769    
regionscostl:hp -0.17499333  0.50256236 -0.3482   0.72769    
regionmountn:er -0.11264645  0.59541445 -0.1892   0.84994    
regionmountn:gc -0.12100751  0.48215126 -0.2510   0.80183    
regionmountn:gr -0.00526292  0.54984361 -0.0096   0.99236    
regionmountn:hp -0.07507769  0.65815292 -0.1141   0.90918    
regionncostl:er -0.35024905  0.49737039 -0.7042   0.48131    
regionncostl:gc  0.24880760  0.38678432  0.6433   0.52005    
regionncostl:gr -0.32716929  0.45772075 -0.7148   0.47475    
regionncostl:hp -0.39192623  0.55673253 -0.7040   0.48145    
income:er       -0.03806051  0.09991780 -0.3809   0.70326    
income:gc       -0.00683752  0.07913516 -0.0864   0.93115    
income:gr       -0.11680820  0.09175928 -1.2730   0.20302    
income:hp        0.05946817  0.11392698  0.5220   0.60168    
agehed:er       -0.02575406  0.01209345 -2.1296   0.03321 *  
agehed:gc       -0.00433579  0.00946517 -0.4581   0.64690    
agehed:gr       -0.00128269  0.01103055 -0.1163   0.90743    
agehed:hp       -0.01939645  0.01363131 -1.4229   0.15476    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -994.95
McFadden R^2:  0.026686 
Likelihood ratio test : chisq = 54.557 (p.value = 0.00086317)
Code
# Formal model comparisons
lr_mnl_mixed <- lrtest(fit_mnl, fit_mixed)
lr_cl_mixed <- lrtest(fit_cl, fit_mixed)

Comparison to (a): Adding system-specific costs (ic, oc) significantly improves model fit (χ² = 27.23, p < 0.001). The household characteristic coefficients remain stable: the age coefficient for electric room heating is -0.0263 in model (a) versus -0.0258 in model (c), suggesting robust demographic effects independent of costs.

Comparison to (b): Adding household characteristics does not significantly improve fit over the conditional logit (χ² = 26.57, p = 0.33). The cost coefficients remain nearly identical: ic is -0.0015 in model (b) versus -0.0015 in model (c), and oc is -0.007 versus -0.007. This suggests that system costs are the primary drivers of heating choice, with household demographics adding minimal explanatory power.

  1. Estimate a mixed logit model that allows installation and operating costs to have alternative-specific effects. Test whether this more flexible specification is warranted compared to the mixed model in (c). (7 points)
Code
# Flexible model with alternative-specific cost coefficients
fit_mixed_flex <- mlogit(depvar ~ 0 | rooms + region + income + agehed | ic + oc, 
                         data = Heating_mlogit)
summary(fit_mixed_flex)

Call:
mlogit(formula = depvar ~ 0 | rooms + region + income + agehed | 
    ic + oc, data = Heating_mlogit, method = "nr")

Frequencies of alternatives:choice
      ec       er       gc       gr       hp 
0.071111 0.093333 0.636667 0.143333 0.055556 

nr method
6 iterations, 0h:0m:0s 
g'(-H)^-1g = 0.000535 
successive function values within tolerance limits 

Coefficients :
                   Estimate  Std. Error z-value Pr(>|z|)  
(Intercept):er   2.22221854  1.50650781  1.4751  0.14019  
(Intercept):gc  -0.01040972  1.22055849 -0.0085  0.99320  
(Intercept):gr  -1.23945973  1.40588070 -0.8816  0.37798  
(Intercept):hp  -0.51818062  1.72517771 -0.3004  0.76390  
rooms:er        -0.02505955  0.09607934 -0.2608  0.79423  
rooms:gc        -0.05681909  0.07620062 -0.7457  0.45588  
rooms:gr        -0.07111573  0.08865102 -0.8022  0.42244  
rooms:hp        -0.04388533  0.10888719 -0.4030  0.68692  
regionscostl:er  0.00605322  0.45055399  0.0134  0.98928  
regionscostl:gc  0.04483994  0.36446916  0.1230  0.90208  
regionscostl:gr  0.07771509  0.41721815  0.1863  0.85223  
regionscostl:hp -0.16915633  0.50253470 -0.3366  0.73641  
regionmountn:er -0.10592193  0.59606970 -0.1777  0.85896  
regionmountn:gc -0.10733199  0.48287704 -0.2223  0.82410  
regionmountn:gr  0.00594645  0.55098349  0.0108  0.99139  
regionmountn:hp -0.06477909  0.65838562 -0.0984  0.92162  
regionncostl:er -0.36825933  0.49779443 -0.7398  0.45943  
regionncostl:gc  0.24947847  0.38710750  0.6445  0.51927  
regionncostl:gr -0.31688082  0.45830881 -0.6914  0.48931  
regionncostl:hp -0.39573308  0.55666593 -0.7109  0.47715  
income:er       -0.04587284  0.10033287 -0.4572  0.64752  
income:gc       -0.00630219  0.07939732 -0.0794  0.93673  
income:gr       -0.11450041  0.09205246 -1.2439  0.21355  
income:hp        0.05922721  0.11414404  0.5189  0.60384  
agehed:er       -0.02554338  0.01210271 -2.1106  0.03481 *
agehed:gc       -0.00406595  0.00947229 -0.4292  0.66774  
agehed:gr       -0.00125802  0.01102922 -0.1141  0.90919  
agehed:hp       -0.01957427  0.01363789 -1.4353  0.15121  
ic:ec           -0.00179901  0.00153614 -1.1711  0.24155  
ic:er           -0.00288304  0.00115485 -2.4965  0.01254 *
ic:gc           -0.00158026  0.00098137 -1.6103  0.10734  
ic:gr           -0.00019242  0.00106365 -0.1809  0.85644  
ic:hp           -0.00078114  0.00139291 -0.5608  0.57493  
oc:ec           -0.00595433  0.00259597 -2.2937  0.02181 *
oc:er           -0.00456550  0.00262519 -1.7391  0.08202 .
oc:gc           -0.00256492  0.00428889 -0.5980  0.54981  
oc:gr           -0.00783141  0.00647471 -1.2095  0.22646  
oc:hp           -0.01036428  0.00656819 -1.5780  0.11458  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -992.12
McFadden R^2:  0.02945 
Likelihood ratio test : chisq = 60.21 (p.value = 0.0036758)
Code
# Compare to the restricted model from (c)
lr_test_result <- lrtest(fit_mixed, fit_mixed_flex)

The LR test (χ² = 5.65, p = 0.69) fails to reject the null hypothesis that cost effects are equal across alternatives. The generic specification from (c) is preferred because there’s insufficient evidence that an increase in installation cost has different effects on choosing a heat pump versus electric central heating. The simpler model with common cost coefficients is adequate for these data.

  1. Using the model from part (c), calculate the marginal effects of household income on the probability of choosing each heating system. Evaluate these at the mean values of all other covariates. Interpret your results. (5 points)
Code
# Create data at mean values
z <- with(Heating, data.frame(
  ic = c(mean(ic.gc), mean(ic.gr), mean(ic.ec), mean(ic.er), mean(ic.hp)),
  oc = c(mean(oc.gc), mean(oc.gr), mean(oc.ec), mean(oc.er), mean(oc.hp)),
  rooms = mean(rooms),
  region = names(sort(table(region), decreasing=TRUE))[1],
  income = mean(income),
  agehed = mean(agehed)
))
z$region <- factor(z$region, levels=levels(Heating$region))

# Calculate marginal effects of income
me_income <- effects(fit_mixed, covariate = "income", data = z)

# Visualize
me_df <- data.frame(
  Alternative = names(me_income),
  MarginalEffect = as.numeric(me_income)
)

ggplot(me_df, aes(x = Alternative, y = MarginalEffect)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = round(MarginalEffect, 3)), vjust = ifelse(me_df$MarginalEffect >= 0, -0.3, 1.3), size = 3.5) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Marginal Effect of Income on Heating System Choice",
    subtitle = "Evaluated at mean of other covariates",
    y = "Change in Probability (per $1000 income increase)",
    x = "Heating System") +
  theme_minimal()

The marginal effects show how a $1 increase in household income changes the probability of choosing each heating system, holding all other variables at their means. The effects are small but must sum to zero across all alternatives (since probabilities sum to 1).

Heat pumps have a positive marginal effect of 0.0029, meaning a $1 income increase raises the probability of choosing a heat pump by 0.29 percentage points, suggesting heat pumps are a “normal good” preferred by higher-income households. Electric room heating shows the largest negative effect (-0.0096), with electric central (0.0076) also positive. Gas central shows minimal change (6^{-4}). The pattern suggests higher-income households shift away from room-based systems toward central heating, particularly heat pumps.

  1. Using the same model, calculate how the probability of choosing a heat pump changes if its installation cost decreases by 20%. Compare this to the marginal effect of installation cost. (5 points)
Code
# Baseline prediction at means
alts <- c("gc", "gr", "ec", "er", "hp")
baseline_data <- data.frame(
  depvar = alts,
  ic = c(mean(Heating$ic.gc), mean(Heating$ic.gr), mean(Heating$ic.ec), 
         mean(Heating$ic.er), mean(Heating$ic.hp)),
  oc = c(mean(Heating$oc.gc), mean(Heating$oc.gr), mean(Heating$oc.ec), 
         mean(Heating$oc.er), mean(Heating$oc.hp)),
  rooms = mean(Heating$rooms),
  region = names(sort(table(Heating$region), decreasing=TRUE))[1],
  income = mean(Heating$income),
  agehed = mean(Heating$agehed)
)
baseline_data$region <- factor(baseline_data$region, levels=levels(Heating$region))

# Scenario: 20% reduction in heat pump installation cost
policy_data <- baseline_data
policy_data$ic[policy_data$depvar == "hp"] <- 
  policy_data$ic[policy_data$depvar == "hp"] * 0.8

# Calculate probabilities
probs_baseline <- predict(fit_mixed, newdata = baseline_data, type = "probabilities")
probs_policy <- predict(fit_mixed, newdata = policy_data, type = "probabilities")

# Change in heat pump probability
hp_baseline <- probs_baseline["hp"]
hp_policy <- probs_policy["hp"]
hp_change <- hp_policy - hp_baseline
actual_ic_change <- policy_data$ic[policy_data$depvar == "hp"] - 
                    baseline_data$ic[policy_data$depvar == "hp"]

# Calculate marginal effect manually for alternative-specific variable
beta_ic <- coef(fit_mixed)["ic"]
me_ic_hp <- beta_ic * hp_baseline * (1 - hp_baseline)
predicted_change_hp <- me_ic_hp * actual_ic_change

Heat pump probability at baseline: 0.04

Heat pump probability with 20% IC reduction: 0.05

Absolute change: 0.01 (a 35.4% increase)

Actual IC change: $-209.3

Comparison of approaches:

Marginal effect approach predicts change of: 0.0113

Simulation approach shows actual change of: 0.0131

Difference (due to nonlinearity): 0.0018

The policy simulation shows the finite change in probability from a substantial (20%) cost reduction. The marginal effect (0.0113) provides a local linear approximation that closely matches the simulation result (0.0131), with a small difference (0.0018) due to the nonlinearity of the logit model. For this moderately large change, both approaches give similar results, though the simulation is more accurate. For smaller policy changes, marginal effects provide excellent approximations.

  1. Estimate two nested logit models for heating system choice. For each model: (i) report and interpret the inclusive value (dissimilarity) parameters for each nest, and (ii) test whether each inclusive value parameter is statistically different from 1 and explain how to determine if the nested structure is justified (show your code and results). (7 points)
  • Model 1: Nests by energy source: (1) gas (gc, gr), (2) electric (ec, er, hp)
  • Model 2: Nests by system type: (1) room systems (gr, er), (2) central systems (gc, ec, hp)
Code
# Model 1: Nests by energy source
nests1 <- list(
  gas = c("gc", "gr"),
  electric = c("ec", "er", "hp")
)
fit_nested1 <- mlogit(depvar ~ ic + oc | rooms + region + income + agehed, 
                      data = Heating_mlogit, nests = nests1)
summary(fit_nested1)

Call:
mlogit(formula = depvar ~ ic + oc | rooms + region + income + 
    agehed, data = Heating_mlogit, nests = nests1)

Frequencies of alternatives:choice
      ec       er       gc       gr       hp 
0.071111 0.093333 0.636667 0.143333 0.055556 

bfgs method
32 iterations, 0h:0m:0s 
g'(-H)^-1g = 8.08E-07 
gradient close to zero 

Coefficients :
                  Estimate Std. Error z-value  Pr(>|z|)    
(Intercept):er   2.7479260  2.0091893  1.3677 0.1714126    
(Intercept):gc  -1.1984881  4.0700520 -0.2945 0.7684026    
(Intercept):gr  -8.5459905 15.7172959 -0.5437 0.5866262    
(Intercept):hp  -0.6274016  1.9373954 -0.3238 0.7460610    
ic              -0.0031745  0.0010661 -2.9777 0.0029043 ** 
oc              -0.0088668  0.0023394 -3.7902 0.0001505 ***
rooms:er        -0.0516856  0.1772146 -0.2917 0.7705499    
rooms:gc        -0.0760807  0.1469274 -0.5178 0.6045899    
rooms:gr        -0.0552023  0.3736915 -0.1477 0.8825626    
rooms:hp        -0.0830879  0.2087238 -0.3981 0.6905741    
regionscostl:er  0.1010795  0.7746066  0.1305 0.8961777    
regionscostl:gc  0.0477840  0.6955701  0.0687 0.9452303    
regionscostl:gr  0.2246164  1.7283593  0.1300 0.8965986    
regionscostl:hp -0.2127078  0.8590362 -0.2476 0.8044345    
regionmountn:er -0.3250167  1.0155552 -0.3200 0.7489391    
regionmountn:gc -0.3995104  1.0071212 -0.3967 0.6915993    
regionmountn:gr  0.6854109  2.5422520  0.2696 0.7874620    
regionmountn:hp -0.1144052  1.1318936 -0.1011 0.9194916    
regionncostl:er -0.6594775  0.8806322 -0.7489 0.4539366    
regionncostl:gc  0.7002131  1.2540343  0.5584 0.5765928    
regionncostl:gr -3.6758443  5.5906039 -0.6575 0.5108569    
regionncostl:hp -0.5869461  1.0064257 -0.5832 0.5597596    
income:er       -0.0382816  0.1725618 -0.2218 0.8244363    
income:gc        0.1397929  0.2335681  0.5985 0.5494994    
income:gr       -0.6676505  0.9444758 -0.7069 0.4796283    
income:hp        0.1186132  0.1979989  0.5991 0.5491330    
agehed:er       -0.0391520  0.0273393 -1.4321 0.1521208    
agehed:gc       -0.0119770  0.0204779 -0.5849 0.5586339    
agehed:gr       -0.0094012  0.0475383 -0.1978 0.8432332    
agehed:hp       -0.0293793  0.0263954 -1.1130 0.2656873    
iv:gas           8.0604282 11.8093681  0.6825 0.4948942    
iv:electric      1.6908830  0.7697774  2.1966 0.0280500 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -991.22
McFadden R^2:  0.030331 
Likelihood ratio test : chisq = 62.011 (p.value = 0.00022402)
Code
# Model 2: Nests by system type
nests2 <- list(
  room = c("gr", "er"),
  central = c("gc", "ec", "hp")
)
fit_nested2 <- mlogit(depvar ~ ic + oc | rooms + region + income + agehed, 
                      data = Heating_mlogit, nests = nests2)
summary(fit_nested2)

Call:
mlogit(formula = depvar ~ ic + oc | rooms + region + income + 
    agehed, data = Heating_mlogit, nests = nests2)

Frequencies of alternatives:choice
      ec       er       gc       gr       hp 
0.071111 0.093333 0.636667 0.143333 0.055556 

bfgs method
35 iterations, 0h:0m:1s 
g'(-H)^-1g =  3.49 
last step couldn't find higher value 

Coefficients :
                   Estimate  Std. Error z-value Pr(>|z|)  
(Intercept):er  -0.28123910  0.46565968 -0.6040  0.54587  
(Intercept):gc   0.00576324  0.05305657  0.1086  0.91350  
(Intercept):gr  -0.42412866  0.45754474 -0.9270  0.35394  
(Intercept):hp  -0.03698318  0.06721872 -0.5502  0.58219  
ic              -0.00014668  0.00010227 -1.4343  0.15149  
oc              -0.00040744  0.00026614 -1.5309  0.12579  
rooms:er         0.00050418  0.04562206  0.0111  0.99118  
rooms:gc        -0.00249469  0.00500943 -0.4980  0.61848  
rooms:gr        -0.00104067  0.04560075 -0.0228  0.98179  
rooms:hp        -0.00180377  0.00647196 -0.2787  0.78047  
regionscostl:er  0.03121694  0.21515351  0.1451  0.88464  
regionscostl:gc -0.00101978  0.02029818 -0.0502  0.95993  
regionscostl:gr  0.04088605  0.21472747  0.1904  0.84899  
regionscostl:hp -0.01343961  0.02846143 -0.4722  0.63678  
regionmountn:er  0.08025465  0.28453495  0.2821  0.77790  
regionmountn:gc -0.00619686  0.02650717 -0.2338  0.81516  
regionmountn:gr  0.08415093  0.28392910  0.2964  0.76694  
regionmountn:hp -0.00574833  0.03516411 -0.1635  0.87015  
regionncostl:er -0.50411564  0.24302507 -2.0743  0.03805 *
regionncostl:gc  0.01098884  0.02365148  0.4646  0.64221  
regionncostl:gr -0.49330614  0.24248868 -2.0343  0.04192 *
regionncostl:hp -0.02330291  0.03387800 -0.6878  0.49155  
income:er       -0.08566792  0.04794947 -1.7866  0.07400 .
income:gc       -0.00051513  0.00429831 -0.1198  0.90461  
income:gr       -0.08540383  0.04776485 -1.7880  0.07378 .
income:hp        0.00350815  0.00651635  0.5384  0.59033  
agehed:er       -0.00715292  0.00603213 -1.1858  0.23570  
agehed:gc       -0.00014915  0.00059344 -0.2513  0.80156  
agehed:gr       -0.00607883  0.00600091 -1.0130  0.31107  
agehed:hp       -0.00100781  0.00098953 -1.0185  0.30845  
iv:room          0.04215850  0.02475624  1.7029  0.08858 .
iv:central       0.05206765  0.03643434  1.4291  0.15298  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -990.28
McFadden R^2:  0.031252 
Likelihood ratio test : chisq = 63.893 (p.value = 0.00012662)
Code
# Test inclusive value parameters for Model 1
est_iv1 <- coef(fit_nested1)[grep("^iv:", names(coef(fit_nested1)))]
se_iv1 <- sqrt(diag(vcov(fit_nested1)))[grep("^iv:", names(coef(fit_nested1)))]
z_iv1 <- (est_iv1 - 1) / se_iv1
p_iv1 <- 2 * pnorm(-abs(z_iv1))

iv1_results <- data.frame(
  Nest = names(est_iv1), 
  Estimate = est_iv1, 
  Std.Error = se_iv1, 
  z.value = z_iv1, 
  p.value = p_iv1,
  row.names = NULL
)
knitr::kable(iv1_results, digits = 2, caption = "Model 1 - Energy Source Nests")
Model 1 - Energy Source Nests
Nest Estimate Std.Error z.value p.value
iv:gas 8.06 11.81 0.6 0.55
iv:electric 1.69 0.77 0.9 0.37
Code
# Test inclusive value parameters for Model 2
est_iv2 <- coef(fit_nested2)[grep("^iv:", names(coef(fit_nested2)))]
se_iv2 <- sqrt(diag(vcov(fit_nested2)))[grep("^iv:", names(coef(fit_nested2)))]
z_iv2 <- (est_iv2 - 1) / se_iv2
p_iv2 <- 2 * pnorm(-abs(z_iv2))

iv2_results <- data.frame(
  Nest = names(est_iv2), 
  Estimate = est_iv2, 
  Std.Error = se_iv2, 
  z.value = z_iv2, 
  p.value = p_iv2,
  row.names = NULL
)
knitr::kable(iv2_results, digits = 2, caption = "Model 2 - System Type Nests")
Model 2 - System Type Nests
Nest Estimate Std.Error z.value p.value
iv:room 0.04 0.02 -38.69 0
iv:central 0.05 0.04 -26.02 0
Code
# Compare nested models to standard multinomial logit
lr_nested1 <- lrtest(fit_mixed, fit_nested1)
lr_nested2 <- lrtest(fit_mixed, fit_nested2)

The inclusive value (IV) parameter λ for each nest measures the degree of independence among alternatives within that nest:

  • λ = 1: Alternatives in the nest are completely independent (IIA holds, reduces to MNL)
  • 0 < λ < 1: Alternatives in the nest are correlated in unobserved utility
  • λ closer to 0: Higher correlation within nest

Model 1 shows evidence of misspecification: The gas nest has λ = 8.06 (SE = 11.81), well above 1, and the electric nest has λ = 1.69 (p = 0.37). Values of λ > 1 violate the random utility maximization framework and indicate the nesting structure is fundamentally incorrect. Neither IV parameter is significantly different from 1 (p > 0.05), suggesting the energy source grouping does not capture meaningful correlation patterns.

Model 2 shows a valid nested structure: Both IV parameters (0.04 and 0.05) fall within the valid range of (0,1) and are significantly different from 1 (both p < 0.001), providing strong evidence that the nesting structure is appropriate. The λ values close to 0 indicate very high correlation within nests, meaning room systems (gr, er) share substantial unobserved attributes, as do central systems (gc, ec, hp). This does not violate theoretical foundations—rather, it reveals important substitution patterns that the standard multinomial logit cannot capture.

Determining if Nested Structure is Justified:

  1. Check validity: 0 < λ < 1
    • Model 1: ✗ (λ > 1 for both nests)
    • Model 2: ✓ (both λ values in valid range)
  2. Test H₀: λ = 1 (if rejected, nesting is justified):
    • Model 1: Cannot reject (p > 0.05) → nesting not justified
    • Model 2: Strongly reject (p < 0.001) → nesting IS justified
  3. Likelihood ratio test vs. MNL:
    • Model 1 vs. MNL: χ² = 7.45, p = 0.02
    • Model 2 vs. MNL: χ² = 9.34, p = 0.01

Conclusion: Model 1 (energy source nesting) is invalid due to λ > 1, while Model 2 (system type nesting) is well-specified with valid IV parameters that are significantly different from 1. The system type structure reveals that households view room-based heating systems as close substitutes, and similarly for central systems. Model 2 should be preferred over the standard multinomial logit, as it better captures the actual substitution patterns in heating system choice.

  1. Combine the five heating systems into three categories: gas (gc, gr), electric (ec, er), and heat pump (hp). Estimate both a mixed multinomial probit and a mixed multinomial logit model for this grouped outcome. Compare the results to the nested logit from (g). (8 points)
Code
# Create grouped dataset with data.table
DT <- data.table(Heating)[
  , grouped := fifelse(depvar %in% c("gc", "gr"), "gas",
               fifelse(depvar %in% c("ec", "er"), "electric", "hp"))
][, `:=`(
  ic_gas = (ic.gc + ic.gr) / 2,
  ic_elec = (ic.ec + ic.er) / 2,
  oc_gas = (oc.gc + oc.gr) / 2,
  oc_elec = (oc.ec + oc.er) / 2
)]

# Create long format
final_data <- rbindlist(list(
  DT[, .(idcase = .I, alt = "gas", chosen = grouped == "gas", 
         ic = ic_gas, oc = oc_gas, rooms, region, income, agehed)],
  DT[, .(idcase = .I, alt = "electric", chosen = grouped == "electric", 
         ic = ic_elec, oc = oc_elec, rooms, region, income, agehed)],
  DT[, .(idcase = .I, alt = "hp", chosen = grouped == "hp", 
         ic = ic.hp, oc = oc.hp, rooms, region, income, agehed)]
))

# Convert to data.frame, then to dfidx
Heating_grouped <- dfidx(as.data.frame(final_data), 
                         idx = c("idcase", "alt"),
                         choice = "chosen",
                         idnames = c("chid", "alt"))

# Fit multinomial logit
fit_grouped_mnl <- mlogit(chosen ~ ic + oc | rooms + region + income + agehed,
                          data = Heating_grouped)
summary(fit_grouped_mnl)

Call:
mlogit(formula = chosen ~ ic + oc | rooms + region + income + 
    agehed, data = Heating_grouped, method = "nr")

Frequencies of alternatives:choice
electric      gas       hp 
0.164444 0.780000 0.055556 

nr method
5 iterations, 0h:0m:0s 
g'(-H)^-1g = 2.09E-07 
gradient close to zero 

Coefficients :
                    Estimate  Std. Error z-value  Pr(>|z|)    
(Intercept):gas  -1.41020020  0.75214311 -1.8749  0.060805 .  
(Intercept):hp   -2.69979951  0.99623432 -2.7100  0.006728 ** 
ic               -0.00233546  0.00098085 -2.3810  0.017264 *  
oc               -0.00866179  0.00194608 -4.4509 8.552e-06 ***
rooms:gas        -0.04023309  0.05288049 -0.7608  0.446758    
rooms:hp         -0.02895866  0.09444260 -0.3066  0.759127    
regionscostl:gas  0.04115461  0.24782355  0.1661  0.868106    
regionscostl:hp  -0.18484241  0.42845620 -0.4314  0.666167    
regionmountn:gas -0.00935714  0.33070437 -0.0283  0.977427    
regionmountn:hp   0.00405791  0.56121137  0.0072  0.994231    
regionncostl:gas  0.35715212  0.27310026  1.3078  0.190952    
regionncostl:hp  -0.18261228  0.48678068 -0.3751  0.707554    
income:gas       -0.00421967  0.05490109 -0.0769  0.938735    
income:hp         0.08238021  0.09896970  0.8324  0.405196    
agehed:gas        0.01124869  0.00659696  1.7051  0.088170 .  
agehed:hp        -0.00473771  0.01182312 -0.4007  0.688629    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -567.3
McFadden R^2:  0.032078 
Likelihood ratio test : chisq = 37.603 (p.value = 0.00059757)
Code
# Fit multinomial probit
fit_grouped_mnp <- mlogit(chosen ~ ic + oc | rooms + region + income + agehed,
                          data = Heating_grouped,
                          probit = TRUE)
summary(fit_grouped_mnp)

Call:
mlogit(formula = chosen ~ ic + oc | rooms + region + income + 
    agehed, data = Heating_grouped, probit = TRUE)

Frequencies of alternatives:choice
electric      gas       hp 
0.164444 0.780000 0.055556 

bfgs method
43 iterations, 0h:0m:21s 
g'(-H)^-1g = 8.4E-07 
gradient close to zero 

Coefficients :
                    Estimate  Std. Error z-value  Pr(>|z|)    
(Intercept):gas  -0.81450328  0.49857851 -1.6337   0.10233    
(Intercept):hp   -5.87060995  8.12409484 -0.7226   0.46992    
ic               -0.00180130  0.00070237 -2.5646   0.01033 *  
oc               -0.00512234  0.00124139 -4.1263 3.686e-05 ***
rooms:gas        -0.02477752  0.02981188 -0.8311   0.40590    
rooms:hp         -0.02533523  0.18560464 -0.1365   0.89143    
regionscostl:gas  0.02524269  0.14201696  0.1777   0.85892    
regionscostl:hp  -0.36992609  0.88391496 -0.4185   0.67558    
regionmountn:gas -0.00785087  0.18605730 -0.0422   0.96634    
regionmountn:hp   0.10264062  1.02487926  0.1001   0.92023    
regionncostl:gas  0.20012507  0.15843605  1.2631   0.20654    
regionncostl:hp  -0.72707112  1.40273558 -0.5183   0.60423    
income:gas       -0.00099572  0.03266488 -0.0305   0.97568    
income:hp         0.16523254  0.28106392  0.5879   0.55661    
agehed:gas        0.00615218  0.00410626  1.4982   0.13407    
agehed:hp        -0.01912928  0.03806654 -0.5025   0.61530    
gas.hp            0.92278921  5.47410667  0.1686   0.86613    
hp.hp             4.04394782  5.34149050  0.7571   0.44900    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood: -566.01
McFadden R^2:  0.034294 
Likelihood ratio test : chisq = 40.199 (p.value = 0.00072809)

Comparison:

Grouped MNL vs. Grouped MNP: The probit model allows for more flexible correlation patterns in errors but is computationally more intensive. Both models yield similar coefficient estimates for costs: ic is -0.0023 (MNL) vs. -0.0018 (MNP), and oc is -0.0087 vs. -0.0051. Log-likelihoods are nearly identical (-567.3 vs. -566.01), suggesting that the additional flexibility of probit adds little value for this grouped outcome.

Grouped Models vs. Nested Logit:

  • The grouped models explicitly combine alternatives before estimation
  • The nested logit (LL = -991.22) maintains separate alternatives but models correlation
  • These log-likelihoods are not directly comparable because they use different outcome variables (3 categories vs. 5 categories). The nested logit appears to have a “worse” (more negative) log-likelihood, but this is misleading—it uses more detailed choice information.
  • The grouped models discard within-group variation by pre-aggregating choices (e.g., treating gc and gr as identical), while the nested logit preserves the original 5-category structure

Summary:

  • The grouped and nested models give similar substantive conclusions about cost effects (both show strong negative effects of ic and oc), supporting the validity of grouping if within-group distinctions are unimportant
  • The nested logit provides a more flexible approach that doesn’t require pre-grouping alternatives, but suffered from misspecification issues (λ values outside valid range)
  • For this application, if the distinction between gc/gr or ec/er is theoretically unimportant, the grouped MNL offers a simpler, well-specified alternative to the problematic nested logit

3. Ordered Logits and Probits

(20 points total)

Load the Persistence_preferences_rural_Guatemala data from the replication files for the paper “Persistence of Individual and Social Preferences in Rural Settings.” This dataset contains a panel of 1,262 agricultural households in rural Guatemala, surveyed annually from 2019 to 2022. It includes self-reported preference indicators, household socioeconomic characteristics, and contextual variables.

  • trust_people_0: trust in people, scale 1–4 (1 = almost always can trust, 4 = almost always must be very careful)
  • head_compprim_above: household head completed elementary or above
  • head_male: household head is male
  • head_age: household head age
  • head_spanish: household head’s main language spoken is Spanish
  • t_TV_radio: household head owns TV or radio
  • HH_poor_PL2011: household is poor (below $1.90/day, 2011 PPP)
  • Tot_AgriLand_ha: household head agricultural land size (hectares)
  1. Estimate an ordered logit model for trust in people (trust_people_0) as a function of household head education (head_compprim_above), sex (head_male), age (head_age), main language spoken (head_spanish), TV/radio ownership (t_TV_radio), poverty status (HH_poor_PL2011), and agricultural land size (Tot_AgriLand_ha). Report and interpret the coefficients. (5 points)
Code
guat <- haven::read_dta(here::here("assignment", "data", "Persistence_preferences_rural_Guatemala.dta"))
guat$trust_people_0 <- factor(guat$trust_people_0, ordered = TRUE)
fit_logit <- polr(trust_people_0 ~ head_compprim_above + head_male + head_age + 
                  head_spanish + t_TV_radio + HH_poor_PL2011 + Tot_AgriLand_ha, 
                  data = guat, method = "logistic")
summary(fit_logit)
Call:
polr(formula = trust_people_0 ~ head_compprim_above + head_male + 
    head_age + head_spanish + t_TV_radio + HH_poor_PL2011 + Tot_AgriLand_ha, 
    data = guat, method = "logistic")

Coefficients:
                        Value Std. Error t value
head_compprim_above  0.184386   0.133410  1.3821
head_male           -0.128022   0.165553 -0.7733
head_age             0.002724   0.004856  0.5610
head_spanish         0.171032   0.132548  1.2903
t_TV_radio          -0.073130   0.132798 -0.5507
HH_poor_PL2011       0.088101   0.170908  0.5155
Tot_AgriLand_ha      0.104222   0.055393  1.8815

Intercepts:
    Value   Std. Error t value
1|2 -1.8900  0.3320    -5.6932
2|3 -1.3156  0.3280    -4.0114
3|4  1.5015  0.3289     4.5656

Residual Deviance: 2520.828 
AIC: 2540.828 
(71 observations deleted due to missingness)

Positive coefficients indicate higher values of the trust scale (i.e., less trust). The coefficient on head_compprim_above is 0.18, suggesting that household heads with completed elementary education or above have slightly lower trust (higher on the mistrust scale), though this effect is not significant (p ≈ 0.17).

The coefficient on Tot_AgriLand_ha is 0.1 (p ≈ 0.06), indicating that households with larger landholdings tend to have lower trust levels, with this effect approaching significance at the 10% level. Each additional hectare of agricultural land is associated with a 0.1 unit increase in the log-odds of being in a higher (less trusting) category. Most other covariates show minimal and non-significant effects on trust.

  1. Estimate an ordered probit model for the same outcome and predictors. Compare the results to the ordered logit model. (5 points)
Code
fit_probit <- polr(trust_people_0 ~ head_compprim_above + head_male + head_age + 
                   head_spanish + t_TV_radio + HH_poor_PL2011 + Tot_AgriLand_ha, 
                   data = guat, method = "probit")
summary(fit_probit)
Call:
polr(formula = trust_people_0 ~ head_compprim_above + head_male + 
    head_age + head_spanish + t_TV_radio + HH_poor_PL2011 + Tot_AgriLand_ha, 
    data = guat, method = "probit")

Coefficients:
                        Value Std. Error t value
head_compprim_above  0.104388   0.076061  1.3724
head_male           -0.056075   0.092379 -0.6070
head_age             0.001652   0.002744  0.6020
head_spanish         0.103172   0.075759  1.3619
t_TV_radio          -0.039368   0.075521 -0.5213
HH_poor_PL2011       0.044139   0.095954  0.4600
Tot_AgriLand_ha      0.044384   0.028013  1.5844

Intercepts:
    Value   Std. Error t value
1|2 -1.1041  0.1852    -5.9604
2|3 -0.7946  0.1841    -4.3153
3|4  0.9093  0.1846     4.9271

Residual Deviance: 2521.934 
AIC: 2541.934 
(71 observations deleted due to missingness)

The ordered probit yields similar conclusions to the ordered logit, with coefficients differing in scale. To compare, we can divide logit coefficients by approximately 1.6 (the ratio of logistic to normal standard deviations). After scaling, the coefficients are nearly identical:

  • head_compprim_above: logit = 0.18, probit = 0.1, scaled logit = 0.12
  • Tot_AgriLand_ha: logit = 0.1, probit = 0.04, scaled logit = 0.07

The statistical significance and substantive interpretations are identical across both models, confirming that the choice between ordered logit and ordered probit is largely a matter of preference for this application.

  1. For a female household head, age 40, with completed elementary education, not poor, owns a TV or radio, main language spoken is Spanish, and owns 0.8 ha of agricultural land, calculate the predicted probability of each trust category using the ordered logit model. (5 points)
Code
newdata_female <- data.frame(
  head_compprim_above = 1,
  head_male = 0,
  head_age = 40,
  head_spanish = 1,
  t_TV_radio = 1,
  HH_poor_PL2011 = 0,
  Tot_AgriLand_ha = 0.8
)
probs_female <- predict(fit_logit, newdata_female, type = "probs")
Code
knitr::kable(prob_df, digits = 2, 
             caption = "Predicted Trust Probabilities for Female Household Head")
Predicted Trust Probabilities for Female Household Head
Trust_Category Probability
1 (High trust) 0.09
2 0.06
3 0.59
4 (Low trust) 0.26

For this profile, the predicted probability of being in the lowest trust category (4 = “almost always must be very careful”) is 0.26, while the probability of the highest trust category (1 = “almost always can trust”) is 0.09. The modal category is category 3 with probability 0.59.

  1. Discuss how the predicted probabilities change if the household head is male, holding other characteristics the same. (5 points)
Code
newdata_male <- newdata_female
newdata_male$head_male <- 1
probs_male <- predict(fit_logit, newdata_male, type = "probs")
Code
knitr::kable(prob_comparison, digits = 2,
             caption = "Comparison of Trust Probabilities by Gender")
Comparison of Trust Probabilities by Gender
Trust_Category Female Male Difference
1 (High trust) 0.09 0.10 0.01
2 0.06 0.06 0.01
3 0.59 0.60 0.01
4 (Low trust) 0.26 0.24 -0.02

Being male is associated with slightly higher trust (lower values on the mistrust scale). Male household heads have a predicted probability of 23.96% for the lowest trust category versus 26.37% for females (a difference of -2.41 percentage points). The probability of high trust (category 1) increases from 8.59% to 9.65% (1.06 percentage points).

However, the coefficient on head_male is not statistically significant (t = -0.77), suggesting this difference could be due to sampling variability rather than a true gender effect on trust. The small magnitude of the differences (all under 2.4 percentage points) further indicates that gender has minimal impact on trust in this population.