Let \(X\) be distributed as N(\(\mu,\sigma^2\)). The unknown parameters are \(\mu\) and \(\sigma^2\).
Find the log-likelihood function \(\ell_n (\mu,\sigma^2)\). (8 points)
TipSolution (a)
The probability density function for a normal distribution is: \[f(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]
The log-likelihood for a single observation is: \[\log f(x_i|\mu,\sigma^2) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2) - \frac{(x_i-\mu)^2}{2\sigma^2}\]
For \(n\) observations, the log-likelihood function is: \[\ell_n(\mu,\sigma^2) = \sum_{i=1}^n \log f(x_i|\mu,\sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\]
Take the first order condition with respect to \(\mu\), and show that the solution for \(\hat{\mu}\) does not depend on \(\sigma^2\). (10 points)
TipSolution (b)
Taking the derivative with respect to μ: \[\frac{\partial \ell_n}{\partial \mu} = \frac{1}{2\sigma^2} \cdot 2\sum_{i=1}^n(x_i-\mu) = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu)\]
Setting the first order condition equal to zero: \[\frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu) = 0\]
Note that \(\sigma^2\) cancels out, so \(\hat{\mu}\) does not depend on \(\sigma^2\).
Define the concentrated log-likelihood function \(\ell_n (\hat{\mu},\sigma^2)\). The concentrated log-likelihood substitutes the MLE \(\hat{\mu} = \bar{x}\) from part (b) into the original log-likelihood, treating it as a function of \(\sigma^2\) only. Take the first order condition with respect to \(\sigma^2\), and find the MLE \(\hat{\sigma}^2\) for \(\sigma^2\). (12 points)
TipSolution (c)
From part (b), we found that \(\hat{\mu} = \bar{x}\). We now substitute this into the log-likelihood from part (a), treating \(\hat{\mu}\) as given (no longer a parameter to estimate).
The concentrated log-likelihood is: \[\ell_n(\hat{\mu},\sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\bar{x})^2\]
Taking the derivative with respect to \(\sigma^2\): \[\frac{\partial \ell_n}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n(x_i-\bar{x})^2\]
Setting equal to zero: \[-\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n(x_i-\bar{x})^2 = 0\]
Using the chain rule and the result from part (a): \[\frac{d}{dx}\log\Lambda(x) = \frac{1}{\Lambda(x)} \cdot \frac{d\Lambda(x)}{dx} = \frac{1}{\Lambda(x)} \cdot \Lambda(x)(1-\Lambda(x)) = 1-\Lambda(x)\]
From part (c), \(H_{\text{logit}}(x) = \Lambda(x)(1-\Lambda(x))\).
Since \(\Lambda(x) = \frac{1}{1+e^{-x}}\), we have \(0 < \Lambda(x) < 1\) for all \(x\).
Therefore, \(0 < 1-\Lambda(x) < 1\) as well.
The product \(\Lambda(x)(1-\Lambda(x))\) is maximized when \(\Lambda(x) = 1-\Lambda(x) = 0.5\), which gives: \[H_{\text{logit}}(x) \le 0.5 \times 0.5 = 0.25 < 1\]
Since \(H_{\text{logit}}(x) > 0\) for all \(x\) (as both factors are positive), we have: \[0 < H_{\text{logit}}(x) \le 0.25 < 1\]
Therefore, \(|H_{\text{logit}}(x)| \le 1\).
3. Binary Choice Estimation
(40 points total)
The data for Question 3 come from Grogger, J. (1991), “Certainty vs. Severity of Punishment,” Economic Inquiry 29, 297-309. The dataset grogger.dta is available on Canvas.
Dataset description:
Unit of observation: Individual men
Sample: Men arrested at least once prior to 1986
Time period: 1986 arrests and prior criminal history
Key variables:
narr86: Number of arrests in 1986
pcnv: Proportion of prior arrests resulting in conviction (probability of conviction)
avgsen: Average sentence length (months) from prior convictions
tottime: Total time spent in prison since age 18 (months)
Define a binary variable, \(arr86\), equal to unity if a man was arrested at least once during 1986, and zero otherwise. Estimate an LPM relating \(arr86\) to \(pcnv\), \(avgsen\), \(tottime\), \(ptime86\), \(inc86\), \(black\), \(hispan\), and \(born60\). Report the usual and heteroskedasticity-robust standard errors. What is the estimated effect on the probability of arrest if \(pcnv\) goes from .25 to .75? (8 points)
TipSolution (a)
Code
# Estimate LPMlpm <-lm(arr86 ~ pcnv + avgsen + tottime + ptime86 + inc86 + black + hispan + born60, data = crime)# Get robust standard errorsrobust_se <-sqrt(diag(vcovHC(lpm, type ="HC1")))# Create comparison tablestargazer(lpm, lpm,se =list(NULL, robust_se),type ="text",column.labels =c("Standard SE", "Robust SE"),model.names =FALSE,title ="Linear Probability Model Results")
The estimated effect on the probability of arrest when pcnv increases from 0.25 to 0.75 is approximately -0.0772. This means that a 0.50 increase in the proportion of prior convictions is associated with a -7.72 percentage point change in the probability of arrest in 1986. The robust standard errors are generally larger than the standard errors, accounting for heteroskedasticity in the LPM.
Test the joint significance of \(avgsen\) and \(tottime\), using a nonrobust and robust test. (6 points)
Linear hypothesis test
Hypothesis:
avgsen = 0
tottime = 0
Model 1: restricted model
Model 2: arr86 ~ pcnv + avgsen + tottime + ptime86 + inc86 + black + hispan +
born60
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 2718
2 2716 2 0.1839 0.832
The nonrobust F-test gives an F-statistic of 0.179 with a p-value of 0.836. The robust F-test gives an F-statistic of 0.184 with a p-value of 0.832. Both tests suggest that we fail to reject the null hypothesis that avgsen and tottime are jointly insignificant at the 5% level. In fact, with p-values above 0.80, these variables appear to have virtually no joint effect on arrest probability.
Now estimate the model by probit. At the average values of \(avgsen\), \(tottime\), \(inc86\), and \(ptime86\) in the sample, and with \(black=1\), \(hispan=0\), and \(born60=1\), what is the estimated effect on the probability of arrest if \(pcnv\) goes from .25 to .75? Compare this result with the answer from part a. (8 points)
# Calculate averages for continuous variablesavg_avgsen <-mean(crime$avgsen, na.rm =TRUE)avg_tottime <-mean(crime$tottime, na.rm =TRUE)avg_inc86 <-mean(crime$inc86, na.rm =TRUE)avg_ptime86 <-mean(crime$ptime86, na.rm =TRUE)# Create data for prediction at pcnv = 0.25newdata_25 <-data.frame(pcnv =0.25, avgsen = avg_avgsen, tottime = avg_tottime,ptime86 = avg_ptime86, inc86 = avg_inc86,black =1, hispan =0, born60 =1)# Create data for prediction at pcnv = 0.75newdata_75 <-data.frame(pcnv =0.75, avgsen = avg_avgsen, tottime = avg_tottime,ptime86 = avg_ptime86, inc86 = avg_inc86,black =1, hispan =0, born60 =1)# Predict probabilitiesprob_25 <-predict(probit, newdata = newdata_25, type ="response")prob_75 <-predict(probit, newdata = newdata_75, type ="response")# Calculate effectprobit_effect <- prob_75 - prob_25
At the specified covariate values, the probit model predicts an arrest probability of 0.3979 when pcnv = 0.25 and 0.2962 when pcnv = 0.75, for an effect of -0.1017 (-10.17 percentage points). The LPM estimates a -7.72 percentage point effect. The probit effect is larger in magnitude because probit accounts for the nonlinear relationship between covariates and probability, while the LPM assumes a constant linear effect.
For the probit model estimated in part c, obtain the percent correctly predicted. What is the percent correctly predicted when \(narr86=0\)? When \(narr86=1\)? What do you make of these findings? (8 points)
TipSolution (d)
Code
# Get predicted probabilities for all observationscrime$pred_prob <-predict(probit, type ="response")# Use 0.5 as cutoff for predictioncrime$pred_arr86 <-ifelse(crime$pred_prob >0.5, 1, 0)# Overall percent correctly predictedoverall_correct <-mean(crime$pred_arr86 == crime$arr86, na.rm =TRUE)# Percent correct when narr86 = 0 (arr86 = 0)correct_0 <-mean(crime$pred_arr86[crime$arr86 ==0] ==0, na.rm =TRUE)# Percent correct when narr86 >= 1 (arr86 = 1)correct_1 <-mean(crime$pred_arr86[crime$arr86 ==1] ==1, na.rm =TRUE)# Create summary tableaccuracy_table <-data.frame(Category =c("Overall", "When arr86=0", "When arr86=1"),Percent_Correct =c(overall_correct, correct_0, correct_1) *100)knitr::kable(accuracy_table, digits =2,col.names =c("Category", "Percent Correct"),caption ="Probit Model Prediction Accuracy")
Probit Model Prediction Accuracy
Category
Percent Correct
Overall
72.70
When arr86=0
96.60
When arr86=1
10.33
Code
# Confusion matrixtable_pred <-table(Predicted = crime$pred_arr86, Actual = crime$arr86)print(table_pred)
Actual
Predicted 0 1
0 1903 677
1 67 78
The overall percent correctly predicted is 72.7%. When arr86=0, the model correctly predicts 96.6% of cases. When arr86=1, the model correctly predicts 10.33% of cases.
These findings suggest that the model is much better at predicting non-arrest (arr86=0) than arrest (arr86=1). This is common in binary choice models when one outcome is much more frequent than the other. The model tends to predict the more common outcome (no arrest) for most observations. This indicates that while the model has high overall accuracy, it has limited ability to predict actual arrests, which may be the outcome of greater interest.
In the probit model, add the terms \(pcnv^2\), \(ptime86^2\), and \(inc86^2\) to the model. Are these individually or jointly significant? Describe the estimated relationship between the probability of arrest and pcnv. In particular, at what point does the probability of conviction have a negative effect on probability of arrest? (10 points)
# Extract coefficients for interpretationcoef_pcnv <-coef(probit_quad)["pcnv"]coef_pcnv2 <-coef(probit_quad)["pcnv2"]coef_ptime862 <-coef(probit_quad)["ptime862"]coef_inc862 <-coef(probit_quad)["inc862"]# Extract p-valuessummary_quad <-summary(probit_quad)pval_pcnv2 <- summary_quad$coefficients["pcnv2", "Pr(>|z|)"]pval_ptime862 <- summary_quad$coefficients["ptime862", "Pr(>|z|)"]pval_inc862 <- summary_quad$coefficients["inc862", "Pr(>|z|)"]# Turning point: where beta_pcnv + 2*beta_pcnv2 * pcnv = 0turning_point <--coef_pcnv / (2* coef_pcnv2)# Plot the relationshippcnv_seq <-seq(0, 1, length.out =100)newdata_plot <-data.frame(pcnv = pcnv_seq,pcnv2 = pcnv_seq^2,avgsen = avg_avgsen,tottime = avg_tottime,ptime86 = avg_ptime86,ptime862 = avg_ptime86^2,inc86 = avg_inc86,inc862 = avg_inc86^2,black =1,hispan =0,born60 =1)pred_probs <-predict(probit_quad, newdata = newdata_plot, type ="response")plot(pcnv_seq, pred_probs, type ="l", lwd =2,xlab ="Prior Conviction Rate (pcnv)", ylab ="Predicted Probability of Arrest",main ="Relationship between pcnv and Arrest Probability")grid()abline(v = turning_point, col ="red", lty =2, lwd =2)text(turning_point, max(pred_probs) *0.9, sprintf("Turning point\npcnv = %.3f", turning_point),pos =4, col ="red")
The likelihood ratio test gives a chi-square statistic of 87.68 with 3 degrees of freedom (p < 0.001), indicating that the squared terms are jointly highly significant.
Individually examining the coefficients:
pcnv²: coefficient = -0.8571, p = 0.0015 (significant)
ptime86²: coefficient = -0.1035, p < 0.001 (significant)
inc86²: coefficient = 8.747e-06, p = 0.035 (significant)
The relationship between pcnv and arrest probability is nonlinear. The coefficient on pcnv is 0.2168 (positive but not significant), while the coefficient on pcnv² is -0.8571 (negative and significant). This creates an inverted-U relationship.
The turning point occurs at pcnv = 0.1265. Below this threshold, increases in conviction probability are associated with increases in arrest probability (contrary to deterrence theory). However, beyond pcnv = 0.1265, further increases in conviction probability are associated with decreases in arrest probability, consistent with a strong deterrence effect.
This suggests that deterrence effects may be weak or absent for individuals with low conviction rates, but become substantial for those with higher conviction histories. Alternatively, this could reflect selection effects where individuals with very high conviction rates modify their behavior to avoid detection.