Problem Set 2
Regression and Uncertainty
Instructions
•Submitboth fileson Canvas:One knitted PDFand original.Rmdfile.
–Failure to submit both the knitted PDF and the .Rmd will result in a 20-point deduction.
•Your knitted PDF should bereadable on its own. Organize your work using clear section headers,
include brief text explanations between code chunks, and display only the outputs needed to answer
each question (avoid printing large objects or long intermediate outputs).
–Make sure all tables are clearly labeled and easy to read in the final PDF.
–For every plot: include a clear title, axis labels (with units if relevant), and readable text.
•Include a short“Use of AI”subsection at the end of your submission (even if you used none).
–You can use AI tools for debugging error messages, clarifying how an R function works, or checking
syntax, etc. Donotuse AI as a substitute for your own understanding—you should be able to
explain and justify everything you submit.
–Failure to disclose AI usage will result in a 20-point deduction.This exercise is based on the
article, Testa, A., Young, J. K., & Mullins, C. (2017). “Does Democracy Enhance or Reduce
Lethal Violence? Examining the Role of the Rule of Law”Homicide Studies, Vol. 21, No. 3,
pp. 219-239.
The paper examines the cross-national causes of homicide rates. Briefly, many scholars have looked at how
institutions influences homicide rates, arguing democracy can reduce lethal violence. The authors of the
paper claim that all previous claims are incomplete as they do not unpack the concept of democracy and
examine different dimensions of the concept such as how having an independent jury impact homicide rates.
In short, some democracies may have more homicides than others. Let’s find out why.
Here is a description for the variables inhomicide2.dta:
Name Description
countrycountry names
yearyear
homi_ratehomicide rate
region1the numeric code indicating the region that the
country is in the world
polity2the democracy score for the country from the Polity
data set, ranging from -10 (autocracy) to 10
(democracy)
jurythe binary variable indicating whether the country
has an independent jury
d <-read_dta("homicide2.dta")
1
Question 1
Examine the data. Look at summary statistics of the data: (1) how many observations and variables are
included in the data, (2) what is the range of year in the data, (3) how many unique countries are included
in the data, (4) the summary statistics of the key variables includinghomi_rate,polity2, andjury.
The dataset has 2992 rows and 52 columns. The years of the observations in this dataset range from 1950
to 2007. This dataset contains a total of 103 different countries.
summary(d[,c("homi_rate", "polity2", "jury")])
## homi_rate polity2 jury
## Min. : 0.0000 Min. :-10.000 Min. :0.0000
## 1st Qu.: 0.9642 1st Qu.: -2.000 1st Qu.:0.0000
## Median : 1.7557 Median : 9.000 Median :1.0000
## Mean : 4.2901 Mean : 4.697 Mean :0.6041
## 3rd Qu.: 4.4758 3rd Qu.: 10.000 3rd Qu.:1.0000
## Max. :107.9936 Max. : 10.000 Max. :1.0000
## NA’s :650 NA’s :759
The summary shows the distribution of the key variables.homi_rateis right-skewed,polity2spans the
full democracy-autocracy scale, andjuryis a binary indicator with some missing values.
Question 2
Before running actual analyses, we need to recode and clean the data. Currently, region variable (region1)
is coded as a numeric variable, which cannot be directly included in the regression. Recoderegion1using
the region names and create a newregion2variable. The coding forregion1is Europe=1, Middle East=2,
Africa=3, Asia & Oceania=4, Americas=5. How many units are there for each region? How do you think this
would affect our inference? (i.e. can the cases represent the relationship between homicide and democracy
for all the countries in the world?)
d$region2 <-factor(
d$region1,
levels =c(1, 2, 3, 4, 5),
labels =c("Europe", "Middle East", "Africa", "Asia & Oceania", "Americas")
)
region_table <-as.data.frame(table(d$region2, useNA = "no"))
region_table <- region_table[order(-region_table$Freq), ]
names(region_table) <-c("Region", "Observations")
kable(region_table, caption = "Observations by region")
Table 1: Observations by region
Region Observations
1 Europe 1423
5 Americas 694
4 Asia & Oceania 391
2 Middle East 89
3 Africa 47
2
The number of observations without a region code in 348 represents an observed missing region code. The
coded regions are much more concentrated in Europe and the Americas than in the Middle East, Africa,
or Asia & Oceania. Therefore this skewed distribution reduces the external validity of the sample. Region
controls can accommodate major regional variance, but the resulting sample is still not fully representative
of all countries in the world (as of 2022).
Question 3
Start with the basic regression. Regress the outcome variable (homi_rate) on the democracy score (polity2)
as in the previous studies. What do you find? Interpret the coefficients ofpolity2.
m1 <-lm(homi_rate~polity2, data = d)
summary(m1)
##
## Call:
## lm(formula = homi_rate ~ polity2, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.806 -3.054 -2.328 -0.094 101.234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.71369 0.19684 29.027 <2e-16 ***
## polity2 -0.20926 0.02337 -8.956 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 7.908 on 2340 degrees of freedom
## (650 observations deleted due to missingness)
## Multiple R-squared: 0.03314, Adjusted R-squared: 0.03273
## F-statistic: 80.21 on 1 and 2340 DF, p-value: < 2.2e-16
According to -0.209, the coefficient forpolity2indicates that an increase of 1 point in the democracy score
will typically lead to a decrease of approximately 0.209 in the homicide rate. In this simple model, democracy
is negatively correlated with the homicide rate.
Question 4
Next, let’s include independent jury (jury) as an explanatory variable and additionally control for the
region variable (region2) we created in Question 2 as it may be the case democracy plays a different role
in influencing homicide depending upon different regions. What do you find? Interpret all the coefficients
including the intercept/constant and effects ofpolity2,jury, andregeion2.
m2 <-lm(homi_rate~polity2+jury+region2, data = d)
summary(m2)
##
## Call:
## lm(formula = homi_rate ~ polity2 + jury + region2, data = d)
3
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.733 -2.945 -0.968 0.462 97.347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.1303 0.3476 17.636 < 2e-16 ***
## polity2 0.1738 0.0375 4.635 3.79e-06 ***
## jury -5.9691 0.5448 -10.956 < 2e-16 ***
## region2Middle East -2.6158 0.8439 -3.100 0.00196 **
## region2Africa -5.6896 1.3188 -4.314 1.68e-05 ***
## region2Asia & Oceania 0.9641 0.4634 2.081 0.03760 *
## region2Americas 5.3855 0.4021 13.394 < 2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 7.284 on 2047 degrees of freedom
## (938 observations deleted due to missingness)
## Multiple R-squared: 0.1907, Adjusted R-squared: 0.1883
## F-statistic: 80.39 on 6 and 2047 DF, p-value: < 2.2e-16
Using Europe as the reference category, the intercept 6.13 is the predicted homicide rate for a European
country withpolity2 = 0andjury = 0.
Holding region constant, the coefficient onpolity2is 0.174, so a one-point increase in democracy score is
associated with about 0.174 higher homicide rate on average.
The coefficient onjuryis -5.969, so countries with an independent jury are estimated to have lower homicide
rates by about 5.969 points, all else equal.
Relative to Europe, the regional differences are:
•Middle East: -2.616
•Africa: -5.69
•Asia & Oceania: 0.964
•Americas: 5.386
Regional composition is an essential determining factor in how you view the effect of the variable. The
sign changed for polity2 indicating that there is likely regionally based confounding of the pooled bivariate
analysis with respect to the effect of geographically based variables on democracy.
Question 5
This is a country-level longitudinal data, which means it include country cases across multiple years. Previ-
ously, we pooled all the cases together and fitted regression models, which is not totally correct. Specifically,
the effects we estimated are averaging over both between-country and within-country comparisons while we
left out many country- and year-specific effects. For example, the model will recognize USA 1996, USA 1998,
and Russia 1996 as independent cases. But we know it’s more relevant to compare USA between 1996 and
1998 or compare USA and Russia in the same year of 1996. We can resolve this by controllingyearand
countryvariables, which is called fixed effects. It allows is to control for variables we cannot observe or
measure such as some country-specific characteristics that affect the outcome, or variables that change over
time but not across countries. Regresshomi_rateonpolity2,jury,year, andcountry. Note thatyearis
currently a continous/numeric variable, which wouldn’t be appropriate to be directly included in the model.
4
You can usefactor(year)in the regression specification to make it a factor variable. After controlling for
year and country effects, what do you find for the effects of Polity scores and independent jury? Does the
implication from previous questions still hold?
m3 <-lm(homi_rate~polity2+jury+ factor(year)+country, data = d)
coef(summary(m3))[c("polity2", "jury"), ]
## Estimate Std. Error t value Pr(>|t|)
## polity2 0.1240071 0.0359356 3.450815 0.0005709686
## jury -1.7382361 0.5269434 -3.298715 0.0009890087
The coefficients for both democracy and jury as they relate to the homicide rate, after accounting for
differences in country and year, are also statistically significant.
This suggests that even when accounting for differences between countries over time, democracy (polity2)
is positively associated with the homicide rate, while jury is negatively associated with the homicide rate.
Thus, the earlier pooled results only partially hold true given their respective heterogeneity attributable to
country and year.
Question 6
Finally, let’s rethink this study. What is the research question in this study? What does the authors of paper
contribute to this line of research? What may be the authors’ hypothesis? Do the results from the previous
questions support the hypothesis? What are the limitations of the current studies? (i.e. case selection,
causality, confounders)
The research question is whether or not democracy can lower homicide rates, specifically if the various
institutional aspects of how democracy is formulated (as in the case of institutionalized juries) affect homicide
rates as well.
The authors are attempting to address this question by moving from generalizing about measures of democ-
racy to understanding the specifics of how democracy operates across the various categories that a country
can fall into regarding how strong these legal institutions are.
Therefore, it is possible to use a general hypothesis that stronger democratic institutions (and specifically
the independent jury aspect of democracy) will correlate with lower homicide rates.
The partial or weak support for this hypothesis comes from looking at how the relationship correlates using
various models. In the pooled model, there is a negative correlation between democracy and homicides; how-
ever, as the authors have included controls based upon different regions and have controlled for fixed effects
on the year and/or country level, the relationship has weakened, and the directionality of the correlation has
changed. There is a more consistent negative correlation for the jury variable; however, caution should be
used when interpreting these results as evidence for a causal relationship, as all of the remaining evidence is
also observational in nature.
The main limitations include:
•Case selection and missing data will limit the extent to which these results can be generalized.
•The sample used is not evenly distributed across each region.
•Since the analyses are based upon observational data, it is weakly causal.
•The possibility of omitted confounding variables may still bias the estimate of the relationship.
•Pooling country-year observations without any fixed effects will not take into account critical within-
country structure and temporal changes.
5
Question 7 (Bonus)
To create confidence intervals forpolity2andjury, I used the model discussed in Question 4.
ci95 <-confint(m2,c("polity2", "jury"), level = 0.95)
ci99 <-confint(m2,c("polity2", "jury"), level = 0.99)
ci_table <-data.frame(
Term =rep(rownames(ci95), 2),
Level =rep(c("95%", "99%"), each =nrow(ci95)),
Lower =c(ci95[, 1], ci99[, 1]),
Upper =c(ci95[, 2], ci99[, 2])
)
kable(ci_table, digits = 3, caption = "Confidence intervals for the Question 4 model")
Table 2: Confidence intervals for the Question 4 model
Term Level Lower Upper
polity2 95% 0.100 0.247
jury 95% -7.038 -4.901
polity2 99% 0.077 0.270
jury 99% -7.374 -4.564
As anticipated, the intervals are wider at the 99% level than at the 95% level, because more confidence
requires a wider interval.
Use of AI
None
6