Week 5
Data Analysis
Soci—316
Qualtrics Survey Assessment Deadline
Your Qualtrics survey assignments are due by 8:00 PM on Monday, March 2nd.
We should treat regressions as
counterfactual prediction machines.
Today, we’ll explore this idea—and work to demystify
regressions—using a basic example.
Image can be retrieved here.
To predict the probability of surviving the sinking of the Titanic as
a function of predictors like age, sex and passenger class.
P(\text{Survival} = 1 \mid X) = \beta_0 + X\beta + \varepsilon
We’re going to estimate a linear probability model.
Caveat Emptor
Using a linear probability model (LPM) to predict a binary outcome can produce predicted probabilities—P(Y=1\mid X)—that fall outside the [0, 1] interval.
We’re going to fit one anyway
(cf. von Hippel 2015).
# A tibble: 9 × 3
name survived age
<chr> <dbl> <dbl>
1 Aubart, Mme. Leontine Pauline 1 24
2 Stone, Mrs. George Nelson (Mart 1 62
3 Cardeza, Mr. Thomas Drake Marti 1 36
4 Buss, Miss. Kate 1 36
5 Smith, Miss. Marion Elsie 1 40
6 Hocking, Mr. Richard George 0 23
7 Cohen, Mr. Gurshon Gus 1 18
8 Cribb, Mr. John Hatfield 0 44
9 Skoog, Miss. Mabel 0 9
P(\text{Survival} = 1 \mid X) = \beta_0 + \beta_1\text{age} + \varepsilon
We’re modelling the probability of survival as a function of age.
To fit this model, we need to calculate the slope and intercept parameters—that is, \beta_0 and \beta_1\text{age}, respectively.
Slopes can be derived using the following formula:
b = \frac{\sum{(X - \bar{X})(Y - \bar{Y})}}{\sum{(X - \bar{X})^2}}
Intercepts can be calculated as follows:
\alpha = \bar{Y} - b\bar{X}
| Y | X | X - \bar{X} | Y - \bar{Y} | (X - \bar{X})(Y - \bar{Y}) | (X - \bar{X})^2 |
|---|---|---|---|---|---|
| 1 | 24 | -8.44 | 0.33 | -2.81 | 71.31 |
| 1 | 62 | 29.56 | 0.33 | 9.85 | 873.53 |
| 1 | 36 | 3.56 | 0.33 | 1.19 | 12.64 |
| 1 | 36 | 3.56 | 0.33 | 1.19 | 12.64 |
| 1 | 40 | 7.56 | 0.33 | 2.52 | 57.09 |
| 0 | 23 | -9.44 | -0.67 | 6.33 | 89.14 |
| 1 | 18 | -14.44 | 0.33 | -4.81 | 208.53 |
| 0 | 44 | 11.56 | -0.67 | -7.74 | 133.64 |
| 0 | 9 | -23.44 | -0.67 | 15.67 | 549.64 |
| \sum (X - \bar{X}) = 0 | \sum (Y - \bar{Y}) = 0 | \sum (X - \bar{X})(Y - \bar{Y}) = 21.33 | \sum (X - \bar{X})^2 = 2008.92 |
\beta_1 = \frac{21.33}{2008.92} \approx 0.0106
\beta_0 = 0.667 - (0.0106 \times 32.44) \approx 0.3220
Our first LPM can therefore be summarized as follows:
\hat{P}(\text{Survived} = 1) = 0.3220 + 0.0106 \cdot \text{age} + \varepsilon
We can use programs like to streamline hand calculations.
library(tidyverse)
titanic_subset |> mutate(# Averages
mean_y = mean(survived),
mean_x = mean(age),
# To establish covariation between x and y
# Deviations from average, x
x_minus_mean_x = age - mean(age),
# Corresponding y value--deviations from average
y_minus_mean_y = survived - mean(mean_y),
# Tapping co-movement
slope_numerator = x_minus_mean_x * y_minus_mean_y,
# Squared residuals, positive values are yielded
slope_denominator = x_minus_mean_x^2) |>
# Putting it all together
mutate(# Summing up the values for the numerator ...
slope_numerator = sum(slope_numerator),
# And the denominator ...
slope_denominator = sum(slope_denominator),
# Yielding the slope estimate
slope = slope_numerator/slope_denominator,
# Which allows us to "solve" for the intercept:
intercept = mean_y - slope * mean_x) |>
select(intercept, slope) |>
distinct()# A tibble: 1 × 2
intercept slope
<dbl> <dbl>
1 0.322 0.0106
Kate Buss was 36 years old when the Titanic sank.
# A tibble: 1 × 3
name survived age
<chr> <dbl> <dbl>
1 Buss, Miss. Kate 1 36
We can compute her probability of survival Using our clearly misspecified model. as follows:
Or we can use functions to generate these predictions for us.
library(marginaleffects)
avg_predictions(survival_truncated, variables = "age") |>
as_tibble() |>
ggplot(mapping = aes(x = age, y = estimate)) +
geom_line(colour = "#b7a5d3", linewidth = 1.5) +
geom_ribbon(mapping = aes(ymin = conf.low,
ymax = conf.high),
fill = "lightgrey",
alpha = 0.2) +
theme_bw()survival_full <- lm(survived ~ age, data = titanic)
avg_predictions(survival_full, variables = "age") |>
as_tibble() |>
ggplot(mapping = aes(x = age, y = estimate)) +
geom_line(colour = "#b7a5d3", linewidth = 1.5) +
geom_ribbon(mapping = aes(ymin = conf.low,
ymax = conf.high),
fill = "lightgrey",
alpha = 0.2) +
theme_bw()Uh-oh.
Our initial model painted a misleading portrait.
Call:
lm(formula = survived ~ age + sex + passengerClass, data = titanic)
Residuals:
Min 1Q Median 3Q Max
-1.09442 -0.24408 -0.08375 0.23289 0.99397
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1049549 0.0438210 25.215 < 2e-16 ***
age -0.0052695 0.0009316 -5.656 2.00e-08 ***
sexmale -0.4914131 0.0255520 -19.232 < 2e-16 ***
passengerClass2nd -0.2113738 0.0348568 -6.064 1.85e-09 ***
passengerClass3rd -0.3703874 0.0325039 -11.395 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3913 on 1041 degrees of freedom
(263 observations deleted due to missingness)
Multiple R-squared: 0.3691, Adjusted R-squared: 0.3667
F-statistic: 152.3 on 4 and 1041 DF, p-value: < 2.2e-16
\begin{align*} \hat{P}(\text{Survived} = 1) = 1.105 - 0.00527 \cdot \text{age} - 0.4914 \cdot \text{sex}_\text{male} \\ - 0.2114 \cdot \text{class}_\text{2nd} - 0.3704 \cdot \text{class}_\text{3rd} + \varepsilon \end{align*}
Call:
lm(formula = survived ~ age + sex + passengerClass, data = titanic)
Coefficients:
(Intercept) age sexmale passengerClass2nd
1.104955 -0.005269 -0.491413 -0.211374
passengerClass3rd
-0.370387
We’re now modelling the probability of survival as a function of age, sex and passenger class.
Here are three passengers selected at random:
# A tibble: 3 × 5
name survived sex age passengerClass
<chr> <dbl> <fct> <dbl> <fct>
1 Veal, Mr. James 0 male 40 2nd
2 Petersen, Mr. Marius 0 male 24 3rd
3 Johnson, Master. Harold Theodor 1 male 4 3rd
In groups of 2-3, calculate the probability of survival for each passenger.
Call:
lm(formula = survived ~ age + sex + passengerClass, data = titanic)
Coefficients:
(Intercept) age sexmale passengerClass2nd
1.104955 -0.005269 -0.491413 -0.211374
passengerClass3rd
-0.370387
Scroll
# A tibble: 3 × 2
name estimate
<chr> <dbl>
1 Veal, Mr. James 0.191
2 Petersen, Mr. Marius 0.117
3 Johnson, Master. Harold Theodor 0.222
| (1) | |
|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
| (Intercept) | 1.105*** |
| (0.044) | |
| age | -0.005*** |
| (0.001) | |
| sexmale | -0.491*** |
| (0.026) | |
| passengerClass2nd | -0.211*** |
| (0.035) | |
| passengerClass3rd | -0.370*** |
| (0.033) | |
| Num.Obs. | 1046 |
| R2 | 0.369 |
| R2 Adj. | 0.367 |
| AIC | 1012.7 |
| BIC | 1042.4 |
| Log.Lik. | -500.354 |
| F | 152.257 |
| RMSE | 0.39 |
SE(\hat{\beta}) = \sqrt{\frac{\hat{\sigma}^2}{\sum_{i = 1}^n (x_i- \bar{x})^2}}
Parameter estimates based on our full model.
# A tibble: 1 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 age -0.00527 0.000932 -5.66 0.0000000155
t = \frac{\hat{\beta} - \color{856cb0}{\beta_{H_\theta}}}{SE(\hat{\beta})}
t = \frac{\hat{\beta} - \color{856cb0}{\beta_{H_0}}}{SE(\hat{\beta})}
# A tibble: 1 × 6
term estimate std.error statistic manual_statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 age -0.00527 0.000932 -5.66 -5.66 0.0000000200
A result is statistically significant at the 5% level when the corresponding parameter is distinguishable from zero using 5% as the rejection threshold. Conversely, a result is not statistically significant at the 5% level when the corresponding parameter is not distinguishable from zero using 5% as the rejection threshold.
(Llaudet and Imai 2023:215, EMPHASIS ADDED)
As Llaudet and Imai (2023) note, we can use simple linear models to approximate the difference-in-means estimator and
derive our key estimand, the \widehat{\textit{ATE}}.
In groups, discuss how conceptualizing regressions as counterfactual prediction machines can help us understand what the
\widehat{\textit{ATE}} represents.
For the rest of today’s session, please work on your
Qualtrics survey assignment.
Click to Expand Flyer
Qualtrics Survey Assessment Deadline
Your Qualtrics survey assignments are due by 8:00 PM on Monday, March 2nd.
Grimmer and colleagues (2021:396, EMPHASIS ADDED)
A class of flexible algorithmic and statistical techniques for prediction and dimension reduction.
Molina and Garip (2019:28, EMPHASIS ADDED)
A way to learn from data and estimate complex functions that discover representations of some input (X), or link the input to an output (Y) in order to make predictions on new data.
Once deployed, SML algorithms learn the complex patterns linking X—a set of features Independent variables or predictors. —to a target variable Outcome or dependent variable. , Y.
The goal of SML is to optimize predictions—i.e., to find functions or algorithms that offer substantial predictive power when
confronted with new or unseen data.
SML algorithms include, but are not limited to, logistic regressions, random forests, ridge regressions, support vector machines and neural networks.
A quick note on terminology:
If a target variable is quantitative, we are dealing with a regression problem.
If a target variable is qualitative, we are dealing with a classification problem.
# A tibble: 80,000 × 6
fruit shape weight_g colour texture origin
<chr> <chr> <dbl> <chr> <chr> <chr>
1 apple spherical 150 red crispy china
2 banana curved 120 yellow creamy ecuador
3 orange spherical 148 orange juicy egypt
4 watermelon spherical 4500 green juicy spain
5 strawberry conical 13 red juicy mexico
6 grape spherical 5 green juicy chile
7 mango ellipsoidal 240 yellow juicy india
8 pineapple conical 2100 yellow juicy costa rica
9 apple spherical 140 green crispy usa
10 banana curved 110 yellow creamy ecuador
# ℹ 79,990 more rows
# A tibble: 1 × 6
fruit shape weight_g colour texture origin
<chr> <chr> <int> <chr> <chr> <chr>
1 <NA> conical 6 red juicy china
# A tibble: 8 × 2
fruit probability
<chr> <dbl>
1 apple 0.03
2 banana 0
3 orange 0.05
4 watermelon 0
5 strawberry 0.71
6 grape 0.21
7 mango 0
8 pineapple 0
# A tibble: 1 × 1
prediction
<chr>
1 strawberry

UML techniques search for a representation of the inputs (or features) that is more useful than X itself (Molina and Garip 2019).
With UML, analysts mine for hidden structure or latent patterns in
high-dimensional space.
What does this mean, exactly?
With UML, there is no observed outcome variable Y—or target—to supervise the estimation process. Instead, we only have inputs.
The primary goal in UML is to develop a lower-dimensional representation of complex data by inductively learning from the
interrelationships among inputs.
Reducing a vector of features to a small set of scales (e.g., via principal component analysis) or dividing a sample into a small number of hidden groups (e.g., via k-means clustering).
| V-Party Name | Label | Underlying Question |
|---|---|---|
v2paanteli |
Anti-Elitism | How important is anti-elite rhetoric for this party? |
v2papeople |
People-Centrism | Do leaders of this party glorify the ordinary people and identify themselves as part of them? |
v2paculsup |
Cultural Chauvinism | To what extent does the party leadership promote the cultural superiority of a specific social group or the nation as a whole? |
v2paminor |
Minority Rights | According to the leadership of this party, how often should the will of the majority be implemented even if doing so would violate the rights of minorities? |
v2paplur |
Political Pluralism | Prior to this election, to what extent was the leadership of this political party clearly committed to free and fair elections with multiple parties, freedom of speech, media, assembly and association? |
v2paopresp |
Demonization of Opponents | Prior to this election, have leaders of this party used severe personal attacks or tactics of demonization against their opponents? |
v2paviol |
Rejection of Political Violence | To what extent does the leadership of this party explicitly discourage the use of violence against domestic political opponents? |
Karim and Lukk’s The Radicalization of Mainstream Parties in the 21st Century
As Grimmer and colleagues note (2021:397, EMPHASIS ADDED) note,
“machine learning is as much a culture defined by a distinct set of values and tools as it is a set of algorithms.”
This point has, of course, been made elsewhere.
The terms generative and predictive—as opposed to data and algorithmic—come from
David Donoho’s (2017) 50 Years of Data Science.
| Quantity of Interest | Primary Goals | Key Strengths | Key Limitations |
|---|---|---|---|
| Generative (i.e., Classical Statistics) |
|||
| \hat{\beta} | Inferring relationships between X and Y | Interpretability, emphasis on uncertainty, explanatory power | Bounded by statistical assumptions, inattention to variance across samples. |
| Predictive (i.e., Machine Learning) |
|||
| \hat{y} | Generating accurate predictions of Y | Predictive power, can simplify high-dimensional data, relatively unbeholden to statistical assumptions. | Inattention to explanatory processes, opaque links between X and Y. |
To be sure, the putative strengths and weaknesses of these modelling “cultures” have been hotly debated.
Advances in machine learning can provide empirical leverage to social scientists and sharpen social theory in one fell swoop.
Lundberg, Brand and Jeon (2022), for instance, argue that adopting a machine learning framework can help social scientists:
Image can be retrieved here.
Emerges when we build SML algorithms that fail to sufficiently map the patterns—or pick up the empirical signal–linking X and Y. Think: underfitting.
Arises when our algorithms not only pick up the signal linking X and Y, but some of the noise in our data as well. Think: overfitting.
To strike the optimal balance between bias and variance.
In an SML setting, we want to reduce our algorithm’s generalization or test error—i.e., “the prediction error of a model on new data” (Molina and Garip 2019).
To arrive at an estimate of our model’s performance, we can (randomly) partition our global sample of observations into disjoint sets or subsamples.
We can use a validation set to find the right estimator out of a series of candidate algorithms—or select the best-fitting parameterization of a single algorithm.
Using both training and validation sets can be costly: data sparsity can amplify variance.
Thus, when limited to smaller samples, analysts often combine training and validation—say, by recycling training data for model tuning and selection.
We can use a testing set to generate a measure of our model’s predictive accuracy (e.g., the F1 score for classification problems)—or to derive our generalization error.
This subsample is used only once (to report the performance metric); put another way, it cannot be used to train, tune or select our algorithm.
Unlike conventional approaches to sample partition, v or k-fold cross-validation allows us to learn from all our data.
k-fold cross-validation proceeds as follows:
Stratified k-fold cross-validation ensures that the distribution of class labels—or for numeric targets, the mean—is relatively constant across folds.
In SML settings, we automatically learn the parameters (e.g., coefficients) of our algorithms during estimation.
Hyperparameters, on the other hand, are chosen by the analyst, guide the entire learning or estimation process, and can powerfully shape our algorithm’s predictive performance.
Test different values via trial and error.
Use automated procedures like GridSearchCV from scikit-learn.
Figure 8 from Mittleman (2022).
Figure 8 from Boutyline and colleagues (2023)
For the rest of today’s session, please work on your
Qualtrics survey assignment once again.
Note: Scroll to access the entire bibliography
