library(tidyverse)
library(randomForest)
setwd("/Users/kevinli/Documents/GitHub/kevinli03.github.io/election/us2024")
<- read_csv("data.csv") data
Presidential Prediction
Disclamer: i collected all the data and ran all this models in like 2 hours. I did not put much effort into this, its not very good. Just for fun. Might add ridge lasso later but lazy.
- Very low sample size as well - just 22 presidential elections included as training data.
- Also, this is a very unique presidential election - may be flawed to assume that past data can predict what will happen this time.
3 types of models on here (use table of contents to access or scroll down)
- Electoral College Prediction Models
- Winner/Loser Prediction Models
- My own gut based prediction (with map!)
Electoral Votes Predictions
I create 3 different predictions for the 2024 election for each model.
Most Likely Scenario: polls say harris leads by 1.2% (according to 538 aggregator, seems to be most accurate)
Moderately Likely Scenario: polls say harris leads by 2% (the economist aggregator rounded up)
Unlikely Scenario: If polls say harris leads by 3% (a few polls say this, but unlikely to be true).
2 Best Models (by recent performance)
Random Forest Model (5 Variables Bootstrap Sampled) With Height Excluded:
Most Likely: If harris leads 1.2% in polls: Harris 266.97, Trump 271.03
If harris leads 2% in polls: Harris 269.24, Trump 268.76
If harris leads 3% in polls (unlikely): Harris 297, Trump 241
Past Performance (5 out of last 6 elections correct winner)
2020 | 2016 | 2012 | 2008 | 2004 | 2000 | |
---|---|---|---|---|---|---|
Model (Incumbent) | 250.5 | 264.2 | 317.4 | 188.6 | 285.1 | 272.9 |
Correct Winner? | Yes | Yes | Yes | Yes | Yes | No |
Actual (Incumbent) | 232 | 227 | 332 | 173 | 286 | 266 |
Incumbent | Trump | Clinton | Obama | McCain | Bush | Gore |
% of Variance Explained in \(Y\) (Electoral Votes Received by Incumbent): 42.15%
Random Forest Model (5 Variables Bootstrap Sampled) with Height Included:
Most Likely: If harris leads 1.2% in polls: Harris 267.88, Trump 270.12
If harris leads 2% in polls: Harris 271.96, Trump 266.04
If harris leads 3% in polls (unlikely): Harris 296.45, Trump 241.5
Past performance (5 out of last 6 elections correct winner)
2020 | 2016 | 2012 | 2008 | 2004 | 2000 | |
---|---|---|---|---|---|---|
Model (Incumbent) | 253.5 | 267.6 | 313.7 | 191.2 | 272.7 | 276.2 |
Correct Winner? | Yes | Yes | Yes | Yes | Yes | No |
Actual (Incumbent) | 232 | 227 | 332 | 173 | 286 | 266 |
Incumbent | Trump | Clinton | Obama | McCain | Bush | Gore |
% of Variance Explained in \(Y\) (Electoral Votes Received by Incumbent): 37.64%
Other Models
Random Forest (20 Variables Boostrap Sampled) with Height:
If harris leads 1.2% in polls: Harris 229.14 Electoral votes, Trump 308.86
If Harris leads 2% in polls: Harris 232.33 Electoral Votes, Trump 305.67
If Harris leads by 3% in polls (unlikely): Harris 290.29 Electoral Votes, Trump 247.7
Past performance (4 out of last 6 elections correct winner)
2020 | 2016 | 2012 | 2008 | 2004 | 2000 | |
---|---|---|---|---|---|---|
Model (Incumbent) | 237.6 | 278.9 | 300.2 | 186.5 | 259 | 261.4 |
Correct Winner? | Yes | No | Yes | Yes | No | Yes |
Actual (Incumbent) | 232 | 227 | 332 | 173 | 286 | 266 |
Incumbent | Trump | Clinton | Obama | McCain | Bush | Gore |
% of Variance Explained in \(Y\) (Electoral Votes Recieved by Incumbent): 56.94%
Bagging Model with Height Excluded:
If harris leads 1.2% in polls: 229.66 Harris, 308.34 Trump
If harris leads 2% in polls: 235.41 Harris, 302.59 Trump
If harris leads by 3% in polls (unlikely): 307 Harris, 231 Trump
Past Performance (4 out of last 6 elections correct winner)
2020 | 2016 | 2012 | 2008 | 2004 | 2000 | |
---|---|---|---|---|---|---|
Model (Incumbent) | 225 | 294.7 | 302.8 | 190.4 | 263.2 | 253.7 |
Correct Winner? | Yes | No | Yes | Yes | No | Yes |
Actual (Incumbent) | 232 | 227 | 332 | 173 | 286 | 266 |
Incumbent | Trump | Clinton | Obama | McCain | Bush | Gore |
% of Variance Explained in \(Y\) (Electoral Votes Received by Incumbent): 54.12%
Random Forest (19 Variables Bootstrap Sampled) With Height Excluded:
If harris leads 1.2% in polls: 227.07 Harris, 310.93 Trump
If harris leads 2% in polls: 229.9 Harris, 308.1 Trump
If harris leads 3% in polls (unlikely): 290.72 Harris, 247.28 Harris
Past Performance (4 out of last 6 elections correct winner)
2020 | 2016 | 2012 | 2008 | 2004 | 2000 | |
---|---|---|---|---|---|---|
Model (Incumbent) | 227.5 | 281.8 | 306.2 | 184.2 | 264.3 | 258.2 |
Correct Winner? | Yes | No | Yes | Yes | No | Yes |
Actual (Incumbent) | 232 | 227 | 332 | 173 | 286 | 266 |
Incumbent | Trump | Clinton | Obama | McCain | Bush | Gore |
% of Variance Explained in \(Y\) (Electoral Votes Received by Incumbent): 56.42%
Bagging With Height:
If harris leads 1.2% in polls: Harris 215.74, Trump 332.26
If harris leads 2% in polls: Harris 221.1, Trump 316.9
If harris leads 3% in polls (unlikely): Harris 293.18, Trump 244.82
Past Peformance (4 out of last 6 elections correct winner)
2020 | 2016 | 2012 | 2008 | 2004 | 2000 | |
---|---|---|---|---|---|---|
Model (Incumbent) | 230.7 | 292.4 | 296.2 | 185.2 | 254.4 | 259.4 |
Correct Winner? | Yes | No | Yes | Yes | No | Yes |
Actual (Incumbent) | 232 | 227 | 332 | 173 | 286 | 266 |
Incumbent | Trump | Clinton | Obama | McCain | Bush | Gore |
% of Variance Explained in \(Y\) (Electoral Votes Received by Incumbent): 55.92%
Win/Lose Probability Predictions
Note: Every model in this section correctly predicts the election winners of the last 7 elections.
Aggregate: All believe Trump will win if Harris leads by only 1.2% or 2% in polls.
Naive Bayes Without Height:
Most Likely - If harris leads 1.2% in polls: Trump wins
Moderately Likely - If harris leads 2% in polls: Trump wins
Unlikely - If harris leads 3% in polls: Harris wins
Past performance (6/6 for the last 6 elections)
2020 | 2016 | 2012 | 2008 | 2004 | 2000 | |
---|---|---|---|---|---|---|
Model (Incumbent Result) | Lost | Lost | Won | Lost | Won | Lost |
Correct Winner? | Yes | Yes | Yes | Yes | Yes | Yes |
Actual (Incumbent) | Lost | Lost | Won | Lost | Won | Lost |
Incumbent | Trump | Clinton | Obama | McCain | Bush | Gore |
Error rate over last 22 elections: NA (too lazy to calculate)
Bagging Model Without Height:
Most Likely - If harris leads 1.2% in polls: Trump Wins
Moderately Likely - If harris leads 2% in polls: Trump wins
Unlikely - If harris leads 3% in polls: Trump wins
Past performance (6/6 for the last 6 elections)
2020 | 2016 | 2012 | 2008 | 2004 | 2000 | |
---|---|---|---|---|---|---|
Model (Incumbent Result) | Lost | Lost | Won | Lost | Won | Lost |
Correct Winner? | Yes | Yes | Yes | Yes | Yes | Yes |
Actual (Incumbent) | Lost | Lost | Won | Lost | Won | Lost |
Incumbent | Trump | Clinton | Obama | McCain | Bush | Gore |
Error rate over last 22 elections: 22.73%
Random Forest (5 variables bootstrapped) Without Height:
Most Likely - If harris leads 1.2% in polls: Trump Wins
Moderately Likely - If harris leads 2% in polls: Trump wins
Unlikely - If harris leads 3% in polls: Trump wins
Past performance (6/6 for the last 6 elections)
2020 | 2016 | 2012 | 2008 | 2004 | 2000 | |
---|---|---|---|---|---|---|
Model (Incumbent Result) | Lost | Lost | Won | Lost | Won | Lost |
Correct Winner? | Yes | Yes | Yes | Yes | Yes | Yes |
Actual (Incumbent) | Lost | Lost | Won | Lost | Won | Lost |
Incumbent | Trump | Clinton | Obama | McCain | Bush | Gore |
Error rate over last 22 elections: 27.27%
My Personal (Gut-Based) Predictions
My Personal Map
The Data in My Models
I used the following variables (that I gathered in like one hour). I did not check data quality, could be terrible.
- Year
- Incumbent Party
- Reelection (is on of the candidates the current president)
- Terms current party is in office continuously
- Poll margin (incumbent party candidate- challenger)
- Real GDP Growth in election year
- Unemployment rate in election rate
- Inflation rate in election year
- Incumbent party president average approval rating (gallup)
- Incumbent party president highest approval rating (gallup)
- Incumbent party president lowest approval rating (gallup)
- Recession occurred in the past 4 years?
- House net change in seats (of incumbent party) during the midterm election 2 years before the election
- House net change in seats (of incumbent party) during the election 4 years ago
- House change in seats (of incumbent party) for both midterm and 4 year ago election combined.
- Incumbent party faced a primary challenge? (challenge defined as winner of primary recieved less than 60% of votes. I classify harris is no-primary challenge).
- S&P 500 Returns in election year
- Midterm house elections from 2 years ago, incumbent party’s vote share overall (entire country)
- % Change in Jobs from last election to this election (a full presidential term)
- Height of the candidates (incumbent - challenger)
- Real estate returns in election year
- Treasury 10-year bond returns in election year.
- Does incumbent party have majority in House?
- Does incumbent party have majority in Senate?
- How many workers were furloughed as a result of a government shutdown during the current administration?
- Did any of the 15 largest tax rises in history occur during the current administration? If so, how much was the tax rise (% wise)
- Did any of the 15 largest tax breaks in history occur during the current administration? If so, how much was the tax break (% wise)
- The last two data points but net.
Start with the data analysis.
Load packages, clean data
clean data
<- data[-c(1:3),] # get rid of first 3 row
dta <- dta %>%
dta select(-c(Incumbent, Challenger))
Electoral College (R Code)
Models with Height
Bagging:
set.seed(32435)
<- randomForest(Pct_incumb ~ .,
bagging ntree = 501,
nodesize = 1,
data = dta,
na.action = na.omit,
mtry = 28,
importance = TRUE)
bagging
Call:
randomForest(formula = Pct_incumb ~ ., data = dta, ntree = 501, nodesize = 1, mtry = 28, importance = TRUE, na.action = na.omit)
Type of random forest: regression
Number of trees: 501
No. of variables tried at each split: 28
Mean of squared residuals: 0.03047664
% Var explained: 55.92
set.seed(32435)
#prediction
<- data[1:9,]
data2024 <- data2024[,-c(2,3,4)]
data2024
set.seed(32435)
<- predict(bagging, newdata = data2024)
bagging_pred * 538 bagging_pred
1 2 3 4 5 6 7 8
215.7469 221.1062 293.1835 230.7959 292.4063 296.2063 185.2304 254.4689
9
259.4328
Random Forest (5 variables boostrap Sampled):
set.seed(32435)
<- randomForest(Pct_incumb ~ .,
forest ntree = 501,
nodesize = 1,
data = dta,
na.action = na.omit,
mtry = 5,
importance = TRUE)
forest
Call:
randomForest(formula = Pct_incumb ~ ., data = dta, ntree = 501, nodesize = 1, mtry = 5, importance = TRUE, na.action = na.omit)
Type of random forest: regression
Number of trees: 501
No. of variables tried at each split: 5
Mean of squared residuals: 0.043118
% Var explained: 37.64
# predictions
set.seed(32435)
<- predict(forest, newdata = data2024)
forest_pred * 538 forest_pred
1 2 3 4 5 6 7 8
267.8893 271.9680 296.4508 253.5039 267.6710 313.7579 191.2754 272.7308
9
276.2091
Random Forest (20 Variables boostrap sampled)
set.seed(32435)
<- randomForest(Pct_incumb ~ .,
forest ntree = 501,
nodesize = 1,
data = dta,
na.action = na.omit,
mtry = 20,
importance = TRUE)
forest
Call:
randomForest(formula = Pct_incumb ~ ., data = dta, ntree = 501, nodesize = 1, mtry = 20, importance = TRUE, na.action = na.omit)
Type of random forest: regression
Number of trees: 501
No. of variables tried at each split: 20
Mean of squared residuals: 0.02977701
% Var explained: 56.94
set.seed(32435)
<- predict(forest, newdata = data2024)
forest_pred * 538 forest_pred
1 2 3 4 5 6 7 8
229.1486 232.3351 290.2950 237.6001 278.9861 300.2055 186.5987 259.0048
9
261.4613
Importance
varImpPlot(bagging, type = 2)
varImpPlot(forest, type = 2)
Models without Height
<- dta %>%
dta_noheight select(-height)
<- data2024 %>%
data2024_noheight select(-height)
bagging
set.seed(32435)
<- randomForest(Pct_incumb ~ .,
bagging1 ntree = 501,
nodesize = 1,
data = dta_noheight,
na.action = na.omit,
mtry = 27,
importance = TRUE)
bagging1
Call:
randomForest(formula = Pct_incumb ~ ., data = dta_noheight, ntree = 501, nodesize = 1, mtry = 27, importance = TRUE, na.action = na.omit)
Type of random forest: regression
Number of trees: 501
No. of variables tried at each split: 27
Mean of squared residuals: 0.03172727
% Var explained: 54.12
set.seed(32435)
<- predict(bagging1, newdata = data2024_noheight)
bagging_pred1 * 538 bagging_pred1
1 2 3 4 5 6 7 8
229.6651 235.4156 307.0048 225.0212 294.7422 302.8635 190.4628 263.2217
9
253.7652
Random forst (19 variables boostrap sampled)
set.seed(32435)
<- randomForest(Pct_incumb ~ .,
forest1 ntree = 501,
nodesize = 1,
data = dta_noheight,
na.action = na.omit,
mtry = 19,
importance = TRUE)
forest1
Call:
randomForest(formula = Pct_incumb ~ ., data = dta_noheight, ntree = 501, nodesize = 1, mtry = 19, importance = TRUE, na.action = na.omit)
Type of random forest: regression
Number of trees: 501
No. of variables tried at each split: 19
Mean of squared residuals: 0.03013152
% Var explained: 56.42
set.seed(32435)
<- predict(forest1, newdata = data2024_noheight)
forest_pred1 * 538 forest_pred1
1 2 3 4 5 6 7 8
227.0768 229.9066 290.7208 227.5034 281.8953 306.2310 184.2865 264.3723
9
258.2720
random forest (5 variables boostrap sampled):
set.seed(32435)
<- randomForest(Pct_incumb ~ .,
forest1 ntree = 501,
nodesize = 1,
data = dta_noheight,
na.action = na.omit,
mtry = 5,
importance = TRUE)
forest1
Call:
randomForest(formula = Pct_incumb ~ ., data = dta_noheight, ntree = 501, nodesize = 1, mtry = 5, importance = TRUE, na.action = na.omit)
Type of random forest: regression
Number of trees: 501
No. of variables tried at each split: 5
Mean of squared residuals: 0.04000085
% Var explained: 42.15
set.seed(32435)
<- predict(forest1, newdata = data2024_noheight)
forest_pred1 * 538 forest_pred1
1 2 3 4 5 6 7 8
266.9741 269.2456 297.0022 250.5545 264.2572 317.4231 188.6188 285.1010
9
272.9281
Win/Lose (R Code)
With Height
clean data
<- read_csv("win.csv")
win <- win[1:9,]
win2024 <- win2024[,-2]
win2024 <- win[-c(1:3),] # get rid of first 3 row
win $win <- as.factor(win$win) win
bagging
set.seed(32435)
<- randomForest(win ~ .,
bagging_win ntree = 501,
nodesize = 1,
data = win,
na.action = na.omit,
mtry = 28,
importance = TRUE)
bagging_win
Call:
randomForest(formula = win ~ ., data = win, ntree = 501, nodesize = 1, mtry = 28, importance = TRUE, na.action = na.omit)
Type of random forest: classification
Number of trees: 501
No. of variables tried at each split: 28
OOB estimate of error rate: 27.27%
Confusion matrix:
0 1 class.error
0 8 2 0.2000000
1 4 8 0.3333333
#prediction
set.seed(32435)
<- predict(bagging_win, newdata = win2024, type = "prob")
bagging_winpred bagging_winpred
0 1
1 0.6866267 0.31337325
2 0.6846307 0.31536926
3 0.6227545 0.37724551
4 0.8642715 0.13572854
5 0.7844311 0.21556886
6 0.1836327 0.81636727
7 0.9900200 0.00998004
8 0.1876248 0.81237525
9 0.8722555 0.12774451
attr(,"class")
[1] "matrix" "array" "votes"
Random Forest (5 vairables boostrap sampled):
<- randomForest(win ~ .,
bagging_win ntree = 501,
nodesize = 1,
data = win,
na.action = na.omit,
mtry = 5,
importance = TRUE)
bagging_win
Call:
randomForest(formula = win ~ ., data = win, ntree = 501, nodesize = 1, mtry = 5, importance = TRUE, na.action = na.omit)
Type of random forest: classification
Number of trees: 501
No. of variables tried at each split: 5
OOB estimate of error rate: 31.82%
Confusion matrix:
0 1 class.error
0 8 2 0.2000000
1 5 7 0.4166667
set.seed(32435)
<- predict(bagging_win, newdata = win2024, type = "prob")
bagging_winpred bagging_winpred
0 1
1 0.5888224 0.41117764
2 0.5868263 0.41317365
3 0.5489022 0.45109780
4 0.8423154 0.15768463
5 0.8003992 0.19960080
6 0.1956088 0.80439122
7 0.9600798 0.03992016
8 0.2195609 0.78043912
9 0.8323353 0.16766467
attr(,"class")
[1] "matrix" "array" "votes"
naive bayes
library(e1071)
<- naiveBayes(win ~ ., data = win)
bayes set.seed(32435)
<- predict(bayes, newdata = win2024, type = "raw")
bayes_winpred
bayes_winpred
0 1
[1,] 0.9995677419 4.322581e-04
[2,] 0.9994574316 5.425684e-04
[3,] 0.9992597549 7.402451e-04
[4,] 0.9999948160 5.184010e-06
[5,] 0.9999765940 2.340603e-05
[6,] 0.0001471148 9.998529e-01
[7,] 1.0000000000 1.548431e-13
[8,] 0.1160488759 8.839511e-01
[9,] 0.9941196876 5.880312e-03
Without Height
<- win %>%
win_noheight select(-height)
<- win2024 %>%
win2024_noheight select(-height)
Bagging:
set.seed(32435)
<- randomForest(win ~ .,
bagging_win1 ntree = 501,
nodesize = 1,
data = win_noheight,
na.action = na.omit,
mtry = 27,
importance = TRUE)
bagging_win1
Call:
randomForest(formula = win ~ ., data = win_noheight, ntree = 501, nodesize = 1, mtry = 27, importance = TRUE, na.action = na.omit)
Type of random forest: classification
Number of trees: 501
No. of variables tried at each split: 27
OOB estimate of error rate: 22.73%
Confusion matrix:
0 1 class.error
0 8 2 0.20
1 3 9 0.25
set.seed(32435)
<- predict(bagging_win1, newdata = win2024_noheight, type = "prob")
bagging_winpred1 bagging_winpred1
0 1
1 0.6506986 0.34930140
2 0.6487026 0.35129741
3 0.5748503 0.42514970
4 0.8722555 0.12774451
5 0.7684631 0.23153693
6 0.1756487 0.82435130
7 0.9680639 0.03193613
8 0.2055888 0.79441118
9 0.8403194 0.15968064
attr(,"class")
[1] "matrix" "array" "votes"
Random Forest (5 Variables bootstrapped):
set.seed(32435)
<- randomForest(win ~ .,
bagging_win1 ntree = 501,
nodesize = 1,
data = win_noheight,
na.action = na.omit,
mtry = 5,
importance = TRUE)
bagging_win1
Call:
randomForest(formula = win ~ ., data = win_noheight, ntree = 501, nodesize = 1, mtry = 5, importance = TRUE, na.action = na.omit)
Type of random forest: classification
Number of trees: 501
No. of variables tried at each split: 5
OOB estimate of error rate: 27.27%
Confusion matrix:
0 1 class.error
0 6 4 0.4000000
1 2 10 0.1666667
set.seed(32435)
<- predict(bagging_win1, newdata = win2024_noheight, type = "prob")
bagging_winpred1 bagging_winpred1
0 1
1 0.5489022 0.45109780
2 0.5489022 0.45109780
3 0.5329341 0.46706587
4 0.8243513 0.17564870
5 0.8283433 0.17165669
6 0.1377246 0.86227545
7 0.9640719 0.03592814
8 0.1696607 0.83033932
9 0.8103792 0.18962076
attr(,"class")
[1] "matrix" "array" "votes"
Naive Bayes:
set.seed(32435)
<- naiveBayes(win ~ ., data = win_noheight)
bayes1 <- predict(bayes1, newdata = win2024_noheight, type = "raw")
bayes_winpred1
set.seed(32435)
bayes_winpred1
0 1
[1,] 0.9397623361 6.023766e-02
[2,] 0.9255270741 7.447293e-02
[3,] 0.9010603924 9.893961e-02
[4,] 0.9999976463 2.353692e-06
[5,] 0.9976060912 2.393909e-03
[6,] 0.0001890752 9.998109e-01
[7,] 1.0000000000 6.021154e-13
[8,] 0.0326591074 9.673409e-01
[9,] 0.9966782550 3.321745e-03