What Factors Help Predict the Outcome of the 2020 Election?

An Analysis by Neha Swamy

December 21, 2020

Table of Contents

  1. Introduction
  2. Data Collection
  3. Data Processing
  4. Data Exploration
  5. Analysis and Hypothesis Testing
  6. Conclusions and Reflections
  7. Works Cited

1. Introduction

The result of the 2020 presidential election was a critical moment in our lives. Many of us were sitting with our hands gripped to the arms of our couches as the news anchors announced the projected winner of each state one by one.

Although the overall outcome was not obvious on the day of the election, we usually can predict the result within certain states. It's no shock that states like Kentucky, Alabama, and Arkansas voted for Trump, while states like California, Massachusetts, and Connecticut voted for Biden. However, what exactly allows us to make such predictions? Is it the racial makeup? How about income? Education? Unemployment? In this project, I attempt to find the factors that are associated with the outcome of the 2020 election.

2. Data Collection

This is the data collection stage of the data life cycle. In this part, we collect data from websites, databases, CSV files, etc.

I will specifically be using a dataset from Kaggle that includes the percentage of voters who voted for Trump, the percentage of voters who voted for Biden, demographic information, coronavirus information, income information, and employment information by county. The dataset also includes the percentage of voters who voted for Trump and Clinton in the 2016 election.

I will first be using the pandas library to read the data and put the data in a table known as a DataFrame. I will be using the .read_csv() function since the file is a .csv file.

Here is the link to the pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html for further reading if you are interested.

In [1]:
import pandas as pd

data = pd.read_csv('county_statistics.csv', sep = ',')

3. Data Processing

The next step of the data lifecycle is data processing. Here, we attempt to "clean up" our data to put it in a more readable form. We can remove unnecessary columns, duplicate data, rows with missing information, etc.

In this project, I will be removing an unnecessary column titled "Unnamed: 0" and rows with missing information.

In [2]:
data = data.drop('Unnamed: 0', axis = 1)
data = data.dropna()

Since the percentage of voters who voted for Trump, Biden, and Clinton are in decimal form, let's update each row so that they reflect the percentage rather than the decimal.

In [3]:
for index, row in data.iterrows():
    data.at[index, 'percentage16_Donald_Trump'] = data.at[index, 'percentage16_Donald_Trump'] * 100.0
    data.at[index, 'percentage16_Hillary_Clinton'] = data.at[index, 'percentage16_Hillary_Clinton'] * 100.0
    data.at[index, 'percentage20_Donald_Trump'] = data.at[index, 'percentage20_Donald_Trump'] * 100.0
    data.at[index, 'percentage20_Joe_Biden'] = data.at[index, 'percentage20_Joe_Biden'] * 100.0

Now, let's look at the first 10 values of the DataFrame to see what our data looks like.

In [4]:
data.head(10)
Out[4]:
county state percentage16_Donald_Trump percentage16_Hillary_Clinton total_votes16 votes16_Donald_Trump votes16_Hillary_Clinton percentage20_Donald_Trump percentage20_Joe_Biden total_votes20 ... Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment
0 Abbeville SC 62.9 34.6 10724.0 6742.0 3712.0 66.1 33.0 12433.0 ... 1.8 1.8 6.5 25.8 9505.0 78.8 13.3 7.8 0.1 9.4
1 Acadia LA 77.3 20.6 27386.0 21159.0 5638.0 79.5 19.1 28425.0 ... 1.6 2.2 2.5 27.6 24982.0 80.0 12.1 7.6 0.3 8.9
2 Accomack VA 54.5 42.8 15755.0 8582.0 6737.0 54.2 44.7 16938.0 ... 2.6 1.8 4.5 22.0 13837.0 74.6 18.1 7.1 0.2 5.4
3 Ada ID 47.9 38.7 195587.0 93748.0 75676.0 50.4 46.5 259389.0 ... 1.5 2.8 6.9 20.4 214984.0 78.3 15.0 6.6 0.1 4.3
4 Adair IA 65.3 30.0 3759.0 2456.0 1127.0 69.7 28.6 4183.0 ... 2.8 0.4 6.2 22.3 3680.0 73.8 15.3 10.4 0.5 3.0
5 Adair KY 80.6 16.1 8231.0 6637.0 1323.0 83.0 15.9 8766.0 ... 2.6 0.5 3.4 22.2 7988.0 74.1 15.8 9.9 0.1 6.2
6 Adair MO 59.4 34.5 10137.0 6019.0 3495.0 61.8 35.8 10337.0 ... 4.0 2.6 4.0 17.1 11274.0 73.6 20.9 5.3 0.2 5.5
7 Adair OK 73.5 21.2 6468.0 4753.0 1374.0 78.6 19.5 7108.0 ... 2.8 1.0 3.2 23.1 8130.0 71.6 20.4 7.5 0.5 5.5
8 Adams CO 42.1 49.4 175125.0 73807.0 86471.0 40.4 56.7 234599.0 ... 1.2 1.1 5.0 29.2 246450.0 83.6 11.2 5.1 0.1 5.1
9 Adams IA 66.9 27.1 2082.0 1393.0 565.0 70.8 27.3 2158.0 ... 3.3 0.8 5.3 19.6 1796.0 72.0 12.1 15.5 0.3 4.2

10 rows × 50 columns

4. Data Exploration and Visualization

The next step of the data lifecycle is to explore and visualize our data. This allows us to discover potential patterns and find points of interest before analyzing the data.

4.A Histogram and Boxplot

Let's first examine our data more generally. We can make histograms and boxplots of the percentage of voters who voted for Biden. I will also be using the pyplot() function from the matplotlib library (https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.html) to add information (x-axis label, y-axis label, and title) to the graphs.

In [5]:
from matplotlib import pyplot 

data.plot(y = 'percentage20_Joe_Biden', kind = 'hist', legend = None)
pyplot.xlabel('Percentage')
pyplot.title('Histogram of The Percentage of Voters Who Voted for Biden')
pyplot.show()

data.plot(y = 'percentage20_Joe_Biden', kind = 'box', legend = None)
pyplot.xlabel('')
pyplot.ylabel('Percentage')
pyplot.title('Boxplot of The Percentage of Voters Who Voted for Biden')
pyplot.show()

Even though Joe Biden won the election, the histogram shows us that the mean percentage of voters who voted for Biden was only around 30%! The boxplot also shows that counties in which over 70-75% of voters voted for Biden are outliers. However, this is not necessarily surprising because counties that voted for Biden most likely had a higher population density. Thus, these counties would have influenced the outcome more than counties with a smaller population density.

4.B Regression Plots

Now, I will examine different factors to see if there is an association between them and the percentage of voters who voted for Joe Biden. Specifically, I will look at the white population, income, income per capita, unemployment rate, poverty rate, and percentage of voters who voted for Hillary Clinton in 2016 of each county.

I will be using the seaborn library (https://seaborn.pydata.org/), a statistical data visualization library, to plot linear regression plots. I will also be using matplotlib to add information to the plots.

In [6]:
import seaborn as sns

'''
This function creates a regression plot, where x is one of the factors listed above, x-label is the label of the 
x-axis, title is the title of the plot, and data is the DataFrame with all the data. 
'''
def plt(x, xlabel, title, data):
    # Use pyplot to create a canvas for the plot and seaborn to create the regression plot. 
    pyplot.figure(figsize = (10, 5))
    sns.regplot(x = x, y = 'percentage20_Joe_Biden', data = data)
    
    # Add the title, x-label, y-label, and title 
    pyplot.xlabel(xlabel)
    pyplot.ylabel('Percentage of Voters Who Voted for Joe Biden')
    pyplot.title(title)
    pyplot.show()
In [7]:
x = 'White'
xlabel = 'White Population (%)'
title = 'The Association Between The White Population and the Percentage of Voters who Voted for Joe Biden'
plt(x, xlabel, title, data)
In [8]:
x = 'Income'
xlabel = 'Income ($)'
title = 'The Association Between Income and the Percentage of Voters who Voted for Joe Biden'
plt(x, xlabel, title, data)
In [9]:
x = 'IncomePerCap'
xlabel = 'Income Per Capita ($)'
title = 'The Association Between Income per Capita and the Percentage of Voters who Voted for Joe Biden'
plt(x, xlabel, title, data)
In [10]:
x = 'Unemployment'
xlabel = 'Unemployment Rate (%)'
title = 'The Association Between Unemployment Rate and the Percentage of Voters who Voted for Biden'
plt(x, xlabel, title, data)
In [11]:
x = 'Poverty'
xlabel = 'Poverty Rate (%)'
title = 'The Association Between Poverty Rate and the Percentage of Voters who Voted for Biden'
plt(x, xlabel, title, data)
In [12]:
x = 'percentage16_Hillary_Clinton'
xlabel = 'Percentage of Voters Who Voted for Hillary Clinton'
title = 'The Association Between The Percentage of Voters who Voted for Hillary Clinton and the Percentage of Voters who Voted for Joe Biden'
plt(x, xlabel, title, data)

Looking at these plots, there seems to be a relationship between each factor and the percentage of voters who voted for Joe Biden. In particular, there is a negative relationship between the percentage of white Americans in the county and the percentage of those who voted for Biden, while there is a positive relationship for the other factors. It also looks like there is a very strong relationship between the percentage of voters who voted for Clinton in 2016 and those who voted for Biden in 2020.

5. Analysis and Hypothesis Testing

Now that we have visualized the relationship between each factor and the outcome of the 2020 election, we want to analyze the data and perform hypothesis testing to see if the relationships are statistically significant. In other words, we want to determine if the results are likely not caused by chance. In this section, I will be performing linear regression and chi-square tests for independence.

5.A Linear Regression

Since we created scatter plots, let's now look at the correlation coefficients of each factor to determine the strength. I will be using sklearn (https://scikit-learn.org/stable/), a machine learning library. sklearn has a function called LinearRegression (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) that allows us to perform linear regression. I will use this function to derive the r-squared value, a statistical measure of how close the data are to the fitted line. I will then derive the square root of this value (r), which represents the direction (positive or negative) and strength of the relationship. The closer r is to 1 (or -1), the stronger the relationship is.

In [13]:
from sklearn.linear_model import LinearRegression
from math import sqrt

# Get the data of each factor 
X1 = [[w] for w in data['White']]
X2 = [[i] for i in data['Income']]
X3 = [[i] for i in data['IncomePerCap']]
X4 = [[u] for u in data['Unemployment']]
X5 = [[p] for p in data['Poverty']]
X6 = [[h] for h in data['percentage16_Hillary_Clinton']]

# Our dependent variable is the percentage of voters who voted for Biden
y = data['percentage20_Joe_Biden']

'''
Perform linear regression for each factor vs. the percentage of voters who voted for Biden and use the .fit() function 
to get the r-squared value. Then, square the value to get the correlation coefficient. 
'''
reg = LinearRegression().fit(X1, y)

'''
Since we know that the relationship between white population and the percentage of voters who voted for Biden is 
negative, we can multiply the square root of the r-squared value by -1.
'''
r1 = sqrt(reg.score(X1, y)) * -1 

reg = LinearRegression().fit(X2, y)
r2 = sqrt(reg.score(X2, y))

reg = LinearRegression().fit(X3, y)
r3 = sqrt(reg.score(X3, y))

reg = LinearRegression().fit(X4, y)
r4 = sqrt(reg.score(X4, y))

reg = LinearRegression().fit(X5, y)
r5 = sqrt(reg.score(X5, y))
      
reg = LinearRegression().fit(X6, y)
r6 = sqrt(reg.score(X6, y))

# Print the correlation coefficients
print('% of White Residents vs. % of Biden voters: ' + str(r1))
print('Income vs. % of Biden voters: ' + str(r2))
print('Income per Capita vs. % of Biden voters: ' + str(r3))
print('Unemployment vs. % of Biden voters: ' + str(r4))
print('Poverty rate vs. % of Biden voters: ' + str(r5))
print('% of Clinton voters vs. % of Biden voters: ' + str(r6))
% of White Residents vs. % of Biden voters: -0.527246244784012
Income vs. % of Biden voters: 0.2138775046831241
Income per Capita vs. % of Biden voters: 0.2668367505170639
Unemployment vs. % of Biden voters: 0.2593405266953418
Poverty rate vs. % of Biden voters: 0.13413622169828548
% of Clinton voters vs. % of Biden voters: 0.9761973275142386

As shown by the correlation coefficients, there is a very strong relationship between the percentage of voters who voted for Clinton in 2016 and that of those who voted for Biden in 2020. There is a moderately strong association between the percentage of white residents and the percentage of those who voted for Biden. There is a weaker relationship between income, income per capita, unemployment rate, or poverty rate and the percentage of voters who voted for Biden. Thus, it seems that race and the percentage of those who voted for Clinton are the two biggest factors that help predict the outcome of the 2020 election.

Although I am looking at the correlation coefficient here, the r-squared value is also important as it indicates how well the independent variable explains the variation of the dependent variable. Among these factors, the percentage of voters who voted for Hillary Clinton seems to be the only variable that explains this variation very well, as it has an r-squared value of approximately 0.95 (0.976 * 0.976).

What if we use the chosen factor AND state? Does that make a difference in the coefficient of correlation?

I will be using NumPy (https://numpy.org/), a library for performing mathematic functions, and the .corrcoef() function (https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html) to derive the correlation coefficients. I want to use NumPy instead of sklearn since we do not know the direction of the regression line.

In [14]:
import numpy as np 

# Get a list of all the states 
states = data['state'].drop_duplicates()

# Create a DataFrame that will hold the data for each factor of each state
state_data = pd.DataFrame(columns = ['State', 'Percentage16_Hillary_Clinton', 'Income', 'White', 'Income Per Capita', 
                                     'Unemployment', 'Poverty'])
index = 0

# Iterate through the states and get the data of each state 
for state in states:
    X1 = data[data.state == state]['percentage16_Hillary_Clinton']
    X2 = data[data.state == state]['Income']
    X3 = data[data.state == state]['White']
    X4 = data[data.state == state]['IncomePerCap']
    X5 = data[data.state == state]['Unemployment']
    X6 = data[data.state == state]['Poverty']
    y = data[data.state == state]['percentage20_Joe_Biden']
    
    # Compute the correlation coefficients
    r1 = np.corrcoef(X1, y)
    r2 = np.corrcoef(X2, y)
    r3 = np.corrcoef(X3, y)
    r4 = np.corrcoef(X4, y)
    r5 = np.corrcoef(X5, y)
    r6 = np.corrcoef(X6, y)
    
    # Put the results in the DataFrame
    state_data.at[index, 'State'] = state
    state_data.at[index, 'Percentage16_Hillary_Clinton'] = r1[0, 1]
    state_data.at[index, 'Income'] = r2[0, 1]
    state_data.at[index, 'White'] = r3[0, 1]
    state_data.at[index, 'Income Per Capita'] = r4[0, 1]
    state_data.at[index, 'Unemployment'] = r5[0, 1]
    state_data.at[index, 'Poverty'] = r6[0, 1]
    
    index += 1

state_data.index = range(1, len(state_data) + 1)
/opt/conda/lib/python3.8/site-packages/numpy/lib/function_base.py:2551: RuntimeWarning: Degrees of freedom <= 0 for slice
  c = cov(x, y, rowvar)
/opt/conda/lib/python3.8/site-packages/numpy/lib/function_base.py:2480: RuntimeWarning: divide by zero encountered in true_divide
  c *= np.true_divide(1, fact)
/opt/conda/lib/python3.8/site-packages/numpy/lib/function_base.py:2480: RuntimeWarning: invalid value encountered in multiply
  c *= np.true_divide(1, fact)

Now that we have the correlation coefficients for each factor of each state, let's look at each factor individually. I will be sorting the results by the correlation coefficient.

In [15]:
HC = state_data[['State', 'Percentage16_Hillary_Clinton']].sort_values(by = 'Percentage16_Hillary_Clinton')
HC.index = range(1, 51)
HC
Out[15]:
State Percentage16_Hillary_Clinton
1 NH -0.590559
2 RI -0.36134
3 VT -0.0965266
4 MA 0.372058
5 ME 0.41645
6 CT 0.671977
7 HI 0.929397
8 AR 0.952888
9 TX 0.954245
10 TN 0.965956
11 OK 0.967547
12 UT 0.977928
13 IN 0.978512
14 KY 0.978843
15 CA 0.979895
16 IA 0.980555
17 SC 0.982192
18 MO 0.982246
19 FL 0.983104
20 MI 0.983134
21 VA 0.983513
22 WV 0.983998
23 AZ 0.984135
24 NJ 0.984394
25 MN 0.984538
26 NM 0.984825
27 OH 0.985903
28 ID 0.986036
29 NC 0.987008
30 MD 0.987383
31 GA 0.987784
32 NY 0.98828
33 NE 0.988642
34 WI 0.988949
35 SD 0.989771
36 IL 0.990238
37 ND 0.990511
38 CO 0.992321
39 LA 0.99246
40 MS 0.992848
41 KS 0.993661
42 WA 0.993889
43 PA 0.994549
44 MT 0.995081
45 AL 0.995579
46 NV 0.995817
47 OR 0.996947
48 WY 0.997737
49 DE 0.999999
50 DC NaN

It appears that the correlation between the percentage of voters who voted for Clinton in 2016 and the percentage of voters who voted for Biden in 2020 is very strong and positive for every state except for

  • New Hampshire (moderate and negative)
  • Rhode Island (moderate and negative)
  • Vermont (weak and negative)
  • Massachusetts (moderate and positive)
  • Maine (moderate and positive)
  • Connecticut (moderate and positive)
In [16]:
income = state_data[['State', 'Income']].sort_values(by = 'Income')
income.index = range(1, 51)
income
Out[16]:
State Income
1 HI -0.750996
2 RI -0.718499
3 VT -0.607722
4 MS -0.58954
5 CT -0.501043
6 ND -0.450082
7 AL -0.445215
8 SC -0.394257
9 SD -0.363785
10 NH -0.258198
11 NV -0.220829
12 LA -0.192887
13 MT -0.175531
14 AR -0.171966
15 NJ -0.127815
16 TX -0.121695
17 AZ -0.0985675
18 GA -0.0484172
19 OK 0.0114567
20 WI 0.0336888
21 NM 0.0342603
22 NC 0.0868865
23 IN 0.128729
24 WY 0.159356
25 KS 0.165077
26 OH 0.175408
27 IA 0.247594
28 UT 0.271511
29 NY 0.291386
30 FL 0.294713
31 MN 0.323589
32 NE 0.33665
33 CO 0.387512
34 ID 0.397837
35 MI 0.417794
36 TN 0.428772
37 IL 0.432659
38 MA 0.502154
39 PA 0.524413
40 VA 0.540581
41 WV 0.570742
42 WA 0.578414
43 MD 0.58299
44 MO 0.593352
45 KY 0.593803
46 ME 0.620414
47 CA 0.649854
48 OR 0.699365
49 DE 0.947134
50 DC NaN

The correlation coefficients of Income vs. % of voters who voted for Biden have a much wider range. It seems that Hawaii and Delaware are the only two states where the correlation is very strong, with Hawaii having a negative association and Delaware having a positive correlation.

In [17]:
income_per_capita = state_data[['State', 'Income Per Capita']].sort_values(by = 'Income Per Capita')
income_per_capita.index = range(1, 51)
income_per_capita
Out[17]:
State Income Per Capita
1 RI -0.68652
2 HI -0.592613
3 NH -0.589962
4 ND -0.58532
5 SD -0.556433
6 MS -0.519906
7 SC -0.335362
8 AL -0.317739
9 MT -0.31224
10 CT -0.215345
11 TX -0.18883
12 VT -0.183628
13 AZ -0.129121
14 NE -0.0415492
15 LA -0.0415393
16 NJ -0.0113731
17 KS -0.0100427
18 AR -0.00208738
19 OK 0.0250902
20 GA 0.0307808
21 WI 0.0335671
22 NV 0.0563796
23 NM 0.0792746
24 NC 0.20003
25 IA 0.270782
26 IN 0.343657
27 FL 0.408883
28 MN 0.440998
29 OH 0.473479
30 TN 0.473664
31 MA 0.499457
32 VA 0.510046
33 NY 0.519625
34 IL 0.522339
35 WY 0.527473
36 ME 0.538223
37 MI 0.544536
38 DE 0.562595
39 CO 0.580211
40 UT 0.580612
41 MD 0.584102
42 PA 0.61154
43 CA 0.616729
44 ID 0.617091
45 OR 0.628642
46 KY 0.66888
47 WV 0.698353
48 MO 0.698403
49 WA 0.732731
50 DC NaN

As with income, the correlation coefficients of income per capita vs. % of voters who voted for Biden has a very large range. Some states have a negative correlation, while others have a positive correlation. It also does not appear that any state has a very strong correlation. Most states have a weak or moderate association. Some states, including Arkansas, Kansas, and New Jersey, have virtually no correlation.

In [18]:
white = state_data[['State', 'White']].sort_values(by = 'White')
white.index = range(1, 51)
white
Out[18]:
State White
1 AL -0.97708
2 MS -0.974944
3 LA -0.942384
4 SC -0.914585
5 GA -0.889558
6 DE -0.881234
7 NJ -0.876839
8 AR -0.869588
9 MD -0.86808
10 RI -0.864532
11 OH -0.850809
12 TN -0.842246
13 NY -0.806288
14 VA -0.798785
15 CT -0.798214
16 IN -0.796976
17 IL -0.792852
18 PA -0.789526
19 SD -0.743702
20 NC -0.7276
21 ND -0.705169
22 AZ -0.696396
23 KY -0.691665
24 TX -0.689369
25 MI -0.665021
26 WI -0.64277
27 MO -0.642377
28 FL -0.642003
29 NV -0.641419
30 NE -0.614493
31 MT -0.574308
32 WV -0.571435
33 NM -0.555562
34 UT -0.508254
35 ME -0.505665
36 WY -0.505389
37 MN -0.496125
38 OK -0.480392
39 IA -0.448081
40 KS -0.41579
41 CA -0.382094
42 OR -0.337704
43 CO -0.15264
44 ID -0.0195283
45 WA -0.00487945
46 VT 0.0870963
47 MA 0.172901
48 NH 0.708457
49 HI 0.747494
50 DC NaN

Most states appear to have a moderate to strong correlation between % of white residents vs. % of voters who voted for Biden. Most states also have a negative association. Interestingly, Hawaii has a strong association between % of White residents vs. % of voters who voted for Biden, as it has a correlation coefficient of 0.748. Thus, counties with a higher percentage of white residents tended to vote for Biden more, the opposite of most other states. Some states like Vermont, Washington, Idaho, and Colorado have a very weak correlation, meaning that the makeup of the white population was not a strong predictor in these states.

In [19]:
unemployment = state_data[['State', 'Unemployment']].sort_values(by = 'Unemployment')
unemployment.index = range(1, 51)
unemployment
Out[19]:
State Unemployment
1 MA -0.800506
2 NH -0.465392
3 KY -0.436563
4 WV -0.348316
5 OR -0.347594
6 FL -0.285034
7 CA -0.23465
8 VA -0.19873
9 MO -0.186072
10 MI -0.18373
11 WA -0.173067
12 MD -0.156455
13 ME -0.155592
14 TN -0.150419
15 WY -0.142678
16 CO -0.0568616
17 NY 0.000819107
18 AZ 0.0173685
19 ID 0.0309651
20 UT 0.103872
21 MN 0.155736
22 NJ 0.166749
23 OH 0.180378
24 LA 0.183782
25 IL 0.186151
26 WI 0.187164
27 PA 0.206611
28 NC 0.221562
29 IA 0.225419
30 NM 0.2371
31 OK 0.258212
32 IN 0.271783
33 TX 0.324982
34 NV 0.328191
35 GA 0.330534
36 KS 0.44246
37 AR 0.444144
38 DE 0.475839
39 SC 0.489858
40 MT 0.503218
41 NE 0.529737
42 AL 0.600389
43 MS 0.610996
44 ND 0.621748
45 VT 0.631073
46 SD 0.666756
47 CT 0.714619
48 HI 0.74209
49 RI 0.928021
50 DC NaN

The correlation coefficients range from strong and negative to strong and positive. However, it seems that most states have a weak to moderate correlation between the unemployment rate and the percentage of voters who voted for Biden.

In [20]:
poverty = state_data[['State', 'Poverty']].sort_values(by = 'Poverty')
poverty.index = range(1, 51)
poverty
Out[20]:
State Poverty
1 NH -0.567405
2 KY -0.490007
3 MD -0.404241
4 ME -0.355767
5 WA -0.338037
6 MA -0.331302
7 CA -0.306772
8 DE -0.295615
9 MO -0.292322
10 TN -0.289936
11 WV -0.251787
12 VA -0.237587
13 OR -0.232595
14 CO -0.145048
15 FL -0.134862
16 ID -0.134024
17 MI -0.0545347
18 PA -0.0116987
19 UT 0.00155584
20 IL 0.0232928
21 OH 0.0895015
22 NM 0.116249
23 OK 0.143382
24 WY 0.144068
25 MN 0.147682
26 IA 0.175163
27 NY 0.226164
28 NE 0.245384
29 NC 0.252249
30 GA 0.277153
31 IN 0.317951
32 NJ 0.322967
33 NV 0.376401
34 KS 0.387356
35 LA 0.402918
36 TX 0.423468
37 AZ 0.445527
38 SC 0.46419
39 AR 0.473291
40 WI 0.474729
41 VT 0.562333
42 MT 0.592008
43 ND 0.639071
44 SD 0.670651
45 AL 0.691283
46 HI 0.691484
47 MS 0.795287
48 RI 0.857395
49 CT 0.866259
50 DC NaN

The correlation coefficients for poverty rate range from negative and moderate to positive and strong. Connecticut, Rhode Island, and Mississipi have the strongest positive correlations. All other states have a weak to moderate association between poverty and the percentage of voters who voted for Biden.

Across the board, race and the percentage of voters who voted for Hillary Clinton are the two factors that have a strong correlation with the outcome of the 2020 election. Some states have a strong or moderate correlation between poverty, income, income per capita, and unemployment and the outcome of this election, but the correlation varies a lot more between states.

5.B Hypothesis Testing: Chi-Square

Now, I will use hypothesis testing to test the results and see if there if the data is meaningful/significant. In other words, we want to see if the outcome of the 2016 election, the percentage of white residents, income, income per capita, unemployment rate, and poverty rate are truely factors that helped predict the outcome of the 2020 election.

I will specifically be running a chi-square test of independence using scipy.stats (https://docs.scipy.org/doc/scipy/reference/stats.html). For each variable, we will be dividing the percentages into low and high. The low/high cutoffs will be explained for each test later. The null and alternative hypotheses are the following:

H_o: There is no association between [factor] and the outcome of the 2020 election.

H_a: There is an association between [factor] and the outcome of the 2020 election.

I will be rejecting the null hypothesis if we get a p-value of less than 0.05.

Below, I created a function to create the chi-square tables and compute the p-value.

Here is more information about chi-square tests in case you are unfamiliar: https://libguides.library.kent.edu/spss/chisquare#:~:text=The%20Chi%2DSquare%20Test%20of%20Independence%20determines%20whether%20there%20is,Chi%2DSquare%20Test%20of%20Association.

Here is more information about the chi2_contingency function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

In [21]:
# Library to perform chi square test of independence
from scipy.stats import chi2_contingency as chisquare

def compute_chi_square(data, factor, cutoff):
    '''
    Create a table to hold the number of observations (counties) that fall under the 'Low' (i.e. low percentage of
    voters who voted for Biden/Clinton, low income, low income per capita, etc.) category and the number of 
    observations that fall under the 'High' category. The first row corresponds to counties in which Biden won, 
    and the second row corresponds to counties in which Biden lost. 
    '''
    table = pd.DataFrame(columns = ['Low', 'High'])
    table.at[0, 'Low'] = 0
    table.at[1, 'Low'] = 0 
    table.at[0, 'High'] = 0
    table.at[1, 'High'] = 0
    
    # Iterate through the data to count the number of observations that fall under the low and high categories. 
    for index, row in data.iterrows(): 
        biden = data.at[index, 'percentage20_Joe_Biden']
        obs = data.at[index, factor]
        
        '''
        If Biden won in the county, we will increase the count in the first row. Otherwise, we will increase the 
        count in the second row. We will determine which column ('Low'/'High') to increment based on whether or not 
        the observation is considered low or high. 
        '''
        if (biden > 50.0):
            if (obs <= cutoff):
                table.at[0, 'Low'] += 1
            
            else:
                table.at[0, 'High'] += 1
        
        else:
            if (obs <= cutoff):
                table.at[1, 'Low'] += 1
            
            else:
                table.at[1, 'High'] += 1
    
    cs = chisquare(table)
    # Return the p-value
    return table, cs, cs[1]

First, I will focus on the percentage of voters who voted for Hillary Clinton in 2016. I will be using a cutoff value of 50.0 since > 50.0 means that Clinton won in a given county.

In [22]:
table, cs, p = compute_chi_square(data, 'percentage16_Hillary_Clinton', 50.0)
print(p)
0.0

Since we have a p-value less than 0.05, we can reject the null hypothesis of no association between the percentage of voters who voted for Clinton in 2016 and the percentage of voters who voted for Biden in 2020.

Next, I will focus on income. Since the median national household income in 2019 was $68,703, I will be using that value as the cutoff.

In [23]:
table, cs, p = compute_chi_square(data, 'Income', 68703)
print(p)
5.2201075823590994e-30

Since we have a p-value less than 0.05, we can reject the null hypothesis of no association between income and the percentage of voters who voted for Biden.

I will then focus on income per capita. I will be using the national income per capita in 2019 of $34,103 (U.S. Census Bureau) as the cutoff.

In [24]:
table, cs, p = compute_chi_square(data, 'IncomePerCap', 34103)

We have another p-value less than 0.05, so we can reject the null hypothesis of no association between income per capita and the percentage of voters who voted for Biden.

Now, I will focus on the percentage of residents who are white. I will be using a value of 76.3, the national percentage.

In [25]:
table, cs, p = compute_chi_square(data, 'White', 76.3)
print(p)
6.209847965620941e-65

Since the p-value is very close to 0, we can reject the null hypothesis of no association between the percentage of residents who are white and the percentage of voters who voted for Biden.

The next factor is poverty. I will be using the national poverty rate in 2019 (10.5%) as the cutoff (U.S. Census Bureau).

In [26]:
table, cs, p = compute_chi_square(data, 'Poverty', 10.5)
print(p)
0.2770381385183103

Here, the p-value is greater than 0.05, so we fail to reject the null hypothesis of no association between poverty rate and the percentage of voters who voted for Biden.

The last factor is unemployment rate. I will be using a cutoff of 4.5% since unemployment rates of 4.5% or below are considered healthy.

In [27]:
table, cs, p = compute_chi_square(data, 'Unemployment', 4.5)
print(p)
4.394533162225231e-09

Since the p-value is less than 0.05, we reject the null hypothesis of no association between the unemployment rate and the percentage of voters who voted for Biden.

According to the results of these tests, there appears to be an association between the percentage of voters who voted for Clinton in 2016, income, income per cepita, the percentage of the population that is white, and unemployment rate and the percentage of voters who voted for Biden in 2020 but not with poverty rate.

6. Insight

Here, we will use the results from our analysis and hypothesis tests to draw conclusions.

As said before, there appears to be an association between the percentage of voters who voted for Clinton in 2016, income, income per capita, the percentage of the population that is white, and unemployment rate and the percentage of voters who voted for Biden in 2020. However, there does not appear to be an association between poverty and the outcome of the election. Moreover, even though the only variable to have a very high r-squared value was the percentage of voters who voted for Clinton in 2016, the p-values from the chi-square tests still indicate that there is an association between the other factors (except for poverty) and the percentage of voters who voted for Biden. This is not surprising because we are focusing on people. The results still indicate that there is a real relationship between our independent variables and our dependent variable (“How to Interpret a Regression Model with Low R-Squared and Low P Values" 2014). Here is some more information explaining the case of a low p-value but a low r-squared value: https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-a-regression-model-with-low-r-squared-and-low-p-values#:~:text=For%20example%2C%20many%20psychology%20studies,because%20people%20are%20fairly%20unpredictable.&text=The%20good%20news%20is%20that,predictors%20and%20the%20response%20variable.

One thing to keep in mind is that correlation does not equal causation. In other words, being white does not cause one to not vote for Biden, nor does being a Clinton voter cause one to vote for Biden. Thus, the results do raise the question, what DOES cause a county with more people of color, higher income, higher income per capita, etc. to vote for Biden? That is something that we can explore in the future.

We could potentially use this data to predict the outcome of future elections, but instead of using the outcome of the 2016 election as a factor to predict the outcome of the 2024 election, we can use the outcome of this election. We could also use different factors, like the total number of coronavirus cases in 2020 per county since the pandemic certainly was an important aspect of this year, to say the least.

7. Works Cited

Amadeo, Kimberly. “Why Zero Unemployment Isn't as Good as It Sounds.” The Balance, The Balance, 30 Aug. 2020, www.thebalance.com/natural-rate-of-unemployment-definition-and-trends-3305950.

“How to Interpret a Regression Model with Low R-Squared and Low P Values.” The Minitab Blog, Minitab, 12 June 2014, blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-a-regression-model-with-low-r-squared-and-low-p- values.

“Income and Poverty in the United States: 2019.” The United States Census Bureau, The United States Census Bureau, 15 Sept. 2020, www.census.gov/library/publications/2020/demo/p60-270.html.

“U.S. Census Bureau QuickFacts: United States.” The United States Census Bureau, The United States Census Bureau, 1 July 2019, www.census.gov/quickfacts/fact/table/US/PST045219.

“U.S. Census Bureau QuickFacts: United States.” The United States Census Bureau, The United States Census Bureau, 1 July 2019, www.census.gov/quickfacts/fact/table/US/SEX255219.