The result of the 2020 presidential election was a critical moment in our lives. Many of us were sitting with our hands gripped to the arms of our couches as the news anchors announced the projected winner of each state one by one.
Although the overall outcome was not obvious on the day of the election, we usually can predict the result within certain states. It's no shock that states like Kentucky, Alabama, and Arkansas voted for Trump, while states like California, Massachusetts, and Connecticut voted for Biden. However, what exactly allows us to make such predictions? Is it the racial makeup? How about income? Education? Unemployment? In this project, I attempt to find the factors that are associated with the outcome of the 2020 election.
This is the data collection stage of the data life cycle. In this part, we collect data from websites, databases, CSV files, etc.
I will specifically be using a dataset from Kaggle that includes the percentage of voters who voted for Trump, the percentage of voters who voted for Biden, demographic information, coronavirus information, income information, and employment information by county. The dataset also includes the percentage of voters who voted for Trump and Clinton in the 2016 election.
I will first be using the pandas library to read the data and put the data in a table known as a DataFrame. I will be using the .read_csv() function since the file is a .csv file.
Here is the link to the pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html for further reading if you are interested.
import pandas as pd
data = pd.read_csv('county_statistics.csv', sep = ',')
The next step of the data lifecycle is data processing. Here, we attempt to "clean up" our data to put it in a more readable form. We can remove unnecessary columns, duplicate data, rows with missing information, etc.
In this project, I will be removing an unnecessary column titled "Unnamed: 0" and rows with missing information.
data = data.drop('Unnamed: 0', axis = 1)
data = data.dropna()
Since the percentage of voters who voted for Trump, Biden, and Clinton are in decimal form, let's update each row so that they reflect the percentage rather than the decimal.
for index, row in data.iterrows():
data.at[index, 'percentage16_Donald_Trump'] = data.at[index, 'percentage16_Donald_Trump'] * 100.0
data.at[index, 'percentage16_Hillary_Clinton'] = data.at[index, 'percentage16_Hillary_Clinton'] * 100.0
data.at[index, 'percentage20_Donald_Trump'] = data.at[index, 'percentage20_Donald_Trump'] * 100.0
data.at[index, 'percentage20_Joe_Biden'] = data.at[index, 'percentage20_Joe_Biden'] * 100.0
Now, let's look at the first 10 values of the DataFrame to see what our data looks like.
data.head(10)
The next step of the data lifecycle is to explore and visualize our data. This allows us to discover potential patterns and find points of interest before analyzing the data.
Let's first examine our data more generally. We can make histograms and boxplots of the percentage of voters who voted for Biden. I will also be using the pyplot() function from the matplotlib library (https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.html) to add information (x-axis label, y-axis label, and title) to the graphs.
from matplotlib import pyplot
data.plot(y = 'percentage20_Joe_Biden', kind = 'hist', legend = None)
pyplot.xlabel('Percentage')
pyplot.title('Histogram of The Percentage of Voters Who Voted for Biden')
pyplot.show()
data.plot(y = 'percentage20_Joe_Biden', kind = 'box', legend = None)
pyplot.xlabel('')
pyplot.ylabel('Percentage')
pyplot.title('Boxplot of The Percentage of Voters Who Voted for Biden')
pyplot.show()
Even though Joe Biden won the election, the histogram shows us that the mean percentage of voters who voted for Biden was only around 30%! The boxplot also shows that counties in which over 70-75% of voters voted for Biden are outliers. However, this is not necessarily surprising because counties that voted for Biden most likely had a higher population density. Thus, these counties would have influenced the outcome more than counties with a smaller population density.
Now, I will examine different factors to see if there is an association between them and the percentage of voters who voted for Joe Biden. Specifically, I will look at the white population, income, income per capita, unemployment rate, poverty rate, and percentage of voters who voted for Hillary Clinton in 2016 of each county.
I will be using the seaborn library (https://seaborn.pydata.org/), a statistical data visualization library, to plot linear regression plots. I will also be using matplotlib to add information to the plots.
import seaborn as sns
'''
This function creates a regression plot, where x is one of the factors listed above, x-label is the label of the
x-axis, title is the title of the plot, and data is the DataFrame with all the data.
'''
def plt(x, xlabel, title, data):
# Use pyplot to create a canvas for the plot and seaborn to create the regression plot.
pyplot.figure(figsize = (10, 5))
sns.regplot(x = x, y = 'percentage20_Joe_Biden', data = data)
# Add the title, x-label, y-label, and title
pyplot.xlabel(xlabel)
pyplot.ylabel('Percentage of Voters Who Voted for Joe Biden')
pyplot.title(title)
pyplot.show()
x = 'White'
xlabel = 'White Population (%)'
title = 'The Association Between The White Population and the Percentage of Voters who Voted for Joe Biden'
plt(x, xlabel, title, data)
x = 'Income'
xlabel = 'Income ($)'
title = 'The Association Between Income and the Percentage of Voters who Voted for Joe Biden'
plt(x, xlabel, title, data)
x = 'IncomePerCap'
xlabel = 'Income Per Capita ($)'
title = 'The Association Between Income per Capita and the Percentage of Voters who Voted for Joe Biden'
plt(x, xlabel, title, data)
x = 'Unemployment'
xlabel = 'Unemployment Rate (%)'
title = 'The Association Between Unemployment Rate and the Percentage of Voters who Voted for Biden'
plt(x, xlabel, title, data)
x = 'Poverty'
xlabel = 'Poverty Rate (%)'
title = 'The Association Between Poverty Rate and the Percentage of Voters who Voted for Biden'
plt(x, xlabel, title, data)
x = 'percentage16_Hillary_Clinton'
xlabel = 'Percentage of Voters Who Voted for Hillary Clinton'
title = 'The Association Between The Percentage of Voters who Voted for Hillary Clinton and the Percentage of Voters who Voted for Joe Biden'
plt(x, xlabel, title, data)
Looking at these plots, there seems to be a relationship between each factor and the percentage of voters who voted for Joe Biden. In particular, there is a negative relationship between the percentage of white Americans in the county and the percentage of those who voted for Biden, while there is a positive relationship for the other factors. It also looks like there is a very strong relationship between the percentage of voters who voted for Clinton in 2016 and those who voted for Biden in 2020.
Now that we have visualized the relationship between each factor and the outcome of the 2020 election, we want to analyze the data and perform hypothesis testing to see if the relationships are statistically significant. In other words, we want to determine if the results are likely not caused by chance. In this section, I will be performing linear regression and chi-square tests for independence.
Since we created scatter plots, let's now look at the correlation coefficients of each factor to determine the strength. I will be using sklearn (https://scikit-learn.org/stable/), a machine learning library. sklearn has a function called LinearRegression (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) that allows us to perform linear regression. I will use this function to derive the r-squared value, a statistical measure of how close the data are to the fitted line. I will then derive the square root of this value (r), which represents the direction (positive or negative) and strength of the relationship. The closer r is to 1 (or -1), the stronger the relationship is.
from sklearn.linear_model import LinearRegression
from math import sqrt
# Get the data of each factor
X1 = [[w] for w in data['White']]
X2 = [[i] for i in data['Income']]
X3 = [[i] for i in data['IncomePerCap']]
X4 = [[u] for u in data['Unemployment']]
X5 = [[p] for p in data['Poverty']]
X6 = [[h] for h in data['percentage16_Hillary_Clinton']]
# Our dependent variable is the percentage of voters who voted for Biden
y = data['percentage20_Joe_Biden']
'''
Perform linear regression for each factor vs. the percentage of voters who voted for Biden and use the .fit() function
to get the r-squared value. Then, square the value to get the correlation coefficient.
'''
reg = LinearRegression().fit(X1, y)
'''
Since we know that the relationship between white population and the percentage of voters who voted for Biden is
negative, we can multiply the square root of the r-squared value by -1.
'''
r1 = sqrt(reg.score(X1, y)) * -1
reg = LinearRegression().fit(X2, y)
r2 = sqrt(reg.score(X2, y))
reg = LinearRegression().fit(X3, y)
r3 = sqrt(reg.score(X3, y))
reg = LinearRegression().fit(X4, y)
r4 = sqrt(reg.score(X4, y))
reg = LinearRegression().fit(X5, y)
r5 = sqrt(reg.score(X5, y))
reg = LinearRegression().fit(X6, y)
r6 = sqrt(reg.score(X6, y))
# Print the correlation coefficients
print('% of White Residents vs. % of Biden voters: ' + str(r1))
print('Income vs. % of Biden voters: ' + str(r2))
print('Income per Capita vs. % of Biden voters: ' + str(r3))
print('Unemployment vs. % of Biden voters: ' + str(r4))
print('Poverty rate vs. % of Biden voters: ' + str(r5))
print('% of Clinton voters vs. % of Biden voters: ' + str(r6))
As shown by the correlation coefficients, there is a very strong relationship between the percentage of voters who voted for Clinton in 2016 and that of those who voted for Biden in 2020. There is a moderately strong association between the percentage of white residents and the percentage of those who voted for Biden. There is a weaker relationship between income, income per capita, unemployment rate, or poverty rate and the percentage of voters who voted for Biden. Thus, it seems that race and the percentage of those who voted for Clinton are the two biggest factors that help predict the outcome of the 2020 election.
Although I am looking at the correlation coefficient here, the r-squared value is also important as it indicates how well the independent variable explains the variation of the dependent variable. Among these factors, the percentage of voters who voted for Hillary Clinton seems to be the only variable that explains this variation very well, as it has an r-squared value of approximately 0.95 (0.976 * 0.976).
What if we use the chosen factor AND state? Does that make a difference in the coefficient of correlation?
I will be using NumPy (https://numpy.org/), a library for performing mathematic functions, and the .corrcoef() function (https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html) to derive the correlation coefficients. I want to use NumPy instead of sklearn since we do not know the direction of the regression line.
import numpy as np
# Get a list of all the states
states = data['state'].drop_duplicates()
# Create a DataFrame that will hold the data for each factor of each state
state_data = pd.DataFrame(columns = ['State', 'Percentage16_Hillary_Clinton', 'Income', 'White', 'Income Per Capita',
'Unemployment', 'Poverty'])
index = 0
# Iterate through the states and get the data of each state
for state in states:
X1 = data[data.state == state]['percentage16_Hillary_Clinton']
X2 = data[data.state == state]['Income']
X3 = data[data.state == state]['White']
X4 = data[data.state == state]['IncomePerCap']
X5 = data[data.state == state]['Unemployment']
X6 = data[data.state == state]['Poverty']
y = data[data.state == state]['percentage20_Joe_Biden']
# Compute the correlation coefficients
r1 = np.corrcoef(X1, y)
r2 = np.corrcoef(X2, y)
r3 = np.corrcoef(X3, y)
r4 = np.corrcoef(X4, y)
r5 = np.corrcoef(X5, y)
r6 = np.corrcoef(X6, y)
# Put the results in the DataFrame
state_data.at[index, 'State'] = state
state_data.at[index, 'Percentage16_Hillary_Clinton'] = r1[0, 1]
state_data.at[index, 'Income'] = r2[0, 1]
state_data.at[index, 'White'] = r3[0, 1]
state_data.at[index, 'Income Per Capita'] = r4[0, 1]
state_data.at[index, 'Unemployment'] = r5[0, 1]
state_data.at[index, 'Poverty'] = r6[0, 1]
index += 1
state_data.index = range(1, len(state_data) + 1)
Now that we have the correlation coefficients for each factor of each state, let's look at each factor individually. I will be sorting the results by the correlation coefficient.
HC = state_data[['State', 'Percentage16_Hillary_Clinton']].sort_values(by = 'Percentage16_Hillary_Clinton')
HC.index = range(1, 51)
HC
It appears that the correlation between the percentage of voters who voted for Clinton in 2016 and the percentage of voters who voted for Biden in 2020 is very strong and positive for every state except for
income = state_data[['State', 'Income']].sort_values(by = 'Income')
income.index = range(1, 51)
income
The correlation coefficients of Income vs. % of voters who voted for Biden have a much wider range. It seems that Hawaii and Delaware are the only two states where the correlation is very strong, with Hawaii having a negative association and Delaware having a positive correlation.
income_per_capita = state_data[['State', 'Income Per Capita']].sort_values(by = 'Income Per Capita')
income_per_capita.index = range(1, 51)
income_per_capita
As with income, the correlation coefficients of income per capita vs. % of voters who voted for Biden has a very large range. Some states have a negative correlation, while others have a positive correlation. It also does not appear that any state has a very strong correlation. Most states have a weak or moderate association. Some states, including Arkansas, Kansas, and New Jersey, have virtually no correlation.
white = state_data[['State', 'White']].sort_values(by = 'White')
white.index = range(1, 51)
white
Most states appear to have a moderate to strong correlation between % of white residents vs. % of voters who voted for Biden. Most states also have a negative association. Interestingly, Hawaii has a strong association between % of White residents vs. % of voters who voted for Biden, as it has a correlation coefficient of 0.748. Thus, counties with a higher percentage of white residents tended to vote for Biden more, the opposite of most other states. Some states like Vermont, Washington, Idaho, and Colorado have a very weak correlation, meaning that the makeup of the white population was not a strong predictor in these states.
unemployment = state_data[['State', 'Unemployment']].sort_values(by = 'Unemployment')
unemployment.index = range(1, 51)
unemployment
The correlation coefficients range from strong and negative to strong and positive. However, it seems that most states have a weak to moderate correlation between the unemployment rate and the percentage of voters who voted for Biden.
poverty = state_data[['State', 'Poverty']].sort_values(by = 'Poverty')
poverty.index = range(1, 51)
poverty
The correlation coefficients for poverty rate range from negative and moderate to positive and strong. Connecticut, Rhode Island, and Mississipi have the strongest positive correlations. All other states have a weak to moderate association between poverty and the percentage of voters who voted for Biden.
Across the board, race and the percentage of voters who voted for Hillary Clinton are the two factors that have a strong correlation with the outcome of the 2020 election. Some states have a strong or moderate correlation between poverty, income, income per capita, and unemployment and the outcome of this election, but the correlation varies a lot more between states.
Now, I will use hypothesis testing to test the results and see if there if the data is meaningful/significant. In other words, we want to see if the outcome of the 2016 election, the percentage of white residents, income, income per capita, unemployment rate, and poverty rate are truely factors that helped predict the outcome of the 2020 election.
I will specifically be running a chi-square test of independence using scipy.stats (https://docs.scipy.org/doc/scipy/reference/stats.html). For each variable, we will be dividing the percentages into low and high. The low/high cutoffs will be explained for each test later. The null and alternative hypotheses are the following:
H_o: There is no association between [factor] and the outcome of the 2020 election.
H_a: There is an association between [factor] and the outcome of the 2020 election.
I will be rejecting the null hypothesis if we get a p-value of less than 0.05.
Below, I created a function to create the chi-square tables and compute the p-value.
Here is more information about chi-square tests in case you are unfamiliar: https://libguides.library.kent.edu/spss/chisquare#:~:text=The%20Chi%2DSquare%20Test%20of%20Independence%20determines%20whether%20there%20is,Chi%2DSquare%20Test%20of%20Association.
Here is more information about the chi2_contingency function: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html
# Library to perform chi square test of independence
from scipy.stats import chi2_contingency as chisquare
def compute_chi_square(data, factor, cutoff):
'''
Create a table to hold the number of observations (counties) that fall under the 'Low' (i.e. low percentage of
voters who voted for Biden/Clinton, low income, low income per capita, etc.) category and the number of
observations that fall under the 'High' category. The first row corresponds to counties in which Biden won,
and the second row corresponds to counties in which Biden lost.
'''
table = pd.DataFrame(columns = ['Low', 'High'])
table.at[0, 'Low'] = 0
table.at[1, 'Low'] = 0
table.at[0, 'High'] = 0
table.at[1, 'High'] = 0
# Iterate through the data to count the number of observations that fall under the low and high categories.
for index, row in data.iterrows():
biden = data.at[index, 'percentage20_Joe_Biden']
obs = data.at[index, factor]
'''
If Biden won in the county, we will increase the count in the first row. Otherwise, we will increase the
count in the second row. We will determine which column ('Low'/'High') to increment based on whether or not
the observation is considered low or high.
'''
if (biden > 50.0):
if (obs <= cutoff):
table.at[0, 'Low'] += 1
else:
table.at[0, 'High'] += 1
else:
if (obs <= cutoff):
table.at[1, 'Low'] += 1
else:
table.at[1, 'High'] += 1
cs = chisquare(table)
# Return the p-value
return table, cs, cs[1]
First, I will focus on the percentage of voters who voted for Hillary Clinton in 2016. I will be using a cutoff value of 50.0 since > 50.0 means that Clinton won in a given county.
table, cs, p = compute_chi_square(data, 'percentage16_Hillary_Clinton', 50.0)
print(p)
Since we have a p-value less than 0.05, we can reject the null hypothesis of no association between the percentage of voters who voted for Clinton in 2016 and the percentage of voters who voted for Biden in 2020.
Next, I will focus on income. Since the median national household income in 2019 was $68,703, I will be using that value as the cutoff.
table, cs, p = compute_chi_square(data, 'Income', 68703)
print(p)
Since we have a p-value less than 0.05, we can reject the null hypothesis of no association between income and the percentage of voters who voted for Biden.
I will then focus on income per capita. I will be using the national income per capita in 2019 of $34,103 (U.S. Census Bureau) as the cutoff.
table, cs, p = compute_chi_square(data, 'IncomePerCap', 34103)
We have another p-value less than 0.05, so we can reject the null hypothesis of no association between income per capita and the percentage of voters who voted for Biden.
Now, I will focus on the percentage of residents who are white. I will be using a value of 76.3, the national percentage.
table, cs, p = compute_chi_square(data, 'White', 76.3)
print(p)
Since the p-value is very close to 0, we can reject the null hypothesis of no association between the percentage of residents who are white and the percentage of voters who voted for Biden.
The next factor is poverty. I will be using the national poverty rate in 2019 (10.5%) as the cutoff (U.S. Census Bureau).
table, cs, p = compute_chi_square(data, 'Poverty', 10.5)
print(p)
Here, the p-value is greater than 0.05, so we fail to reject the null hypothesis of no association between poverty rate and the percentage of voters who voted for Biden.
The last factor is unemployment rate. I will be using a cutoff of 4.5% since unemployment rates of 4.5% or below are considered healthy.
table, cs, p = compute_chi_square(data, 'Unemployment', 4.5)
print(p)
Since the p-value is less than 0.05, we reject the null hypothesis of no association between the unemployment rate and the percentage of voters who voted for Biden.
According to the results of these tests, there appears to be an association between the percentage of voters who voted for Clinton in 2016, income, income per cepita, the percentage of the population that is white, and unemployment rate and the percentage of voters who voted for Biden in 2020 but not with poverty rate.
Here, we will use the results from our analysis and hypothesis tests to draw conclusions.
As said before, there appears to be an association between the percentage of voters who voted for Clinton in 2016, income, income per capita, the percentage of the population that is white, and unemployment rate and the percentage of voters who voted for Biden in 2020. However, there does not appear to be an association between poverty and the outcome of the election. Moreover, even though the only variable to have a very high r-squared value was the percentage of voters who voted for Clinton in 2016, the p-values from the chi-square tests still indicate that there is an association between the other factors (except for poverty) and the percentage of voters who voted for Biden. This is not surprising because we are focusing on people. The results still indicate that there is a real relationship between our independent variables and our dependent variable (“How to Interpret a Regression Model with Low R-Squared and Low P Values" 2014). Here is some more information explaining the case of a low p-value but a low r-squared value: https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-a-regression-model-with-low-r-squared-and-low-p-values#:~:text=For%20example%2C%20many%20psychology%20studies,because%20people%20are%20fairly%20unpredictable.&text=The%20good%20news%20is%20that,predictors%20and%20the%20response%20variable.
One thing to keep in mind is that correlation does not equal causation. In other words, being white does not cause one to not vote for Biden, nor does being a Clinton voter cause one to vote for Biden. Thus, the results do raise the question, what DOES cause a county with more people of color, higher income, higher income per capita, etc. to vote for Biden? That is something that we can explore in the future.
We could potentially use this data to predict the outcome of future elections, but instead of using the outcome of the 2016 election as a factor to predict the outcome of the 2024 election, we can use the outcome of this election. We could also use different factors, like the total number of coronavirus cases in 2020 per county since the pandemic certainly was an important aspect of this year, to say the least.
Amadeo, Kimberly. “Why Zero Unemployment Isn't as Good as It Sounds.” The Balance, The Balance, 30 Aug. 2020, www.thebalance.com/natural-rate-of-unemployment-definition-and-trends-3305950.
“How to Interpret a Regression Model with Low R-Squared and Low P Values.” The Minitab Blog, Minitab, 12 June 2014, blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-a-regression-model-with-low-r-squared-and-low-p- values.
“Income and Poverty in the United States: 2019.” The United States Census Bureau, The United States Census Bureau, 15 Sept. 2020, www.census.gov/library/publications/2020/demo/p60-270.html.
“U.S. Census Bureau QuickFacts: United States.” The United States Census Bureau, The United States Census Bureau, 1 July 2019, www.census.gov/quickfacts/fact/table/US/PST045219.
“U.S. Census Bureau QuickFacts: United States.” The United States Census Bureau, The United States Census Bureau, 1 July 2019, www.census.gov/quickfacts/fact/table/US/SEX255219.