Analytics – Arden Ott

Anyone with an interest in politics and statistics has almost certainly come across FiveThirtyEight’s website. Their in-depth analysis of American politics is unlike so much punditry found elsewhere. One of their more recognizable features is their presidential approval tracker–an impressively robust and reliable aggregate of thousands of presidential approval polls.

You can visit their site if you want an in-depth analysis on how exactly they calculate their numbers, and we aren’t going to emulate that today. Rather, we are going to build a similar chart in appearance using Python. The finished result looks something like this:

Not an exact replica, but something similar that is also relatively easy to build with basic tools.

Getting The Data

The first step is to get our data. Head over to FiveThirtyEight’s data page, and download (the blue arrow) the data for “How Popular is Donald Trump?” A folder containing two CSV files will be downloaded to your machine. The “approval_polllist.csv” is the one we are looking for.

For purposes of this tutorial, we are going to covert the CSV into an Excel file. Open the CSV in Excel, and choose Save As… and save the file somewhere convenient. I chose my desktop.

Once you’ve done this, we need to make a couple changes to the data to make our code simpler. First, we need to rename a couple columns. While the columns as labeled work fine, changing them now will make it easier down the line to determine which is which. Columns “L” and “M” should have the header row changed to “yes” and “no” respectively:

Next, we need to select all of the data (ctrl-A on a PC, ⌘-A on a Mac), and filter it:

Once that’s been done, sort column “E” (endnote) descending:

Getting Started With Python

If you don’t already have Python and an IDE installed, you will need to do that first. I won’t go over it here, but there are countless tutorials across the web that can get you started.

First we will need to install two third party packages. MatPlotLib and Pandas. Installation instructions can be found on their respective websites.

Once installed, we will import them at the top of our Python file, starting with MatPlotLib, like so:

import matplotlib.pyplot as plt
import matplotlib.patheffects as pe
import matplotlib.dates as dt

A few notes here. The .pyplot library is what will actually build our charts, or plots as they are otherwise known. The .patheffects will allow us to change some of the styling of our chart, and the .dates will allow us to format dates a little more cleanly.

Next, we need to import pandas:

import pandas as pd
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

Pandas gets imported all at once. The second and third lines are simply there for compatibility reasons.

Next, we can pull the data into our program, using Pandas. Replace the file path with the file path on your machine:

# Read the data
df = pd.read_excel("/Users/ardenott/Desktop/approval_polllist.xlsx")

We’ve used “df” as the variable for our data because it is short for “data frame,” an object type frequently used in data science. It also keeps the variable name short as we will be referring to it often.

Great work! The data should now be pulled into your program. If you run the program right now, you won’t see any output, as we haven’t told MatPlotLib to output anything yet. So let’s work on that.

Displaying the Data

Next, lets see if we can output our data to a chart, even if the formatting will be off. Insert this code below what you’ve already written:

plt.plot(df['enddate'], df['yes'])
plt.plot(df['enddate'], df['no'])

plt.show()

All we are doing here is having the program grab each “yes” and “no”, and place them on a chart with their end date. The “pot.show()” line is required at the end of the program every time you run it. If it isn’t there, no plot will be shown. If everything has been done correctly, it should look something like this:

Pretty good, right? It’s obviously messy and difficult to read, but we can see some trends. The dates on the bottom certainly need some work. Let’s make a few changes to clean this up.

Let’s change the code above into this:

# Adjust figure size
fig, ax = plt.subplots(figsize=(11,9.5), dpi=300)

plt.plot(df['enddate'], df['yes'], marker='o', linestyle=' ', alpha=.15, color='g', markersize=6)
plt.plot(df['enddate'], df['no'], marker='o', linestyle=' ', alpha=.15, color='orange', markersize=6)

plt.show()

We’ve done a few things here. First, we’ve changed the figure size. This increases the actual size of the chart, making it much clearer and easier to read. It will also increase the processing time, so don’t be worried if it take a little longer to run than before.

We’ve also changed a lot of the formatting on the datapoints. The marker shape, outline style, alpha (transparency), color, and marker size have all been adjusted. Now it should look something like this:

Now we’re getting much closer. The dots are where we want them. Next, let’s add the trend line to each side of the chart. In order to do this we need to first calculate the rolling average:

# Calculate the rolling average to smooth out data
rolling_mean_yes = df.yes.rolling(window=100).mean()
rolling_mean_no = df.no.rolling(window=100).mean()

Each of these variables will allow us to superimpose a line on top of our current chart. To draw the lines, add this underneath where you are drawing the dots:

plt.plot(df['enddate'], rolling_mean_yes)
plt.plot(df['enddate'], rolling_mean_no)

Your chart should now look something like this:

Again, it’s not pretty, but a little bit of formatting should fix that quickly. Change those two lines, adding the formatting details:

plt.plot(df['enddate'], rolling_mean_yes, color='g', lw=5,
         path_effects=[pe.Stroke(linewidth=10, foreground='white'), 
         pe.Normal()])
plt.plot(df['enddate'], rolling_mean_no, color='orange', lw=5,
         path_effects=[pe.Stroke(linewidth=10, foreground='white'), 
         pe.Normal()])

And with that, all of the major formatting has been finished. Now, we need to add some finishing touches to the rest of the chart to enhance readability.

Finishing Touches

The first thing we want to fix is the dates, as they are a bit of a mess at the moment. The are spaced out evenly, but the year-month combo isn’t the easiest to read. Underneath the code where you output the data (and before the plt.show()), insert the following:

# Reformat dates to enhance readability
date_form = dt.DateFormatter('%b %y')
ax.xaxis.set_major_formatter(date_form)
ax.tick_params(labelsize=15)

This will change the date format as well as the table sizes.

Next, we will add some basic gridlines. This is fairly straightforward:

# Add gridlines
ax.xaxis.grid()
ax.yaxis.grid()

After the previous two steps, the chart should look this this:

The last thing we need to do is add some labels so viewers know what the chart is actually talking about. That can be done like so:

plt.title('Presidential Approval', size=40)
plt.ylabel('Approval in %', fontsize=20)
plt.legend(['Approve', 'Disapprove'], numpoints=50, fontsize=20)

And voilà! The finished result.

I hope this guide has been useful. It is by no means comprehensive-there are still many improvements that could be made to enhance the visual. Feel free to experiment yourself! If you would like to view the full code for the project, I have it saved below. Thank you for reading!

import matplotlib.pyplot as plt
import matplotlib.patheffects as pe
import matplotlib.dates as dt
import pandas as pd
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()


# Read the data
df = pd.read_excel("/Users/ardenott/Desktop/approval_polllist.xlsx")

# Calculate the rolling average to smooth out data
rolling_mean_yes = df.yes.rolling(window=100).mean()
rolling_mean_no = df.no.rolling(window=100).mean()

# Adjust figure size
fig, ax = plt.subplots(figsize=(11,9.5), dpi=300)

# Plot all data points and rolling average line
plt.plot(df['enddate'], df['yes'], marker='o', linestyle=' ', alpha=.15, color='g', markersize=6)
plt.plot(df['enddate'], df['no'], marker='o', linestyle=' ', alpha=.15, color='orange', markersize=6)
plt.plot(df['enddate'], rolling_mean_yes, color='g', lw=5,
         path_effects=[pe.Stroke(linewidth=10, foreground='white'), pe.Normal()])
plt.plot(df['enddate'], rolling_mean_no, color='orange', lw=5,
         path_effects=[pe.Stroke(linewidth=10, foreground='white'), pe.Normal()])

# Reformat dates to enhance readability
date_form = dt.DateFormatter('%b %y')
ax.xaxis.set_major_formatter(date_form)
ax.tick_params(labelsize=15)

# Add gridlines
ax.xaxis.grid()
ax.yaxis.grid()

# Format and print graph
plt.title('Presidential Approval', size=40)
plt.ylabel('Approval in %', fontsize=20)
plt.legend(['Approve', 'Disapprove'], numpoints=50, fontsize=20)
plt.show()

Earlier this year, in January, I joined a team focused on entering the AIS Student Chapter Competition, Scholastic Analytics Challenge. My team took second place in the graphics portion of the competition, and I wanted to review how we performed our analysis, arrived at our conclusions, and built our graphic.

The data we received was large (over four million rows) and disorienting. On top of that, many rows were missing data. Our first task was to load the data into PowerBI in order to do some basic data cleansing and perform some initial analytics. The majority of the work in this phase involved correcting erroneous price values, misspelled genres, and other small, easy to catch problems.

From there we needed a better way to wrangle the data as it was too large to effectively manage on a local machine. Utah State University has a Microsoft SQL server accessible to students, and we were able to load the entire dataset onto it.

The next task was to begin mining the data for insights. Breaking down the sales by genre told us very little, as well as breaking down sales by state. After some trial and error, we noticed that households earning more money were spending less on books, on average. This was interesting, mostly because of how counterintuitive it was. Wouldn’t you expect higher earning households to spend more on books than their lower earning counterparts? I certainly would.

With this nugget of information we decided to focus our submission on sales broken down by income. The below visual shows the average book price and average number of books sold per transaction, broken down by household income.*

The SQL to generate these graphs, if you’re interested:

SELECT HHI_Band, CAST(SUM(UNIT_PRICE * total_units) AS FLOAT)/CAST(COUNT(HHI_BAND) AS FLOAT) UnitsPerHousehold
    FROM Scholastic
    WHERE HHI_BAND IS NOT NULL
GROUP BY HHI_Band
ORDER BY [UnitsPerHousehold] DESC;

SELECT HHI_BAND, AVG(CAST(total_units AS FLOAT)) AS AVG
    FROM Scholastic
    WHERE HHI_BAND IS NOT NULL
GROUP BY HHI_BAND
ORDER BY AVG DESC;

We picked up a couple of insights, first was that the trend was surprisingly clear. Almost without exception, as households earned more, they spent less on books. Second was just how big the difference could be. Below is shown the average transaction amount by income, with the transaction totals being cut nearly in half from the lowest earners to the highest. Over millions of transactions, this can add up to tens of millions of dollars of revenue.

So, why is this happening? We aren’t sure. The data provided to us is pretty sparse, but we can make a couple of assumptions. First, is that we are likely missing some data. Scholastic themselves released a study showing that families earning over $100 thousand per year owned twice the number of books as families earning less than $35 thousand. It wouldn’t be a huge surprise that Scholastic didn’t offer us all of their data.

Second, we theorized that higher income households may be obtaining books from other places. These households are more likely to own Kindles or iPads, and may be purchasing books through digital platforms instead of Scholastic book fairs. They also may have inherited books from other family members.

There is also the possibility of birth rate affecting this statistic. Looking at the birth rates by household incomes, we see a similar trend to the sales data. Lower earning households tend to have more children, and may be buying more books as a result.

In any case, we concluded that the children’s book market is likely not saturated, especially in the higher income brackets. By focusing marketing efforts in these markets, Scholastic could likely tap into sales they are missing out on. We also suggest they market to lower earning households, as these households have been shown to be high spenders, and Scholastic can use these sales to maximize margins.

To see the entire infographic, head over to the portfolio page. All of the data was queried using SQL and visuals were created in Excel. The final infographic was created using Canva.

* The <$10 thousand category is not statistically significant, so data for that class cannot be trusted. In addition, the data provided cutoff at >$100 thousand, so all households making more will be lumped into the same class, potentially skewing results.

Category: Analytics

Replicating FiveThirtyEight’s Presidential Approval Tracker

Getting The Data

Getting Started With Python

Displaying the Data

Finishing Touches

Analyzing Scholastic Children’s Book Sales