Analyzing Scholastic Children’s Book Sales

A look at my 2nd place entry for the AIS Scholastic Analytics challenge.

Earlier this year, in January, I joined a team focused on entering the AIS Student Chapter Competition, Scholastic Analytics Challenge. My team took second place in the graphics portion of the competition, and I wanted to review how we performed our analysis, arrived at our conclusions, and built our graphic.

The data we received was large (over four million rows) and disorienting. On top of that, many rows were missing data. Our first task was to load the data into PowerBI in order to do some basic data cleansing and perform some initial analytics. The majority of the work in this phase involved correcting erroneous price values, misspelled genres, and other small, easy to catch problems.

From there we needed a better way to wrangle the data as it was too large to effectively manage on a local machine. Utah State University has a Microsoft SQL server accessible to students, and we were able to load the entire dataset onto it.

The next task was to begin mining the data for insights. Breaking down the sales by genre told us very little, as well as breaking down sales by state. After some trial and error, we noticed that households earning more money were spending less on books, on average. This was interesting, mostly because of how counterintuitive it was. Wouldn’t you expect higher earning households to spend more on books than their lower earning counterparts? I certainly would.

With this nugget of information we decided to focus our submission on sales broken down by income. The below visual shows the average book price and average number of books sold per transaction, broken down by household income.*

The SQL to generate these graphs, if you’re interested:

SELECT HHI_Band, CAST(SUM(UNIT_PRICE * total_units) AS FLOAT)/CAST(COUNT(HHI_BAND) AS FLOAT) UnitsPerHousehold
    FROM Scholastic
    WHERE HHI_BAND IS NOT NULL
GROUP BY HHI_Band
ORDER BY [UnitsPerHousehold] DESC;

SELECT HHI_BAND, AVG(CAST(total_units AS FLOAT)) AS AVG
    FROM Scholastic
    WHERE HHI_BAND IS NOT NULL
GROUP BY HHI_BAND
ORDER BY AVG DESC;

We picked up a couple of insights, first was that the trend was surprisingly clear. Almost without exception, as households earned more, they spent less on books. Second was just how big the difference could be. Below is shown the average transaction amount by income, with the transaction totals being cut nearly in half from the lowest earners to the highest. Over millions of transactions, this can add up to tens of millions of dollars of revenue.

So, why is this happening? We aren’t sure. The data provided to us is pretty sparse, but we can make a couple of assumptions. First, is that we are likely missing some data. Scholastic themselves released a study showing that families earning over $100 thousand per year owned twice the number of books as families earning less than $35 thousand. It wouldn’t be a huge surprise that Scholastic didn’t offer us all of their data.

Second, we theorized that higher income households may be obtaining books from other places. These households are more likely to own Kindles or iPads, and may be purchasing books through digital platforms instead of Scholastic book fairs. They also may have inherited books from other family members.

There is also the possibility of birth rate affecting this statistic. Looking at the birth rates by household incomes, we see a similar trend to the sales data. Lower earning households tend to have more children, and may be buying more books as a result.

In any case, we concluded that the children’s book market is likely not saturated, especially in the higher income brackets. By focusing marketing efforts in these markets, Scholastic could likely tap into sales they are missing out on. We also suggest they market to lower earning households, as these households have been shown to be high spenders, and Scholastic can use these sales to maximize margins.

To see the entire infographic, head over to the portfolio page. All of the data was queried using SQL and visuals were created in Excel. The final infographic was created using Canva.


* The <$10 thousand category is not statistically significant, so data for that class cannot be trusted. In addition, the data provided cutoff at >$100 thousand, so all households making more will be lumped into the same class, potentially skewing results.