Advanced Placement and You: Analyzing AP Program and Census Data

Samuel Winter
9 min readApr 20, 2021
Photo by NeONBRAND on Unsplash

Motivation

May brings sunshine, warmth, and, for many North American high school students, Advanced Placement (AP) exam season. The AP program, run by the College Board, allows students to access college-level curricula while they are in high school, culminating in college-credit granting end of year examinations. Last year, the Advanced Placement (AP) program administered more than 4.5 million tests to over 2.5 million students in the United States and Canada. This program continues to grow and has become a staple in schools across the continent. However, the AP program is also a source of frequent controversy. I am curious about the implications of this program. How does general economic well-being impact the ability of students to access both higher education or programs such as the AP program? Could the expansion of the AP program lead to a higher number of Americans seeking higher education, or is it reflective of this trend? Such questions could have broad effects on the educational landscape of the United States. Let’s get some insight into these issues with SQL, Python, and freely-available data sets.

Data Sources

The first source I analyzed was the AP program Wikipedia page, found here. Along with general information about the Advanced Placement program, this page includes an HTML table with data for each individual AP exam administered (39 in total) in 2019, along with overall data. The fields included are the name of the exam, number of exams administered, percentage of tests scored above 3 (a passing grade), and the mean score. I retrieved every entry and added the results to a SQL database to provide some exploratory data about the program.

I used census data for a second source. Census data is open and available to download for a variety of metrics, but I was specifically interested in educational attainment. These datasets can be found here. The available data is separated by year, so I downloaded 10 individual CSV documents spanning the years 2010–2020. Each sheet contains information regarding educational attainment, the poverty rate, and median income, broken down by demographic. As every sheet contained about 37 entries, I retrieved around 370 entries in total.

Finally, I accessed data directly from the College Board as well. I found this data on the AP program summary page, which can be accessed here. On this page, the College Board provides a wealth of data about the Advanced Placement program, covering individual state participation, exam volume, exam scores, and other metrics from 2010–2020. Each data set is downloadable and available in both PDF and XLS format. I retrieved Excel Sheets containing overall data, volume data, and data for individual subjects. These spreadsheets contained about 150 entries total.

Data Processing

Wikipedia page data extraction to a SQL table

I was able to process the data from Wikipedia relatively easily. I used the pandas read_html method on the page’s URL, which returned a list where the fourth item contained a DataFrame representing the table I was interested in. I extracted this DataFrame from the list, cut out the entry for total data (as I only cared about data for individual exams), and set the index of the frame as the exam name. Then, I used the pandas to_sql method to create an SQL table for easy querying of the data. To do so, I first had to rename the “Scored ≥3 (%)” column to “Scored greater than 3”, as SQL had issues interpreting the symbols in the original column name.

Processing census data

Processing the census data Excel documents proved to be more of a challenge. Each year was represented by an individual sheet, so I knew I had to concatenate the ten sheets together into a single pandas DataFrame. I downloaded each sheet into a folder in chronological order, starting at 2010 and ending with 2020. However, each file contained no indication of the year it represented. Knowing that the earliest uploaded file would correspond to 2010, I created a sorted list of file names based on their creation date and iterated through them with a year counter, incrementing it between each file. Because of this, I knew the corresponding year of each set of data when I converted it to a pandas DataFrame with the read_csv method. This allowed me to create a “Year” column with the correct entry. From there, I worked on extracting only the data that I wanted — the educational attainment and median yearly income for Americans aged 18–24 (those most immediately after the AP program). Before doing so, I first changed all percent values within the DataFrame (formatted like “N%”) to usable float values by using the pandas apply method.

CSV indexing issue

Then, I ran into another issue (pictured above). When downloading each data set as a CSV, the multilevel index of the document was collapsed. For each individual metric, there was first a row with the name of the metric corresponding to overall data followed by rows containing individual demographic information about that metric. Because of this, I could not simply query “Median yearly income” and “Ages 18–24”, as rows did not indicate the metric that they corresponded to. However, each document was processed using the same format. This meant that I could query based on index (with iloc) and “hard-code” the metric of that data for each sheet. In this way, I was able to correctly format a pandas DataFrame from each CSV document. I then concatenated each frame into a larger DataFrame that contained all of the census data.

Census data functions applied to year groups

Finally, I created separate functions to extract the data for each individual metric that I wanted to analyze. I grouped the greater DataFrame by year and applied my extraction functions in order to create series with yearly data for the percentage of Americans 18–24 with at least some college and the yearly median income of Americans 18–24.

Extracting and processing AP data

Luckily, the data directly from the College Board was much simpler to process. For the spreadsheet showing AP exam volume changes, I loaded the file into pandas, set the header as the first line, and queried only for data that encompassed total volume changes. I then removed all columns that had percent change data, as I was only interested in the total number of exams administered. This resulted in a series of yearly volume data. Next, I extracted two series from the Excel spreadsheet containing AP program information for individual schools — one containing yearly data for the number of subjects offered per school, and another containing the total number of schools each year that offered at least one AP course. To do so, I once again loaded the document into pandas with the first row as a header, sliced the resulting DataFrame to only keep relevant data, and set the subject name as the index. Then, I queried each section of the data individually to create two series for AP subjects and AP schools. After this, I dropped the “% change” columns and returned both series in a tuple. At that point, I had processed all of the necessary information from my sources.

Analysis and Visualization

First, let’s take a look at the general AP program data from 2019. With basic SQL queries, we can see which tests were administered the most and least, and which tests had the highest and lowest mean scores.

Some basic queries on 2019 AP data

Surprisingly, tests such as Calculus BC and Physics C were among the tests with the highest mean scores. Language tests tended to be taken the least, but scored highly. Tests that were highly administered, such as English Literature and Composition and United States Government and Politics, also had some of the lowest mean scores.

Average mean score of all AP exams in 2019

What about relationships between this data? The average mean AP test score was about 3.1 in 2019. Could there be any major differences between exams with higher than average mean scores and those with lower than average mean scores? At 196,656, the mean number of exams administered for higher-scored exams was nearly 4 times greater than the mean number of exams administered for lower-scored tests (50,462).

Average volume for higher and lower score groups

When performing a T-Test for independence with the stats package from the scipy module, the resulting p-value was about 0.00097, indicating that this difference is statistically significant.

Correlation between number of exams administered and mean score

The relationship between mean score and number of exams administered is quite clear in the following scatter plot, generated by matplotlib:

There are many potential explanations for this correlation. Perhaps the only schools that administer niche AP tests are the ones with high amounts of resources that can be used to best prepare students for their examinations. For a simpler explanation, maybe smaller tests are simply graded less harshly. Regardless, with the information available for this analysis, it is impossible to make any such conclusion.

Next, let’s explore our other two data sets. We’ll use the Pearson correlation coefficient to measure the correlation between two variables. The scipy stats package contains a function that performs this test. First, as a sanity check, let’s see if there’s a correlation between the number of AP exams administered and the number of schools that offer at least one AP course.

Correlation between volume of AP exams administered and number of schools offering AP exams

As expected, the correlation coefficient is very high (0.975), indicating a heavy correlation. Further, let’s see if there’s a correlation between median yearly income and percentage of Americans with some college for the age group 18–24. We would expect that as more people are attending college, median incomes are rising, right? Surprisingly, there seems to be no correlation between these two variables. The correlation coefficient is -0.163, and the p-value is 0.653.

Correlation between median yearly income and percentage of Americans 18–24 with some college

Hopefully, we will have more exciting findings when we combine datasets. We’ll look first at the median yearly income for Americans aged 18–24 and the volume of AP exams administered. Using the Pearson correlation coefficient, there seems to be a relatively weak, negative correlation between the two variables (-0.368). However, the high p-value (0.296) means that this correlation is not statistically significant.

Correlation between median yearly income and volume of AP exams administered

As economic well-being does not seem to be correlated with the number of AP exams administered, we’ll finally address our other main question — does the growing number of AP exams administered reflect trends within educational attainment? To do so, we’ll check the correlation between AP exam volume and the percentage of Americans aged 18–24 who have attained some college. This appears to be a moderately high positive correlation (Pearson correlation coefficient of 0.665) with a low p-value of 0.0358.

Correlation between percentage of Americans 18–24 with some college and volume of AP exams administered

Given a typical alpha level of 0.05, we have enough evidence to conclude that there is a statistically significant correlation between these two metrics. This correlation is quite clearly visible on the line plot below:

It is possible, then, that the AP program impacts the number of students who choose and are able to attend college, or vice versa. More research could be done to affirm such a relationship. Nevertheless, it is clear that the AP program has been a titanic institution in the North American educational system, and will continue to do so for quite some time.

Source code, datasets, testing, and documentation

Please see the GitHub repository for this project here

--

--