Five lessons from mining public school data

Some students take it easy for the spring semester of their senior year; I loaded up on Introduction to Algorithms and Statistical Methods for Data Mining. The stats class covered theoretical foundations for data mining techniques like logistic regression and neural networks and finished with an open-ended group project assignment.

Chicago Public Schools closed 49 schools and drew criticism from groups like the Chicago Teachers Union, which claimed CPS unfairly selected its closing schools. Protesters and internet memes called racism. For our final project, classmates Jim Garrison, Jaya Sah and I used three data mining techniques – logistic regression, neural networks and classification trees – to determine if racial demographics could predict whether a school closed.

You can read our 15-page report or see the slides from our in-class presentation, but here are five data observations I made during this stats project.

1. Get the basics
Before we opened SPSS to apply our data mining methods to determine if race is a predictor for schools closing, we ran some basic averages to get an initial evaluation of the protesters’ claims. Indeed, the Chicago school closures disproportionately affect black students, who make up 40% of the Chicago student population but 90% of the student body in closed schools.

We used the Illinois State Report Cards throughout our analysis. The report card is released annually, and schools are required by law to disclose all kinds of statistics about their campus demographics, test scores and resources. We still found many CPS schools missing from the data set, and its 9,655 columns of data for each school created problems for our analysis. Wrangling the report card dataset was such a challenge that I spent the majority of my time reducing its complexity. Luckily I was working at the Knight Lab with Joe Germuska, who is very familiar with the Illinois data thanks to his work with the Chicago Tribune report cards news app.

2. csvkit rules
Thanks to Joe I could use a schema to convert the 225 megabyte, semicolon-delimited 2012 report card file to a csv. SPSS still couldn’t import 9,655 data attributes at once, and we understanding what all was in the dataset was a struggle.

csvkit was a godsend. Chris Groskopf’s tool allowed us to examine and splice the data just how we wanted before importing it to SPSS. (more on our data cleaning methods below) csvkit let us have a smooth workflow testing csvs with different attributes so we only stayed in the library until 2 a.m., not 5 a.m.

3. Have a dirty data plan
Being aware of and dealing with “dirty data” – information that is inaccurate or incomplete – is vital for any data-driven project, but it was especially crucial to our data mining techniques. Neural networks and logistic regression fail with missing data, so we had to clean our dataset.

This is a neural network with lots of public school nodes.
We observed that the report cards included fields for high school graduation and high school test scores, even the majority of our closed schools were elementary schools that contained no values. We used csvkit to exclude those fields. For other missing data fields, however, we also could have used SPSS processes that compute averages to assign to missing data fields.

4. Another dimension, another dimension
The thousands of data attributes for each school also posed a challenge for SPSS and prevented our data mining methods from drawing conclusions. The case of having lots of columns in your csv is called “high-dimensional data,” and high-dimensional data is increasingly an increasingly common situation in our data-driven world. Although techniques like “featuring screening” and “multi-dimensional scaling” algorithmically address this issue, we came up with our own approach. We called it “the bracket,” or “March Madness.” Read our report if you’re curious.

5. Consider the source
Wrangling the state dataset was one of the most difficult aspects of this data mining project, so we could have considered using other sources or mashing several. We tracked these data projects, which use various data sources and employ different levels of data analysis:

ProPublica Opportunity Gap
Chicago Tribune school report cards

So what did logistic regressions, neural networks and classification trees say about the CPS school closures? Our tests largely found attendance rate – “the aggregate days of student attendance, divided by the sum of the aggregate days of student attendance and aggregate days of student absence” – to be the best at predicting whether a school would close.

This supports CPS claims that schools were closed based on its own utilization metric. However, as Ramsin Cannon writes, just because a decision isn’t based on race doesn’t necessarily mean its affects aren’t discriminatory…