Studying journalism and computer science at Northwestern University brought me a train ride away from Chicago. In addition to covering city government for The Daily Northwestern, I explored the city’s rich neighborhoods and taught a computer literacy class for Spanish-speaking immigrants.
I also worked for a non-profit in Los Angeles during the summer of 2010 and documented the salary scandal involving officials in the incorporated city of Bell erupted.
While reporting a story on Chicago gun regulations, I pulled a year’s worth of gun possession and shooting crimes from the city data portal (first mistake). I wanted to explore the relationship between the types of crimes but wasn’t confident in a statistical or visualization method. I applied the correlation and Bayesian probability techniques I was learning in my math class to the data, but I couldn’t grasp the output.
For two years, an ONA recap of an event in Minnesota that mentions my side project haunted me because the same data files continued to sit on my hard drive, unfinished.
Then I learned about bivariate choropleth maps and the possibilities of showing two variables at the same time with colors. This fantastic how-to by Joshua Stevens showed me the way, and I found a real-world journalism application of the method with goats and sheep in the Washington Post.
So here’s my first bivariate choropleth, which shows gun possession and assault with firearm crimes in Chicago binned into hexagons. Each hexagon has at least one possession or shooting incident. As with any visualization, make sure you understand the legend.
Selecting color breaks was tricky because the distributions of the possession and shooting crime datasets are both skewed. I’m relieved to have finally mapped this dataset, but you’ll probably learn more about gun regulation by watching my final project video than looking at that map.
Some students take it easy for the spring semester of their senior year; I loaded up on Introduction to Algorithms and Statistical Methods for Data Mining. The stats class covered theoretical foundations for data mining techniques like logistic regression and neural networks and finished with an open-ended group project assignment.
Chicago Public Schools closed 49 schools and drew criticism from groups like the Chicago Teachers Union, which claimed CPS unfairly selected its closing schools. Protesters and internet memes called racism. For our final project, classmates Jim Garrison, Jaya Sah and I used three data mining techniques – logistic regression, neural networks and classification trees – to determine if racial demographics could predict whether a school closed.
1. Get the basics
Before we opened SPSS to apply our data mining methods to determine if race is a predictor for schools closing, we ran some basic averages to get an initial evaluation of the protesters’ claims. Indeed, the Chicago school closures disproportionately affect black students, who make up 40% of the Chicago student population but 90% of the student body in closed schools.
We used the Illinois State Report Cards throughout our analysis. The report card is released annually, and schools are required by law to disclose all kinds of statistics about their campus demographics, test scores and resources. We still found many CPS schools missing from the data set, and its 9,655 columns of data for each school created problems for our analysis. Wrangling the report card dataset was such a challenge that I spent the majority of my time reducing its complexity. Luckily I was working at the Knight Lab with Joe Germuska, who is very familiar with the Illinois data thanks to his work with the Chicago Tribune report cards news app.
2. csvkit rules
Thanks to Joe I could use a schema to convert the 225 megabyte, semicolon-delimited 2012 report card file to a csv. SPSS still couldn’t import 9,655 data attributes at once, and we understanding what all was in the dataset was a struggle.
csvkit was a godsend. Chris Groskopf’s tool allowed us to examine and splice the data just how we wanted before importing it to SPSS. (more on our data cleaning methods below) csvkit let us have a smooth workflow testing csvs with different attributes so we only stayed in the library until 2 a.m., not 5 a.m.
3. Have a dirty data plan
Being aware of and dealing with “dirty data” – information that is inaccurate or incomplete – is vital for any data-driven project, but it was especially crucial to our data mining techniques. Neural networks and logistic regression fail with missing data, so we had to clean our dataset.
We observed that the report cards included fields for high school graduation and high school test scores, even the majority of our closed schools were elementary schools that contained no values. We used csvkit to exclude those fields. For other missing data fields, however, we also could have used SPSS processes that compute averages to assign to missing data fields.
5. Consider the source
Wrangling the state dataset was one of the most difficult aspects of this data mining project, so we could have considered using other sources or mashing several. We tracked these data projects, which use various data sources and employ different levels of data analysis:
This supports CPS claims that schools were closed based on its own utilization metric. However, as Ramsin Cannon writes, just because a decision isn’t based on race doesn’t necessarily mean its affects aren’t discriminatory…
Health and faith leaders in Chicago’s Roseland neighborhood responded to a spree of overnight shootings with a movement they’re calling “Arms Around Roseland.”
Roseland Community Hospital joined with neighborhood ministers to bring congregants out of the pews and on the streets to raise awareness about gun violence. There have been 11 murders in police district that comprises the South Side neighborhood this year, up four from this time last year.
“Arms Around Roseland” organizers say congregations will pray outside their churches every Sunday until the month of October.
“Clean and Green” day has taken place in Chicago for more than 20 years; this is the first time neighborhood organizations have collaborated on “Greater Englewood Unity Day.” Inspiration for the unity day came from a community meeting with Illinois senator Mattie Hunter where community organizations expressed a desire to collaborate, volunteers said.
Data journalist from Sacramento, not the r&b singer