Category Archives: Metro

Studying journalism and computer science at Northwestern University brought me a train ride away from Chicago. In addition to covering city government for The Daily Northwestern, I explored the city’s rich neighborhoods and taught a computer literacy class for Spanish-speaking immigrants.

I also worked for a non-profit in Los Angeles during the summer of 2010 and documented the salary scandal involving officials in the incorporated city of Bell erupted.

How bivariate hexbins saved my side project

I sat on this one dataset for two years.

While reporting a story on Chicago gun regulations, I pulled a year’s worth of gun possession and shooting crimes from the city data portal (first mistake). I wanted to explore the relationship between the types of crimes but wasn’t confident in a statistical or visualization method. I applied the correlation and Bayesian probability techniques I was learning in my math class to the data, but I couldn’t grasp the output.

For two years, an ONA recap of an event in Minnesota that mentions my side project haunted me because the same data files continued to sit on my hard drive, unfinished.

Then I learned about bivariate choropleth maps and the possibilities of showing two variables at the same time with colors. This fantastic how-to by Joshua Stevens showed me the way, and I found a real-world journalism application of the method with goats and sheep in the Washington Post.

So here’s my first bivariate choropleth, which shows gun possession and assault with firearm crimes in Chicago binned into hexagons. Each hexagon has at least one possession or shooting incident. As with any visualization, make sure you understand the legend.


Selecting color breaks was tricky because the distributions of the possession and shooting crime datasets are both skewed. I’m relieved to have finally mapped this dataset, but you’ll probably learn more about gun regulation by watching my final project video than looking at that map.

Five lessons from mining public school data

Some students take it easy for the spring semester of their senior year; I loaded up on Introduction to Algorithms and Statistical Methods for Data Mining. The stats class covered theoretical foundations for data mining techniques like logistic regression and neural networks and finished with an open-ended group project assignment.

Chicago Public Schools closed 49 schools and drew criticism from groups like the Chicago Teachers Union, which claimed CPS unfairly selected its closing schools. Protesters and internet memes called racism. For our final project, classmates Jim Garrison, Jaya Sah and I used three data mining techniques – logistic regression, neural networks and classification trees – to determine if racial demographics could predict whether a school closed.

You can read our 15-page report or see the slides from our in-class presentation, but here are five data observations I made during this stats project.

1. Get the basics
Before we opened SPSS to apply our data mining methods to determine if race is a predictor for schools closing, we ran some basic averages to get an initial evaluation of the protesters’ claims. Indeed, the Chicago school closures disproportionately affect black students, who make up 40% of the Chicago student population but 90% of the student body in closed schools.

We used the Illinois State Report Cards throughout our analysis. The report card is released annually, and schools are required by law to disclose all kinds of statistics about their campus demographics, test scores and resources. We still found many CPS schools missing from the data set, and its 9,655 columns of data for each school created problems for our analysis. Wrangling the report card dataset was such a challenge that I spent the majority of my time reducing its complexity. Luckily I was working at the Knight Lab with Joe Germuska, who is very familiar with the Illinois data thanks to his work with the Chicago Tribune report cards news app.

2. csvkit rules
Thanks to Joe I could use a schema to convert the 225 megabyte, semicolon-delimited 2012 report card file to a csv. SPSS still couldn’t import 9,655 data attributes at once, and we understanding what all was in the dataset was a struggle.

csvkit was a godsend. Chris Groskopf’s tool allowed us to examine and splice the data just how we wanted before importing it to SPSS. (more on our data cleaning methods below) csvkit let us have a smooth workflow testing csvs with different attributes so we only stayed in the library until 2 a.m., not 5 a.m.

3. Have a dirty data plan
Being aware of and dealing with “dirty data” – information that is inaccurate or incomplete – is vital for any data-driven project, but it was especially crucial to our data mining techniques. Neural networks and logistic regression fail with missing data, so we had to clean our dataset.

This is a neural network with lots of public school nodes.
We observed that the report cards included fields for high school graduation and high school test scores, even the majority of our closed schools were elementary schools that contained no values. We used csvkit to exclude those fields. For other missing data fields, however, we also could have used SPSS processes that compute averages to assign to missing data fields.

4. Another dimension, another dimension
The thousands of data attributes for each school also posed a challenge for SPSS and prevented our data mining methods from drawing conclusions. The case of having lots of columns in your csv is called “high-dimensional data,” and high-dimensional data is increasingly an increasingly common situation in our data-driven world. Although techniques like “featuring screening” and “multi-dimensional scaling” algorithmically address this issue, we came up with our own approach. We called it “the bracket,” or “March Madness.” Read our report if you’re curious.

5. Consider the source
Wrangling the state dataset was one of the most difficult aspects of this data mining project, so we could have considered using other sources or mashing several. We tracked these data projects, which use various data sources and employ different levels of data analysis:

ProPublica Opportunity Gap
Chicago Tribune school report cards

So what did logistic regressions, neural networks and classification trees say about the CPS school closures? Our tests largely found attendance rate – “the aggregate days of student attendance, divided by the sum of the aggregate days of student attendance and aggregate days of student absence” – to be the best at predicting whether a school would close.

This supports CPS claims that schools were closed based on its own utilization metric. However, as Ramsin Cannon writes, just because a decision isn’t based on race doesn’t necessarily mean its affects aren’t discriminatory…

Mental Health Movement protests closures of Chicago public clinics

The City of Chicago closed six public mental health clinics in the month of April, drawing protest from the “Mental Health Movement.”

The cuts, formalized in the 2012 city budget, save the Department of Public Health $2.3 millon. The department budget dropped 25 percent between 2011 and 2012.

Chicago police arrested 23 people on April 13 for barricading themselves inside the Woodlawn Mental Health Clinic. Protesters set up outside the closed Northwest Mental Health Clinic on May 9.

Six public clinics remain open, and the city announced plans to partner with community health providers to expand mental health services.

Open mental health clinics are shown in green on the map below, closed clinics are red.

Arms Around Roseland leaders pray for peace in Chicago neighborhoods

Health and faith leaders in Chicago’s Roseland neighborhood responded to a spree of overnight shootings with a movement they’re calling “Arms Around Roseland.”

Roseland Community Hospital joined with neighborhood ministers to bring congregants out of the pews and on the streets to raise awareness about gun violence. There have been 11 murders in police district that comprises the South Side neighborhood this year, up four from this time last year.

“Arms Around Roseland” organizers say congregations will pray outside their churches every Sunday until the month of October.

Englewood organizations clean Chicago streets on neighborhood unity day

The city of Chicago held its annual “Clean and Green” volunteer effort on Earth Day, and community organizations in the Englewood neighborhood used the day to clean up and organize.

Resident Association of Greater Englewood and Imagine Englewood If… brought volunteers and tools to the 69th Street corridor, attacking weeds and picking up trash.

“Clean and Green” day has taken place in Chicago for more than 20 years; this is the first time neighborhood organizations have collaborated on “Greater Englewood Unity Day.” Inspiration for the unity day came from a community meeting with Illinois senator Mattie Hunter where community organizations expressed a desire to collaborate, volunteers said.