Archive FM

Data Skeptic

Streetlight Outage and Crime Rate Analysis with Zach Seeskin

Duration:
33m
Broadcast on:
18 Jul 2014
Audio Format:
other

This episode features a discussion with statistics PhD student Zach Seeskin about a project he was involved in as part of the Eric and Wendy Schmidt Data Science for Social Good Summer Fellowship.  The project involved exploring the relationship (if any) between streetlight outages and crime in the City of Chicago.  We discuss how the data was accessed via the City of Chicago data portal, how the analysis was done, and what correlations were discovered in the data.  Won't you listen and hear what was found? 

(upbeat music) - The Data Skeptic Podcast features conversations with researchers and professionals working on problems or projects related to data science. - So welcome back to the Data Skeptic Podcast. I'm here today with my guest, Zach Syskin. How you doing, Zach? - Thanks for having me on the podcast, Kyle. - Yeah, thanks so much for joining me. So I got wind of your work when I can't remember exactly how I ended up there. But I landed on this website that I'll link to in the show notes that was a part of a study that you were involved in that looked at streetlights and crime rates in Chicago and how outages seem pretty highly correlated with that. So I thought it'd be a really interesting topic to explore on the podcast. - Excellent, yeah, I'm excited to discuss it with you. - Before we get too deep, maybe you could tell me a little bit about your background and how you got involved in that work. - Sure, so first of all, I'm a doctoral student in statistics at Northwestern University. So I'm in St. Illinois just outside of Chicago. Last summer, I was involved in a summer fellowship. It was called the Eric and Wendy Schmidt Data Science for Social Good Fellowship. It was in Chicago and last year was the inaugural summer, so it's continuing into the summer. That fellowship brought together graduate and undergraduate students from a variety of backgrounds, all with skills and an interest in science. So there were students from computer science backgrounds, statistics or machine learning backgrounds, also expertise in subject areas. We collaborated on data science projects, all of which had to have kind of social, good elements of benefit to the greater public. There were about 12 different projects or so that we worked on last summer. We mostly partnered with nonprofits with local government agencies, master's funds, education, transportation and energy. My team teamed up with the Chicago Department of Transportation to look at if there was an association between street light outages and crime around Chicago. - That's very cool. It's a neat program, how long has it been going on? - This is just in second years and it's going really strong. It was amazing last summer. They kind of got the funding for it starting in March and they put out the application process and then they had the program all put together by the summer 2013. So that's continuing. I think they plan to continue it for a long time. - That's awesome. Yeah, it seems like Chicago in particular has kind of had this sort of renaissance in data projects over the last few years. Or maybe that's just my, some homesickness from me being from Chicago. Or how do you see it? Is there are things really on the move in Chicago in particular? - I completely agree with you. The past years has seen a lot of change. It's a great city to be in an awesome data community. So I can speak mostly from my own experiences which might be different from another Chicago and who's in the data science community. But I think first of all, it starts with the great universities that are in the Chicago area who's so unfortunate to be at Northwestern. I'm involved with the Institute for Policy Research there which is an interdisciplinary organization. So researchers from a variety of fields are using quantitative methods to study policy questions in depth. Many questions are related to issues in the Chicago community. They often partner with organizations in Chicago. So of course, University of Chicago and University of Illinois, Chicago, which I know is your all a matter. And so they have a very strong presence in the data science community. On one institute, particular mention that U of Chicago is the new center, the urban center for computation in data. And so that's a collaboration between some tech people and people with expertise in urban issues but from a more academic perspective from different departments. They're working to build greater data infrastructures to study studies and understand them better than ever before. And they're also a sponsor for data science for social good. Beyond the universities, you'll have a great community that's using data science to work on civic issues. So we have a weekly half an eye here. I think that's, I don't think that's more than two years old if it is, it's not much more than that. Not that that's been computer science oriented people working with community organizations to collaborate on a variety of projects that matter to Chicagoans. There's about five or so other data science type meetups that I'm aware of possibly more of that. Many of them are excellent. I've gotten involved in our community here. So R is a statistical software package. But there's a great community here and there are many tips on statistical methods or computing techniques from people with other experiences. So that's a taste of what I experienced in the Chicago data science community. - That's awesome. Sounds really vibrant. - It really is. - So maybe we can get into the particular project that we're gonna talk about. Tell me a little bit about the problem you guys set out to solve or maybe analysis you set out to do. - Our partner was with the Chicago Department of Transportation, their CDOT, this is for short. So I'll probably refer to them as CDOT throughout our discussion. They're responsible for a number of projects that affect Chicagoans their everyday lives. And one of those is to manage the network of streetlights around Chicago. There's over 250,000 streetlights in Chicago. So that's a pretty big task. The question that they came to us with was to investigate if there was a connection between streetlight edges and crime around Chicago. If there were a connection that we found that that would CDOT to place choreography on fixing streetlights. So they came up to us with some specific stories for the motivation. Matages may be caused by vandalism with the intent of committing crime. That's something that we weren't able to look into directly, possibly it could be in the future. But that was a motivating story for them. They were also interested in seeing if there were certain neighbor modes or specific allages where there was a particularly greater risk of increase in crime. And that would tell them if there's certain geographic areas of Chicago that they should place their prioritization on instead for a uniform improvement across the city. - How do they collect the outage data? - Well, most of it comes from 311 service requests. So people calling into a 311 service line, saying I have this outage in my neighborhood or in this street that I asked. So this data is publicly available on the city of Chicago data portal. We also had some help from CDOT that had a cleaned up service request data and may be added in a few things that weren't on that initial data set. But most of it is just by service requests from residents who see things. - So I also noticed one of the things you reported in your paper was recognizing that temperature played a strong seasonal effect. I was wondering about how you go about deciding what variables like that to look at. I mean, temperature in retrospect, it seems obvious, but day one of the project, there are an infinite number of things you could be looking at. - One thing to do before you choose any modeling strategy and do your estimation is just explore the data, get to know it, and see what you find, test out some of my policies and have that inform your modeling strategy. We want to be aware of factors that would cause differences in crime rates. And so we looked at variation across two different dimensions, temperature. So that's related to fine dimension, strong seasonality in crime rates, particularly in Chicago. Anybody who watches the news in Chicago in the summer, you hear a lot about increased crime. And that's definitely confirmed by the data. And we have some information about that in our blog post. Also, on the spatial dimension, we looked at variation in crime rates from different neighborhoods in Chicago. And certainly there was tremendous variation. - It seems like you would... You have two kind of data points you're looking at, I would guess. The outage and then when repairs got completed. On the outage front, I suspect because it's 311 calls, there's some variability there, like not everyone calls. Some people call quicker than others. And maybe even on the repair end, repair guy, it's Friday. He files his report Monday morning, something like that. How noisy was your data set? And did that cause any problems for you? - That's one of those things that I think is impossible or no, exactly for you. It's something you certainly don't trust when an outage started, when a stardage ended. So we thought, is there something we could do to help make a cleaner comparison for the city? First of all, we ended up doing for our analysis, I can get to this in more detail later, was we looked within a block and we compared the crime rate during an outage based upon the repair date and the completion date when the outage was repaired. We compared that to crime rate in the same block and before and after period. What we did was we had a buffer period between the before period and the outage period between the outage period, and the after period. So we excluded after discussing with the city's set up a period for which we didn't know if the outage had already started and that could cause some confusion in our estimation. You have to think about what does this really represent? And then is there something I can do to get around that that's a little bit bigger comparison? - Yeah, that was one of the kind of novel things that I really liked about the work, was you did that a block compared to itself before and after is sort of your test and control group because especially in Chicago or maybe that it's like this everywhere, but neighborhood to neighborhood, it makes such a huge difference. You wouldn't compare Chatham neighborhood to Rogers Park that totally different scenarios. - When we started out thinking about how are you going to do this project from like wow, it's gonna be really hard to kind of clean comparison with that, oh, there's something maybe in controlling and regression type framework where we can make comparisons between different neighborhoods or between different time periods, but that would be misleading and a fairly complex model. So what we came up with was look in a block that was affected by an outage, look at the crime rate during that outage and then compare it to the crime rate in periods just before and just after the outage occurred. We're controlling for space by looking the same place and we're controlling for pretty similar time periods. So if we see a change in the crime rates, we can be pretty sure that's happening, that's changed as the outage itself. - Another really nice step that it's intuitive once you read it, but I wouldn't have thought of it on my own if I had done the project was you segmented out crimes that were related just to plausibly, just related to streetlights. So some SEC violation or white collar crime naturally wouldn't be related and indoor crime wouldn't be related. Were those easy categorical variables available in your data set or was that something you had to go through and do some classification on? - Fortunately, the data that's on the data portal for crime is maintained by the police department in the city has these fields, right? So that was unfortunately pretty straightforward that we just discussed with the Department of Transportation to pick out different location descriptions that were places that could be affected, street ideologies, we're looking for alleys, for alley lights, streets, for street lights, if there was a yard, anything like that, we picked about 15 different location. And then we also were to pick out about nine crime types that we thought could plausibly be affected by outages. So we focused on fashion, narcotics, every criminal damage, more vehicle theft. And a few others, there were nine in total ones. That was something we discussed with the city, if it was out. But fortunately, the data was set up pretty well for them. - Were you at all able to consider privately held light sources, like a citizen who puts a big street light in their backyard or something like that? - That's an interesting idea to look at. So unfortunately, no, that's not something that we looked at. So we had a limited amount of time to complete the analysis and just stuck to the data sets that were already on the city of Chicago data portal over provide to us department of transportation. I think that's a good idea to be thorough if we could identify the state of set. It would definitely be helpful to take data on other light sources. I just wanna say that if there's a some listener out there who's, I've inspired to work to improve our analysis, there's links to our work and our programming on GitHub. So if you were inspired and I had an idea in how to improve our work or dive into our code, I would definitely encourage you to do that. - Yeah, I'll put links to all that in the show notes for anybody who wants to take a look. Having that, you know, what with the city of Chicago portal and things like GitHub and your recommendation, maybe somebody says, hey, I've got a novel way, I wanna play with this data as well. It's a really interesting time for, I guess, what I would call citizen scientists to step up and make novel contributions. - Absolutely, today it's so easy to share work with each other that it's great to start from somebody or since this work is jumping off point and improve it and there's many more ways to look at this problem than the way we did. So it's definitely an opportunity to improve it. So I would definitely encourage anybody to jump into that. - Let's talk a little bit about the conclusion of your analysis. - So our main conclusion is that, again, if we looked within the block that analogy occurred, we saw that if you have street lights all out, so that means little street lights out down a block. I wasn't associated on average with a 7% increase in the nine crime types that we looked at. That was on average and so there's great variation across the city and in our neighborhood estimates and that number. We also looked at street lights one out, so just one or two street lights out and also alley-light outages. And at the block level, we did not find association. Just one thing to point out about interpreting in these estimates is that a very helpful comparison for the city, but we can't necessarily conclude a causal relationship in case something's happening within a block that tends to happen simultaneously without it's possible you could come up with a story there. But nonetheless, it's a very helpful comparison for the city. - Sure, yeah, and even if it is causal or reverse causal, there's a pronounced phenomenon that's definitely worth exploring. - Absolutely. I know you may or may not have any real hard calculations on this, but from my point of view, I think we could assume that there's certainly, there's some cost per repair, right, to go out and have the city fix a street light. But of those, that 7% of crime that theoretically could be prevented, there's damages, both public and private. So roughly speaking, and I know this wasn't exactly part of your analysis, but if you had to take a back of the envelope guest, what would you say might be the potential savings for implementing a plan that maybe invested more in repairing street lights or doing them faster or more strategically, and the kind of gain that the city would get in terms of reduced crime, and also the private sector in terms of reduced loss and things like that. - This is a really good question, and one that's really important for the city to think about carefully. I do have to say that any numbers I put on it, would be outside my expertise, and shops in the dark, obviously just costs associated with repairing the street lights more quickly, but benefits from increasing the crime, and you have to think carefully about prioritization, some different neighborhood. But we're really glad that we provide the inputs for the city to do a cost-benefit analysis, and to think about what is the way that we prioritize the street lights, unfortunately, and just don't know enough to do a precise cost-benefit calculation. - Sure, sure, yeah, that's very fair. Intuitively, I would guess that the cost of street light repair, while non-trivial, and as you point out, probably more expensive if you wanna reduce the time to do it, is probably a fractional amount of the potential savings in police cost and damages and things like that, which is why I got really excited about the work you guys were doing. - I think that's definitely a possibility for transportation if they would find this cost-effective could really do some work to improve the rest of the residents of Chicago and help make them safer. - Yeah, so beyond the block analysis, I think you guys also did a more aggregate neighborhood level analysis. If I read correctly, I think there were seven neighborhoods where you found statistically significant change related to repair from, I guess, faster street light repair, help me understand what the conclusion was there. - All right, so first of all, what we did, we have a map that's a nice visualization on our blog post. We took each of these 77 communities around Chicago, and we estimate our model just for within the community area, but to look at the block level and look at the association between street light all out outages and night crime type. One thing to point out is that there's fewer outages and fewer crimes within a community area than with all Chicago. So we used statistical power and we lose the ability to detect associations and find statistical significance, but there were six community areas where we found a statistically significant increase and crime rates just using a 5% P value. We found two that were in northwest Chicago and four that were on the south side. We actually found one community area that there was a statistically significant decrease, interestingly, so I just point that out. So that was lower west side, which is the Hispanic neighborhood that's just west of the loop, which is on pound Chicago. The goal of all of this was to help get the city a very geographic picture of a particularly strong association between outages and crime. And it's not a perfect detailed geographic resolution, but we also provide to the city data with the merged streetlight outages and crime by a block and by time period and that way they could look at particular outages that caught their eye and were interesting. - The fact that you had enough resolution to block by block was really interesting to me. And it seemed like a very powerful tool for both police and legislators who make these sorts of choices. But I was curious a little bit about the one neighborhood that saw that counterintuitive decrease. I, if I had to put a guess on it, I would say maybe there's some missing explanatory variable, like if that community got a new police station, you might expect a decrease in crime, or if maybe there was an independent trend that that neighborhood is getting gentrified or something like that. Did you have any thoughts on why that one might have been an outlier? - Those are all good ideas. So we can't absolutely know the reason for sure. It could be that there's just something happening in that neighborhood. It could be that there's something really first happening. You could tell a story, maybe about why there was an association that was negative in that particular neighborhood. We don't know for sure, that's a possibility. Another issue that's happening here that's important to think about is the issue of what statisticians called multiple comparisons. And that's with when you're looking for 77, that's we differ from the 77 community areas of Chicago. We just use a straight forward 5% P value for all of our estimates. So what that means is if you had no effect, no association, and you're doing 77 estimates, you're going to 2.5% what the statistics in it can increase in 2.5% with statistics in it can increase. And that's just by random chance. So it's possible that we just got something kind of blue-ish and it's something you deal with with the way statisticians think that works. But the fact that we had one with the statistically significant decrease in six with a statistically significant increase, the fact that there was an overall 7% increase on average, the idea that we had still caught on to something in aggregates. To my eye as a former resident, one thing that jumped out to me was that the neighborhoods I considered the highest crime rates and again, that's just eyeballing. I didn't pull out facts and figures. And granted, I haven't lived there in a couple of years and I know things have been shifting. But it didn't seem to me that there showed much effect from treatment in really high crime areas, like Inglewood and places like that. Do you think that's the nature of those neighborhoods that these are more densely crime-ridden areas and therefore the streetlight or not crime's going to go on? Or perhaps wasn't that the city is less involved in repairs or reports are fewer, so the outages last longer? I know you may not have access to a lot of those data points, but was it surprising to your team as well that the highest crime neighborhoods, again, maybe just to my eye, didn't seem to show that much benefit from other areas? Yeah, those are all really interesting points. So certainly looking at our neighborhood visualization with the estimates by the community area, it's hard to see any pattern, what kinds of neighborhoods, how the association between streetlights and crime and some don't. We found two on the northwest, there's a cluster on the south side, but it's definitely not a hard role that high crime neighborhoods we found in the association. So, again, our estimates are limited to that we're dealing with reported crimes, so that's a good point that you made, but this kind of justifies why the analysis we did is so important that part of transportation may have some a priori policy. So let's put prioritization on all the crime neighborhoods, being what our estimates suggested, so it's not a rule for all high crime neighborhoods that there's this association. You need to figure out neighborhoods in particular to focus on not just work on all of them. And the fact that we found this variation in no consistent pattern in the kinds of neighborhoods that had an association really justifies why you have to do this kind of analysis. Yeah, absolutely. I think that's actually one of the best takeaways in that this treatment seemed to work well in certain areas and it bears further research, really. Exactly. So if I were, I don't know if it's an aldermen or some sort of decision maker in the city, I would want to take your research and use it to help me prioritize repair schedules and maybe even adjust my repair budget. I think my next question maybe lies outside the scope of your work, but just for fun. Do you think there's any concern about secondary effects or collinear variables that would play into a course here where someone should take great care in making a decision based on the initial analysis? Yeah, this is a really key point and really cool that you've caught on to this. This was something we discussed with the Chicago Department of Transportation. That's a thought experiment up. We fix a street line quickly and we prevent crime from occurring in a specific block. And so that's good for that block, but what we don't know is that crime shift to another neighborhood or does that crime occur at all? So what does that criminal do that instead of committing that crime not specifically? So that's something we don't know and something that city has to try to understand and grapple with before it's time to do. That's something that an observational study like our own has gone to just dealing with retrospective data, but something you could create some kind of cool experiments. You could possibly get into that issue more deeply and see that and looking at their future repairs could try to monitor this and see what happens with crimes working together with the police department. Yeah, that was going to be my next step and you've sort of basically answered it. What's next for this line of research? Would you propose a policy change or a controlled, a good point you made ethical experiment or are there further data points you would want to look at? What do you think should come next? Sure. I definitely think that you cannot conclude causality from an observational study very easily. There's definitely work that could be done to confirm that when we've cut onto something that's played happening. So there's ethical questions anytime you do an experiment that involves crime in the neighborhoods or others, it could be political around Chicago. So it's not for me to say if they really should do this kind of experiment. I'd like to see fine or geographic resolution. I think the lock is near an area that's typically affected by a streetlight outage, crime data from the city of Chicago's private for any fine resolution then. So somebody were able to get access to that data or the city were to do kind of internal study. I think that would be very valuable. I think also related thought is more of the next step that the city takes. And I don't think because it's not for sure a causal relationship. So if there's definite steps that the city can sure know we're going to make an impact, but we've identified neighborhoods where there was at least an association during this time period. You can look to these neighborhoods and try to figure out what's going on more closely monitoring what's happening and experiment with quicker repair times than these neighborhoods and talk with police department to see if they're seeing crime drops results in those areas. So that's the possible further steps that a department transportation is thinking about. Yeah, that's really exciting. What's next steps for you professionally and academically? I just finished the third year of my PhD program at Northwestern Statistics. So I'm looking to propose my dissertation or in the fall or in the winter coming up and to graduate in about two more years. My research is actually a bit different from why I worked on with this study. But I look at questions related to survey statistics and also official statistics and government estimates and related to public policy issues. In grad school, I'm really open to a number of directions. I'm certainly interested in the academic route. I could also be open to government or can take survey research at an organization. So now let's see what opportunities are out there when I'm a job market. It's a real opportunity with the new kinds of data sets are emerging. It's like clicks on the internet or things like the city of Chicago data portal. It's a real opportunity to use these data sets to look at these kinds of problems to help departments with the problems they have and understand things better. There have been some novel uses data portals that cities maintain. I don't know of too many like drivers that are I'm using this to untangle a new relationship for the city. So I definitely look forward to seeing more of the future and why I encourage anybody who's listening so inspired to think about it. Is there a way that we can use these data sets more than hasn't been done before to really help and to help residents of cities? I think that there's a lot of opportunity. Yeah, would you have any advice for maybe an undergrad, even a high school student potentially or a grad student who wants to get involved? What are some of the steps they could take? I would say, first of all, don't be afraid to get your hands dirty, just learning about the data sets and starting to do some basic summary statistics. That's the way that these projects get started. So just dive in and pick a problem that you're interested in that you're passionate about. So we use more statistical coding and statistical models in our study department. There's a lot of good information that you can provide just by doing good summary statistics on issues or primes in a city. So we're definitely encouraging when inspired to think about something they're interested in and explore. There's a data set that's out there on some city, state, and portal or something similar. Yeah, and while I truly believe our can be for everyone, I've seen a ton of great data science done in Excel. So-- Absolutely. Yeah, Excel is still the most commonly used tool. I use it all the time when I'm doing basic calculations with data. Excel is a good way to go. It's a good start for working with data. Definitely. Lastly, I like to ask all my guests to give me two references. A reference can be anything, a book, a paper, a Twitter account, a blog. First, I ask for what I call the benevolent reference, which is something you've got no affiliation to, but would like to give a nod towards, and hopefully some publicity through the podcast. And secondly, a completely self-serving link, something that benefits you directly, whether it be a link to your fund at school, or your blog, or anything you might have like that. So I've learned a lot from Coursera and their data science, that's actualization. Coursera, for those who don't know, is they run a collection of MOOCs, which are massive open-line courses. So there's any number of topics that are on there, from any social science, you name it. And there happened to be a lot about computer programming, and about data science. A new thing that I think is just a few months old, is they have a data science specialization. That's collections of courses from these guys at Johns Hopkins, Roger Payne, Leak, and Brian Caffo. Great lectures on programming skills. So Oregon is the statistical software language. And so I've taught my things like creating apps from R, using something called Shiny. So in app where there's an article underlying it, or making neat publications with our markdown in RPUMS, you can take these courses for credit if you want. I'm just kind of watching the lectures that are most useful to me. I also learn best by watching lectures. So that's been great. So I highly recommend Coursera to get science specialization, for somebody who wants to add more and to bring about data science and statistical programming. Yeah, great one. Something that's sort of me. I really recommend the Data Science for Social Good blog. Again, Data Science for Social Good was the organization that I did my fellowship with, through which I did this project. You can go to website DSSG. So Data Science for Social Good. DSSG.UChicago.edu. Here's the website. Learn about the different projects that people are doing, and the blogs, and each project leads to a GitHub repo that has the underlying code that was used for that project. So you can dive in, learn more about the project. And I think you're doing a lot of work both the free data scientists and the work on just meaningful projects, show how data science can be used for that help people in their everyday lives. So I really need work, and that's a fun place to put up data science projects. Yeah, very exciting stuff going on there, for sure. Well, awesome, Zach. This has been really great. Thank you so much for joining me and for the work you and your team did on this project. I'm really glad to learn about it, and I hope everybody listens and enjoys as well. Thanks so much, Kyle. It was a great conversation with you. I wish you best of luck, and thanks for having me on the podcast. Thanks. Thank you for listening to the Data Skeptic podcast. For show notes or other information related to the show, please visit our website at www.dataskeptic.com. Follow us on Twitter @dataskeptic. If you enjoyed the program, leave us an iTunes review and help others find us. [BLANK_AUDIO]