Publishing data on Kaggle is a way organizations can reach a diverse audience of data scientists with an enthusiasm for learning, knowledge, and collaboration. For Dr. Erin Miller of START, the National Consortium for the Study of Terrorism and Responses to Terrorism, making her organization's Global Terrorism Database available for analysis by Kaggle users has brought new awareness to their cause.
In this Open Data Spotlight, Erin discusses how setting aside agendas and focusing on understanding this unparalleled dataset of over 150,000 attack events allows users to undertake constructive analyses that may defy common conceptions about terrorism. Read on to learn more about the Global Terrorism Database project and the ways users of open data can make valuable contributions to the organizations that make them possible.
I’m a criminologist at the University of Maryland, and I’m currently a Program Manager for the Global Terrorism Database (GTD) project at START. My role started out (more than 12 years ago) as a graduate assistant cleaning raw data, and now I manage the project team, workflow, resources, and interaction with end users and related research projects.
START was created in 2005 by the US Department of Homeland Security, Office of University Programs as a Center of Excellence. The idea of the COE program is to have multidisciplinary university-based researchers focused on issues related to homeland security, and START’s organizing framework is social science. We develop research, training, and educational resources on the human causes and consequences of terrorism.
The GTD is an event-level database on terrorist attacks that have occurred worldwide, dating back to 1970. The “story” of GTD collection is a long one, but the short version is that it currently includes data on more than 150,000 attacks, with more than 100 variables describing when and where the attack happened, who the perpetrators and victims were, what tactics were involved, and what the outcome of the attack was… to the extent that this information is known. It’s based entirely on unclassified information–almost always media reports. Data collection is ongoing and we currently update the database annually.
With the expansion of online media, we’ve developed what we call a “hybrid” data collection strategy. We leverage automated processes (natural language processing, machine learning models) to sift through millions of news articles each month. We leverage human processes (reading thousands articles about terrorist attacks), to create entries in the database and maximize its accuracy.
Making the GTD available to users has always been a major priority, for both principled and practical reasons. We initially spent a few years digitizing and cleaning tens of thousands of hand-written data records, but ever since the data were presentable we’ve had the GTD posted on START’s website. We've seen a growing interest in objective data on this important topic, and there’s certainly far more potential for the analytical community at large to generate important findings than if we had kept it to ourselves for the last 10 years.
In addition, for any data collection project transparency is critical. It’s important that people understand and can actually see how the data are collected and what individual records look like, in order to promote smart usage and credibility. Finally, making the data available is great for the quality of the dataset itself. The best way to improve the accuracy of the data is to have eyes on it, flagging potential problems for review.
Two things- first, Kaggle’s platform has much cooler functionality than we do. Allowing users to do custom analysis and then share it with other users is really powerful and promotes collaboration and knowledge growth.
Second, although we’ve shared our data for about a decade on START’s website, our user community seems to overlap with Kaggle’s user community only a little bit.