Open Data Spotlight: The Global Terrorism Database

Open Data Spotlight: The Global Terrorism Database

Open Data Spotlight: The Global Terrorism Database

Publishing data on Kaggle is a way organizations can reach a diverse audience of data scientists with an enthusiasm for learning, knowledge, and collaboration. For Dr. Erin Miller of START, the National Consortium for the Study of Terrorism and Responses to Terrorism, making her organization's Global Terrorism Database available for analysis by Kaggle users has brought new awareness to their cause.

In this Open Data Spotlight, Erin discusses how setting aside agendas and focusing on understanding this unparalleled dataset of over 150,000 attack events allows users to undertake constructive analyses that may defy common conceptions about terrorism. Read on to learn more about the Global Terrorism Database project and the ways users of open data can make valuable contributions to the organizations that make them possible.

I’m a criminologist at the University of Maryland, and I’m currently a Program Manager for the Global Terrorism Database (GTD) project at START. My role started out (more than 12 years ago) as a graduate assistant cleaning raw data, and now I manage the project team, workflow, resources, and interaction with end users and related research projects.

START was created in 2005 by the US Department of Homeland Security, Office of University Programs as a Center of Excellence. The idea of the COE program is to have multidisciplinary university-based researchers focused on issues related to homeland security, and START’s organizing framework is social science. We develop research, training, and educational resources on the human causes and consequences of terrorism.

Read Also:
Privacy In The Age Of The Data Breach

The GTD is an event-level database on terrorist attacks that have occurred worldwide, dating back to 1970. The “story” of GTD collection is a long one, but the short version is that it currently includes data on more than 150,000 attacks, with more than 100 variables describing when and where the attack happened, who the perpetrators and victims were, what tactics were involved, and what the outcome of the attack was… to the extent that this information is known. It’s based entirely on unclassified information–almost always media reports. Data collection is ongoing and we currently update the database annually.

With the expansion of online media, we’ve developed what we call a “hybrid” data collection strategy. We leverage automated processes (natural language processing, machine learning models) to sift through millions of news articles each month. We leverage human processes (reading thousands articles about terrorist attacks), to create entries in the database and maximize its accuracy.

Read Also:
Big data technology market speeding up – with NoSQL and Hadoop at forefront

Making the GTD available to users has always been a major priority, for both principled and practical reasons. We initially spent a few years digitizing and cleaning tens of thousands of hand-written data records, but ever since the data were presentable we’ve had the GTD posted on START’s website. We've seen a growing interest in objective data on this important topic, and there’s certainly far more potential for the analytical community at large to generate important findings than if we had kept it to ourselves for the last 10 years.

In addition, for any data collection project transparency is critical. It’s important that people understand and can actually see how the data are collected and what individual records look like, in order to promote smart usage and credibility. Finally, making the data available is great for the quality of the dataset itself. The best way to improve the accuracy of the data is to have eyes on it, flagging potential problems for review.

Two things- first, Kaggle’s platform has much cooler functionality than we do. Allowing users to do custom analysis and then share it with other users is really powerful and promotes collaboration and knowledge growth.

Read Also:
5 Ways Artificial Intelligence Is Shaping the Future of Ecommerce

Second, although we’ve shared our data for about a decade on START’s website, our user community seems to overlap with Kaggle’s user community only a little bit.



Data Innovation Summit 2017

30
Mar
2017
Data Innovation Summit 2017

30% off with code 7wData

Read Also:
How AI and next-generation genomic sequencing is helping cancer patients

Big Data Innovation Summit London

30
Mar
2017
Big Data Innovation Summit London

$200 off with code DATA200

Read Also:
Data Outliers: 10 Ways To Prevent Big Data Damage

Enterprise Data World 2017

2
Apr
2017
Enterprise Data World 2017

$200 off with code 7WDATA

Read Also:
Should Google be your AI and machine learning platform?

Data Visualisation Summit San Francisco

19
Apr
2017
Data Visualisation Summit San Francisco

$200 off with code DATA200

Read Also:
Privacy In The Age Of The Data Breach

Chief Analytics Officer Europe

25
Apr
2017
Chief Analytics Officer Europe

15% off with code 7WDCAO17

Read Also:
This is why dozens of companies have bought Nvidia’s $129,000 deep-learning supercomputer in a box

Leave a Reply

Your email address will not be published. Required fields are marked *