Mining a trove of text
With his innovative method for analyzing language, political science student Andrew Halterman maps civilian deaths in Syria
Few students boast as precocious a start in their field as Andrew Halterman. At age seven, Halterman accompanied his mom, a political scientist, on a research trip to Bosnia. It was just a few months after the ceasefire in that region's civil war. "I learned all about the conflict and ethnic cleansing," he says.
With dinner conversations revolving around this kind of field work and on the academic world at the University of Oklahoma where both parents served on the faculty, Halterman was hooked. "I became focused on a life in political science at a pretty young age, and by high school, it seemed like a natural path," he says.
Today, the fourth-year doctoral student in political science is pursuing an ambitious two-part research agenda, deploying an original computational strategy to pursue questions about the casualties of war.
Halterman's methodology involves a new way of analyzing large collections of written communications—whether newspaper stories or social media postings. It offers him, and potentially other researchers, a way of ferreting out connections between people and places that might otherwise remain concealed.
"I take innovative tools from computer science—different ways of representing sentences and analyzing word order—and assemble them like Lego pieces to link events and locations," says Halterman. "This approach is something new for political scientists, and will make possible new techniques."
In testing out his approach, Halterman chose a pressing contemporary problem. "I am interested in political violence, and I wanted to understand the vast number of civilian casualties in the Syrian conflict," he says. "There are a lot of theories why armed groups kill civilians, and I wondered if these theories applied to a war like Syria."
To answer this question, Halterman collected data from 2011-2016 comprised of news texts from international wire services, local newspapers, and Syrian information posted online concerning civilian deaths. "I don't think there had ever been a data set like this, which gave us a pretty good sense of where and how most civilians were actually killed," he says.
Halterman then trained a natural language program he had designed to parse the nearly 10 thousand sentences, seeking in particular to link verbs such as "attack" and "advance" with location words. To ensure the accuracy of his model, he included non-Syrian text from Wikipedia and The New York Times.
One conventional theory about civilian loss in conflicts suggests "that civilian casualties are like a bell curve: if civilians are right next to the front line, casualties are low, then as you move away, they go up, and then as you go further still, the numbers go down again," Halterman says. "But I didn’t find evidence for this idea, showing that in Syria, the closer you are to the front line, the more killing of civilians there is."
Halterman's research links civilian deaths to one particular cause:
"What explains most violence in Syria is the government's deliberate targeting of civilians," he says. "Some deaths plausibly occur as collateral damage—artillery and barrel bomb strikes against rebel fighters—but those areas where the regime saw the greatest threat saw the highest rates of violence."
Deep dive into data
Halterman has long been concerned with conflict and its consequences. His thesis at Amherst College examined the efficacy of American aid in the Balkans during and after the war. On a research trip, he visited Belgrade, Sarajevo and Pristina, observing the flow of American and UN personnel. "I realized the power of asking people about what they think is going on, and how official statistics can be inaccurate and really misleading," he says.
On a Fulbright Fellowship to Kosovo after Amherst, Halterman was able to flesh out some of the themes of his undergraduate research, looking at the mechanisms of local and international NGO partnerships. This research launched him into a series of opportunities in Washington, D.C. at government-related think tanks and consultancies that proved formative.
"I was introduced to working in offices on a team—understanding how that world works," he recalls. The government agencies he assisted looked at such problems as developing effective counterinsurgency plans for Afghanistan. "These jobs also introduced me to quantitative approaches to problems, as we developed new software to help analysts find meaning in large datasets," he says.
Halterman took online courses in Python and R and taught himself programming, so he could help academics and government officials visualize questions. "I'd ask, 'What do you need to do your job, and does this kind of software make your job easier?'"
After several years, "I was ready to go back to grad school," says Halterman. "I'd learned a ton of technical skills but had gotten away from substantive problems." He chose MIT's graduate program in part because of its rigorous methods sequence. "I had no formal training formulating research questions, or a good process for answering them."
Today, Halterman continues to develop his data tools in the service of political science. "I'm motivated to make things better for others in my field," he says. He has published his event extraction software on open source websites. "My method can be useful for a range of purposes—looking at protests, attacks, meetings, speeches, finding relationships between people and places," he notes.
As he closes in on his degree, Halterman is polishing his Syria paper before turning his attention to text analysis of conflicts in other countries. He draws considerable support from MIT's political science graduate student community. "We get together for dinner, invite a faculty member over, and through the years I've developed really close friendships," says Halterman. "I'm surrounded by people who read drafts and give advice, which helps make me a better scholar."