Drew Conway is Scientist-in-Residence at IA Ventures and a Ph.D. candidate in Politics at New York University. His research focuses on the modeling of social systems and the behavior of groups, particularly with regard to conflict and terrorism. Drew has a background as an analyst in the U.S. intelligence and defense communities. He is also the organizer of the New York City R Statistical Programming Meetup, which meets monthly and has over 1,000 members.
Disclosure: IA Ventures is an investor in Metamarkets.
Metamarkets: What is your day‑to‑day job as a data scientist like?
Drew Conway: I would say that I wear two hats. One is as a PhD student in political science at NYU, where I’ll be finishing up my dissertation in the fall. I’m very interested in studying human behavior and trying to understand that in aggregate. I think we basically have the tools to do that analysis now because of a large amount of data that’s being generated. We have the language of mathematics and statistics that actually get in there and understand a bit about what’s going on. In my academic work, I’m really interested in networks.
I also worked for the Department of Defense doing counter‑terrorism work. They didn’t call it data science then, they called it computational social science. It’s sort of a rebranding. I tried to understand groups’ behavior and how they would evolve over time through social networks, and I carried over a lot of that work into my academic work.
I’m a student part of the time, and the other part of the time I work as a Scientist-in-Residence at IA Ventures, which is a venture capital firm. This is kind of a data scientist’s dream job because I get to work with all the portfolio companies in lots of different ways. I’m always thinking about data and how the companies use data to improve the way they’re running their business. IA Ventures is a venture capital fund that has a thematic focus on finding companies that have defensible data access, so they use data to create their product. It cuts horizontally across lots of different industries.
I work with companies in a number of different ways: being the initial power user, helping them refine their model, or using a product the same way a client might. It ranges, but really the way I like to describe it to people is the chief sandbox player. I’m in all the companies’ data one way or another, and I get to play around. It’s nice for the companies to have somebody who has the ability to do that and say, “That worked, that didn’t work,” and then brainstorm with them.
Metamarkets: Could you talk about what you did in the federal intelligence community, and about the application of data science in that area?
Drew: The intel community has a lot of different agencies and I spent my time primarily at a few of them. My time would be split between doing two things. One would be what I call direct tactical analysis. This would be a structural analysis of a group we were studying, with deadlines because I supported a lot of troop operations.
It could also be giving analysis about individuals and what might happen if a certain individual was taken into custody, how might the group reconstitute. Who should we be targeting for collection or interrogating? Who would give us the most information given where they are in the organization? It was very tactical. I would be working a 24-hour mission cycle.
The other side of my work was much more research‑based. My organization was always interested in answering the question of how leadership transition happens in organizations that are networks. There’s not a clear managed control structure. I looked at groups where many of the members don’t know each other, so there was not a clear transition of power when somebody gets removed or killed or whatever. That was much more scientific research where we were experimenting and using data in a more academic way.
My work flipped between those two roughly 50/50, though I’d say probably more on the research side than the tactical side. Although the team I worked on was certainly much more 50/50.
Metamarkets: How exactly did you become a data scientist? Have you always been a technical person?
Drew: Yeah, in my undergrad I was actually a computer science and political science double major. I’ve always been interested in programming and understanding computers. When I got further down in my education, towards the end of undergrad, I realized that I was much more interested in answering questions that came out of social science, because I was more interested in human behavior. Using tools from computer science and math and statistics as a means to an end. I was interested in using those tools because I thought they were valuable in representing all of the questions I was interested in, especially around social networks and the way that information was moving through these networks. Very often the tools in math and computer science were representing that kind of data, so it was this natural combination.
In the federal government, I really cut my teeth on becoming a data scientist, learning on the job the tools and how to apply them properly. It was when I left work and became a graduate student that this idea of a data science community became more apparent to me and then I started to self‑identify with what I was doing as a new discipline. I have this idea that data science is fundamentally about understanding human behavior. We actually have the ability to study humans in a way that we were never able to do before because we’re collecting so much data online. That’s why companies like Google and Facebook and Twitter are synonymous with data science, because they’re the keepers of all the data on people. When I realized that connection I understood and gravitated towards this idea of data science.
I’d be perfectly happy calling myself a computational social scientist or just a social scientist. But I think there is value in identifying that way, and that’s how I think I became in that community of data scientists.
Metamarkets: What do you think is one of the biggest challenges that data scientists face today?
Drew: That’s a tough question. I actually think that a lot of the conversation right now in data science is very focused on engineering‑type questions, which we’re pretty good at specifying and solving in a specific way. So, the thing that I think data scientists are not so good at, is that the story about data science is not being told in a way that describes the challenging sociological questions that we really need to be focusing our attention on and spending much more time thinking about. So again, getting back to this thesis that I have, that data science is really about understanding human behavior and trying to find interesting patterns about that so that we can form lots of problem areas that we haven’t been able to address yet. Things like social policy, health care and medicine, local, national, and international policies about national security and war and peace and things like that, we haven’t really addressed those before.
I think that data science, if we do it right, will allow us to enter into those areas that we haven’t been able to yet. But the problem is that we haven’t really pushed the conversation in that direction yet. We’re still focused on engineering problems, and we need to be focused on social problems, because now is the time, now we have the ability to study them because we have the data and the tools are there.
Recently, I think health care’s probably a natural bridge for this, because there’s clear avenues to profitability, but there’s also clear avenues to how that actually makes society better as a whole. It’s less clear to me how you take that same motivation and turn it to industries like education and other social welfare kinds of things or local government things, where we should be focusing our attention. It’s less clear what the business proposition is there. I’m hopeful that much of that will happen. I think if there is a certain amount of success in places like health care that it will happen, because people will see it as valuable.
But again, I think there isn’t a lot of energy within the community of data scientists right now to tell the story from that lens. It’s much more about telling the story through tools. There are people who talk about it, and I count myself among the people who try to talk about it in other ways. It’s just about collecting momentum behind that and then having a larger platform.
Metamarkets: What’s a current project you’re working on that you’re excited about?
Drew: There’s this classic problem in political science with trying to quantify the political ideology of parties based on the texts of their manifestos. So, we could assign a number between ‑1 and 1 based on how socially liberal or socially conservative they are. The way that this has been done traditionally in political science is to have experts code political party platforms. So let’s say that we want to understand the political ideology of some party in Estonia. You would find some Eastern European politics expert and send them a copy of the document with some basic rubric for how to score it. And then they would read through the document and take some notes in the margin and score it, and then send back that score.
There’s a project called the Comparative Manifesto Project, which has been around since the 70s that’s done that. And it’s been a really useful source of data for the discipline, but of course it’s fraught with error in the sense in that there’s a tremendous amount of bias in it because, A, you’re using experts, and B, you’re only using one expert, and that expert has lots of historical context that they bring to the table when they read the document. So they might read one sentence and know, or at least think based on the context of what they know, oh, this is a liberal statement. But the words themselves may not have that context to them.
So, the experiment that I’m running right now is to say, can you actually do the same thing, but have the documents coded by non‑experts not as entire documents, but as little chunks of text? Then aggregate up all those chunks of text to get a mean, then have a more accurate reading, actually, of the ideologies. In the preliminary experiments where I’m comparing the expert coding to the crowd coding, the crowd actually does really well, which is an interesting phenomenon where the crowd obviously has a much higher variance. So, if 10 people code the same thing, you might have a much fatter distribution over the scores.
But the mean is actually very, very close, significantly close to what the experts are coding. This is a very encouraging thing, because we can do the same coding but do it much faster and much more cheaply, and probably much more accurately because they’re not bringing in all the bias. So, that’s what I’m working on now that I’m pretty excited about.
Click here for bonus interview material to see more of how Drew brings a social science bent to the field of data science. Look to Metamarkets next week for a profile of Pete Skomoroch, Principal Data Scientist at LinkedIn.