Claudia Perlich is the Chief Scientist at M6D, a marketing technology company that does targeted online display advertising. She previously worked with the Predictive Modeling Group at the IBM T.J. Watson Research Center. Claudia is a three-time winner of the KDD Cup. She received her Ph.D. in Information Systems from NYU Stern in 2004.
Metamarkets: How did you land on data science as a career? Did you just sort of fall into it, or did you have an idea that this was what you wanted to do?
Claudia Perlich: I’ve always been reasonably good at math and logic. On the other hand, I wasn’t terribly excited by really abstract topics. I can do math if I have to, but in its abstract form it doesn’t excite me. In the same way, I can’t really get excited by chess games. I think males are much more attracted by the purity of the discipline. I just can’t really get excited about something that’s not real.
With this being said, I ended up in computer science. I ran into a guy in Boulder, Colorado, who taught a course on neural networks. Through him, I got interested in machine learning. And that’s when I started to get hooked on having data actually tell you something about the real world, which is what fascinates me. Then I did my German master’s thesis around data and I came back for more, to get my Ph.D. in data mining, just because I loved the combination of doing something that’s somehow real, connected to something that actually happened. It’s different from writing compilers, or even GUIs. By the same token, I’m not a terribly good programmer. I’m not an industry-trained coder who can produce production code. I don’t have the dedication to detail to deal with all of that overhead. I’m more of a Perl hacker to get jobs done. That, in combination with fiddling with data. It’s like a big puzzle. You just try to understand what they put in front of you, because every data set has its secrets and little dirty things they don’t really want you to know, and that’s exactly what excites me.
Metamarkets: What’s your day-to-day work as a data scientist like?
Claudia: Today I’m taking a quick look at a lot of reports that come in from our model to just see if my babies are behaving. It’s like a system check for whether the things that are supposed to be doing something are actually doing the right thing. We have a lot of different QA [quality assurance] processes in place that report back the status of different models and just give me the impression that things are alright. So that’s just a little bit of supervision. I spend some time on just starting up analytical packages or analytical jobs. We build analytical reports for our customers on what the model found in terms of interesting patterns. This process isn’t fully automated, so I make sure the models are doing what they should be.
Some part of my day goes to trying to improve the current modeling methodology that we have. I’m doing display advertising, so for instance, we have a model that’s supposed to help us estimate the correct bid price. You have to decide at certain moments in time how much you want to bid to show an ad for a given company to a specific person. There are models that estimate how good a person that is, if it’s a really good candidate for running shoes, say. But then the question is also, what is he doing right now. Is he reading his email on Yahoo, is he sitting on Facebook, is he reading a blog about the New York Marathon? So we have different layers of models that then kick in to say, should we pay the usual price or should we pay more? I’m constantly fiddling so I can see if we can add additional features or information to those models, and seeing whether the information we get from the review-type bidding systems is clean. There’s a lot of gaming involved. What they send you is not necessarily what it really is. I’m also spending some time taking the worst of my models that are running to see if I can make them a little bit better. This usually means doing some prototyping, pulling some data from somewhere, trying to build a different type of model, then comparing it to the existing model.
And then, sometimes, you’re just thinking about new ways of using models, just coming up with something cool that somebody might find useful. You’re looking around and saying, “Let me try to understand better how our internal system works and if there’s anything that we could improve upon.” We have a lot of internal brainstorming, when a colleague comes to me and says, “I’d love to evaluate the impact that this model has on our process. What would you recommend? How should I do this?” And then we just sit around and debate what we think might work best. Even if I don’t execute all of it, we talk a lot about what different people are engaged in, and I like that part of it as well.
Metamarkets: Give us a favorite example of how data science techniques improved a company or an organization’s practices.
Claudia: One thing that I feel somewhat proud of is a piece of work I did for IBM. We had some effort around wallet estimation, where we tried to build models that estimate the opportunity or the potential of a customer and what percentage of the IT spend we currently capture. We looked for customers that we already had somewhat good relationships with, but were still not capturing anywhere close to 90% of the budget. That’s really where you want to try and push a little harder. So we built a model for them, and that took a couple of years before it got more large-scale. We started the first year just in America, then the year after we included Europe, and about five years down the road it touched about 99% of IBM revenue. I would argue that this really had impact on a pretty large scale, like 30,000 salespeople at IBM were affected one way or another by this piece of analytics that we put together.
Metamarkets: What do you think the biggest challenge is that data scientists face today?
Claudia: I suspect that depending on the breed of data scientist, you will get very different answers for this one. I actually think that the technical challenges have gotten easier. In the old days, you had a hard time finding somebody who could do something useful with 2 GB of data. That’s different now, just because the tools have grown dramatically and made things a lot easier.
My challenge is still trying to communicate to the powers in charge what can and cannot be done with data. I think there’s a breed of people who understand data and know what to do and then there’s this huge expectation that people have for what data should be able to do. Some people have too low expectations for data; they just don’t get certain aspects of it. And others have too high expectations, believing that just because you have data you can answer any question. I spend time working with people, trying to understand what they expect to see, and helping them understand what they can realistically hope for or how long it takes to get there. Despite the fact that there are a lot more people embarking on big data, you almost have this backlash with it, that there will be certain disappointment, because it’s very hard to explain what exactly is possible.
Metamarkets: Where do you see the field of data science heading in the next decade? Do you anticipate it growing a lot, and needs changing?
Claudia: The practical field has an extreme demand for people who can do something with data. I certainly see education catching up, for instance with programs at Columbia and NYU. I think there’s a tremendous amount of demand, just judging by the number of headhunters calling me each week. I don’t think that will subside any time soon. I think we will be behind in education.
The biggest challenge, in my opinion, on the supply side, is that evaluating a data scientist is really hard. Quality control around data science is incredibly difficult. Even myself, if I build a model that predicts something, I have a hunch, but I don’t know how good it is. And then you ask me to evaluate somebody else’s work, where I only get exposed to about 5% of what the person really did. It’s impossible for me to judge how good a job the other person did. And that makes it extremely hard to evaluate candidates as well. To actually figure out whether they are really good with what they are doing is incredibly hard. And that’s one of the challenges that the educational piece has to somehow solve.
I think that the demand for good data scientists certainly will stay up. It seems that technology is far ahead. They can store anything and everything, but you’re still kind of feeling that nobody really knows what to do with all the stuff. You have a lot of vendors who tell you “Oh, it’s all very easy. You just integrate X, Y, and Z with your Hadoop, and then all works beautifully.” Well, honestly it doesn’t. So, the technology kind of promises things at this point that don’t really work without some good data scientist in the loop. Or at least, I’m not convinced that it’ll ever work without a good data scientist in between. I do anticipate some degree of backlash when people realize it takes more than fancy software to really make sense of the stuff. Finding the right people is much harder than buying fancy software.
Metamarkets: What’s a current project that you’re excited about?
Claudia: I’m currently really excited about all the analytical challenges and opportunities that we see in the real-time bidding systems. I work in online display advertising. Every time you are reading some blog, and there’s an ad showing up, chances are there was an auction where your browser sends a request to an ad exchange. All of us—and there are a lot of firms like Media6Degrees—have to think about whether we want to show you an ad, and then in real time we submit a bid. The technology’s already amazing—just think about what happens in that short period of time. For me, with the analytics, how do I honestly evaluate such an opportunity? I think that’s really fascinating. I’m not sure whether it’s going to make such a big business impact. I think there’s a lot of money in it, but at the end of the day, what does it really matter whether I show you an ad or not? To you, probably very little. To me, a little bit more. But it’s really the scale that makes all this fascinating. And of course you have a lot of incentives to people to try to take advantage of this system, because there’s a lot of money to be had. The moment email was born, spam wasn’t far behind, right? I think you have the same now happening in the display targeting system. I really find that piece of technology absolutely fascinating right now.
Starting this week, Metamarkets is kicking off a series of data scientist profiles, acquainting a broader audience with the field of data science and the work of data scientists. Thank you to Claudia Perlich for being such an interesting subject! Check back next week for a profile of Drew Conway, Scientist-in-Residence at IA Ventures.