Hadley Wickham is Metamarkets’ inaugural Data Scientist In Residence. He is an Assistant Professor of Statistics at Rice University and is the creator of popular R packages including ggplot2 (which was used to create the visualization above), plyr, and reshape. Hadley’s research focuses on data analysis and the development of visualization tools. He earned his Ph.D from Iowa State in 2008, writing a thesis titled “Practical tools for exploring data and models.”
Metamarkets: What kind of stuff have you been working on recently?
Metamarkets: Is that the sort of thing you’ve been helping people at Metamarkets with?
Hadley: Yeah, partly I’ve been talking to people about the Metamarkets visualizations. I design tools that allow experts to make visualizations where you want to give them as much freedom as possible, so the Metamarkets visualizations, obviously, are pretty different. I’ve been talking to people about the infinite pivot idea and these type of graphics. How, given the types of inputs, they can come up with the best graphics. One thing I discussed with a front-end developer here that I thought is quite cool is this idea that you can come up with a good default if you know what types of data are going into the plot. You can do even better if you know something about the questions they’re interested in. Are they interested in comparing within companies or across companies? Trying to figure out what questions people ask I think is really important in terms of automatically coming up with the best plots.
Metamarkets: What’s been particularly challenging during your month here?
Metamarkets: Is that something you would ever imagine creating, a guide to R like that?
Hadley: Yeah, I think it’s not quite there in R. There are bits of R that are still its own unique craft, but I like tools. I think tools that have a theoretical framework are really important. Even if you don’t explicitly think about the theory, there is some underlying, organizing theory. It just makes it better than a collection of special cases. A lot of my work has been driven by teaching. I’ll try and teach a subject. If it’s just special cases, then all you can do is memorize all these special cases, which is not a particularly satisfying, intellectual effort.
In R, there are three different ways of doing object‑oriented programming. Basically, none of them is the clear winner. Code in the wild is generally like people pick one of the three, but if you’re working across packages there will be mixtures. Somebody will mix them within a package. It’s just complicated. You need to know so many different things to understand how R works.
Metamarkets: What’s your own background? How did you make it here to Metamarkets, as our first Data Scientist in Residence?
Hadley: In college, I actually started off in medical school. In New Zealand, you go into med school straight out of high school. I’d been doing over three years and basically realized I had no interest in being a doctor. One big thing was just med school was just about memorization, basically. The human body is so complicated that there aren’t that many general patterns. You just memorize, memorize, memorize. I switched to computer science and statistics, which I enjoyed in high school. I like programming okay, but I really appreciate theoretical computer science, statistics, and visualization. I asked my professors, “Where would be good places to do this?” They suggested a few, and I ended up at Iowa State. Dianne Cook and Heike Hofmann were both really into visualization, and that was really influential. The other thing that happened at Iowa State was I had an assistantship where I did statistical consulting, so people doing Ph.Ds in the departments would come to us with their problems and we’d help them solve them.
That really drove home to me that one of the biggest problems of applied statistics or data analysis is getting the data in the right form. It’s hugely painful. And you don’t need terribly sophisticated statistical tools, but you need to go to show people what’s going on with their data, and then help them. Once you’ve got that, then you can tie it into some models.
The other thing with consulting is it’s always this balance between the right thing and the thing that your client will actually understand, because if they don’t understand it, it’s useless. And that’s basically what I’d call data science now. With data analysis, that process where you come with a dataset, you might have some questions. Often the initial questions aren’t maybe the most interesting or the right questions for the dataset. As well as answering those questions, you’ve got to iterate and think about what are the other questions to ask. Interestingly, one of the things I found most useful from med school was we got trained in how to take a medical history, like how to do an interview. Really, there’s a lot of similarities. When you’re a doctor, someone will come to you and say, “I’ve broken my arm. I need you to put a cast on it.” It’s the same thing when you’re a statistician, someone comes to you and says, “I’ve got this problem, I need you to fit a linear model and give me a p value.” The first task of any consulting appointment is to think about what they actually need, not what they think they want. It’s the same thing in medicine, people self‑diagnose and you’ve got to try and break through that and figure out what they really need.
After I finished my Ph.D at Iowa State, I joined Rice University, where I’ve been for four years. What I primarily teach there is data analysis or data science. I don’t tend to do big data. I’m more into medium sized data. I think a lot of the time, big data is often more of a computation or analysis problem, or a modeling problem or analytics problem. When you want to really explore and learn something, you’re pretty limited on size, not just because of your computing power, but also the cognitive load. You simply can’t look at a billion observations and figure out what’s going on. You’ve got to somehow constrain that. I think a lot of what I do and the people I teach do, there might be a big data engine on the backend, but you say, “I’ve got this huge dataset, but I’m interested in this question, so I’m just going to pull out this medium sized dataset.” And that might be a sub‑sample, it might be a summary or aggegration or something. Once you’ve got that smaller subset, how do you figure out what’s going on, and then how do you reflect that back to the larger data set?
I think a pretty good process is you might pull out one subset, like you might pull out one person or one company, and you figure out what’s going on with that one person, that one company, write some model, and then apply that model to every other person or every other company. And then you can say, “Well, does that model work for everyone?” and then you look at the ones where it doesn’t work well and try and improve your model.
Metamarkets: You’ve created a bunch of packages for R, including a visualization one called ggplot2. What was that process like?
Hadley: ggplot2 came out of my frustration with the state‑of‑the‑art in R graphics at the time, which was called lattice. Lattice was practically frustrating because there were things that seemed like they should be really easy to do but were just impossible. For example, if you want to do a map and then overlay locations on that map, it’s incredibly, incredibly difficult. Then, it was also theoretically inelegant, because you had things like you could take a scatter plot, and if you gave a certain argument to that function, you’d get a box‑and‑whisker plot. That just seemed wrong. If it’s called a scatter plot and you can turn it into a box‑and‑whiskers plot, there’s just something wrong with that.
And so, around that time, I was also reading this book called “The Grammar of Graphics” by Leland Wilkinson, which outlines this for a very large number of graphics, like what are the common pieces that you can think about and recombine.
I had a problem, and I saw this thing that looked like a solution. The book was really, really cool, because you read it, and you’re like, “Wow, this is so true,” but if you wanted to actually use it, at the time, the only software that implemented it cost like $100,000 or something. So, that was the impetus to make an open‑source version.
Metamarkets: Are you thinking of staying in academia, or do you want to move on from it eventually?
Hadley: My goal is to be an academic, but I’m not sure that a university is necessarily the right place. I really like the mix of research and teaching and hands‑on software development, but it’s not obvious that a university is best if you’ve got a really practical bend, and with data science being so popular…I think it’d be pretty easy to do those things outside of a university.
Metamarkets: Have you seen any rise in interest in R that you can tell over the last few years?
Hadley: One thing I’ve noticed, I think, is way more people now know what R is. They don’t know the details but at least they’ve heard of it. Even in academia, like the joint statistics meeting, which is the big statistics conference, the number of talks has risen. I’ve been going for six or seven years. The first year I was there maybe there were two or three talks and now there’s tons. It’s definitely exploded. This is the thing. Even if people aren’t using R they feel guilty about it, like they should be switching from SAS or SPSS to do their analysis.
The other thing I think has had an impact is the financial crisis because most university budgets got cut. SAS and SPSS and stuff, those site licenses are relatively expensive, so there’s quite a big pressure to get rid of those expensive commercial programs and switch to an open source alternative. That has this long term effect because now people coming out of college, it might be that the only statistical environment they’ve used is R, so when they work on their first job they’ll use R when they have a problem.