Data Scientist Interview: Pete Warden, Co-Founder of Jetpac
August 17th, 2012 Rachel Hyman
Pete Warden is the co-founder and CEO of Jetpac, an app that curates travel photos from friends. He founded Open Heat Map, and formerly worked as a Senior Engineer at Apple. He is a prolific blogger and wrote a few things for O’Reilly Media.
Metamarkets: How did you get where you are today?
Pete Warden: I actually started off in the computer graphics world. I worked writing low-level code for game consoles like PlayStation back in the day when that was big. Then I ended up doing a bunch of open source video effects for live concerts and clubs. I was doing video processing on low level laptops, just because that’s something I liked to do anyway in my free time, and I found it fun. That turned into a company that ended up getting acquired by Apple. I worked at Apple for five years, again doing stuff on big streams of video for some of their applications like Final Cut and Motion.
I loved playing with these big sets of really, really messy data. With video, you’ve got stuff coming through at 30 frames a second, or 60 frames a second. You’ve only got a very short time to process each frame that contains millions and millions of pixels. It’s pretty challenging. You have to jump through quite a few hoops to actually get enough processing done in a short amount of time.
That was a great workout for trying to figure out how to deal with very large amounts of data with very limited resources. When I left Apple I wanted to get back to the startup world. I felt like you could do a lot more with these other types of data that weren’t video. I definitely learned some stuff in the video world, especially from other really good people. There were a lot of techniques that were invented to do image processing and video processing. I learned a way of thinking about problems from these really smart people who’d spent decades working on these problems that hadn’t made it into all of these other areas where they’re actually really useful. That’s really the thing that I found talking to people a lot—data science is a very fuzzy term, but what seems to unite a lot of people is that is they’ve come from a lot of different directions and ended up in the same place. It almost feels like this crossroads where a lot of different disciplines actually meet and share and exchange tools and techniques and talk and collaborate.
We’ve got this toolkit that’s got almost the best techniques from all of these different areas, but we’ve all been dealing with it in our own little enclaves. Data science is finally seeing all of these different specialties actually start to share a lot of what they’ve learned over the past few decades, which is really exciting.
After I left Apple, I wanted to create my own startup. The first thing I wanted to do was, one of the big problems at Apple was we had thousands and thousands of engineers. You knew that if you had a problem, somebody out there knew what the answer to it was. But finding that right person who knew the answer was almost impossible. There was no company‑wide directory of experts. It would be really hard to manually create one. But I realized that my emails I sent actually gave a very good picture of the things I was at least interested in, and maybe a bit knowledgeable about. I set out to take an entire company’s emails, hundreds of millions or billions of emails on an email server, and just analyze them very simply, just looking for keywords. Things like C++ or image processing, or anything that might be related to a job skill.
If you see a lot of those words showing up in the emails that somebody sends, you use that to help build a profile for them which they then get to approve or deny. They still have complete control over what’s published about them, but you go through and create this almost internal LinkedIn profile for people without them having to do any work. Then, you’d actually have a chance to find people within these massive organizations that you needed. It just seemed like a win‑win because people generally liked actually sharing knowledge and helping out. The phrase I kept coming back to at Apple was like, they’re all getting paid by Steve. There are very different departments, but they’re all trying to work towards the same end goal. That was massive fun, and it turned out to be actually surprisingly easy to build. Even just with scripting languages, it didn’t have to be super fast optimized code. I was able to build a pipeline that could suck in hundreds of millions of email messages or billions of email messages, analyze them, and look for experts.
I could also map out the social networks within companies. That was the other thing at Apple, or any big company I think, is that the real power structures and the real way that things actually get done, they usually involve a conspiracy of engineers from different departments within tech companies getting together over beers and figuring out what actually needs to get done. Then going back and individually selling the management structure and prototyping stuff together. One of the things I found hard at Apple was the people who were really involved in this, they were really crucial but not formally recognized in any structure. They didn’t get rewarded for all of this help that they were giving to the company. They were invisible. No special effort was made to keep those really crucial people who were these known, connected people in the networks.
But I really sucked at selling. I really was not good at enterprise sales. It was so hard, but I discovered I could actually apply the same code I’d built to Twitter. I took 400 million Twitter messages and actually started analyzing who talked to whom on Twitter and building out these conversation graphs of, who talks to Evan Williams and who does he talk back to, and how do they talk amongst themselves? [View that visualization here]. You end up with these really interesting different groupings. Very much like the LinkedIn connection graphs. That was a very visual way of exploring stuff, so that proved to be pretty popular. I did a blog post about it, and then it got picked up by a few different bloggers because people like Marshall Kirkpatrick, when they were actually writing about somebody they would go and look at the person’s graph to figure out who they were close to so they could actually get some quotes or background from the real people they had some relationship was.
Metamarkets: Then how’d you get from there to Jetpac?
Pete: That’s a convoluted route. First, one stop along the way was getting sued by Facebook. That was fun. Once I realized that the enterprise thing wasn’t really working, I moved over to trying to create more of a personal email service. I tried to think of things I wanted out of my email that could be useful. One of the things that I really wanted was when I first started emailing back and forth with somebody, I wanted to get a little report that had their Twitter, their LinkedIn and their Facebook. So if I wanted to, I could actually just easily connect so you don’t have to go through that whole Googling people process. It’s what everybody does when they want to find out about people or get that social connection, but there was no API to do that. I realized what everybody does is they go to Google and they search for keyword “LinkedIn” to find my profile, so those search engines have access to it. How hard can it be for that search engine?
It turned out that it was pretty easy. I knew a lot of the techniques that I started doing for the email analysis stuff. So I actually wrote a search engine robot that crawled 220 million Facebook profiles and obeying robots.txt, so behaving exactly like a normal search engine. I was using that and I was actually able to start connecting email contacts with these other services without having to use an API.
I also realized there was some really interesting data in those Facebook public profiles because they weren’t the full selection of data you see when you’re logged in. They were quite sparse, but they at least had a person’s name, where they lived and a selection of their friends. So I created a visualization that actually mapped for every city in the US what the top ten other cities that people in that city had friends in. So you could actually see the people in Utah only have friends in Utah. Whereas people in places like California and New York are very nomadic. All of the big cities just have these fan-outs to everywhere and then the smaller cities not so much.
I was just really fascinated by the data, so I did a blog post and colored in and grouped together different areas in the US that I thought were closely connected and gave them silly names like “Mormonia” for the sections of Utah that were tightly connected. People love visualizations, so that ended up becoming popular. But I also wanted to make the dataset that used the public profile data available to academics. I mentioned that in a blog post and that was obviously likely to make Facebook freak out a bit, especially when the blog post became very popular. It’s a very sensitive data set and for me it was almost an afterthought because anybody else could have written a robot. It only cost me a few hundred bucks to gather those 220 million profiles, so for me I felt like I was just making things a little bit easier for people.
Because I had been working in isolation a bit and all these techniques seemed fairly obvious to me, because I had had my head wrapped around them for a couple of years and I had this background, I didn’t realize that other people were a bit surprised. And that was a nice thing that initially got me connected to a bit of a movement, back in 2010. There were other people who were also discovering that all of these tools weren’t too crazy.
We ended up settling quite amicably with Facebook and I never released that data set, but that got me connected with Julian [Green] at Jetpac and really with the data science movement. At that point, I started open sourcing a lot of my tools because some of the stuff I had built over the last couple of years was actually going to be useful. There were people asking me how to do some of this stuff, so I started doing some writing for O’Reilly. I got talking to people like Julian who had interesting problems. And what blew me away about what Julian was envisioning was how much really interesting information we capture in all of our photos. There’s on average 200,000 photos of people shared on Facebook that people have access to that have been shared by their friends and you’ve seen almost none of them.
So if you’re actually able to go through all of those photos that your friends have shared with you and find the ones that are really awesome and really interesting, that feels like that’s actually a really cool thing to build. It has a really emotional impact because photos are so evocative.
Just sitting down with users when we get them to sign into the app and load up with Facebook, they suddenly see photo after photo from their friends that look great and that are showing them having been to all of these places they never knew their friends have been to, they’re like, “Oh my god. I never knew that Mike had been to Peru.” You actually get to see what they are doing there. That was pretty magical.
What I keep coming back to with a lot of this work is that data really helps me connect people together who should be connected. The best feeling is when you can get a “You guys should talk” moment. With the expert finding thing, that was always kind of the magic moment of that. With Jetpac, it’s when you take you and one of your friends, you’re thinking of going somewhere, you realize your friend has already been there. It looks like she had a wonderful time, and you’re able to start a conversation saying, “Wow, you took amazing photos. Tell me about your trip.”
Normally you just upload photos to Facebook, and if somebody sees them in their timeline, you might have that moment. But most of the time it feels like you’re just throwing them into a void. Whereas with this, you actually get to hear from your friends about how much they like your photos. You get to help out your friends.
Metamarkets: What data science challenges are unique to working with photos versus your past experiences?
Pete: One of the things was, it’s almost not a challenge. One of the things that makes it a lot easier is that people caption photos in a really useful way on the data side. They tend to describe what’s in the photos. There doesn’t tend to be too much sarcasm. It tends to be people talking about what they’ve done, not people uploading photos taken by somebody else and commenting on them. They’re actually talking about them, so it makes a lot of the semantic analysis stuff a lot easier because normally, if you’re taking a Twitter stream from somebody, it’s full of all of these quotes from other people, all of these links and hashtags. It’s very hard to make sense of it. It’s not somebody talking about themselves. It’s like somebody changing the TV channel constantly, and all this noise coming in. It’s been really nice just working with this data service. People just describing what they’ve been doing.
At the moment, we’re just using metadata. That’s actually been enough to get us some really good results. That’s one thing I learned from my past doing image processing at Apple is that, video and images are very large packets of data. If you can actually start off by using just the words and the metadata around the photos, that is a lot easier to handle.
Metamarkets: So how do you build a business around what you guys are doing?
Pete: We think of this very much like those travel magazines. There’s not an online equivalent to opening up a gorgeous travel magazine and leaning back and just flipping through it. A lot of the appeal of travel magazines has been the beautiful ads that show up. Often they’re better photos than the actual stories themselves. They don’t feel like intrusive billboards. They’re very, very relevant and inspirational. One of our most popular slideshows is actually the Hot 100 List of Hotels, the 100 best hotels in the world. People love flipping through that. Essentially, each one of those, you can look at it as an advertisement. Especially if you’ve got beautiful photos showing up when you’re looking for a destination. It doesn’t feel like an intrusion. It feels like a very natural part of the experience to have a professional photo of a hotel or other destination showing up amongst your friends’ photos. Then, obviously, you might want to click and say, “Oh, yeah. That looks interesting.”
Metamarkets: Do you use data science techniques internally to evaluate your own processes, like building internal tools to take a look into customers?
Pete: Yes, we do. A key thing for us is to understand what’s working for our users, and what isn’t. We look at how effective our slideshows are, trying to figure out, what is actually turning people off? What things do they hit and do they then tend to leave the app so that we can actually improve the experience? We’re very, very data focused when it comes to trying to figure out what the top of the priority list for improving the app is.
Metamarkets: Where do you see this field of data science heading?
Pete: It feels like this toolkit is going to be useful all over the place. That’s what’s been really exciting about the data science world is seeing all of these other industries adopt some of the techniques. One of my favorite examples is Kaggle, where they’re able to run data competitions with all of these data scientists who’ve mostly been around machine learning, but they’ve learned this basic toolkit. These techniques that have been developed by all these different people for analyzing data…they’re able to go into all these different industries and produce solutions that are better than the best that the traditional experts within those industries have actually managed to create. Everything from healthcare to astronomy, to government stuff, to traffic. It’s a pretty clear demonstration there is something new and valuable here. If there wasn’t something different about data science, and especially the machine learning stuff in their case, you would have already been getting these better results. The proof is in the pudding. It’s obviously very effective. I see that happening across the board as people realize that traditionally, within bigger corporations, anything involving databases is this massive, expensive, very slow process. There’s the whole world of agile tools and techniques. They’re the Rebel Alliance. It’s been pretty amazing seeing these big companies start to pick these techniques up and see how effective they can be.
Metamarkets: What challenges would you say data scientists uniquely face in evangelizing or spreading their discipline?
Pete: I think the biggest challenge is it’s such an eclectic toolkit of techniques that there isn’t an agreed set of things that belong to data science. As the term becomes more popular, anybody who’s dealing with data is going to use the label data science until it becomes so diluted that it’s hard to tell what it actually means. It’s a good problem to have and there’s been some interesting efforts by O’Reilly and people like that to try and define what it is, to try and just draw a fuzzy line around the things you should know if you want to be a data scientist.
Metamarkets: What advice would you give to someone who’s interested in data science and wants to develop their skills more?
Pete: I would really tell them to find a project that they actually care about. Grab a problem that’s interesting and has some data floating around somewhere, and take it all the way from figuring out where to get the data, to processing them, to analyzing them, to visualizing them, to trying to come up with something actionable at the end. That’s really what I feel like distinguishes data scientists from statisticians, or database analysts or these other more specialized roles, is that we’re able to do this full stack of stuff.
It feels to me like a very, very hands‑on and practical thing. You should just be able to find something out there. There’s so much data flowing around now. There’s so many interesting problems in the world. You really should be able to find something in an area that you’re interested in, and even just do a visualization to show something.
Metamarkets: Name a data scientist that you admire most or name a project that has caught your eye recently.
Pete: I think Steve Coast of OpenStreetMap, just for how he managed to galvanize this incredible community of people to create the OpenStreetMap project, and the fact that he’s now trying to create an open geocoder. He’s actually working at Microsoft now. He’s created this new, open source project where it’s almost like Wikipedia for place names. You actually type in a location. If it’s not found, it says, “Can you drag a value box on a map around what that string is associated with?” You end up with this open source, public data set. That’s one of the biggest problems. People don’t know what lat-long they’re at. They enter an address, and there’s very few open source public solutions for finding addresses. Yeah, Steve is doing amazing stuff. I’m in awe of him.