Data Scientist Interview: Florian Leibert of Airbnb
February 5th, 2013 Rachel Hyman
Florian Leibert is currently a Tech Lead at Airbnb. He has previously worked as a software engineer and applications developer at Twitter, Ning, a large ad company, and other startups. He has extensive experience with distributed data structures, Hadoop, algorithm design, predictive modeling, lexical semantics, and more. In contrast to our previous data scientist profiles, Florian sits lower down on the stack, working more on architecture than in analysis. He sits down with us to discuss Apache Mesos, the cluster management platform.
Metamarkets: What is your background like, starting from university and going through your various jobs?
Florian Leibert: I studied computer science in Germany. I was interested in both natural language processing and cloud computing, although it wasn’t called cloud back then yet. It was called grid computing, but the idea was close to today’s version of cloud computing.
Afterwards, in 2006, I started to work for a company called HeadCase. We were riding the wave of the Second Life movement, where people were using these immersive 3D worlds and building characters. You could actually program these characters to do certain tasks. We wanted to script a lot of these characters, parsing and generating content that was relevant, allowing them to do automated tasks within this virtual world, such as having conversations with human players.
We built the parsing part first, so we’d parse all this textual data that someone would write. Then we tried to figure out what this text was about. What we were really good at is if someone wrote sentences that contained the word “street,” “car,” “truck,” and “airplane,” then we could justify that the category was transportation.
We could do that across a wide number of languages, and we used, back then, a pretty novel technique called semantic analysis of lexical shape. One of the algorithms we used was called Efficient Text Summarization using Lexical Chains. We ported that over to Hadoop, and that’s how I started out using Hadoop.
Then we pivoted, because the company didn’t really go anywhere. Paco [Nathan], who was my boss at the time, and I, both left to join a large ad company and tried to use our technology in contextual advertising.
However, contextual advertizing wasn’t considered important, so instead we focused on porting their existing collaborative filter system built on top of a Netezza cluster over to Hadoop. Netezza is this proprietary SQL data warehousing system that gets expensive very quickly. It performs very well, but it’s not built on commodity hardware. We used Hadoop and rewrote their algorithms to run in the Map/Reduce framework.
This Netezza system was already at terabytes scale. We were computing click probabilities–like determining, which ad should you show to which user to get the maximum value from the ad. In order to do that, you had these gigantic datasets of historic events, like which ads users had clicked on.
We wanted to do that at scale, and we wanted to do it outside in the cloud. That’s when I started to really dive into Hadoop. We worked with Tom White, the author of the Hadoop book [Hadoop: The Definitive Guide]. He developed the Hadoop S3 file system support for us as a contractor which we allowed him to open source. All of our analytics infrastructure back then ran on EC2, data was read from S3. But the web services remained in a private cloud.
Then, I moved out to California and worked for Ning, where I worked on the Data Services team, along with [Metamarkets Lead Architect] Eric Tschetter. I was working on the ad product that Ning was developing. We were trying to classify the networks that Ning had. That was very experimental, but the results looked pretty good.
After that, I joined Twitter, where I worked for their renowned Chief Scientist Abdur Chowdhury, and I stayed there for the next two years. I built the user search system with Jimmy Lin, an authority in information retrieval, and Jake Mannix, who’s a big contributor on the Mahout project. User-search had a large ETL processing pipeline behind it – it went beyond just matching for a user’s name and instead also allowed finding people known for certain topics.
Another big project I worked on while I was at Twitter was a large scale system that computed derived user data. For example, it classified users by the language used in their tweets, how often a user signed into the service, etc. These signals would be loaded into a system that could be queried by other parts of the company at very low latency and high throughput. Many other products built on top of that system such as ads, graph recommendations and discovery.
It was challenging in terms of the sheer size of the data, but also in the infrastructure that we had to build, to serve the data back. Because, if you can imagine, if a lot of products within the company use a single service, that service has to be reliable. It has to be really fast, because if that service’s response rate is slow, then all these other products that use it will be at least as slow as that service.
Metamarkets: Was your pathway always clear? How has your domain of expertise developed since you’ve been in school?
Florian: If you look at what I’m working on now, cloud computing, Hadoop and Mesos [below], I had already worked on it in college when I did a bit of research on a European Union funded project called GridEcon. It was about dynamic resource allocation, virtualization and developing a marketplace for computing resources. Now, this is reality. Back then, these things were still very academic.
Metamarkets: Was there a sense back when you were still in school and working on research projects that this stuff would be moving this rapidly and put into actual production and execution this soon, or was it still all up in the air at that point?
Florian: No, it was totally up in the air. It was pretty surprising to me to see things moving so quickly. Today, Amazon is the largest cloud service provider. Even though Amazon is such a large company, which would make you think they’re executing much slower, Amazon managed to push out feature after feature really quickly. I’ve been very impressed by their pace.
Metamarkets: It seems like there’s been continuity in all your various jobs, with similar challenges and similar infrastructure that you’re using in each position. What’s stayed the same and what’s presented new challenges from position to position?
Florian: There are some key elements that are present everywhere, but the environment changes and has changed a lot. For example, with Twitter we were working in a private data center, and so you have to think about a lot more how you interact with the operations team and how you do things in a private cloud where you have to order machines in advance. Here at Airbnb, we use Amazon and you can spin up a new cluster within five minutes.
One of the trends that I have been seeing is that we are not exclusively using Hadoop anymore. We are using many more tools. We’re still doing most of the ETL with Hadoop, but then we’re running queries against MySQL and are currently evaluating a data warehousing solution from Amazon which replaces some of our need for the SQL-like interface called Hive which runs on top of Hadoop.
Metamarkets: Do you see that trend continuing? Do you see a whole new suite of tools being developed as these skills get more commonplace and more necessary, too?
Florian: Yes, I think the space, especially the data analytic stack, is getting much more fragmented because there are various use cases that don’t really work well with Hadoop. Graph computation is one of those examples where Hadoop is not a good fit.
When Twitter does recommendations on the Twitter graph, Hadoop would be really slow because graph algorithms are usually very iterative containing multiple steps. The data in Hadoop is stored on disc, streamed from HDFS. These are iterative algorithms that run over a dataset multiple times, so we want the data to be in memory.
Since I deal more with infrastructure right now, I really care for the lower level systems which most tools are built on top of. The ecosystem for data analysis is quite complex right now and we need a common platform to make it easier to develop new tools and deploy them within an organization. One of the systems for this that’s gaining traction is Mesos. When I was at Twitter, I was chatting with my close friend Ben Hindman who wrote Mesos over at UC Berkeley, and Mike Abbott and Abdur [Chowdhury] saw the potential for cost saving and productivity gain of Mesos at Twitter. Thus, it was introduced via the research group and is now used within Twitter at a very large scale. Mesos is basically an operating system for the cloud – a platform that can be used for writing new tools to process data for example. It runs on thousands of nodes at Twitter and at Airbnb we also run it – albeit on a much smaller scale.
Mesos provides abstractions that allow you to build distributed systems in the cloud as well as tools for the data stack, and you don’t have to re-implement every piece over and over again, like storing state in a distributed system and dealing with failure. You can run different tools on the same cluster, which makes maintenance much easier and removes friction because you don’t need to order and configure new hardware if you want to try out a new application. It also removes the need to think about individual hosts, your infrastructure becomes much more service oriented. You often don’t really care where exactly your services run, all you do is instruct Mesos to run your service in a fault tolerant way and it will take care of failover and distribution of the workload.
You can build a lot of different applications on top of Mesos. That’s why I’m mentioning this. It seems to me that it has the potential of becoming a really crucial basis of the data stack. First of all, you can run Hadoop on Mesos, but you can also run MPI, Storm, or Spark. Storm is a scalable real-time processing system which runs on top of Mesos at Twitter. It allows you to run real-time aggregates of data so you can, for example, keep a tally on the number of impressions you’re seeing of a certain tweet. Spark is an in memory analytic system which isn’t very mature yet, but it looks promising, and also runs via Mesos. You could potentially also port Metamarkets’ Druid and run it on Mesos.
To summarize – I see a really interesting platform called Mesos deeper in the stack. It can be used to build lots of long-lived and short-lived services as well as tools for the analytics pipeline, such as in-memory query engines and other tools that will compete with Hadoop. It’s a new level of abstraction in the cloud. As software engineers, we take certain primitives and basically combine them to build bigger systems. The better these primitives are, the better the resulting systems and the easier it is to build them.
Metamarkets: Speaking of Airbnb, where you currently are—what is your position like? What sorts of projects have you been working on?
Florian: At Airbnb, I’m trying to create an infrastructure for the analytics pipeline. We run in the public cloud on Amazon. A sample project is a logging system we built here that writes log data into S3. That logging system is used throughout the web-application. People can easily create new logging events.
This is a very foundational component for most of our other log analytics. Example events are reservations, searches, and more generally page views. We log all of these, and then we run A/B tests, we evaluate them based on the log data.
I’ve been trying to build up these lower levels so that we have very reliable logging with minimal to no loss of data. That was one of the pieces. The other piece was trying to just strengthen the ETL pipeline.
Let’s assume that analysts write jobs in Hive, Pig or Cascading and they want to run these jobs on a daily basis. Often times, these jobs have lots of dependencies. A web company for example has http requests, maybe they’re logged raw and then if you have a search interface, you also log the searches and once a user tries to make a booking and logs in via her username you also record that event. In order to restructure sessions, you need to join all of this data together, since the user was not logged in initially, you might use a cookie as the join key. This big join is costly and the data is used throughout other parts of your analytics stack. Thus, you may want to run it every day and it becomes a first step in your ETL pipeline.
A lot of other jobs now depend on this first step and should only be executed upon completion of this join. For example if you want to see how good your search ranking is, you’ll likely use this data. There really are very few tools that allow you to tie these multi-step jobs together, especially in a heterogeneous environment where you use Pig, Hive and Cascading together. We wrote this system that we are going to open source soon, it’s called Chronos. It basically allows creating arbitrary job dependencies, it assumes everything is triggered via Bash. It’s distributed and built directly on top of Mesos.
Metamarkets: What is unique about your work at Airbnb, whether it’s the type of data you’re dealing with, or the scale, or the tools you’re using?
Florian: I think one of the unique components is that all our services are in the cloud. At Airbnb, almost everything runs on Amazon. And the Amazon network is not as reliable as your own network. You get really high variance in your network latency, for example. Also, Amazon will warn you that tomorrow they’ll turn off a component. So you have to work around a lot of these issues, and you have to heavily use retries in all of your logic that deals with these components. You have to really adjust your timeouts, for example. It’s very different from Twitter and Ning.
Metamarkets: How much do you deal with the business and strategy side of things? Many of the other data scientists we’ve profiled do a lot of translating of this mass of data into actionable insights and relay those to the business side of their teams. They might refer to themselves as storytellers. In contrast, you seem much deeper down in the data.
Florian: I’m really interested in building robust systems and tools that help analysts get their work done faster and want to make sure they work with reliable data. I am basically a service provider to our analytics and finance team, who then can use these tools that we are building in data infrastructure to make their decisions. I work on the plumbing. For example, people come to my team and say, “Hey, we’re trying to run these queries,” and my team has to figure out how much infrastructure do we need to build to make this happen. Can we figure out if we can make these queries faster and more cost efficient? Ultimately, every query you run costs you something, especially in the cloud where you have elasticity, these costs are much more obvious. We’re also building tools to save costs and that help the decision makers and storytellers to be more productive.
Metamarkets: We talked about how fast a lot of these advances have been. If you were looking into the future 10 years, where do you see this all heading, particularly at your level of the stack? What changes do you anticipate happening?
Florian: On the infrastructure side, I think our systems will become much more service-centric. I mentioned previously, that I think things are going to go away from this idea of having a single server that fulfills a single purpose, that model wastes resources. I think we are going to share resources much more efficiently within a cluster. There will be different types of analytics pieces coming together on the same cluster, for example segmented into real time, ad-hoc and batch analytics.
I think Hadoop itself will probably not be used for running many queries – but given that it is lower in the stack than many other components and really a good fit for ETL, which is the basis for many other analytics workflows, we may still refer to the whole ecosystem as the Hadoop ecosystem. What we really need is to make analysts and data scientists much more productive. In order to make them more productive, we need to invest in new tools. Hadoop is a very cool tool that’s really good for certain things, but you can’t build a house with just a hammer. I think it’s really important to realize that there will be many other tools that allow people to make fast decisions.
I’ve also thought about the parallels of the Hadoop ecosystem and the SAP ecosystem. The difference being, Hadoop is free and SAP charges a fortune for a license – but a lot of the money is made from consulting. Deloitte and Accenture employ so many consultants for this complex piece of software that’s more than 40 years old. But the complexity of SAP created a need for all these consultants – and Hadoop is also really complex.
Metamarkets: Do you predict we’ll continue to see specialization of function within the stack? With people like you, who are working on the plumbing, and then the people higher up who are more analytics-focused?
Florian: Yeah, I think so. It takes time to build systems and it takes time to think about which questions to ask and how to ask them. At a lot of bigger companies you can see already that they employ researchers who just focus on graph problems or spam and others deal with traditional business intelligence. I see this at Airbnb, we have statisticians who are experts on how to conduct a great A/B test and others who understand the subtleties of search ranking. The more specialized you are the quicker you usually are with implementing a solution in your domain. But that doesn’t mean you have to stay in the same domain throughout your career, it’s always good to try new things and get exposure to different problems.