Jeff Magnusson, manager of Data Platform Architecture at Netflix, gave a presentation at QCon SF 2013 Conference about their Data Platform as a Service. Following up to this presentation, we will try to further explain how the technology stack lies and how it helps Netflix to tackle important business decisions.
Netflix has more than 30 million subscribers worldwide. Every user provides several data points during a visit to Netflix web site. Playing, rating or searching for a video are events that are captured and analyzed. Time, date, geolocation, device and browsing or scrolling behavior in a page can also be used to provide the context in which such events occurred and place users in different buckets. The company uses the data to increase the engagement with their site and make business decisions like which series to fund next.
Metadata from third parties like Nielsen or social media also contribute to driving engagement or new subscribers to the platform.
Netflix has been running on the cloud and using the Hadoop platform since 2009. The key infrastructure Big Data blocks they use are:
Amazon S3: The Amazon S3 technology is used to capture billions of events from various devices using Ursula, an internal data pipeline tool. S3 is used as the source of truth for the Elastic Map Reduce (EMR) clusters running Hadoop jobs.
Hadoop: Apache Hadoop is used as the baseline library for distributed computations. Hadoop is deployed in Elastic Map Reduce clusters on AWS and instead of using HDFS in the storage each node provides, it’s exploiting S3 bucket storage. This is counterintuitive as it can cause data movement from S3 to the EMR nodes against the data locality principle Hadoop exploits but on the other hand, means that S3 can hold as a single source of truth and the EMR clusters can be expendable and resized as seen fit in almost real time.
Hive: Hive in Netflix is used for ad hoc queries and lightweight aggregation. Pig on the other hand is used for ETL and more complex data flows. Its power in data movement is also exploited to connect between complex operations.
Genie: Genie, a Hadoop PaaS technology is used to submit jobs in EMR. Genie provides a REST-ful API that can be used by developers without having to deal with the intrinsics of spinning or maintaining a Hadoop cluster. Genie can be forked from their GitHub repository.
Franklin: Franklin, a metadata API, can be used to extract information from RDS, Redshift, Cassandra, Teradata or S3 sources. Cassandra has been in use for online data collection since 2011 in Netflix, after migrating successfully from an Oracle data center based solution to AWS cloud. Teradata is mostly used in a datacenter but this will change with Teradata’s announcement that they have signed up Netflix for their Teradata Cloud.
Forklift: Forklift can be used to move analytical data between different data stores. Source and target destinations can be Hive, RDBMS, S3, R and others.
Sting: Sting is then used to visualize Genie jobs results in an ad hoc fashion. Sting provides sub second response time for common OLAP operations like slicing and dicing by keeping the datasets in memory.
Lipstick: Lipstick is used to allow users visualizing data flow in pig jobs and job progress overall. This way stalled jobs, wrong output data or failed jobs can be inspected at a glance and revised to execute correctly.
Along these tools, Netflix has developed several helper tools like Curator. Curator is a set of Java libraries that facilitate Apache Zookeeper usage. Building robust clients is a breeze with Curator and you can avoid several pitfalls like unsafe client calls and false assuming success in a request.
A really important application of all the parts of the technology stack described above is Netflix recommendations. Recommendation results drive around 75% of all Netflix videos streamed. One of the systems driving recommendations is using Markov chains to model movies as states and calculate the probability of transitioning between these states. In an RDBMS, this would run as a stored procedure, once a week, as an expensive copy that can’t scale well. Using Hadoop this problem is inherently scalable, without any need to copy data and using Pig or Java Map Reduce jobs it can be easier to maintain than stored procedures.
A Markov chain is describing a discrete time stochastic process over a set of states according to a transition probability matrix. Modelling each movie as a node and using a double pass Map Reduce job Netflix can calculate the probabilities of transitioning from one node to another, which is the recommendation value. The future values are only dependent on current values, which makes it a good candidate for a Map Reduce job as they don’t have to keep state in Hadoop nodes.
This is not the only parameter that Netflix is taking into account for the recommendations engine. Context is an interesting dimension to take into account. A user may want to view different content based on the viewing device, if he is at home or on vacation and the hour or day of week. This is a problem that not even Netflix has managed to solve yet, as it has several challenges to correlate context with viewing preferences.
Netflix’s Big Data architecture is not something that can be easily replicated in another industry or even a competitor. However, several of the building blocks are open source and available at their GitHub account. These can serve as starting point for any organization willing to start developing a Big Data architecture. And as Netflix has shown, a Big Data strategy is not an afterthought, but something that has to be planned ahead and thoroughly executed throughout the years.
Enterprise Architecture Operations & Infrastructure Architecture & Design Development
Database Infrastructure Data Analytics IaaS Cloud Computing Netflix Big Data Infrastructure Data Analysis Big Data