HDFS: hdfs scalability

Showing posts with label hdfs scalability. Show all posts

Sunday, May 9, 2010

Facebook has the world's largest Hadoop cluster!

It is not a secret anymore!

The Datawarehouse Hadoop cluster at Facebook has become the largest known Hadoop storage cluster in the world. Here are some of the details about this single HDFS cluster:

21 PB of storage in a single HDFS cluster
2000 machines
12 TB per machine (a few machines have 24 TB each)
1200 machines with 8 cores each + 800 machines with 16 cores each
32 GB of RAM per machine
15 map-reduce tasks per machine

That's a total of more than 21 PB of configured storage capacity! This is larger than the previously known Yahoo!'s cluster of 14 PB. Here are the cluster statistics from the HDFS cluster at Facebook:

Hadoop started at Yahoo! and full marks to Yahoo! for developing such critical infrastructure technology in the open. I started working with Hadoop when I joined Yahoo! in 2006. Hadoop was in its infancy at that time and I was fortunate to be part of the core set of Hadoop engineers at Yahoo!. Many thanks to Doug Cutting for creating Hadoop and Eric14 for convincing the executing management at Yahoo! to develop Hadoop as open source software.

Facebook engineers work closely with the Hadoop engineering team at Yahoo! to push Hadoop to greater scalability and performance. Facebook has many Hadoop clusters, the largest among them is the one that is used for Datawarehousing. Here are some statistics that describe a few characteristics of the Facebook's Datawarehousing Hadoop cluster:

12 TB of compressed data added per day
800 TB of compressed data scanned per day
25,000 map-reduce jobs per day
65 millions files in HDFS
30,000 simultaneous clients to the HDFS NameNode

A majority of this data arrives via scribe , as desribed in scribe-hdfs integration. This data is loaded in Hive . Hive provides a very elegant way to query the data stored in Hadoop. Almost 99.9% Hadoop jobs at Facebook are generated by a Hive front-end system. We provide lots more details about our scale of operations in our paper at SIGMOD titled Datawarehousing and Analytics Infrastructure at Faceboo k.

Here are two pictorial representations of the rate of growth of the Hadoop cluster:

Details about our Hadoop configuration

I have fielded many questions from developers and system administrators about the Hadoop configuration that is deployed in the Facebook Hadoop Datawarehouse. Some of these questions are from Linux kernel developers who would like to make Linux swapping work better with Hadoop workload; other questions are from JVM developers who may attempt to make Hadoop run faster for processes with large heap size; yet others are from GPU architects who would like to port a Hadoop workload to run on GPUs. To enable this type of outside research, here are the details about the Facebook's Hadoop warehouse configurations. I hope this open sharing of infrastructure details from Facebook jumpstarts the research community to design ways and means to optimize systems for Hadoop usage.

[フレーム]

Posted by Dhruba Borthakur at 1:14 AM 53 comments

Email This BlogThis! Share to X Share to Facebook Share to Pinterest

Labels: hadoop and facebook, hadoop distributed file system, hadoop hdfs condor, hdfs scalability

Sunday, April 25, 2010

The Curse of the Singletons! The Vertical Scalability of Hadoop NameNode

Introduction

HDFS is designed to be a highly scalable storage system and sites at Facebook and Yahoo have 20PB size file systems in production deployments. The HDFS NameNode is the master of the Hadoop Distributed File System (HDFS). It maintains the critical data structures of the entire file system. Most of HDFS design has focussed on scalability of the system, i.e. the ability to support a large number of slave nodes in the cluster and an even larger number of files and blocks. However, a 20PB size cluster with 30K simultaneous clients requesting service from a single NameNode means that the NameNode has to run on a high-end non-commodity machine. There has been some efforts to scale the NameNode horizontally, i.e. allow the NameNode to run on multiple machines. I will defer analyzing those horizontal-scalability-efforts for a future blog post, instead let's discuss ways and means to make our singleton NameNode support an even greater load.

What are the bottlenecks of the NameNode?

Network: We have around 2000 nodes in our cluster and each node is running 9 mappers and 6 reducers simultaneously. This means that there are around 30K simultaneous clients requesting service from the NameNode. The Hive Metastore and the HDFS RaidNode imposes additional load on the NameNode. The Hadoop RPCServer has a singleton Listener Thread that pulls data from all incoming RPCs and hands it to a bunch of NameNode handler threads. Only after all the incoming parameters of the RPC are copied and deserialized by the Listener Thread does the NameNode handler threads get to process the RPC. One CPU core on our NameNode machine is completely consumed by the Listener Thread. This means that during times of high load, the Listener Thread is unable to copy and deserialize all incoming RPC data in time, thus leading to clients encountering RPC socket errors. This is one big bottleneck to vertically scalabiling of the NameNode.

CPU: The second bottleneck to scalability is the fact that most critical sections of the NameNode is protected by a singleton lock called the FSNamesystem lock. I had done some major restructuring of this code about three years ago via HADOOP-1269 but even that is not enough for supporting current workloads. Our NameNode machine has 8 cores but a fully loaded system can use at most only 2 cores simultaneously on the average; the reason being that most NameNode handler threads encounter serialization via the FSNamesystem lock.

Memory: The NameNode stores all its metadata in the main memory of the singleton machine on which it is deployed. In our cluster, we have about 60 million files and 80 million blocks; this requires the NameNode to have a heap size of about 58GB. This is huge! There isn't any more memory left to grow the NameNode's heap size! What can we do to support even greater number of files and blocks in our system?

Can we break the impasse?

RPC Server: We enhanced the Hadoop RPC Server to have a pool of Reader Threads that work in conjunction with the Listener Thread. The Listener Thread accepts a new connection from a client and then hands over the work of RPC-parameter-deserialization to one of the Reader Threads. In our case, we configured our system so that the Reader Threads consist of 8 threads. This change has doubled the number of RPCs that the NameNode can process at full throttle. This change has been contributed to the Apache code via HADOOP-6713.

The above change allowed a simulated workload to be able to consume 4 CPU cores out of a total of 8 CPU cores in the NameNode machine. Sadly enough, we still cannot get it to use all the 8 CPU cores!

FSNamesystem lock: A review of our workload showed that our NameNode typically has the following distribution of requests:

stat a file or directory 47%
open a file for read 42%
create a new file 3%
create a new directory 3%
rename a file 2%
delete a file 1%

The first two operations constitues about 90% workload for the NameNode and are readonly operations: they do not change file system metadata and do not trigger any synchronous transactions (the access time of a file is updated asynchronously). This means that if we change the FSnamesystem lock to a Readers-Writer lock we can achieve the full power of all processing cores in our NameNode machine. We did just that, and we saw yet another doubling of the processing rate of the NameNode! The load simulator can now make the NameNode process use all 8 CPU cores of the machine simultaneously. This code has been contributed to Apache Hadoop via HDFS-1093.

The memory bottleneck issue is still unresolved. People have asked me if the NameNode can keep some portion of its metadata in disk, but this will require a change in locking model design first. One cannot keep the FSNamesystem lock while reading in data from the disk: this will cause all other threads to block thus throttling the performance of the NameNode. Could one use flash memory effectively here? Maybe an LRU cache of file system metadata will work well with current metadata access patterns? If anybody has good ideas here, please share it with the Apache Hadoop community.

In a Nutshell

The two proposed enhancements have improved NameNode scalability by a factor of 8. Sweet, isn't it?

HDFS

Sunday, May 9, 2010

Facebook has the world's largest Hadoop cluster!

Sunday, April 25, 2010

The Curse of the Singletons! The Vertical Scalability of Hadoop NameNode

Search This Blog

StatCounter

Followers