Towards a Scalable and Highly Available HDFS Namenode

After 3 months of intense hacking, I’m pleased to be writing about a little something I worked on for a project course here at KTH.

The premise

So we’re all familiar with Hadoop, right? It’s the little yellow elephant that provides an excellent platform for distributed computing, which is seeing rapid adoption by the industry, and involvement from major players like Yahoo!, Facebook and recently, Microsoft. Well, Hadoop and friends use the Hadoop Distributed File System (HDFS) as their underlying storage layer. Given the kind of jobs that are expected to run on top of it, HDFS is designed to store large files, and is optimised for throughput as opposed to latency.

HDFS is a single-master-server based distributed file system. Architecturally speaking, HDFS comprises of three important entities:

  1. Clients, who read/write files from/to the filesystem.
  2. Datanodes, which actually store the data (blocks) associated with the files.
  3. The Namenode, which is a central server that stores all the metadata associated with the files, and blocks.

This division between metadata storage and data storage is important, because typical use cases of HDFS are data intensive, and not metadata intensive. That’s fine, but the problem is, if the Namenode crashes, the entire file system becomes inoperable because clients and Datanodes still need the metadata to do anything useful. Furthermore, since the Namenode maintains all the metadata only in memory, the number of files you can store on the filesystem is directly proportional to the amount of RAM the Namenode has. As if that’s not enough, the Namenode will be completely saturated under write intensive workloads, and will be unable to respond to even simple client side queries like “ls”. Have a look at Shvachko’s paper which describes these problems at great length and depth, on which we’ve based our work.

Long story short, the needs of the hour are:

  • High availability for the Namenode, i.e, no single point of failure.
  • Horizontal scalability for the Namenode, i.e, to handle heavier loads, one would need to only add more Namenodes to the system than having to upgrade a single Namenode’s hardware.

Our solution

In order to recover from crashes, the Namenode maintains a journal of all changes that it makes to the metadata. This pretty much involves logging every operation made to disk, and there is quite a huge piece of code related to this as well. However, the database community has been doing journaling, checkpointing and replicated storage since quite a while. So if you haven’t guessed our solution yet, here it is:

“Move all of the Namenode’s metadata storage into an in-memory, replicated, share-nothing distributed database.”

In short, Namenodes themselves are reduced to a stateless frontend to the database, and fetch state into memory only when required. This comes with the added advantage of being able to have multiple stateless Namenodes for the same filesystem namespace. We chose MySQL Cluster as our database because of its wide spread use and stability. So for the filesystem to scale to a larger number of files, one needs to add more MySQL Cluster Datanodes, thus moving the bottleneck from the Namenode’s RAM to the DB’s storage capacity. For the filesystem to handle heavier workloads, one needs to add only more Namenode machines and divide the load amongst them. Another interesting aspect is that if a single Namenode machine has to reboot, it needn’t fetch any state into memory and will be ready for action within a few seconds (although it still has to sync with Datanodes). Another advantage of our design is that the modifications will not affect the clients or Datanodes in anyway, except that we might need to find a way to divide the load among the Namenodes.

How we did it

We first dissected all the internal protocols being used in HDFS, i.e, the client-Namenode, Namenode-Datanode, and client-Datanode protocols. Next, we stripped out all the Namenode code that we didn’t need. This was pretty much the code related to journaling, checkpointing, the secondary Namenode and so forth.

Next, we identified the key data structures we needed to move to the DB. We picked the two most memory intensive data-structures to migrate first: the Inodes, and the Blocks.

Since we were heavily time constrained (three months to deliver the project and the report), we decided to focus on functional correctness first, and then optimise later. So the easiest course of action seemed to be to modify the lowest levels of the call chain, replacing reads/writes from/to memory with query, insert, update and delete operations on the DB. We developed two helper classes, one each for Inodes and Blocks, and interfaced with the DB through these methods. We used the ClusterJ connector to talk to MySQL. This obviously meant that we needed a flat row representation for Inodes and Blocks in the DB, and we had some other problems to think of as well on the way. How do we index Inodes? How do we index Blocks? What about Triplets?

All in all, we tackled the problem of scaling the Namenode with a set of design decisions which we later found to be consistent with Shvacko’s update paper on the Namenode’s scalability, except that he suggests using HBase as the DB.

Current status

  • Multiple stateless Namenodes run merrily, which store Inode and Block related metadata in MySQL Cluster. As a validation test, Clients can do an “ls” query to any Namenode and see a consistent view of the filesystem regardless of which Namenode updated the DB with the content.
  • We’re trying to ensure functional correctness using the HDFS unit tests. We got the most important ones to pass, and decided to keep some more bug fixing until later because we needed to evaluate the system as part of the course.
  • We’ve been evaluating the system using the Synthetic Load Generator. Horizontal scalability has been clearly achieved; adding more Namenodes improves the average number of operations per second for different work loads. With write intensive work loads, the scalability is linear in terms of total operations/sec that are executed.

Current limitations

Obviously, our work isn’t rainbows and sunshine; there’s a long way to go. Here’s what we don’t have yet and are currently addressing:

  • Performance improvements. With a single load-generator thread throwing requests at the Namenode, we’re within a 10th of the original Namenode’s performance because read/writes from/to memory now go over a network to a database cluster (which is OK, I guess). But with more LoadGen threads, we’re experiencing a hefty bottleneck, which I’ll describe in the next point.
  • The Namenode isn’t fully stateless yet. The most important data structures we’re yet to move are the DatanodeDescriptor entities and the file leases. There’ll surely be more, but these are the most crucial ones. Once full statelessness is achieved, we can eliminate the read-write locks in the code which are absolutely not needed any more in our implementation (the Namenode currently uses a multiple-reader-single-writer concurrency model). Profiling experiments indicated that the Namenode spends around 70% of its time waiting to acquire write locks. If we keep the Namenode fully stateless, we can wrap FSNamesystem operations into Database transactions which can be batched, and let MySQL cluster handle serialisability for us (which can handle write-heavy transactions really well). We can even break away from the single-writer model that the Namenode currently uses. Will this lead to a higher throughput for write operations than the original Namenode? Maybe. 🙂
  • Clients and Datanodes have to be statically partitioned as of now (it sufficed for our evaluations). We need a way for them to pick a random Namenode to perform their operations with.

Talk is cheap, show me the code!

The code is publicly available here for thy scrutiny. You’ll also need to have a MySQL cluster setup in order to test this (we have a hard coded set of defaults in DBConnector.java which you can politely ignore). 🙂 Here’s our presentation on it as well. We’ve dubbed our prototype implementation KTHFS (because we’re students at KTH, but yes, no points for creativity on that one).

Future work

As an academic obligation, here’s future work (and no I’m not going to write stuff we’ll never do).

One member (not me) from the team will be continuing this work as part of his Masters thesis, and plans to address the above mentioned limitations as part of his work. I’ll try to contribute as well during my free time (what are weekends for anyway?). Let’s see how this goes. 🙂

Advertisements

12 thoughts on “Towards a Scalable and Highly Available HDFS Namenode

  1. Pingback: Scalable and Highly Available HDFS Namenode

  2. Aditya

    interesting stuff… the problem is widely known….
    using database cluster to do the work you need for availability,, interesting thought,,,, many systems already work on similar lines….
    but the fact that you’ve got it working!!!
    AWESOME!!! Way to go !!!

    i’ll try this out in my free time….
    Keep me posted on further developments..

    Reply
  3. mudit

    Awesome work Lalith. you guys are truly the EMDC rockstars 🙂 this article is making lot of news ..

    Well just a small thought though. With small scale regress testing that you guys did so far, it is evident that the efficiency of HDFS is at stake. (Speaking in terms of KTHFS ability to perform operations when compared to the original file system)

    The basic layman idea behind any distributed system is to be able to perform an operation faster, and so is the case with Hadoop. With this new approach, the whole system’s performance (which is to get/put data from datanodes) while executing tasks, is proportionally dependent on the rate at which matadata can be fetched/stored distributed in-memory mySQL DB. Not to mention that above the file system we have map reduce execution environment

    So we end up having three clusters one over another – in terms of data requirements & job execution.

    map-reduce cluster
    HDFS cluster
    MySQL cluster

    With the problem of High Availability which your solution addresses, it is reasonable enough to use the distributed approach to prevent single point of failure. However, it is not necessary to use the distributed approach even when the machine is supposed to working fine

    .

    For example, let’s say we have 3 name nodes x y z and have associated mysql datanodes d1 d2 d3, now x need to save matadata m, assuming that we have redundancy, m is stored in d2 & d3. Next time when we need m, it would be fetched from d2 or d3 across the wire. But do we really need this distributed approach for a normal operation when it would have been better served if m was in d1? Locality !!

    To recover from failure, yes.

    So let the matadata be saved as it is being saved by Hadoop now, and on top of it, just save a copy in mysql distributed db. HDFS would be working as it does now using it’s own efficient mechanism, in-case there is a failure all we need to do is to recover state from mySQL cluster. Everytime matadata is being written on name node, just make a copy in mySQL cluster beneath.

    In-short distributed approach can be used only for preparing the system (name node) to recover from failure safely & efficiently. For rest of the work, let it be as is. Load balancing can be used only when memory on name node is not enough to full fill the requirement.

    I am not sure if I am making any sense here, but there were some random thoughts I had which I wanted to share. Of-course, you guys know much more than me, so even a single line comment saying “it’s all rubbish” will do 🙂

    Cheers,
    Mudit

    Reply
    1. lalithsuresh Post author

      Thanks for the feedback Mudit!

      You are correct in pointing out that there is a higher latency involved in our implementation. But bear in mind that HDFS is optimised for throughput, not latency. It doesn’t matter how long an individual operation takes as long we can have a whole bunch of operations being executed per second in the cluster.

      Further note that operations in HDFS are data intensive, not metadata intensive. Let’s consider a client trying to simply read a file named “foo” of size 10GB from HDFS. Assume that the blocks are of size 512MB (implying that we have 20 blocks in total). The client first asks the Namenode for the list of blocks corresponding to the file “foo”. The HDFS Namenode would return this list in around 1ms, the KTHFS one would take around 10-20ms in our not-so-optimised-version now, but let’s put it at 100ms even. The client then proceeds to read 10GB of data from the Datanodes over a course of several minutes. So the 100ms latency that we add to the meta-data operation is pretty much shadowed by the duration of the normal data-operation. The latter remains unaffected from the normal HDFS, and is in the order of minutes.

      Again, since throughput is the key, not latency, we can add more and more Namenodes in KTHFS to increase the overall number of metadata operations per second that can be executed. Even at this stage of development, we have linear scalability on the number of operations per second for write intensive workloads. So it’s not only about high availability, it’s also about scaling out.

      Like I said, we’re still far away from getting the most out of this design, and are on our way to map the Namenode internals better to the kind of DB operations we’re doing. And by the way, “over the wire” on the cluster here at SICS is around 0.1ms. 🙂

      Reply
  4. Mat Keep

    Absolutely fantastic work. The MySQL Cluster development team is based in Stockholm, so feel free to mail me and we can hook you all up: mat dot keep at oracledot com

    Reply
  5. Pingback: Let the thesis begin! | Wasif Malik

  6. Pingback: Command history « Comfortably Geek

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s