Hello Friend

In the last few months, there has been numerous blogposts about the end of the Hadoop-era. It is true that:

Hadoop, as a distributed-system, is hard to operate, but can be essential for some type of workload. As Hadoop is based on GFS, we can wonder how GFS evolved inside Google.

Hadoop's story

Hadoop is based on a Google's paper called The Google File System published in 2003. There are some key-elements on this paper:

  • It was designed to be deployed with Borg,
  • to “simplify the overall design problem", they:
    • implemented a single master architecture
    • dropped the idea of a full POSIX-compliant file system
  • Metadatas are stored in RAM in the master,
  • Datas are stored within chunkservers,
  • There is no YARN or Map/Reduce or any kind of compute capabilities.

Is Hadoop still revelant?

Google with GFS and the rest of the world with Hadoop hit some issues:

  • One (Metadata) machine is not large enough for large FS,
  • Single bottleneck for metadata operations,
  • Not appropriate for latency sensitive applications,
  • Fault tolerant not HA,
  • Unpredictable performance,
  • Replication's cost,
  • HDFS Write-path pipelining,
  • fixed-size of blocks,
  • cost of operations,

Despite all the issues, Hadoop is still relevant for some usecases, such as Map/Reduce, or if you need Hbase as a main datastore. There is stories available online about the scalability of Hadoop:

Nowadays, Hadoop is mostly used for Business Intelligence or to create a datalake, but at first, GFS was designed to provide a distributed file-system on top of commodity servers.

Google's developers were/are deploying applications into “containers”, meaning that any process could be spawned somewhere into the cloud. Developers are used to work with the file-system abstraction, which provide a layer of durability and security. To mimic that process, they developed GFS, so that processes don't need to worry about replication (like Bigtable/HBase).

This is a promise that, I think, was forgotten. In a world where Kubernetes seems to be the standard, the need of a global distributed file-system is now higher than before. By providing a “file-system” abstraction for applications deployed in Kubernetes, we may be solving many problems Kubernetes-adopters are hitting, such as:

Well, let's put Hadoop in Kubernetes!

Putting a distributed systems inside Kubernetes is currently a unpleasant experience because of the current tooling:

How GFS evolved within Google

As GFS's paper was published in 2003, we can ask ourselves if GFS has evolved. And it did! The sad part is that there is only a few informations about this project codenamed Colossus. There is no papers, and not a lot informations available, here's what can be found online:

  • From Storage Architecture and Challenges(2010):

    • They moved from full-replication to Reed-Salomon. This feature is acually in Hadoop 3,
    • replication is handled by the client, instead of the pipelining,
    • the metadata layer is automatically sharded. We can find more informations about that in the next ressource!
  • From Cluster-Level Storage @ Google(2017):

    • GFS master replaced by Colossus
    • GFS chunkserver replaced by D
    • Colossus rebalances old, cold data
    • distributes newly written data evenly across disks
    • Metadatas are stored into BigTable. each Bigtable row corresponds to a single file.

The “all in RAM” GFS master design was a severe single-point-of-failure, so getting rid of it was a priority. They didn't had a lof of options for a scalable and rock-solid datastore beside BigTable. When you think about it, a key/value datastore is a great replacement for a distributed file-system master:

  • automatic sharding of regions,
  • scan capabilities for files in the same “directory”,
  • lexical ordering,

The funny part is that they now need a Colossus for Colossus. The only things saving them is that storing the metametametadata (the metadata of the metadata of the metadata) can be hold in Chubby.

  • From GFS: Evolution on Fast-forward(2009)

    • they moved to chunks of 1MB of files, as the limitations of the master disappeared. This is also allowing Colossus to support latency sensitive applications,
  • From a Github comment on Colossus:

    • File reconstruction from Reed-Salomnon was performed on both client-side and server-side
    • on-the-fly recovery of data is greatly enhanced by this data layout(Reed Salomon)
  • From a Hacker News comment:

    • Colossus and D are two separate things.

What is that “D”?

Is there an open-source effort to create a Colossus-like DFS?

I did not found any point towards a open-source version of Colossus, beside some work made for The Baidu File System in which the Nameserver is implemented as a raft group.

There is some work to add colossus's features in Hadoop but based on the bad publicity Hadoop has now, I don't think there will be a lot of money to power those efforts.

I do think that rewriting an distributed file-system based on Colossus would be a huge benefit for the community:

  • Reimplement D may be easy, my current question is how far can we use modern FS such as OpenZFS to facilitate the work? FS capabilities such as OpenZFS checksums seems pretty interesting.
  • To resolve the distributed master issue, we could use Tikv as a building block to provide an “BigTable experience” without the need of a distributed file-system underneath.

But remember:

Like crypto, Do not roll your own DFS!


Thank you for reading my post! Feel free to react to this article, I am also available on Twitter if needed.