Pierre Zemb's Blog

What can be gleaned about GFS successor codenamed Colossus?

Table of contents

In the last few months, there has been numerous blogposts about the end of the Hadoop-era. It is true that:

Hadoop, as a distributed-system, is hard to operate, but can be essential for some type of workload. As Hadoop is based on GFS, we can wonder how GFS evolved inside Google.

πŸ”—Hadoop's story

Hadoop is based on a Google's paper called The Google File System published in 2003. There are some key-elements on this paper:

πŸ”—Is Hadoop still revelant?

Google with GFS and the rest of the world with Hadoop hit some issues:

Despite all the issues, Hadoop is still relevant for some usecases, such as Map/Reduce, or if you need Hbase as a main datastore. There is stories available online about the scalability of Hadoop:

Nowadays, Hadoop is mostly used for Business Intelligence or to create a datalake, but at first, GFS was designed to provide a distributed file-system on top of commodity servers.

Google's developers were/are deploying applications into "containers", meaning that any process could be spawned somewhere into the cloud. Developers are used to work with the file-system abstraction, which provide a layer of durability and security. To mimic that process, they developed GFS, so that processes don't need to worry about replication (like Bigtable/HBase).

This is a promise that, I think, was forgotten. In a world where Kubernetes seems to be the standard, the need of a global distributed file-system is now higher than before. By providing a "file-system" abstraction for applications deployed in Kubernetes, we may be solving many problems Kubernetes-adopters are hitting, such as:

πŸ”—Well, let's put Hadoop in Kubernetes

Putting a distributed systems inside Kubernetes is currently a unpleasant experience because of the current tooling:

πŸ”—How GFS evolved within Google

πŸ”—Technical overview

As GFS's paper was published in 2003, we can ask ourselves if GFS has evolved. And it did! The sad part is that there is only a few informations about this project codenamed Colossus. There is no papers, and not a lot informations available, here's what can be found online:

The "all in RAM" GFS master design was a severe single-point-of-failure, so getting rid of it was a priority. They didn't had a lof of options for a scalable and rock-solid datastore beside BigTable. When you think about it, a key/value datastore is a great replacement for a distributed file-system master:

The funny part is that they now need a Colossus for Colossus. The only things saving them is that storing the metametametadata (the metadata of the metadata of the metadata) can be hold in Chubby.

What is that "D"?

πŸ”—Deployments

How everything is deployed? Using Borg!

πŸ”—Migration

The migration process is described in the now free Case Studies in Infrastructure Change Management book.

πŸ”—Is there an open-source effort to create a Colossus-like DFS?

I did not found any point towards a open-source version of Colossus, beside some work made for The Baidu File System in which the Nameserver is implemented as a raft group.

There is some work to add colossus's features in Hadoop but based on the bad publicity Hadoop has now, I don't think there will be a lot of money to power those efforts.

I do think that rewriting an distributed file-system based on Colossus would be a huge benefit for the community:

But remember:

Like crypto, Do not roll your own DFS!


Thank you for reading my post! Feel free to react to this article, I am also available on Twitter if needed.

Tags: #colossus #gfs #hadoop