In the last few months, there has been numerous blogposts about the end of the Hadoop-era. It is true that:
Hadoop, as a distributed-system, is hard to operate, but can be essential for some type of workload. As Hadoop is based on GFS, we can wonder how GFS evolved inside Google.
Hadoop is based on a Google’s paper called The Google File System published in 2003. There are some key-elements on this paper:
Google with GFS and the rest of the world with Hadoop hit some issues:
Despite all the issues, Hadoop is still relevant for some usecases, such as Map/Reduce, or if you need Hbase as a main datastore. There is stories available online about the scalability of Hadoop:
Nowadays, Hadoop is mostly used for Business Intelligence or to create a datalake, but at first, GFS was designed to provide a distributed file-system on top of commodity servers.
Google’s developers were/are deploying applications into “containers”, meaning that any process could be spawned somewhere into the cloud. Developers are used to work with the file-system abstraction, which provide a layer of durability and security. To mimic that process, they developed GFS, so that processes don’t need to worry about replication (like Bigtable/HBase).
This is a promise that, I think, was forgotten. In a world where Kubernetes seems to be the standard, the need of a global distributed file-system is now higher than before. By providing a “file-system” abstraction for applications deployed in Kubernetes, we may be solving many problems Kubernetes-adopters are hitting, such as:
Putting a distributed systems inside Kubernetes is currently a unpleasant experience because of the current tooling:
As GFS’s paper was published in 2003, we can ask ourselves if GFS has evolved. And it did! The sad part is that there is only a few informations about this project codenamed
Colossus. There is no papers, and not a lot informations available, here’s what can be found online:
The “all in RAM” GFS master design was a severe single-point-of-failure, so getting rid of it was a priority. They didn’t had a lof of options for a scalable and rock-solid datastore beside BigTable. When you think about it, a key/value datastore is a great replacement for a distributed file-system master:
The funny part is that they now need a Colossus for Colossus. The only things saving them is that storing the metametametadata (the metadata of the metadata of the metadata) can be hold in Chubby.
From a Hacker News comment:
What is that “D”?
I did not found any point towards a open-source version of Colossus, beside some work made for The Baidu File System in which the Nameserver is implemented as a raft group.
There is some work to add colossus’s features in Hadoop but based on the bad publicity Hadoop has now, I don’t think there will be a lot of money to power those efforts.
I do think that rewriting an distributed file-system based on Colossus would be a huge benefit for the community:
Like crypto, Do not roll your own DFS!
Thank you for reading my post! Feel free to react to this article, I’m also available on Twitter if needed.