TLDR;  A new class of hierarchical distributed file system with scaleout metadata has taken over at Google, Facebook, and Microsoft that provides a s

Scaleout Metadata: hardest problem in Data. How we build HopsFS on it.

submited by
Style Pass
2021-05-31 13:00:07

TLDR;  A new class of hierarchical distributed file system with scaleout metadata has taken over at Google, Facebook, and Microsoft that provides a single centralized file system that manages the data for an entire data center, scaling to Exabytes in size. The common architectural feature of these systems is scaleout metadata, so we call them scaleout metadata file systems. Scaleout metadata file systems belie the myth that hierarchical distributed file systems do not scale, so you have to redesign your applications to work with object stores, and their weaker semantics. We have built a scaleout metadata file system, HopsFS, that is open-source, but its primary use case is not Exabyte storage, rather customizable consistent metadata for the Hopsworks Feature Store. Scaleout metadata is also the key technology behind Snowflake, but here we stick to file systems.

Google, Microsoft, and Facebook have been pushing out the state-of-the-art in scalable systems research in the last 15 years. Google has presented systems like MapReduce, GFS, Borg, and Spanner. Microsoft introduced CosmosDB, Azure Blob Storage and federated YARN. Facebook has provided Hive, Haystack, and F4 systems. All of these companies have huge amounts of data (Exabytes) under management, and need to efficiently, securely, and durably store that data in data centers. So, why not unify all storage systems within a single data center to more efficiently manage all of its data? That’s what Google and Facebook have done with Colossus and Tectonic, respectively. The other two scaleout metadata file systems covered here, ADLSv2 and HopsFS, were motivated by similar scalability challenges but, although they could be, they are typically not deployed as data center scale file systems, just as scalable file systems for analytics and machine learning.

Leave a Comment