Hadoop Distributed File System (HDFS) Platform

Context:
- It can (typically) be implemented in a Java Programming Language.
- It can be associated with an HDFS Cluster.
- It can use the TCP/IP layer for communication
- It can enable clients to communicate between to each other via RPCs.
- It can (typically) store 64 MB sized file blocks.
- It can because like a single storage volume that has been optimized for many concurrent serialized reads of very large datasets (GB, PB). [1]
- It does not efficiently support random access operations. [2]
- It can be interacted with via a Hadoop Shell Command [3].
Example(s):
- HDFS v1.
- HDFS v2 (in Hadoop v2).
- HDFS v3, e.g. HDFS v3.1.1 ~2018-08-08.
- AWS EMRFS.
- …
Counter-Example(s):
See: Apache Spark.

References

http://docs.cascading.org/cascading/2.1/userguide/htmlsingle/#N20095
- … The storage layer, called the Hadoop File System (HDFS), looks like a single storage volume that has been optimized for many concurrent serialized reads of large data files - where "large" might be measured in gigabytes or petabytes. However, it does have limitations. For example, random access to the data is not really possible in an efficient manner. And Hadoop only supports a single writer for output. But this limit helps make Hadoop very performant and reliable, in part because it allows for the data to be replicated across the cluster, reducing the chance of data loss.

http://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_Distributed_File_System
- … Each node in a Hadoop instance typically has a single datanode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. The HDFS stores large files (an ideal file size is a multiple of 64 MB), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. The HDFS was designed to handle very large files. The HDFS does not provide High Availability.

(Borthaku, 2007) ⇒ Dhruba Borthakur. (2007). “The Hadoop Distributed File System: Architecture and Design." Hadoop Project Website.
- QUOTE: The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/core/.