Hadoop Distributed File System (HDFS) Platform
(Redirected from HDFS-based)
Jump to navigation
Jump to search
A Hadoop Distributed File System (HDFS) Platform is a distributed file system platform within the Hadoop framework (to create HDFS instances) .
- Context:
- It can (typically) be implemented in a Java Programming Language.
- It can be associated with an HDFS Cluster.
- It can use the TCP/IP layer for communication
- It can enable clients to communicate between to each other via RPCs.
- It can (typically) store 64 MB sized file blocks.
- It can because like a single storage volume that has been optimized for many concurrent serialized reads of very large datasets (GB, PB). [1]
- It does not efficiently support random access operations. [2]
- It can be interacted with via a Hadoop Shell Command[3].
- Example(s):
- Counter-Example(s):
- See: Apache Spark.
References
2020
- (Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system Retrieved:2020-8-23.
- The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Some consider it to instead be a data store due to its lack of POSIX compliance, but it does provide shell commands and Java application programming interface (API) methods that are similar to other file systems. A Hadoop is divided into HDFS and MapReduce. HDFS is used for storing the data and MapReduce is used for processing data.
- HDFS has five services as follows:
- Name Node
- Secondary Name Node
- Job tracker
- Data Node
- Task Tracker
- Top three are Master Services/Daemons/Nodes and bottom two are Slave Services. Master Services can communicate with each other and in the same way Slave services can communicate with each other. Name Node is a master node and Data node is its corresponding Slave node and can talk with each other.
2015
- http://ci.apache.org/projects/flink/flink-docs-release-0.8.1/faq.html#is-flink-a-hadoop-project
- Flink is a data processing system and an alternative to Hadoop’s MapReduce component. It comes with its own runtime, rather than building on top of MapReduce. As such, it can work completely independently of the Hadoop ecosystem. However, Flink can also access Hadoop’s distributed file system (HDFS) to read and write data, and Hadoop’s next-generation resource manager (YARN) to provision cluster resources. Since most Flink users are using Hadoop HDFS to store their data, we ship already the required libraries to access HDFS.
2012
- http://docs.cascading.org/cascading/2.1/userguide/htmlsingle/#N20095
- … The storage layer, called the Hadoop File System (HDFS), looks like a single storage volume that has been optimized for many concurrent serialized reads of large data files - where "large" might be measured in gigabytes or petabytes. However, it does have limitations. For example, random access to the data is not really possible in an efficient manner. And Hadoop only supports a single writer for output. But this limit helps make Hadoop very performant and reliable, in part because it allows for the data to be replicated across the cluster, reducing the chance of data loss.
2011
- http://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_Distributed_File_System
- … Each node in a Hadoop instance typically has a single datanode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. The HDFS stores large files (an ideal file size is a multiple of 64 MB), across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does not require RAID storage on hosts. With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant because the requirements for a POSIX filesystem differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX compliant filesystem is increased performance for data throughput. The HDFS was designed to handle very large files. The HDFS does not provide High Availability.
2007
- (Borthaku, 2007) ⇒ Dhruba Borthakur. (2007). “The Hadoop Distributed File System: Architecture and Design." Hadoop Project Website.
- QUOTE: The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/core/.