Apache Hadoop Data-Processing Framework
An Apache Hadoop Data-Processing Framework is a Java-based distributed fault-tolerant batch data-processing software framework.
- Context:
- It can (typically) include a Hadoop Distributed Storage System.
- It can (typically) include a Hadoop Computation Framework.
- It can (typically) be used to develop a Hadoop-based System (a MapReduce system on a Hadoop cluster).
- It can (typically) be produced by the Apache Hadoop Project.
- It can define a Hadoop API(to interface with Non-Java Code), such as: Haddop Pipes (for C++ code) and Hadoop Streaming.
- It can be foundational component to: Apache Hive, Apache Mahout, Apache Spark, ....
- Example(s):
- Hadoop v3[1]
- …
- Hadoop v2.
- hadoop-0.27/Hadoop r2.7.1[2] 2015-07-06
- hadoop-0.26/Hadoop r2.6.0[3] 2014-11-18
- hadoop-0.24/Hadoop r2.4.0[4] 2014-03-07
- Hadoop v1.
- hadoop-0.10/Hadoop r1.2.1[5] 2013-08-01
- Hadoop v3[1]
- Counter-Example(s):
- See: Apache Lucene, Apache UIMA, Apache OpenNLP, Hortonworks.
References
2014
- (Stonebraker, 2014a) ⇒ Michael Stonebraker. (2014). “Hadoop at a Crossroads?.” In: BLOG@CACM, August 5, 2014
- QUOTE: Historically, “Hadoop” referred to the open source version of MapReduce written by Yahoo. However, Impala has thrown this layer out of the stack. How can one be a Hadoop vendor, when Hadoop is no longer in the mainstream stack?
The answer is simple: redefine "Hadoop", and that is exactly what the Hadoop vendors have done. The word “Hadoop” is now used to mean the entire stack. In other words, HDFS is at the bottom, on top of which run Impala, MapReduce and other systems. On top of these systems run higher-level software such as Mahout. The word “Hadoop” is used to refer to the entire collection.
- QUOTE: Historically, “Hadoop” referred to the open source version of MapReduce written by Yahoo. However, Impala has thrown this layer out of the stack. How can one be a Hadoop vendor, when Hadoop is no longer in the mainstream stack?
2013
- http://en.wikipedia.org/wiki/Hadoop
- Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce[1], where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework.[2] It enables applications to work with thousands of computation-independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.
The entire Apache Hadoop “platform” is now commonly considered to consist of the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS), as well as a number of related projects – including Apache Hive, Apache HBase, and others.[3]
Hadoop is written in the Java programming language and is a top-level Apache project being built and used by a global community of contributors.[4] Hadoop and its related projects (Hive, HBase, Zookeeper, and so on) have many contributors from across the ecosystem.[5]
- Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce[1], where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework.[2] It enables applications to work with thousands of computation-independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers.
2012
- http://docs.cascading.org/cascading/2.1/userguide/htmlsingle/#N20095
- … Hadoop does this by providing a storage layer that holds vast amounts of data, and an execution layer that runs an application in parallel across the cluster, using coordinated subsets of the stored data. … Hadoop, according to its documentation, can be configured to run in three modes: standalone mode (i.e., on the local computer, useful for testing and debugging), pseudo-distributed mode (i.e., on an emulated "cluster" of one computer, useful for testing), and fully-distributed mode (on a full cluster, for production purposes). The pseudo-distributed mode does not add value for most purposes, and will not be discussed further.
2011
- http://hadoop.apache.org/#What+Is+Hadoop%3F
- QUOTE: The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
The project includes these subprojects:
- Hadoop Common: The common utilities that support the other Hadoop subprojects.
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.
- QUOTE: The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
- (Apache Blog, 2011-02-14) ⇒ Apache Blog. (2011). “Apache UIMA and Apache Hadoop Advance Data Intelligence and Semantics Capabilities of Watson Supercomputer.” Feb 14, 2011
- Apache UIMA: standards-based frameworks, infrastructure and components that facilitate the analysis and annotation of an array of unstructured content (such as text, audio and video). Watson uses Apache UIMA for real-time content analytics and natural language processing, to comprehend clues, find possible answers, gather supporting evidence, score each answer, compute its confidence in each answer, and improve contextual understanding (machine learning) – all under 3 seconds.
- Apache Hadoop: software framework that enables data-intensive distributed applications to work with thousands of nodes and petabytes of data. A foundation of Cloud computing, Apache Hadoop enables Watson to access, sort, and process data in a massively parallel system (90+ server cluster/2,880 processor cores/16 terabytes of RAM/4 terabytes of disk storage).
- The Watson system uses UIMA as its principal infrastructure for component interoperability and makes extensive use of the UIMA-AS scale-out capabilities that can exploit modern, highly parallel hardware architectures. UIMA manages all work flow and communication between processes, which are spread across the cluster. Apache Hadoop manages the task of preprocessing Watson's enormous information sources by deploying UIMA pipelines as Hadoop mappers, running UIMA analytics.
2010
- http://wiki.apache.org/hadoop
- QUOTE: Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
- http://wiki.apache.org/hadoop/ProjectDescription
- QUOTE: Map/Reduce is a programming paradigm that expresses a large distributed computation as a sequence of distributed operations on data sets of key/value pairs. The Hadoop Map/Reduce framework harnesses a cluster of machines and executes user defined Map/Reduce jobs across the nodes in the cluster. A Map/Reduce computation has two phases, a map phase and a reduce phase. The input to the computation is a data set of key/value pairs. …
… The Hadoop Map/Reduce framework has a master/slave architecture. It has a single master server or jobtracker and several slave servers or tasktrackers, one per node in the cluster. The jobtracker is the point of interaction between users and the framework. Users submit map/reduce jobs to the jobtracker, which puts them in a queue of pending jobs and executes them on a first-come/first-served basis. The jobtracker manages the assignment of map and reduce tasks to the tasktrackers. The tasktrackers execute tasks upon instruction from the jobtracker and also handle data motion between the map and reduce phases.
- QUOTE: Map/Reduce is a programming paradigm that expresses a large distributed computation as a sequence of distributed operations on data sets of key/value pairs. The Hadoop Map/Reduce framework harnesses a cluster of machines and executes user defined Map/Reduce jobs across the nodes in the cluster. A Map/Reduce computation has two phases, a map phase and a reduce phase. The input to the computation is a data set of key/value pairs. …