MapReduce Job
A MapReduce Job is a distributed data-processing job (composed of a MapReduce Mapper Job and a MapReduce Reducer Job) that run on a MapReduce system.
- Context:
- It can (typically) be applied to Highly Parallelizable large data tasks.
- It can (typically) be a Hadoop MapReduce Job.
- Example(s):
- an Apache Pig Job.
- …
- Counter-Example(s):
- See: MapReduce Mapping Phase, MapReduce Reducer Phase, Hadoop API, Data Processing Job.
References
2012
- http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Overview
- … A Map/Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. ...
...
- http://en.wikipedia.org/wiki/MapReduce#Overview
- MapReduce is a framework for processing embarrassingly parallel problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data.
"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel – though in practice it is limited by the number of independent data sources and/or the number of CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase - provided all outputs of the map operation that share the same key are presented to the same reducer at the same time, or if the reduction function is associative. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the input data is still available.
- MapReduce is a framework for processing embarrassingly parallel problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data.
2011
- http://binarynerd.com/java-tutorials/distributed-computing/hadoop-mapreduce-101.html
- The mapping part of a job is distributed to all nodes on a cluster. Each node can then spawn multiple mappers, ie using all the cores of a Quad core CPU, to process the data residing on that computer. All of this occurs in parallel.
The process that a MapReduce job goes through to complete a job is as follows:
- The number of map and reduce functions are defined by the job you are running. Each map processes its input data, sorts it to group the keys and writes it to disk. The job defines how many reduce functions you wish to apply to the output from the maps.
- Each reduce needs to see all the data for a given key. For a single reduce running for the job all the outputs from each map are sent to the node in the cluster that is running that reduce. Before the reduce runs, the data from each map is merged to group all the keys.
- For multiple reducers, the maps partition their output, creating one per reduce. The partitions are sent to the correct reduce. This ensures that all the data for a given key is processed by a single reduce.
- To help reduce the amount of data needed to be sent over the network a combine function can be applied to the output of a map. This has the effect of running a reduce on the output from the map. Thus minimizing the amount of data that needs to be transfered to the reducers, speeding up the execution time of the overall job.
- The mapper processes the unindexed data using a Key/Value method. ...
- The mapping part of a job is distributed to all nodes on a cluster. Each node can then spawn multiple mappers, ie using all the cores of a Quad core CPU, to process the data residing on that computer. All of this occurs in parallel.