MapReduce

What is MapReduce?

MapReduce is a programming model that runs on Hadoop – a data analytics engine widely used for Big Data – and writes applications that run in parallel to process large volumes of data stored on clusters.

Elastic Flexibility

While MapReduce performs much more slowly than other models, its main advantage is its elastic flexibility to scale rapidly or allocate more computer nodes to shorten computation times. MapReduce can scale across thousands of nodes, most likely due to its distributed file systems and its ability to run processes near the data, rather than moving the data itself. Its scalability reduces the costs of storing and processing growing data volumes.

Parallel Processing

With MapReduce, developers do not need to write code for parallelism, distributing data or other complex coding tasks because those are already built into the model. This alone shortens analytical programming time.

Unstructured Data Support

Because unstructured data makes up the majority (80 per cent) of data generated, the fact that MapReduce can support any type of data structure helps enterprises analyse data much more effectively.

Fault Tolerance

MapReduce is less vulnerable to hardware failures causing a system halt because it operates by distributing data across many computers and servers. MapReduce sends a complete set of data to each node in the network, so if one node or piece of hardware fails, all of the data can survive and be recovered automatically.

 

How does MapReduce work?

The way MapReduce works can be broken down into three phases, with an optional fourth phase.

  • Mapper: In this first phase, conditional logic filters the data across all nodes into key value pairs. The “key” refers to the offset address for each record and the “value” contains all record content.
  • Shuffle: During the second phase, the output values from mapping are sorted and consolidated. Values are grouped according to similar keys and duplicate values are discarded. The shuffle phase output is also arranged in key-value pairs, but this time the values indicate a range rather than the content in one record.
  • Reducer: In the third phase, the output from the consolidated shuffle phase is aggregated, with all values added to their corresponding keys. This is then combined into a single output directory.
  • Combiner: Running this phase can optimise MapReduce job performance, making the jobs flow more quickly. It does this by taking the mapper outputs and examining them at the node level for duplicates, which are combined into a single k-v pair, thereby reducing the work that the shuffle phase must complete.

As to how these phases are accomplished, the architecture is made up of two daemon services –Resource Manager and Node Manager – that run mapper and reducer tasks, along with monitoring and re-executing tasks that fail. These two also manage the parallel processing and fault-tolerance components of all MapReduce jobs.

Further along in the structure, resource management and scheduling/monitoring functions are broken down into different daemons: a global resources manager and an application master for each application.

As a job comes in, the resource manager coordinates cluster resource allocation of all running applications. It works with the application master and node manager to determine which nodes can take on the job. Then the application master and the node manager(s) coordinate. The node managers are the ones that actually launch and monitor the compute containers to get the job done.

How do companies use MapReduce?

As the data processing market has matured, MapReduce’s market share has declined to less than one per cent. Nevertheless, it is still used by nearly 1,500 companies in the United States, with some uptake in other countries. 

By and large, MapReduce is used by the computer software and IT services industry. Other industries include financial services, hospitals and healthcare, higher education, retail, insurance, telecommunications and banking. The following are a few example use cases:

  • Financial services: Retail banks use a Hadoop system to validate data accuracy and quality to comply with federal regulations.
  • Healthcare: A health IT company uses a Hadoop system to archive years of claims and remit data, which amounts to processing terabytes of data every day and storing them for further analytical purposes. Another hospital system monitors patient vitals by collecting billions of constantly streaming data points from sensors attached to patients.
  • IT services: A major IT services provider collects diagnostic data from its storage systems deployed at its customers’ sites. It uses a Hadoop system that runs MapReduce on unstructured logs and system diagnostic information.
  • Research: Ongoing research on the human genome project uses Hadoop MapReduce to process massive amounts of data. And a popular family genetics research provider runs an increasing flood of gene-sequencing data, including structured and unstructured data on births, deaths, census results, and military and immigration records, which amounts to many petabytes and continues to grow.
  • Retail: A leading online marketplace uses MapReduce to analyse huge volumes of log data to determine customer behaviour, search recommendations and more. And a major department store runs marketing campaign data through a Hadoop system to gain insights for making more targeted campaigns, down to the individual customer.
  • Telecommunications: Storing billions of call records with real-time access for customers amounts to hundreds of terabytes of data to be processed for a major telecom vendor.

How does HPE help with MapReduce?

HPE offers several solutions that can help you save time, money and workforce resources on managing Hadoop systems running MapReduce.

For example, HPE Pointnext Services offers advice and technical assistance in planning, design and integrating your Big Data analytics environment. They simplify designing and implementing Hadoop – and MapReduce – so that you can truly focus on finding analytical insights to make informed business decisions.

In addition, HPE GreenLake offers a scalable solution that radically simplifies the whole Hadoop lifecycle. It is an end-to-end solution that includes the required hardware, software and support for both symmetrical and asymmetrical environments. The unique HPE pricing and billing method makes it easier to understand your existing Hadoop costs and to more accurately predict future costs associated with your solution.

Following many years of customer engagement experiences in which HPE helped with Hadoop environments, HPE created two editions of an enterprise-grade Hadoop solution that are tested and ready to implement. They are complemented by the HPE Insight Cluster Management Utility, which enables IT I&O leaders to quickly provision, manage and monitor their infrastructure and choice of Hadoop implementations. The HPE enterprise-grade Hadoop standard edition solution can be supported in the HPE GreenLake solution.