Hadoop was designed as a big data system; it is distributed and can handle massive data sets by coordinating the computation power of a large number of commodity machines. I will introduce the Hadoop big data framework.
The Hadoop Big Data Framework
Hadoop is one of the representative approaches for dealing with big data challenges. Compared with traditional data frameworks, Hadoop redefines the way data is managed and analyzed by leveraging the power of computing resources composed of commodity hardware.
What is Hadoop?
Hadoop was first developed as a big data processing platform from 2006 at Yahoo! The idea is based on Google’s MapReduce, which was first published by Google based on their proprietary MapReduce implementation. Hadoop is implemented using java and was open sourced under Apache license. In the past few years, Hadoop has become a widely used platform and runtime environment for the deployment of Big Data applications.
Hadoop was designed to be parallel and resilient. It was built to work on commodity hardware and can automatically recover from failures. Despite of its success as being a big data platform, Hadoop is notorious for its difficulty in configuration and management. A few Hadoop oriented companies were founded to provide enterprise ready Hadoop distributions and Hadoop based big data solutions.
Although designed for processing big data, sometimes the distributed parallel computing framework can be useful for dealing with small data but big computation problems. One such example is the computation of pi (π) shipped with the Hadoop sample package.
Next I will introduce the data management and computation framework, identifying their differences from traditional data systems.
Data Management with HDFS
Hadoop Distributed File System (HDFS) is the default filesystem of Hadoop. It was designed as a distributed filesystem that provides high-throughput access to application data. Data on HDFS is stored as data blocks. These data blocks are replicated on several nodes and checksums of the blocks are computed. In case of checksum error or system failure, erroneous or lost data blocks can be recovered from backup blocks located on other nodes.
HDFS has two types of nodes, NameNode and DataNode. NameNode keeps track of the metadata such as location of data blocks in main memory. DataNode holds data blocks, and communicates with clients for reading and writing of the blocks. In addition, DataNode periodically reports the list of its hosting blocks to the NameNode.
Some HDFS configurations contain a SecondaryNameNode, which maintains a backup of the HDFS filesystem image, once the NameNode crashes, the whole filesystem can be recovered from the SecondaryNameNode. Unfortunately, SecondaryNameNode is not a hot backup of the NameNode. Newer versions of Hadoop have introduced a feature called high availability (HA), which has two or more NameNodes, one of which is the primary and others are hot backup of the primary. When the primary NameNode fails, one of the backups can automatically become the new primary NameNode, reducing the system downtime to the minimum.
Unlike traditional database systems, where data records are managed by a DataBase Management System (DBMS), no schema constraints exist among the data sets. Due to this reason, HDFS is also known as a NoSQL database.
Data Processing with MapReduce
MapReduce provides a programming model that transforms a complex computation into the computation over a set of keys and values. Applications coordinate the processing of tasks on a cluster of nodes by scheduling jobs, monitoring activity, and re-executing failed tasks.
|Figure 1. The MapReduce Data Processing Framework.|
The MapReduce framework has two types of nodes, Master Node and Slave Node. Jobtracker is the daemon on a master node, and Tasktracker is the daemon on a slave node. The master node is the manager of MapReduce jobs; it splits jobs into smaller tasks, including map tasks and reduce tasks. After splitting, the master node assigns tasks to slave nodes to run. When a slave node gets a task from master node, the Tasktracker on the slave node will fork a java process to run the task. The Tasktracker is also responsible for tracking and reporting the progress of individual tasks. MapReduce takes data locality into consideration so that a node will process its locally hosted data first. This reduces the data transfer cost over the network. It is different from traditional data processing system, which retrieves data for processing in a centralized system.
The anatomy of a MapReduce job is described in the following Figure. Multiple mappers on the slave nodes are executed in parallel. Results from mappers will be buffered on local machine. Once some or sometimes all of the mapper tasks have finished, the random shuffling process begins, which aggregates the mapper outputs and shuffle intermediate partitions into reducer machine(s). The reducers will run on the partitioned data generating final results which are written to HDFS. Once the job finishes, the result will reside in multiple files, depending on the number of reducers used in the job.
By default MapReduce programs are written in java. The programming model of a basic MapReduce job is simple and straightforward. Assuming all settings using default, application programmers only need to implement a mapper function and a reducer function, the framework will automatically handle the input, output and shuffle of data. Programmers don’t need to consider task failures or resource contention either. And if your programming language is not java, MapReduce streaming makes it even simpler to use MapReduce.
The Hadoop Big Data Ecosystem
The name Hadoop appears as the synonym of Google’s MapReduce. Gradually, with more and more features and subsystems being added, Hadoop is becoming the umbrella of a full-fledged big data ecosystem. Besides the distributed file system (HDFS) and computing framework (MapReduce) that we have introduced in the previous section, the ecosystem also includes common utilities, a column oriented data storage table (HBase), high level data management systems (Pig and Hive), a big data analytics library (Mahout), a distributed coordination system (Zookeeper), a data integration framework (Sqoop) and a workflow management module (Oozie). As the ecosystem grows, more complementary services or higher-level abstractions systems are being added.
|Figure 2. The Hadoop Big Data System|
Hadoop common is a collection of components and interfaces for the foundation of Hadoop based big data platform. It provides the following modules: interfaces for distributed filesystem and I/O operations, general parallel computation interfaces, logging and security management modules.
HBase is an open source, distributed, versioned, column oriented data store. It was built on top of Hadoop and HDFS. HBase supports random, real time access to big data. It can scale to be hosting very large table, containing billions of rows and millions of columns.
Mahout is an open source scalable machine learning library based on Hadoop. It has a very active community and is still under active development. Currently, the library supports four use cases: recommendation mining, clustering, classification and frequent item set mining.
Apache Pig is a high level system for expressing big data analysis programs. It supports big data by compiling the pig statement into a sequence of MapReduce jobs. Pig uses Pig Latin as the programming language, which has the feature of easy to program, build in optimization for execution and extensibility.
Hive is a high level system for management and analysis of big data stored in Hadoop based systems. It uses a SQL-like language called HiveQL. Similar to Apache Pig, the Hive runtime engine translates the HiveQL statements into a sequence of MapReduce jobs for execution.
ZooKeeper is a centralized coordination service for large scale distributed systems. It maintains the configuration and naming information and provides distributed synchronization and group services for applications in distributed environment.
Hadoop was designed to process data on non-relational database systems (such as HDFS and HBase). Sometimes, there are requirements of integrating data on relational database systems with data on non-relational ones. Sqoop is an open source tool for extracting structured data from relational database to non-relation data systems.
Oozie is a scalable workflow management and coordination service for Hadoop jobs. It is data aware and manages and coordinates jobs based on their dependencies. In addition, Oozie has been integrated with the Hadoop stack and can support all types of Hadoop jobs.
What is Hadoop NOT for?
Hadoop has been successful as a big data platform, but this does not mean that it is a good fit for all data problems. For example, Hadoop is not an optimal choice for the following cases:
- small structured data sets that require interactive queries
- data processing that requires transaction
- streaming data analytics
- real time data analytics
Especially, Hadoop was designed as a batch processing engine. Sometimes, real time processing of the data is required. For example, online big data querying, stream data processing and interactive big data analytics all requires low latency responses.
As a big data solution manager, you always need to consider a lot of technical and non-technical situations before making the final decision. For example, you need to consider the ease of use and management of the system, the functional ability of meeting data processing needs and cost of system deployment etc. Fortunately, as the big data industry develops, alternatives of Hadoop are emerging. In the next section, we will introduce these Hadoop alternatives.
Next, I am going to introduce a few Hadoop alternatives, including improvements over current Hadoop implementation and alternative big data implementations that aims to tackle the fundamental drawbacks of Hadoop.
As open source software, Hadoop is difficult to configure and manage, mainly due to the instability of the software and the lack of proper documentation and support. To work as a Hadoop system administrator, you need to have sufficient Unix/Linux and network management skills. Sometimes even for an experience system administrator, it is still hard to configure a large cluster in limited time. Fortunately, if your organization has sufficient budget, there are several Hadoop oriented companies that provide enterprise ready Hadoop solutions. We will introduce several such companies including Cloudera, Hadapt, MapR and Horntonworks.
On the other hand, Hadoop was not designed to work on real time big data problems. We will introduce Spark and Storm as alternatives to deal with such problems. In the end, we introduce Message Passing Interface (MPI) and High Performance Computing Cluster (HPCC).
Cloudera is one of the first few companies that do enterprise Hadoop big data solutions. This company provides Hadoop consulting, training and certification services. It is also one of the biggest contributors of the Hadoop code base. The Cloudera big data solution provides Cloudera Desktop as cluster manager. It simplifies the installation, management and monitoring of the Hadoop clusters. You can visit the corporate website of Cloudera through www.cloudera.com.
Hadapt differentiate itself from the other Hadoop oriented companies by the goal of integrating structured, semi-structured and unstructured data into a uniform data manipulation platform. The Hadoop based data platform by Hadapt unifies SQL and Hadoop which makes it easy to handle different variety of data. You can visit the Hadapt corporate website at http://hadapt.com/.
Other well-known Hadoop companies include MapR and Hortonworks, which are founded to provide more stable Hadoop distribution and Hadoop based big data solutions.
Spark is a real time in memory big data processing platform. It can be up to 40 times faster than Hadoop. So it is ideal for iteration intensive machine learning and real time online big data analytics. Spark can be integrated with Hadoop, and the Hadoop compatible storage APIs enables it to access any Hadoop supported systems such as HDFS.
Another famous real time big data processing platform is Storm, which is developed and open-sourced by Twitter.
MPI is a library specification for message passing. Different from Hadoop, MPI was designed for high performance on both massively parallel machines and on workstation clusters. In addition, MPI lacks fault tolerance, and performance will be bounded when data becomes large.
HPCC is an open source big data platform developed by LexisNexis Risk Solutions. It achieves high performance by clustering commodity hardware. The system includes configurations for both parallel batch processing and high performance online query applications using indexed data files.
The HPCC platform contains two cluster processing subsystems, Data Refinery subsystem and Data Delivery subsystem. The Data Refinery subsystem is responsible for the general processing of massive raw data and the Data Delivery subsystem is responsible for the delivery of clean data for online queries and analytics.
As big data continues to flood the whole world, handling of big data becomes important. Hadoop becomes successful as a resilient and powerful parallel cluster computing framework on commodity hardware. And the Hadoop based big data ecosystem, including a number of subsystems and services, are being widely accepted by the industry. Hadoop alternatives can help us deal with problems that the open source community Hadoop is not an optimal choice.