Hadoop was designed as a big data system; it is distributed
and can handle massive data sets by coordinating the computation power of a large
number of commodity machines. I will introduce the Hadoop big data framework.
The Hadoop Big Data Framework
Hadoop is one of the representative approaches for dealing
with big data challenges. Compared with traditional data frameworks, Hadoop redefines
the way data is managed and analyzed by leveraging the power of computing
resources composed of commodity hardware.
What is Hadoop?
Hadoop was first developed as a big data processing platform
from 2006 at Yahoo! The idea is based on Google’s MapReduce, which was first
published by Google based on their proprietary MapReduce implementation. Hadoop
is implemented using java and was open sourced under Apache license. In the
past few years, Hadoop has become a widely used platform and runtime
environment for the deployment of Big Data applications.
Hadoop Properties
Hadoop was designed to be parallel and resilient. It was
built to work on commodity hardware and can automatically recover from failures.
Despite of its success as being a big data platform, Hadoop is notorious for its
difficulty in configuration and management. A few Hadoop oriented companies
were founded to provide enterprise ready Hadoop distributions and Hadoop based
big data solutions.
Although designed for processing big data, sometimes the distributed
parallel computing framework can be useful for dealing with small data but big computation
problems. One such example is the computation of pi (π) shipped with the Hadoop
sample package.
Next I will introduce the data management and
computation framework, identifying their differences from traditional data systems.
Data Management with HDFS
Hadoop Distributed File System (HDFS) is the default
filesystem of Hadoop. It was designed as a distributed filesystem that provides
high-throughput access to application data. Data on HDFS is stored as data blocks.
These data blocks are replicated on several nodes and checksums of the blocks
are computed. In case of checksum error or system failure, erroneous or lost
data blocks can be recovered from backup blocks located on other nodes.
HDFS has two types of nodes, NameNode and DataNode. NameNode
keeps track of the metadata such as location of data blocks in main memory. DataNode
holds data blocks, and communicates with clients for reading and writing of the
blocks. In addition, DataNode periodically reports the list of its hosting
blocks to the NameNode.
Some HDFS configurations contain a SecondaryNameNode, which
maintains a backup of the HDFS filesystem image, once the NameNode crashes, the
whole filesystem can be recovered from the SecondaryNameNode. Unfortunately, SecondaryNameNode
is not a hot backup of the NameNode. Newer versions of Hadoop have introduced a
feature called high availability (HA), which has two or more NameNodes, one of
which is the primary and others are hot backup of the primary. When the primary
NameNode fails, one of the backups can automatically become the new primary NameNode,
reducing the system downtime to the minimum.
Unlike traditional database systems, where data records are
managed by a DataBase Management System (DBMS), no schema constraints exist
among the data sets. Due to this reason, HDFS is also known as a NoSQL
database.
Data Processing with MapReduce
MapReduce provides a programming model that transforms a
complex computation into the computation over a set of keys and values. Applications
coordinate the processing of tasks on a cluster of nodes by scheduling jobs,
monitoring activity, and re-executing failed tasks.
![]() |
| Figure 1. The MapReduce Data Processing Framework. |
The MapReduce framework has two types of nodes, Master Node
and Slave Node. Jobtracker is the daemon on a master node, and Tasktracker is
the daemon on a slave node. The master node is the manager of MapReduce jobs;
it splits jobs into smaller tasks, including map tasks and reduce tasks. After
splitting, the master node assigns tasks to slave nodes to run. When a slave
node gets a task from master node, the Tasktracker on the slave node will fork
a java process to run the task. The Tasktracker is also responsible for
tracking and reporting the progress of individual tasks. MapReduce takes data
locality into consideration so that a node will process its locally hosted data
first. This reduces the data transfer cost over the network. It is different
from traditional data processing system, which retrieves data for processing in
a centralized system.
The anatomy of a MapReduce job is described in the following
Figure. Multiple mappers on the slave nodes are executed in parallel. Results from
mappers will be buffered on local machine. Once some or sometimes all of the
mapper tasks have finished, the random shuffling process begins, which
aggregates the mapper outputs and shuffle intermediate partitions into reducer machine(s).
The reducers will run on the partitioned data generating final results which
are written to HDFS. Once the job finishes, the result will reside in multiple
files, depending on the number of reducers used in the job.
By default MapReduce programs are written in java. The
programming model of a basic MapReduce job is simple and straightforward. Assuming
all settings using default, application programmers only need to implement a
mapper function and a reducer function, the framework will automatically handle
the input, output and shuffle of data. Programmers don’t need to consider task
failures or resource contention either. And if your programming language is not
java, MapReduce streaming makes it even simpler to use MapReduce.
The Hadoop Big Data Ecosystem
The name Hadoop appears as the synonym of Google’s
MapReduce. Gradually, with more and more features and subsystems being added,
Hadoop is becoming the umbrella of a full-fledged big data ecosystem. Besides
the distributed file system (HDFS) and computing framework (MapReduce) that we have
introduced in the previous section, the ecosystem also includes common
utilities, a column oriented data storage table (HBase), high level data
management systems (Pig and Hive), a big data analytics library (Mahout), a
distributed coordination system (Zookeeper), a data integration framework
(Sqoop) and a workflow management module (Oozie). As the ecosystem grows, more complementary
services or higher-level abstractions systems are being added.
![]() |
| Figure 2. The Hadoop Big Data System |
Common
Hadoop common is a collection
of components and interfaces for the foundation of Hadoop based big data
platform. It provides the following modules: interfaces for distributed
filesystem and I/O operations, general parallel computation interfaces, logging
and security management modules.
HBase
HBase is an open source,
distributed, versioned, column oriented data store. It was built on top of
Hadoop and HDFS. HBase supports random, real time access to big data. It can
scale to be hosting very large table, containing billions of rows and millions
of columns.
Mahout
Mahout is an open source scalable machine learning library
based on Hadoop. It has a very active community and is still under active
development. Currently, the library supports four use cases: recommendation
mining, clustering, classification and frequent item set mining.
Pig
Apache Pig is a high level
system for expressing big data analysis programs. It supports big data by
compiling the pig statement into a sequence of MapReduce jobs. Pig uses Pig
Latin as the programming language, which has the feature of easy to program,
build in optimization for execution and extensibility.
Hive
Hive is a high level system for management
and analysis of big data stored in Hadoop based systems. It uses a SQL-like
language called HiveQL. Similar to Apache Pig, the Hive runtime engine
translates the HiveQL statements into a sequence of MapReduce jobs for
execution.
ZooKeeper
ZooKeeper is a centralized
coordination service for large scale distributed systems. It maintains the configuration
and naming information and provides distributed synchronization and group
services for applications in distributed environment.
Sqoop
Hadoop was designed to process
data on non-relational database systems (such as HDFS and HBase). Sometimes, there
are requirements of integrating data on relational database systems with data
on non-relational ones. Sqoop is an open source tool for extracting structured
data from relational database to non-relation data systems.
Oozie
Oozie is a scalable workflow management and coordination
service for Hadoop jobs. It is data aware and manages and coordinates jobs
based on their dependencies. In addition, Oozie has been integrated with the
Hadoop stack and can support all types of Hadoop jobs.
Hadoop Alternatives
What is Hadoop NOT for?
Hadoop has been successful as a big data platform, but this
does not mean that it is a good fit for all data problems. For example, Hadoop
is not an optimal choice for the following cases:
- small structured data sets that require interactive queries
- data processing that requires transaction
- streaming data analytics
- real time data analytics
Especially, Hadoop was designed as a batch processing engine.
Sometimes, real time processing of the data is required. For example, online
big data querying, stream data processing and interactive big data analytics
all requires low latency responses.
As a big data solution manager, you always need to consider
a lot of technical and non-technical situations before making the final
decision. For example, you need to consider the ease of use and management of
the system, the functional ability of meeting data processing needs and cost of
system deployment etc. Fortunately, as the big data industry develops, alternatives
of Hadoop are emerging. In the next section, we will introduce these Hadoop
alternatives.
Next, I am going to introduce a few Hadoop alternatives,
including improvements over current Hadoop implementation and alternative big
data implementations that aims to tackle the fundamental drawbacks of Hadoop.
As open source software, Hadoop is difficult to configure and
manage, mainly due to the instability of the software and the lack of proper documentation
and support. To work as a Hadoop system administrator, you need to have
sufficient Unix/Linux and network management skills. Sometimes even for an
experience system administrator, it is still hard to configure a large cluster
in limited time. Fortunately, if your organization has sufficient budget, there
are several Hadoop oriented companies that provide enterprise ready Hadoop
solutions. We will introduce several such companies including Cloudera, Hadapt,
MapR and Horntonworks.
On the other hand, Hadoop was not designed to work on real
time big data problems. We will introduce Spark and Storm as alternatives to
deal with such problems. In the end, we introduce Message Passing Interface
(MPI) and High Performance Computing Cluster (HPCC).
Enterprise Hadoop
Cloudera is one of the first few companies that do
enterprise Hadoop big data solutions. This company provides Hadoop consulting,
training and certification services. It is also one of the biggest contributors
of the Hadoop code base. The Cloudera big data solution provides Cloudera
Desktop as cluster manager. It simplifies the installation, management and
monitoring of the Hadoop clusters. You can visit the corporate website of
Cloudera through www.cloudera.com.
Hadapt differentiate itself from the other Hadoop oriented
companies by the goal of integrating structured, semi-structured and
unstructured data into a uniform data manipulation platform. The Hadoop based
data platform by Hadapt unifies SQL and Hadoop which makes it easy to handle
different variety of data. You can visit the Hadapt corporate website at http://hadapt.com/.
Other well-known Hadoop companies include MapR and
Hortonworks, which are founded to provide more stable Hadoop distribution and
Hadoop based big data solutions.
Spark
Spark is a real time in memory big data processing platform.
It can be up to 40 times faster than Hadoop. So it is ideal for iteration
intensive machine learning and real time online big data analytics. Spark can
be integrated with Hadoop, and the Hadoop compatible storage APIs enables it to
access any Hadoop supported systems such as HDFS.
Another famous real time big data processing platform is
Storm, which is developed and open-sourced by Twitter.
MPI
MPI is a library specification for message passing.
Different from Hadoop, MPI was designed for high performance on both massively
parallel machines and on workstation clusters.
In addition, MPI lacks fault tolerance, and performance will be bounded
when data becomes large.
HPCC
HPCC is an open source big data platform developed by
LexisNexis Risk Solutions. It achieves high performance by clustering commodity
hardware. The system includes configurations for both parallel batch processing
and high performance online query applications using indexed data files.
The HPCC platform contains two cluster processing subsystems,
Data Refinery subsystem and Data Delivery subsystem. The Data Refinery
subsystem is responsible for the general processing of massive raw data and the
Data Delivery subsystem is responsible for the delivery of clean data for online
queries and analytics.
Summary
As big data continues to flood the whole world, handling of
big data becomes important. Hadoop becomes successful as a resilient and powerful
parallel cluster computing framework on commodity hardware. And the Hadoop
based big data ecosystem, including a number of subsystems and services, are
being widely accepted by the industry. Hadoop alternatives can help us deal with problems that the open source community Hadoop is not an optimal choice.


No comments:
Post a Comment