Hadoop 1.0 vs Hadoop 2.0
Hadoop 2.0 has come up with few great things.
Let’s check these cool features and compare it with 1.0.
1. Name node in
High Availability mode(HA)
Name node in Hadoop Cluster is most
important because it stores all the metadata, if it is down due to some
unplanned event such as a machine crash, the whole Hadoop Cluster will be down
as well. How to handle this situation?
Hadoop 2.0 comes with the solution for this
problem.
·
HDFS comes with High Availability feature now, which solves this problem
by providing the option of running two redundant Name Nodes in the same
cluster in an Active/Passive way (one primary Name Node and other a hot standby
Name Node)
·
They both share an edits log. All namespace edits are logged to a shared NFS
storage and there is only a single writer to this shared storage at any point
of time. The passive Name Node reads from this storage and keeps updated
metadata information for cluster. In case of Active Name Node failure, the
passive Name Node becomes the Active Name Node and starts writing to the shared
storage. There is only one write to the shared storage at any point of time.
- Ability to run Non MapReduce
Application on Hadoop 2.0
In Hadoop 1.0, you can only run MapReduce
framework jobs to process the data stored in HDFS. There were no other models
(other than MapReduce) of data processing. For other processing way like
Real-time or graph analysis on the same data stored in HDFS, you need to take
out that data to some alternate storage like HBase because Hadoop 1.0 was only
supporting MapReduce Processing manner.
Hadoop 2.0 came up with new framework YARN (Yet another Resource Navigator), which
provides ability to run Non-MapReduce application.
Hadoop 2.0 provides YARN API‘s to
write other frameworks to run on top of HDFS. This enables running
Non-MapReduce Big Data Applications on Hadoop. Spark, MPI, Giraph, and HAMA are
few of the applications
written or ported to run within YARN.
- Improved Resource Utilization
In Hadoop 1.0 JobTracker is responsible for
both managing the cluster's resources and driving the execution of the
MapReduce job.
YARN splits up the two major functionalities of
overburdened JobTracker (resource management and job scheduling/monitoring)
into two separate daemons:
- a global
Resource Manager and
- Per-application
Application Master.
A Resource Manager (RM) focuses on managing the
cluster resources and
An Application Master (AM),
one-per-running-application, manages each running application (such as a
MapReduce job).
There are no more fixed map-reduce slots. YARN
provides central resource manager. With YARN, you can now run multiple
applications in Hadoop, all sharing a common resource.
- Native Windows Support
Hadoop was originally developed to support the
UNIX family of operating systems. With Hadoop 2, the Windows operating system
is natively supported. This extends the reach of Hadoop significantly to a
sizable Windows Server market
- Beyond Batch Oriented application: Hadoop
goes beyond Batch oriented nature in its version 2.0 and now can run
interactive, streaming application also.
- HDFS Federation
Hadoop cluster storage subsystem has been
generalized to support other frameworks besides HDFS. Similar to YARN, the new
storage architecture generalizes the block storage layer so that it can be used
not only by HDFS but also other storage services. The first use of this feature
is HDFS federation, which allows multiple instances of HDFS namespaces to share
the underlying storage. In future versions of Hadoop, other storage services
(such as key-value storage) will use the same storage layer.
- HDFS- Multiple Storage
One more fundamental change is the support for
heterogeneous storage.
Hadoop 1.0 treated all storage devices (be it
spinning disks or SSDs) on a DataNode as a single uniform pool; although one
could store data on an SSD, one could not control which data. Heterogeneous
storage is part of Hadoop 2.0 onwards, where the system will distinguish
between storage types and also make the storage type information available to
frameworks and applications so that they can take advantage of storage
properties. Indeed, the approach is general enough to allow us to treat even
memory as a storage tier for cached and temporary data.
- Faster access to data—Data Node
caching
Users and applications (such as Hive, Pig or
HBase) can identify now a set of files that need to be cached. For example,
dimension tables in Hive can be configured for caching in the DataNode RAM,
enabling quick reads for Hive queries to these frequently looked up tables.
- HDFS Snapshots
Hadoop 2 adds support for file system
snapshots. A snapshot is a point-in-time image of the entire file system or a
sub tree of a file system. A snapshot has many uses:
- Protection against user
errors: An admin can set up a process to take snapshots periodically. If a
user accidentally deletes files, these can be restored from the snapshot
that contains the files.
- Backup: If an admin wants to
back up the entire file system or a subtree in the file system, the admin
takes a snapshot and uses it as the starting point of a full backup.
Incremental backups are then taken by copying the difference between two
snapshots.
- Disaster recovery: Snapshots
can be used for copying consistent point-in-time images over to a remote
site for disaster recovery.
The snapshots feature supports read-only snapshots; it is implemented only in
the NameNode, and no copy of data is made when the snapshot is taken. Snapshot
creation is instantaneous. All the changes made to the snapshotted directory
are tracked using modified persistent data structures to ensure efficient
storage on the NameNode.