Sql Tour: Difference between Hadoop1.0 & Hadoop 2.0

Hadoop 1.0 vs Hadoop 2.0

Hadoop 2.0 has come up with few great things. Let’s check these cool features and compare it with 1.0.

1. Name node in High Availability mode(HA)

Name node in Hadoop Cluster is most important because it stores all the metadata, if it is down due to some unplanned event such as a machine crash, the whole Hadoop Cluster will be down as well. How to handle this situation?

Hadoop 2.0 comes with the solution for this problem.

· HDFS comes with High Availability feature now, which solves this problem by providing the option of running two redundant Name Nodes in the same cluster in an Active/Passive way (one primary Name Node and other a hot standby Name Node)

· They both share an edits log. All namespace edits are logged to a shared NFS storage and there is only a single writer to this shared storage at any point of time. The passive Name Node reads from this storage and keeps updated metadata information for cluster. In case of Active Name Node failure, the passive Name Node becomes the Active Name Node and starts writing to the shared storage. There is only one write to the shared storage at any point of time.

Ability to run Non MapReduce Application on Hadoop 2.0

In Hadoop 1.0, you can only run MapReduce framework jobs to process the data stored in HDFS. There were no other models (other than MapReduce) of data processing. For other processing way like Real-time or graph analysis on the same data stored in HDFS, you need to take out that data to some alternate storage like HBase because Hadoop 1.0 was only supporting MapReduce Processing manner.

Hadoop 2.0 came up with new framework YARN (Yet another Resource Navigator), which provides ability to run Non-MapReduce application.

Hadoop 2.0 provides YARN API‘s to write other frameworks to run on top of HDFS. This enables running Non-MapReduce Big Data Applications on Hadoop. Spark, MPI, Giraph, and HAMA are few of the applications written or ported to run within YARN.

Improved Resource Utilization

In Hadoop 1.0 JobTracker is responsible for both managing the cluster's resources and driving the execution of the MapReduce job.

YARN splits up the two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons:

a global Resource Manager and
Per-application Application Master.

A Resource Manager (RM) focuses on managing the cluster resources and

An Application Master (AM), one-per-running-application, manages each running application (such as a MapReduce job).

There are no more fixed map-reduce slots. YARN provides central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource.

Native Windows Support

Hadoop was originally developed to support the UNIX family of operating systems. With Hadoop 2, the Windows operating system is natively supported. This extends the reach of Hadoop significantly to a sizable Windows Server market

Beyond Batch Oriented application: Hadoop goes beyond Batch oriented nature in its version 2.0 and now can run interactive, streaming application also.

HDFS Federation

Hadoop cluster storage subsystem has been generalized to support other frameworks besides HDFS. Similar to YARN, the new storage architecture generalizes the block storage layer so that it can be used not only by HDFS but also other storage services. The first use of this feature is HDFS federation, which allows multiple instances of HDFS namespaces to share the underlying storage. In future versions of Hadoop, other storage services (such as key-value storage) will use the same storage layer.

HDFS- Multiple Storage

One more fundamental change is the support for heterogeneous storage.

Hadoop 1.0 treated all storage devices (be it spinning disks or SSDs) on a DataNode as a single uniform pool; although one could store data on an SSD, one could not control which data. Heterogeneous storage is part of Hadoop 2.0 onwards, where the system will distinguish between storage types and also make the storage type information available to frameworks and applications so that they can take advantage of storage properties. Indeed, the approach is general enough to allow us to treat even memory as a storage tier for cached and temporary data.

Faster access to data—Data Node caching

Users and applications (such as Hive, Pig or HBase) can identify now a set of files that need to be cached. For example, dimension tables in Hive can be configured for caching in the DataNode RAM, enabling quick reads for Hive queries to these frequently looked up tables.

HDFS Snapshots

Hadoop 2 adds support for file system snapshots. A snapshot is a point-in-time image of the entire file system or a sub tree of a file system. A snapshot has many uses:

Protection against user errors: An admin can set up a process to take snapshots periodically. If a user accidentally deletes files, these can be restored from the snapshot that contains the files.
Backup: If an admin wants to back up the entire file system or a subtree in the file system, the admin takes a snapshot and uses it as the starting point of a full backup. Incremental backups are then taken by copying the difference between two snapshots.

Disaster recovery: Snapshots can be used for copying consistent point-in-time images over to a remote site for disaster recovery.

The snapshots feature supports read-only snapshots; it is implemented only in the NameNode, and no copy of data is made when the snapshot is taken. Snapshot creation is instantaneous. All the changes made to the snapshotted directory are tracked using modified persistent data structures to ensure efficient storage on the NameNode.

Sql Tour

Friday, June 26, 2015

Difference between Hadoop1.0 & Hadoop 2.0

1 comment: