Sql Tour

Friday, May 5, 2017

HDFS Command

HDFS commands

Print the Hadoop version

hadoop version

List the contents of the root directory in HDFS

hadoop fs -ls /

Run a DFS filesystem checking utility

hadoop fsck – /

Create a new directory named “hadoop” below the

hadoop fs -mkdir /user/training/hadoop

Add a sample text file from the local directory

hadoop fs -put data/sample.txt /user/training/hadoop

List the contents of this new directory in HDFS.

hadoop fs -ls /user/training/hadoop

Add the entire local directory called “retail” to the

hadoop fs -put data/retail /user/training/hadoop

Delete a file ‘customers’ from the “retail” directory.

hadoop fs -rm hadoop/retail/customers

List the contents that match the specified path, if path not specified it displays the contents of current user.

-ls <path>

Similar to ls command, but recursively list the contents

-lsr <path>

Moves files from <src> to <dst>. If <src> is a pattern that may return multiple files then <dst> should be a directory

- mv <src> <dst>

Similar to “mv” command, but it source won’t be removed after the file(s) copied

-cp <src> <dst>

Copies files from Local system to HDFS

-copyFromLocal <localsrc> <HDFSdst>

Same as copyFromLocal, but the source deleted once the file copied

-moveFromLocal <localsrc> <HDFSdst>

Similar to copyFromLocal

-put <localsrc> <HDFSdst>

Displays the amount of space, in bytes, for files

-du <path>

Displays the amount of space, in bytes, for a specified directory

-dus <path>

Delete all file(s) that match the <src> pattern

-rm <src>

Delete all directory(s) that match the <src> pattern

-rmr <src>

Enables you to specify the File/directory permissions on <src>

-chmod (0-7) (0-7) (0-7) <src>

Count the number of directories, files and bytes under <path>

-count <path>

Displays statistics about file/directory at <path>

-stat <path>

Write a timestamp in yyyy-MM-dd HH:mm:ss format in an empty file at <path>

-touchz <path>

Set the replication level of a file

-setrep <path>

Show the last 1KB of the file.

-tail <file>

Outputs the file in text format only for zip file or TextRecordInputStream.

-text <src>

Displays help for a command specified

-help [cmd]

Wednesday, April 26, 2017

Share folder in Window(O/s) & access in Hadoop(VM)

Hi all,

As we are focusing more on practical aspect we need big files to test/learn big data technology. In order to work on that we can share big files in our local machine (window) & can be access in hadoop using below action.

1. Create Folder & share in our Windows System (ShareWindow)

2. Then go to Virtual Machine

Create Folder where we want to see share folder data (ShareVirtual)

3. Run Below command in order to see the content

sudo mount -t vboxsf sharewindow sharevirtual

(/Sharevirtual) in case different path

Hadoop Configuration File

Hello Guys,

Please keep this configuration(.xml) file handy in order to work in Hadoop Ecosystem.

core-site.xml: It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

hdfs-site.xml: All the configuration settings for HDFS daemons(background process), the namenode, the secondary namenode and the data nodes are specified or can be specified in this file.

mapred-site.xml: Configuration settings related to MapReduce daemons : the job-tracker and the task-trackers can be done here.

Monday, September 26, 2016

My first Pig Script to count number of occurances of IP address from the log file

Hello all,

I'm glad to inform you all that today i have successfully wrote a PIG script to generate count of IP address from the 26 Lakhs of records.

I got chance to handle big data issue in my current company & i used PIG script to execute the task,

Here is my script to count the occurances of IP addresss from log file

Ldata = LOAD '/user/cloudera/Pigdata/totalIPcount.txt' AS (line:chararray);
IP = FOREACH Ldata GENERATE FLATTEN(TOKENIZE(line)) as IPaddress;
grouped = GROUP IP BY IPaddress;
IPCount= FOREACH grouped GENERATE group, COUNT(IP);
DUMP IPCount; OR STORE IPCount INTO '/Pigdata'

Thanks
Pradeep

Friday, June 26, 2015

Difference between Hadoop1.0 & Hadoop 2.0

Hadoop 1.0 vs Hadoop 2.0

Hadoop 2.0 has come up with few great things. Let’s check these cool features and compare it with 1.0.

1. Name node in High Availability mode(HA)

Name node in Hadoop Cluster is most important because it stores all the metadata, if it is down due to some unplanned event such as a machine crash, the whole Hadoop Cluster will be down as well. How to handle this situation?

Hadoop 2.0 comes with the solution for this problem.

· HDFS comes with High Availability feature now, which solves this problem by providing the option of running two redundant Name Nodes in the same cluster in an Active/Passive way (one primary Name Node and other a hot standby Name Node)

· They both share an edits log. All namespace edits are logged to a shared NFS storage and there is only a single writer to this shared storage at any point of time. The passive Name Node reads from this storage and keeps updated metadata information for cluster. In case of Active Name Node failure, the passive Name Node becomes the Active Name Node and starts writing to the shared storage. There is only one write to the shared storage at any point of time.

Ability to run Non MapReduce Application on Hadoop 2.0

In Hadoop 1.0, you can only run MapReduce framework jobs to process the data stored in HDFS. There were no other models (other than MapReduce) of data processing. For other processing way like Real-time or graph analysis on the same data stored in HDFS, you need to take out that data to some alternate storage like HBase because Hadoop 1.0 was only supporting MapReduce Processing manner.

Hadoop 2.0 came up with new framework YARN (Yet another Resource Navigator), which provides ability to run Non-MapReduce application.

Hadoop 2.0 provides YARN API‘s to write other frameworks to run on top of HDFS. This enables running Non-MapReduce Big Data Applications on Hadoop. Spark, MPI, Giraph, and HAMA are few of the applications written or ported to run within YARN.

Improved Resource Utilization

In Hadoop 1.0 JobTracker is responsible for both managing the cluster's resources and driving the execution of the MapReduce job.

YARN splits up the two major functionalities of overburdened JobTracker (resource management and job scheduling/monitoring) into two separate daemons:

a global Resource Manager and
Per-application Application Master.

A Resource Manager (RM) focuses on managing the cluster resources and

An Application Master (AM), one-per-running-application, manages each running application (such as a MapReduce job).

There are no more fixed map-reduce slots. YARN provides central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource.

Native Windows Support

Hadoop was originally developed to support the UNIX family of operating systems. With Hadoop 2, the Windows operating system is natively supported. This extends the reach of Hadoop significantly to a sizable Windows Server market

Beyond Batch Oriented application: Hadoop goes beyond Batch oriented nature in its version 2.0 and now can run interactive, streaming application also.

HDFS Federation

Hadoop cluster storage subsystem has been generalized to support other frameworks besides HDFS. Similar to YARN, the new storage architecture generalizes the block storage layer so that it can be used not only by HDFS but also other storage services. The first use of this feature is HDFS federation, which allows multiple instances of HDFS namespaces to share the underlying storage. In future versions of Hadoop, other storage services (such as key-value storage) will use the same storage layer.

HDFS- Multiple Storage

One more fundamental change is the support for heterogeneous storage.

Hadoop 1.0 treated all storage devices (be it spinning disks or SSDs) on a DataNode as a single uniform pool; although one could store data on an SSD, one could not control which data. Heterogeneous storage is part of Hadoop 2.0 onwards, where the system will distinguish between storage types and also make the storage type information available to frameworks and applications so that they can take advantage of storage properties. Indeed, the approach is general enough to allow us to treat even memory as a storage tier for cached and temporary data.

Faster access to data—Data Node caching

Users and applications (such as Hive, Pig or HBase) can identify now a set of files that need to be cached. For example, dimension tables in Hive can be configured for caching in the DataNode RAM, enabling quick reads for Hive queries to these frequently looked up tables.

HDFS Snapshots

Hadoop 2 adds support for file system snapshots. A snapshot is a point-in-time image of the entire file system or a sub tree of a file system. A snapshot has many uses:

Protection against user errors: An admin can set up a process to take snapshots periodically. If a user accidentally deletes files, these can be restored from the snapshot that contains the files.
Backup: If an admin wants to back up the entire file system or a subtree in the file system, the admin takes a snapshot and uses it as the starting point of a full backup. Incremental backups are then taken by copying the difference between two snapshots.

Disaster recovery: Snapshots can be used for copying consistent point-in-time images over to a remote site for disaster recovery.

The snapshots feature supports read-only snapshots; it is implemented only in the NameNode, and no copy of data is made when the snapshot is taken. Snapshot creation is instantaneous. All the changes made to the snapshotted directory are tracked using modified persistent data structures to ensure efficient storage on the NameNode.

Sunday, March 8, 2015

Difference Between Sql server 2005/2008 & Sql server 2008/2012

Hi Friends,

Here i am with most asked & common interview question in SQL.

What is the difference between SQL server 2005 & SQL server 2008.

Here, most of us[not all] used same functionality which was present in earlier version because out project scope. But in order to make interviewer impress here is the differnce

Table2

Difference between SQL Server 2008 & SQL Server 2012

Table4

Tuesday, December 4, 2012

Getting duplicate records with count

SELECT Type, count(*) as TotalCount FROM tblAgentTran
where kcode='4' group by type
having count() > 1 order by COUNT() DESC