Tuesday 21 June 2011

Hadoop

What  is Hadoop
Open Source Framework for writing and running distributed application that process large amount of data.

Why Hadoop:

Problems...................

●Communication
● Coordination
● Dealing with failures
● Dealing with transient data
● Scalability
● Performance per dollar

Features:
●Accessible

●Fast!

●Robust

●Scalable

●Simple

Hadoop : core components
●MapReduce – parallel applications
● HDFS – distributed storage
●Users write map-reduce jobs that process data
stored in the HDFS.

The building blocks of Hadoop

■NameNode
■DataNode
■Secondary NameNode
■JobTracker
■TaskTracker

Name Node
Master of HDFS
directs the slave DataNode daemons to perform the low-level I/O tasks.

NameNode is the bookkeeper of HDFS
It keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem

Single point of failure of your Hadoop cluster
DataNode
Perform the grunt work of the distributed filesystem(in Slave).

Client communicates directly with the DataNode to process the local files corresponding to the blocks.

DataNode may communicate with other DataNodes to replicate its data blocks for redundancy.

DataNodes are constantly reporting to the NameNode

Secondary NameNode
Assistant daemon for monitoring the state of the
cluster HDFS.

Doesn’t receive or record any real-time changes to HDFS.

It communicates with the NameNode to
take snapshots of the HDFS metadata at intervals defined by the cluster configuration.

SNN snapshots help minimize the downtime and loss of data.

JobTracker
When you submit a job to your cluster determines the execution plan.

Determining which files to process, assigns nodes to different tasks, and monitors all
tasks as they’re running.


TaskTracker
Master overseeing the overall execution of a MapReduce job

TaskTrackers manage the execution of individual tasks on each slave node.

constantly communicate with the JobTracker through heartbeat
HDFS
Hadoop Distributed File System
 Handles very large files
 Files can be spread across many machines
 Divides files into chunks, replicates each chunk.
If a node crashes, replicates its data
NN stores all metadata in main memory for faster access. So if NN goes down, all metadata is lost
Secondary Name Node periodically saves the NN state to disk.


HDFS - Replication
Rack aware
One copy inside the rack, one copy on some node outside the rack.
Number of replications per block can be configured, default is 3.

What happens when we submit a job?
– Hadoop determines where the input data is located.
– Calulate number of splits required
– Creates tasks
– Copies necessary files to all nodes
- node runs a task
– Once map tasks are over, starts reduce tasks,
– Collect output


Programming

Mapper class
– Reducer class
– Job configuration: job name, number of maps, reduces, any values required by the map and  reduce classes etc.
– Job configuration done through API .
– Build the code into a jar file and submit.

Refference:
OReilly.Hadoop.The.Definitive.Guide.Jun.2009.pdf
Pro Hadoop.pdf
Other websites.