What is Hadoop
Open Source Framework for writing and running distributed application that process large amount of data.
Why Hadoop:
Problems...................
●Communication
● Coordination
● Dealing with failures
● Dealing with transient data
● Scalability
● Performance per dollar
Features:
●Accessible
●Fast!
●Robust
●Scalable
●Simple
Hadoop : core components
●MapReduce – parallel applications
● HDFS – distributed storage
●Users write map-reduce jobs that process data
stored in the HDFS.
The building blocks of Hadoop
■NameNode
■DataNode
■Secondary NameNode
■JobTracker
■TaskTracker
Name Node
Master of HDFS
directs the slave DataNode daemons to perform the low-level I/O tasks.
NameNode is the bookkeeper of HDFS
It keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem
Single point of failure of your Hadoop cluster
DataNode
Perform the grunt work of the distributed filesystem(in Slave).
Client communicates directly with the DataNode to process the local files corresponding to the blocks.
DataNode may communicate with other DataNodes to replicate its data blocks for redundancy.
DataNodes are constantly reporting to the NameNode
Secondary NameNode
Assistant daemon for monitoring the state of the
cluster HDFS.
Doesn’t receive or record any real-time changes to HDFS.
It communicates with the NameNode to
take snapshots of the HDFS metadata at intervals defined by the cluster configuration.
SNN snapshots help minimize the downtime and loss of data.
JobTracker
When you submit a job to your cluster determines the execution plan.
Determining which files to process, assigns nodes to different tasks, and monitors all
tasks as they’re running.
TaskTracker
Master overseeing the overall execution of a MapReduce job
TaskTrackers manage the execution of individual tasks on each slave node.
constantly communicate with the JobTracker through heartbeat
HDFS
Hadoop Distributed File System
Handles very large files
Files can be spread across many machines
Divides files into chunks, replicates each chunk.
If a node crashes, replicates its data
NN stores all metadata in main memory for faster access. So if NN goes down, all metadata is lost
Secondary Name Node periodically saves the NN state to disk.
HDFS - Replication
Rack aware
One copy inside the rack, one copy on some node outside the rack.
Number of replications per block can be configured, default is 3.
What happens when we submit a job?
– Hadoop determines where the input data is located.
– Calulate number of splits required
– Creates tasks
– Copies necessary files to all nodes
- node runs a task
– Once map tasks are over, starts reduce tasks,
– Collect output
Programming
Mapper class
– Reducer class
– Job configuration: job name, number of maps, reduces, any values required by the map and reduce classes etc.
– Job configuration done through API .
– Build the code into a jar file and submit.
Refference:
OReilly.Hadoop.The.Definitive.Guide.Jun.2009.pdf
Pro Hadoop.pdf
Other websites.
Open Source Framework for writing and running distributed application that process large amount of data.
Why Hadoop:
Problems...................
●Communication
● Coordination
● Dealing with failures
● Dealing with transient data
● Scalability
● Performance per dollar
Features:
●Accessible
●Fast!
●Robust
●Scalable
●Simple
Hadoop : core components
●MapReduce – parallel applications
● HDFS – distributed storage
●Users write map-reduce jobs that process data
stored in the HDFS.
The building blocks of Hadoop
■NameNode
■DataNode
■Secondary NameNode
■JobTracker
■TaskTracker
Name Node
Master of HDFS
directs the slave DataNode daemons to perform the low-level I/O tasks.
NameNode is the bookkeeper of HDFS
It keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem
Single point of failure of your Hadoop cluster
DataNode
Perform the grunt work of the distributed filesystem(in Slave).
Client communicates directly with the DataNode to process the local files corresponding to the blocks.
DataNode may communicate with other DataNodes to replicate its data blocks for redundancy.
DataNodes are constantly reporting to the NameNode
Secondary NameNode
Assistant daemon for monitoring the state of the
cluster HDFS.
Doesn’t receive or record any real-time changes to HDFS.
It communicates with the NameNode to
take snapshots of the HDFS metadata at intervals defined by the cluster configuration.
SNN snapshots help minimize the downtime and loss of data.
JobTracker
When you submit a job to your cluster determines the execution plan.
Determining which files to process, assigns nodes to different tasks, and monitors all
tasks as they’re running.
TaskTracker
Master overseeing the overall execution of a MapReduce job
TaskTrackers manage the execution of individual tasks on each slave node.
constantly communicate with the JobTracker through heartbeat
HDFS
Hadoop Distributed File System
Handles very large files
Files can be spread across many machines
Divides files into chunks, replicates each chunk.
If a node crashes, replicates its data
NN stores all metadata in main memory for faster access. So if NN goes down, all metadata is lost
Secondary Name Node periodically saves the NN state to disk.
HDFS - Replication
Rack aware
One copy inside the rack, one copy on some node outside the rack.
Number of replications per block can be configured, default is 3.
What happens when we submit a job?
– Hadoop determines where the input data is located.
– Calulate number of splits required
– Creates tasks
– Copies necessary files to all nodes
- node runs a task
– Once map tasks are over, starts reduce tasks,
– Collect output
Programming
Mapper class
– Reducer class
– Job configuration: job name, number of maps, reduces, any values required by the map and reduce classes etc.
– Job configuration done through API .
– Build the code into a jar file and submit.
Refference:
OReilly.Hadoop.The.Definitive.Guide.Jun.2009.pdf
Pro Hadoop.pdf
Other websites.