Hadoop: Playing with Big Data

Thursday 2 November 2017

Sqoop Import with Secure Password

Sqoop is a very popular tool to import data from RDMS e.g: Oracle, MySQL, DB2 to HDFS or you can export data from HDFS to RDMS.

Problem definition: In Sqoop import/export we need to provide the RDBMS credentials. How do we secure the password to restrict the access of unauthorized users?

Solution:

1. Using hadoop credential-provider API.

Here are the steps:

hadoop credential create oracle.john -provider jceks://hdfs/user/test/sqoop/pass.jceks

-- Enter the RDMS password

After that, you will get the success message as follows -
oracle.john -provider has been successfully created.
Provider jceks://hdfs/user/test/sqoop/pass.jceks has been updated.

Now in Sqoop import command, you need to pass the password as follows:

-Dhadoop.security.credential.provider.path=jceks://hdfs/user/test/sqoop/pass.jceks \
--connect jdbc:oracle:thin:@server:1521:xyz \
--username use_name \
--password-alias oracle.john \
--table db_name.table_name \
--mapreduce-job-name table_name_LOAD \
--delete-target-dir \
--fields-terminated-by '\001' \
--null-string "" \
--null-non-string "" \
--hive-drop-import-delims \
--escaped-by "\\" \
--split-by "col_name" \
--num-mappers 1

2. Using Option file.

Before starting the sqoop process we can generate option file and after completion of the job we can delete the option file.

Thanks a lot for the read.

Regards,
Ratan Kumar Nath
Email:ratanKrNath@gmail.com

Thursday 22 March 2012

Hadoop Testing: MRUnit

MRUnit: Hadoop Testing tool

The distributed nature of MapReduce programs makes debugging a challenge. Attaching a debugger to a remote process is cumbersome, and the lack of a single console makes it difficult to inspect what is occurring when several distributed copies of a mapper or reducer are running concurrently.

MRUnit helps bridge the gap between MapReduce programs and JUnit by providing a set of interfaces and test harnesses, which allow MapReduce programs to be more easily tested using standard tools and practices.

MRUnit is testing framework for testing MapReduce programs written for running in Hadoop . MRUnit makes testing Mapper and Reducer classes easier.

Setup of development environment:
1. Download junit-4.10.jar
2. Hadoop-mrunit-0.20.2-cdh3u1.jar. It is under /usr/lib/hadoop-0.20/contrib/mrunit/
*** For example if you are using the Hadoop version 0.23.x. Then use mrunit-x.x.x incubating-hadoop023.jar
3. Add both jar to class path

Writing test case:

Initially
1.       Create a source folder as ‘test’ at your project
2.       Right click on the class to which you want to create testcase
3.       Select New -> Junit Test case -> Select source folder to test -> Finish.

CODES:

The corresponding Word Count code is presented below:

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

public class WordCount extends Configured implements Tool {

static public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

final private static LongWritable ONE = new LongWritable(1);

private Text tokenValue = new Text();

@Override

protected void map(LongWritable offset, Text text, Context context) throws IOException, InterruptedException {

for (String token : text.toString().split("\\s+")) {

tokenValue.set(token);

context.write(tokenValue, ONE);

}

static public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

private LongWritable total = new LongWritable();

@Override

protected void reduce(Text token, Iterable<LongWritable> counts, Context context)

throws IOException, InterruptedException {

long n = 0;

for (LongWritable count : counts)

n += count.get();

total.set(n);

context.write(token, total);

}

public int run(String[] args) throws Exception {

Configuration configuration = getConf();

Job job = new Job(configuration, "Word Count");

job.setJarByClass(WordCount.class);

job.setMapperClass(WordCountMapper.class);

job.setCombinerClass(WordCountReducer.class);

job.setReducerClass(WordCountReducer.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(LongWritable.class);

return job.waitForCompletion(true) ? 0 : -1;

}

public static void main(String[] args) throws Exception {

System.exit(ToolRunner.run(new WordCount(), args));

}

Test Case Code:

import java.util.ArrayList;

import java.util.List;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mrunit.mapreduce.MapDriver;

import org.apache.hadoop.mrunit.mapreduce.MapReduceDriver;

import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;

import org.junit.Before;

import org.junit.Test;

public class WordCountTest {

/*We declare three variables for Mapper Driver , Reducer Driver , MapReduceDrivers

Generics parameters for each of them is point worth noting

MapDriver generics matches with our test Mapper generics

SMSCDRMapper extends Mapper<LongWritable, Text, Text, IntWritable>

Similarly for ReduceDriver we have same matching generics declaration with

SMSCDRReducer extends Reducer<Text, IntWritable, Text, IntWritable>*/

MapReduceDriver<LongWritable, Text, Text, LongWritable, Text, LongWritable> mapReduceDriver;

MapDriver<LongWritable, Text, Text, LongWritable> mapDriver;

ReduceDriver<Text, LongWritable, Text, LongWritable> reduceDriver;

//create instances of our Mapper , Reducer .

//Set the corresponding mappers and reducers using setXXX() methods

@Before

public void setUp() {

WordCount.WordCountMapper mapper = new WordCount.WordCountMapper();

WordCount.WordCountReducer reducer = new WordCount.WordCountReducer();

mapDriver = new MapDriver<LongWritable, Text, Text, LongWritable>();

mapDriver.setMapper(mapper);

reduceDriver = new ReduceDriver<Text, LongWritable, Text, LongWritable>();

reduceDriver.setReducer(reducer);

mapReduceDriver = new MapReduceDriver<LongWritable, Text, Text, LongWritable, Text, LongWritable>();

mapReduceDriver.setMapper(mapper);

mapReduceDriver.setReducer(reducer);

}

@Test

public void testMapper() {

//gave one sample line input to the mapper

mapDriver.withInput(new LongWritable(1), new Text("sky sky sky oh my beautiful sky"));

//expected output for the mapper

mapDriver.withOutput(new Text("sky"), new LongWritable(1));

mapDriver.withOutput(new Text("oh"), new LongWritable(1));

mapDriver.withOutput(new Text("my"), new LongWritable(1));

mapDriver.withOutput(new Text("beautiful"), new LongWritable(1));

mapDriver.withOutput(new Text("sky"), new LongWritable(1));

//runTest() method run the Mapper test with input

mapDriver.runTest();

}

@Test

public void testReducer() {

List<LongWritable> values = new ArrayList<LongWritable>();

values.add(new LongWritable(1));

reduceDriver.withInput(new Text("sky"), values);

reduceDriver.withOutput(new Text("sky"), new LongWritable(2));

reduceDriver.runTest();

}

@Test

public void testMapReduce() {

mapReduceDriver.withInput(new LongWritable(1), new Text("sky sky sky"));

mapReduceDriver.addOutput(new Text("sky"), new LongWritable(3));

mapReduceDriver.runTest();

}

explanation of the above code is as follows:

MapReduceDriver<LongWritable, Text, Text, LongWritable, Text, LongWritable> mapReduceDriver;

MapDriver<LongWritable, Text, Text, LongWritable> mapDriver;

ReduceDriver<Text, LongWritable, Text, LongWritable> reduceDriver;

Declare three variables for Mapper Driver , Reducer Driver , MapReduceDrivers
Generics parameters for each of them is point worth noting
MapDriver generics matches with our test Mapper generics

The generic declaration for MapReduceDriver is Mapper Input and Output key value pairs We provide the input key and value that should be sent to the Mapper, and outputs you expect to be sent by the Reducer to the collector for those inputs.

setup() method we are telling to testing class to create instances of our Mapper , Reducer . Set the corresponding mappers and reducers using setXXX() methods.

In test method :

We gave one sample line input to the mapper using withInput() method.

We tell the expected output to test class with withOutput() method.

runTest() method run the Mapper test with input

Run the test class as Junit class.

Thank You.

link: http://www.cloudera.com/blog/2009/07/debugging-mapreduce-programs-with-mrunit/

https://cwiki.apache.org/confluence/display/MRUNIT/Index

Tuesday 21 June 2011

Hadoop

What is Hadoop
Open Source Framework for writing and running distributed application that process large amount of data.

Why Hadoop:

Problems...................

●Communication
● Coordination
● Dealing with failures
● Dealing with transient data
● Scalability
● Performance per dollar

Features:
●Accessible

●Fast!

●Robust

●Scalable

●Simple

Hadoop : core components
●MapReduce – parallel applications
● HDFS – distributed storage
●Users write map-reduce jobs that process data
stored in the HDFS.

The building blocks of Hadoop

■NameNode
■DataNode
■Secondary NameNode
■JobTracker
■TaskTracker

Name Node
Master of HDFS
directs the slave DataNode daemons to perform the low-level I/O tasks.

NameNode is the bookkeeper of HDFS
It keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem

Single point of failure of your Hadoop cluster
DataNode
Perform the grunt work of the distributed filesystem(in Slave).

Client communicates directly with the DataNode to process the local files corresponding to the blocks.

DataNode may communicate with other DataNodes to replicate its data blocks for redundancy.

DataNodes are constantly reporting to the NameNode

Secondary NameNode
Assistant daemon for monitoring the state of the
cluster HDFS.

Doesn’t receive or record any real-time changes to HDFS.

It communicates with the NameNode to
take snapshots of the HDFS metadata at intervals defined by the cluster configuration.

SNN snapshots help minimize the downtime and loss of data.

JobTracker
When you submit a job to your cluster determines the execution plan.

Determining which files to process, assigns nodes to different tasks, and monitors all
tasks as they’re running.

TaskTracker
Master overseeing the overall execution of a MapReduce job

TaskTrackers manage the execution of individual tasks on each slave node.

constantly communicate with the JobTracker through heartbeat
HDFS
Hadoop Distributed File System
Handles very large files
Files can be spread across many machines
Divides files into chunks, replicates each chunk.
If a node crashes, replicates its data
NN stores all metadata in main memory for faster access. So if NN goes down, all metadata is lost
Secondary Name Node periodically saves the NN state to disk.

HDFS - Replication
Rack aware
One copy inside the rack, one copy on some node outside the rack.
Number of replications per block can be configured, default is 3.

What happens when we submit a job?
– Hadoop determines where the input data is located.
– Calulate number of splits required
– Creates tasks
– Copies necessary files to all nodes
- node runs a task
– Once map tasks are over, starts reduce tasks,
– Collect output

Programming

Mapper class
– Reducer class
– Job configuration: job name, number of maps, reduces, any values required by the map and reduce classes etc.
– Job configuration done through API .
– Build the code into a jar file and submit.

Refference:
OReilly.Hadoop.The.Definitive.Guide.Jun.2009.pdf
Pro Hadoop.pdf
Other websites.

Hadoop: Playing with Big Data

Thursday 2 November 2017

Sqoop Import with Secure Password

Thursday 22 March 2012

Hadoop Testing: MRUnit

Tuesday 21 June 2011

Hadoop

About Me

Blog Archive