Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0
Wiki Markup
{scrollbar}
{

Excerpt

...

How

...

to

...

use

...

a

...

custom

...

partitioner

...

in

...

Pentaho

...

MapReduce.

...

In

...

some

...

situations

...

you

...

may

...

wish

...

to

...

specify

...

which

...

reducer

...

a

...

particular

...

key

...

goes

...

to.

...

For

...

example

...

you

...

are

...

parsing

...

a

...

weblog,

...

have

...

a

...

complex

...

key

...

containing

...

IP

...

address,

...

year,

...

and

...

month

...

and

...

need

...

all

...

of

...

the

...

data

...

for

...

a

...

year

...

to

...

go

...

to

...

a

...

particular

...

reducer.

...

For

...

more

...

information

...

on

...

partitioners:

...

http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning

...

Prerequisites

In order to follow along with this how-to

...

guide

...

you

...

will

...

need

...

the

...

following:

...

  • Hadoop
  • Pentaho Data Integration
  • Pentaho Hadoop Distribution

Sample Files

The sample data file needed for this guide is:

File Name

Content

weblogs_parse.txt.zip

...

Parsed,

...

raw

...

weblog

...

data

...

Note:

...

If

...

you

...

have

...

already

...

completed

...

the

...

Using

...

Pentaho

...

MapReduce

...

to

...

Parse

...

Weblog

...

Data

...

guide

...

the

...

data

...

should

...

already

...

be

...

in

...

the

...

correct

...

spot.

...

Add

...

the

...

file

...

to

...

your

...

cluster

...

by

...

running

...

the

...

following:

...

}
Code Block
hadoop fs -mkdir /user/pdi/weblogs
hadoop fs -mkdir /user/pdi/weblogs/parse
hadoop fs -put weblogs_parse.txt /user/pdi/weblogs/parse/part-00000{code}

h1. Sample Code

This guide expands upon the [BAD:Using Pentaho MapReduce to Generate an Aggregate Dataset] guide.  If you have completed this guide you should already have the necessary code, otherwise download [BAD:Using Pentaho MapReduce to Generate an Aggregate Dataset^aggregate_mapper.ktr], [BAD:Using Pentaho MapReduce to Generate an Aggregate Dataset^aggregate_reducer.ktr], and [BAD:Using Pentaho MapReduce to Generate an Aggregate Dataset^aggregate_mr.kjb].

h1. Step-By-Step Instructions

h2. Setup

Start Hadoop if it is not already running.

h2. Create a Custom Partitioner in Java

In this task you will create a Java partitioner that takes a key in the format client_ip tab year tab month and partition on the year.

{tip:title=Speed Tip}You can download [^CustomPartitioner.jar] containing the partitioner if you don't want to do every step{tip}
# *Create Year Partitioner Class*: In a text editor create a new file named YearPartitioner.java containing the following code:
{code}

Sample Code

This guide expands upon the Using Pentaho MapReduce to Generate an Aggregate Dataset guide. If you have completed this guide you should already have the necessary code, otherwise download aggregate_mapper.ktr, aggregate_reducer.ktr, and aggregate_mr.kjb.

Step-By-Step Instructions

Setup

Start Hadoop if it is not already running.

Create a Custom Partitioner in Java

In this task you will create a Java partitioner that takes a key in the format client_ip tab year tab month and partition on the year.

Tip
titleSpeed Tip

You can download CustomPartitioner.jar containing the partitioner if you don't want to do every step

  1. Create Year Partitioner Class: In a text editor create a new file named YearPartitioner.java containing the following code:
    Code Block
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.JobConf;
    import org.apache.hadoop.mapred.Partitioner;
    
    
    public class YearPartitioner implements Partitioner<Text, LongWritable> {
    
    	public void configure(JobConf job) {}
    
    	public int getPartition(Text key, LongWritable value,
    			int numReduceTasks) {
    		String sKey = key.toString();
    		String[] splits=sKey.split("\t");  //Split the key on tab
    		int year = Integer.parseInt(splits[1]);  //The year is the second field
    		return year % numReduceTasks;  //Return the year mod number of reduce tasks as the partitioner number to send the record to.
    	}
    }

...

  1. Compile the Class:

...

  1. Run

...

  1. the

...

  1. following

...

  1. command:

...

  1. Code Block

...

  1. javac -classpath ${HADOOP_HOME}/hadoop-core.jar YearPartitioner.java

...

  1. Collect the Class into a Jar: Run the following command:
    Code Block
    jar cvf CustomPartitioner.jar YearPartitioner.class

...

Deploy the Custom Partitioner

In this task you will deploy the custom partitioner to the cluster so it may be used in the Distributed Cache.

  1. Create a Directory: Create a directory to store the custom partitioner:
    Code Block
    hadoop fs -mkdir /distcache

...

  1. Add the Custom Partitioner to the Cluster: Add the CustomPartitioner.jar

...

  1. to

...

  1. HDFS:

...

  1. Code Block

...

  1. hadoop fs -put CustomPartitioner.jar /distcache

...

Configure Pentaho MapReduce to Use Custom Partitioner

In this task you will configure the aggregate_mr.kjb

...

job

...

to

...

use

...

the

...

custom

...

partitioner.

...

:= }
Tip
title
Speed
Tip

You

can

download

the

already

completed

[^aggregate

aggregate_mr_partition.kjb

]

if

you

do

not

want

to

do

every

step{tip} # *Start PDI on your desktop*. Once it is running choose 'File' \->

step

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'Open',

...

  1. browse

...

  1. to

...

  1. and

...

  1. select

...

  1. the

...

  1. 'aggregate_mr.kjb',

...

  1. then

...

  1. click

...

  1. 'OK'.

...

  1. Configure

...

  1. Number

...

  1. of

...

  1. Reducers:

...

  1. Double

...

  1. click

...

  1. on

...

  1. the

...

  1. 'Pentaho

...

  1. MapReduce'

...

  1. job

...

  1. entry.

...

  1. Once

...

  1. it

...

  1. is

...

  1. open

...

  1. switch

...

  1. to

...

  1. the

...

  1. 'Cluster'

...

  1. tab

...

  1. and

...

  1. set

...

  1. 'Number

...

  1. of

...

  1. Reducer

...

  1. Tasks'

...

  1. to

...

  1. '3'.

...


  1. Image Added
  2. Configure Partitioner to Use: Switch to the User Defined tab and enter the following:

    Name

    Value

    Explanation

    mapred.cache.files

...

  1. /distcache/CustomPartitioner.jar

...

  1. Adds

...

  1. the

...

  1. Custom

...

  1. Partitioner

...

  1. to

...

  1. the

...

  1. distributed

...

  1. cache

...

  1. for

...

  1. the

...

  1. job.

...

  1. mapred.job.classpath.files

...

  1. /distcache/CustomPartitioner.jar

...

  1. Adds

...

  1. the

...

  1. Custom

...

  1. Partitioner

...

  1. from

...

  1. the

...

  1. distributed

...

  1. cache

...

  1. to

...

  1. the

...

  1. java

...

  1. classpath

...

  1. for

...

  1. the

...

  1. job.

...

  1. mapred.partitioner.class

...

  1. YearPartitioner

    Tells the job to use the YearPartitioner class.

    Image Added
  2. Save the Job: Choose 'File' -> 'Save as...'

...

  1. from

...

  1. the

...

  1. menu

...

  1. system.

...

  1. Save

...

  1. the

...

  1. transformation

...

  1. as

...

  1. 'aggregate_mr_partition.kjb'

...

  1. into

...

  1. a

...

  1. folder

...

  1. of

...

  1. your

...

  1. choice.

...

  1. Run

...

  1. the

...

  1. Job

...

  1. :

...

  1. Choose

...

  1. 'Action'

...

  1. ->

...

  1. 'Run'

...

  1. from

...

  1. the

...

  1. menu

...

  1. system

...

  1. or

...

  1. click

...

  1. on

...

  1. the

...

  1. green

...

  1. run

...

  1. button

...

  1. on

...

  1. the

...

  1. job

...

  1. toolbar.

...

  1. A

...

  1. 'Execute

...

  1. a

...

  1. job'

...

  1. window

...

  1. will

...

  1. open.

...

  1. Click

...

  1. on

...

  1. the

...

  1. 'Launch'

...

  1. button.

...

  1. An

...

  1. 'Execution

...

  1. Results'

...

  1. panel

...

  1. will

...

  1. open

...

  1. at

...

  1. the

...

  1. bottom

...

  1. of

...

  1. the

...

  1. PDI

...

  1. window

...

  1. and

...

  1. it

...

  1. will

...

  1. show

...

  1. you

...

  1. the

...

  1. progress

...

  1. of

...

  1. the

...

  1. job

...

  1. as

...

  1. it

...

  1. runs.

...

  1. After

...

  1. a

...

  1. few

...

  1. seconds

...

  1. the

...

  1. job

...

  1. should

...

  1. finish

...

  1. successfully.

...

Check

...

Hadoop

...

  1. View

...

  1. the

...

  1. first

...

  1. Output

...

  1. File:

...

  1. This

...

  1. command

...

  1. should

...

  1. return

...

  1. an

...

  1. empty

...

  1. file.

...

  1. There

...

  1. are

...

  1. only

...

  1. 2

...

  1. years

...

  1. of

...

  1. data

...

  1. in

...

  1. the

...

  1. sample

...

  1. file,

...

  1. but

...

  1. you

...

  1. specified

...

  1. 3

...

  1. reducers,

...

  1. therefore

...

  1. one

...

  1. reducer

...

  1. will

...

  1. receive

...

  1. no

...

  1. data.

...

  1. Code Block

...

  1. hadoop fs -cat /user/pdi/aggregate_mr/part-00000

...

  1. View the second Output File: This command should only return data for the year 2010.
    Code Block
    hadoop fs -cat /user/pdi/aggregate_mr/part-00001 | head -10

...

  1. View the third Output File: This command should only return data for the year 2011.
    Code Block
     hadoop fs -cat /user/pdi/aggregate_mr/part-00002 | head -10

...