Unknown macro: {scrollbar}

How to use a custom partitioner in Pentaho MapReduce. In some situations you may wish to specify which reducer a particular key goes to. For example you are parsing a weblog, have a complex key containing IP address, year, and month and need all of the data for a year to go to a particular reducer. For more information on partitioners: http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning

Prerequisites

In order to follow along with this how-to guide you will need the following:

Hadoop
Pentaho Data Integration
Pentaho Hadoop Distribution

Sample Files

The sample data file needed for this guide is:

File Name	Content
weblogs_parse.txt.zip	Parsed, raw weblog data

Note: If you have already completed the Using Pentaho MapReduce to Parse Weblog Data guide the data should already be in the correct spot.

Add the file to your cluster by running the following:

hadoop fs -mkdir /user/pdi/weblogs
hadoop fs -mkdir /user/pdi/weblogs/parse
hadoop fs -put weblogs_parse.txt /user/pdi/weblogs/parse/part-00000

Sample Code

This guide expands upon the Using Pentaho MapReduce to Generate an Aggregate Dataset guide. If you have completed this guide you should already have the necessary code, otherwise download aggregate_mapper.ktr, aggregate_reducer.ktr, and aggregate_mr.kjb.

Step-By-Step Instructions

Setup

Start Hadoop if it is not already running.

Create a Custom Partitioner in Java

In this task you will create a Java partitioner that takes a key in the format client_ip tab year tab month and partition on the year.

Speed Tip

You can download CustomPartitioner.jar containing the partitioner if you don't want to do every step

Create Year Partitioner Class: In a text editor create a new file named YearPartitioner.java containing the following code:

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;


public class YearPartitioner implements Partitioner<Text, LongWritable> {

	public void configure(JobConf job) {}

	public int getPartition(Text key, LongWritable value,
			int numReduceTasks) {
		String sKey = key.toString();
		String[] splits=sKey.split("\t");  //Split the key on tab
		int year = Integer.parseInt(splits[1]);  //The year is the second field
		return year % numReduceTasks;  //Return the year mod number of reduce tasks as the partitioner number to send the record to.
	}
}

Compile the Class: Run the following command:

javac -classpath ${HADOOP_HOME}/hadoop-core.jar YearPartitioner.java

Collect the Class into a Jar: Run the following command:
```
jar cvf CustomPartitioner.jar YearPartitioner.class
```

Deploy the Custom Partitioner

In this task you will deploy the custom partitioner to the cluster so it may be used in the Distributed Cache.

Create a Directory: Create a directory to store the custom partitioner:
```
hadoop fs -mkdir /distcache
```
Add the Custom Partitioner to the Cluster: Add the CustomPartitioner.jar to HDFS:
```
hadoop fs -put CustomPartitioner.jar /distcache
```

Configure Pentaho MapReduce to Use Custom Partitioner

In this task you will configure the aggregate_mr.kjb job to use the custom partitioner.