Unknown macro: {scrollbar}

How to use HBase TableInputFormat in Pentaho MapReduce.

This guide explains how to configure Pentaho MapReduce to use the TableInputFormat for reading data from HBase and how to configure a map-reduce transformation to process that data using the HBaseRowDecoder step.

Prerequisites

In order to follow along with this how-to guide you will need the following:

HBase
Hadoop configured to access HBase
Pentaho Data Integration

Step-By-Step Instructions

Using HBaseRowDecoder

The HBaseRowDecoder step is designed specifically for use in map-reduce transformations in order to decode the key and value data that is output by the TableInputFormat. The key output is the row key from HBase and the value is an HBase "Result" object containing all the column values for the row in question.

First configure a Pentaho MapReduce input step by specifying that both the incoming key and value fields have type "Serializable".

Next specify the incoming row key and HBase result fields in the HBaseRowDecoder step.

Finally, define or load a mapping using the Mapping editor tab.

Once defined (or loaded), this mapping is encapsulated in the transformation meta data.

Configure the Pentaho MapReduce Job Entry Step

To ensure that input splits are created using the TableInputFormat, configure the Input Format and Input Path fields of the Job Setup tab as shown in the following screenshot.

The following table shows various properties that can be supplied in the User Defined tab of the step in order to configure the scan performed by the TableInputFormat. Entries shown in bold are mandatory.

Property	Description
hbase.mapred.inputtable	Name of the HBase table to read from
hbase.mapred.tablecolumns	Space delimited list of columns in ColFam:ColName format (ColName can be ommitted to read all columns from a family)
hbase.mapreduce.scan.cachedrows	Number of rows for caching that will be passed to scanners
hbase.mapreduce.scan.timestamp	Time stamp used to filter columns with a specific time stamp
hbase.mapreduce.scan.timerange.start	Starting time stamp to filter in a given time stamp range
hbase.mapreduce.scan.timerange.end	End time stamp to filter in a given time stamp range

Unknown macro: {scrollbar}

Browser not supported