Processing HBase data in Pentaho MapReduce using TableInputFormat

Unknown macro: {scrollbar}

How to use HBase TableInputFormat in Pentaho MapReduce.

This guide explains how to configure Pentaho MapReduce to use the TableInputFormat for reading data from HBase and how to configure a map-reduce transformation to process that data using the HBaseRowDecoder step.

Prerequisites

In order to follow along with this how-to guide you will need the following:

  • HBase
  • Hadoop configured to access HBase
  • Pentaho Data Integration

Step-By-Step Instructions

Using HBaseRowDecoder

The HBaseRowDecoder step is designed specifically for use in map-reduce transformations in order to decode the key and value data that is output by the TableInputFormat. The key output is the row key from HBase and the value is an HBase "Result" object containing all the column values for the row in question.

First configure a Pentaho MapReduce input step by specifying that both the incoming key and value fields have type "Serializable".

Next specify the incoming row key and HBase result fields in the HBaseRowDecoder step.

Finally, define or load a mapping using the Mapping editor tab.

Once defined (or loaded), this mapping is encapsulated in the transformation meta data.

Configure the Pentaho MapReduce Job Entry Step

To ensure that input splits are created using the TableInputFormat, configure the Input Format and Input Path fields of the Job Setup tab as shown in the following screenshot.

The following table shows various properties that can be supplied in the User Defined tab of the step in order to configure the scan performed by the TableInputFormat. Entries shown in bold are mandatory.

Property

Description

hbase.mapred.inputtable

Name of the HBase table to read from

hbase.mapred.tablecolumns

Space delimited list of columns in ColFam:ColName format (ColName can be ommitted to read all columns from a family)

hbase.mapreduce.scan.cachedrows

Number of rows for caching that will be passed to scanners

hbase.mapreduce.scan.timestamp

Time stamp used to filter columns with a specific time stamp

hbase.mapreduce.scan.timerange.start

Starting time stamp to filter in a given time stamp range

hbase.mapreduce.scan.timerange.end

End time stamp to filter in a given time stamp range

Unknown macro: {scrollbar}