Processing HBase data in Pentaho MapReduce using TableInputFormat
How to use HBase TableInputFormat in Pentaho MapReduce.
This guide explains how to configure Pentaho MapReduce to use the TableInputFormat for reading data from HBase and how to configure a map-reduce transformation to process that data using the HBaseRowDecoder step.
Prerequisites
In order to follow along with this how-to guide you will need the following:
- HBase
- Hadoop configured to access HBase
- Pentaho Data Integration
Step-By-Step Instructions
Using HBaseRowDecoder
The HBaseRowDecoder step is designed specifically for use in map-reduce transformations in order to decode the key and value data that is output by the TableInputFormat. The key output is the row key from HBase and the value is an HBase "Result" object containing all the column values for the row in question.
First configure a Pentaho MapReduce input step by specifying that both the incoming key and value fields have type "Serializable".
Next specify the incoming row key and HBase result fields in the HBaseRowDecoder step.
Finally, define or load a mapping using the Mapping editor tab.
Once defined (or loaded), this mapping is encapsulated in the transformation meta data.
Configure the Pentaho MapReduce Job Entry Step
To ensure that input splits are created using the TableInputFormat, configure the Input Format and Input Path fields of the Job Setup tab as shown in the following screenshot.
The following table shows various properties that can be supplied in the User Defined tab of the step in order to configure the scan performed by the TableInputFormat. Entries shown in bold are mandatory.
Property |
Description |
---|---|
hbase.mapred.inputtable |
Name of the HBase table to read from |
hbase.mapred.tablecolumns |
Space delimited list of columns in ColFam:ColName format (ColName can be ommitted to read all columns from a family) |
hbase.mapreduce.scan.cachedrows |
Number of rows for caching that will be passed to scanners |
hbase.mapreduce.scan.timestamp |
Time stamp used to filter columns with a specific time stamp |
hbase.mapreduce.scan.timerange.start |
Starting time stamp to filter in a given time stamp range |
hbase.mapreduce.scan.timerange.end |
End time stamp to filter in a given time stamp range |