Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
{scrollbar}
{excerpt}How to use compression with Pentaho MapReduce.{excerpt} This guide uses the Snappy compression codec in its examples, but you may use any compression codec you choose that is supported in Hadoop.  The following scenarios are covered:
* Reading Compressed Files
* Writing Compressed Files
* Compressing Intermediate Data

h1. Prerequisites

In order to follow along with this how-to guide you will need the following:
* Hadoop
* Pentaho Data Integration
* Pentaho Hadoop Distribution
* Compression Codec Installed on Hadoop

h1. Step-By-Step Instructions

h2. Reading Compressed Files

InNormally there thisis tasknothing you will configureneed to do to have Pentaho MapReduce touse reada compressed filesfile intoas the Map/Reducethe Inputinput.  Pentaho MapReduce 
{tip}The following compression codecs are automatically supported by Pentaho MapReduce.  You do not need to do any configuration to read a file using these codecs.
{tip}
# *Create Year Partitioner Class*: In a text editor create a new file named YearPartitioner.java containing the following code:will automatically decompress any compression codec installed on the Hadoop cluster.

h2. Writing Compressed Files

# *Start PDI on your desktop.* Once it is running choose 'File' \-> 'Open', browse to and select your job running Pentaho MapReduce, then click 'OK'.
# *Configure the Compression Codec:* Double click on the 'Pentaho MapReduce' step, switch to the 'User Defined' tab and enter the following information:
| Name | Value |
| mapred.output.compression.codec | The compression codec to use.  For example org.apache.hadoop.io.compress.SnappyCodec |
| mapred.output.compress | true |
| mapred.output.compression.type | BLOCK |
!ConfigureOutputCompression.PNG|width=543,height=308!
# *Run your job*

The output from the job should be compressed using the codec you specified.

h2. Compressing Intermediate Data

You may want to compress the intermediate data that is passed between the Pentaho Mappers and Reducers to reduce network i/o and in some cases improve performance.

# *Start PDI on your desktop.* Once it is running choose 'File' \-> 'Open', browse to and select your job running Pentaho MapReduce, then click 'OK'.
# *Configure the Compression Codec:* Double click on the 'Pentaho MapReduce' step, switch to the 'User Defined' tab and enter the following information:
| Name | Value |
| mapred.map.output.compression.codec | The compression codec to use.  For example org.apache.hadoop.io.compress.SnappyCodec |
| mapred.compress.map.output | true |
!ConfigureIntermediateCompression.PNG|width=554,height=316!
# *Run your job*