Wiki Markup
{scrollbar} {

Excerpt

...


How

...

to

...

use

...

compression

...

with

...

Pentaho

...

MapReduce.

...

This

...

guide

...

uses

...

the

...

Snappy

...

compression

...

codec

...

in

...

its

...

examples,

...

but

...

you

...

may

...

use

...

any

...

compression

...

codec

...

you

...

choose

...

that

...

is

...

supported

...

in

...

Hadoop.

...

The

...

following

...

scenarios

...

are

...

covered:

...

Reading

...

Compressed

...

Files

...

Writing

...

Compressed

...

Files

...

Compressing

...

Intermediate

...

Data

...

Prerequisites

In order to follow along with this how-to

...

guide

...

you

...

will

...

need

...

the

...

following:

...

Hadoop
Pentaho Data Integration
Pentaho Hadoop Distribution
Compression Codec Installed on Hadoop

Step-By-Step

...

Instructions

...

Reading

...

Compressed

...

Files

Normally there is nothing you need to do to have Pentaho MapReduce use a compressed file as the the input. Pentaho MapReduce will automatically decompress any compression codec installed on the Hadoop cluster.

Writing Compressed Files

Start PDI on your desktop. Once it is running choose 'File' -> 'Open', browse to and select your job running Pentaho MapReduce, then click 'OK'.
Configure the Compression Codec: Double click on the 'Pentaho MapReduce' step, switch to the 'User Defined' tab and enter the following information:
Name
Value
mapred.output.compression.codec
The compression codec to use. For example org.apache.hadoop.io.compress.SnappyCodec
mapred.output.compress
true
mapred.output.compression.type
BLOCK
Image Added
Run your job

The output from the job should be compressed using the codec you specified.

Compressing Intermediate Data

You may want to compress the intermediate data that is passed between the Pentaho Mappers and Reducers to reduce network i/o and in some cases improve performance.

Start PDI on your desktop. Once it is running choose 'File' -> 'Open', browse to and select your job running Pentaho MapReduce, then click 'OK'.
Configure the Compression Codec: Double click on the 'Pentaho MapReduce' step, switch to the 'User Defined' tab and enter the following information:
Name
Value
mapred.map.output.compression.codec
The compression codec to use. For example org.apache.hadoop.io.compress.SnappyCodec
mapred.compress.map.output
true
Image Added
Run your job

Versions Compared

Old Version 1

New Version Current

Key

Prerequisites

Step-By-Step

Instructions

Reading

Compressed

Files

Writing Compressed Files

Compressing Intermediate Data

Name	Value
mapred.output.compression.codec	The compression codec to use. For example org.apache.hadoop.io.compress.SnappyCodec
mapred.output.compress	true
mapred.output.compression.type	BLOCK

Name	Value
mapred.map.output.compression.codec	The compression codec to use. For example org.apache.hadoop.io.compress.SnappyCodec
mapred.compress.map.output	true

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Prerequisites

Step-By-Step

Instructions

Reading

Compressed

Files

Writing Compressed Files

Compressing Intermediate Data