Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Wiki Markup
{scrollbar}
{

Excerpt

...

How

...

to

...

use

...

compression

...

with

...

Pentaho

...

MapReduce.

...

This

...

guide

...

uses

...

the

...

Snappy

...

compression

...

codec

...

in

...

its

...

examples,

...

but

...

you

...

may

...

use

...

any

...

compression

...

codec

...

you

...

choose

...

that

...

is

...

supported

...

in

...

Hadoop.

...

The

...

following

...

scenarios

...

are

...

covered:

...

  • Reading

...

  • Compressed

...

  • Files

...

  • Writing

...

  • Compressed

...

  • Files

...

  • Compressing

...

  • Intermediate

...

  • Data

...

Prerequisites

In order to follow along with this how-to

...

guide

...

you

...

will

...

need

...

the

...

following:

...

  • Hadoop
  • Pentaho Data Integration
  • Pentaho Hadoop Distribution
  • Compression Codec Installed on Hadoop

Step-By-Step

...

Instructions

...

Reading

...

Compressed

...

Files

...

Normally

...

there

...

is

...

nothing

...

you

...

need

...

to

...

do

...

to

...

have

...

Pentaho

...

MapReduce

...

use

...

a

...

compressed

...

file

...

as

...

the

...

the

...

input.

...

Pentaho

...

MapReduce

...

will

...

automatically

...

decompress

...

any

...

compression

...

codec

...

installed

...

on

...

the

...

Hadoop

...

cluster.

...

Writing

...

Compressed

...

Files

...

  1. Start

...

  1. PDI

...

  1. on

...

  1. your

...

  1. desktop.

...

  1. Once

...

  1. it

...

  1. is

...

  1. running

...

  1. choose

...

  1. 'File'

...

  1. ->

...

  1. 'Open',

...

  1. browse

...

  1. to

...

  1. and

...

  1. select

...

  1. your

...

  1. job

...

  1. running

...

  1. Pentaho

...

  1. MapReduce,

...

  1. then

...

  1. click

...

  1. 'OK'.

...

  1. Configure

...

  1. the

...

  1. Compression

...

  1. Codec:

...

  1. Double

...

  1. click

...

  1. on

...

  1. the

...

  1. 'Pentaho

...

  1. MapReduce'

...

  1. step,

...

  1. switch

...

  1. to

...

  1. the

...

  1. 'User

...

  1. Defined'

...

  1. tab

...

  1. and

...

  1. enter

...

  1. the

...

  1. following

...

  1. information:

...

  1. Name

    Value

    mapred.output.compression.codec

...

  1. The

...

  1. compression

...

  1. codec

...

  1. to

...

  1. use.

...

  1. For

...

  1. example

...

  1. org.apache.hadoop.io.compress.SnappyCodec

...

  1. mapred.output.compress

...

  1. true

    mapred.output.compression.type

...

  1. BLOCK

    Image Added
  2. Run your job

The output from the job should be compressed using the codec you specified.

Compressing Intermediate Data

You may want to compress the intermediate data that is passed between the Pentaho Mappers and Reducers to reduce network i/o and in some cases improve performance.

  1. Start PDI on your desktop. Once it is running choose 'File' -> 'Open', browse to and select your job running Pentaho MapReduce, then click 'OK'.
  2. Configure the Compression Codec: Double click on the 'Pentaho MapReduce' step, switch to the 'User Defined' tab and enter the following information:

    Name

    Value

    mapred.map.output.compression.codec

...

  1. The

...

  1. compression

...

  1. codec

...

  1. to

...

  1. use.

...

  1. For

...

  1. example

...

  1. org.apache.hadoop.io.compress.SnappyCodec

...

  1. mapred.compress.map.output

...

  1. true

    Image Added
  2. Run your job