Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Include Page
BAD:NavPanel
BAD:NavPanel

Pentaho MapReduce Job Entry

The Pentaho MapReduce job entry allows you to build MapReduce jobs using Kettle transformations as the Mapper, Combiner, and/or Reducer.

Architecture Overview

TODO

Distributed Cache

Pentaho MapReduce relies on Hadoop's Distributed Cache to distribute the Kettle environment, configuration, and plugins across the cluster. By leveraging the Distributed Cache network traffic is reduced up for subsequent executions as the Kettle environment is automatically configured on each node. This also allows you to use multiple version of Kettle against a single cluster.

How it works

Hadoop's Distributed Cache is a mechanism to distribute files into the working directory of each map and reduce task. The origin of these files is HDFS. Pentaho MapReduce will automatically configure the job to use a Kettle environment from HDFS (configured via pmr.kettle.installation.id, see #ConfigurationOptions). If the desired Kettle environment does not exist, Pentaho MapReduce will take care of "installing" it in HDFS before executing the job.

The default Kettle environment installation path within HDFS is /opt/pentaho/mapreduce/$id, where $id is generally the version of Kettle the environment contains but can easily be a custom build that is tailored for a specific set of jobs.

Configuration options
Anchor
ConfigurationOptions
ConfigurationOptions

Pentaho MapReduce can be configured through the pentaho-mapreduce.properties found in the plugin's base directory, or overridden per Pentaho MapReduce job entry if they are defined in the User Defined properties tab.

...

Property Name

Description

pmr.kettle.installation.id

Version of Kettle to use from the Kettle HDFS installation directory. If not set we will use the version of Kettle that is used to submit the Pentaho MapReduce job.

pmr.kettle.dfs.install.dir

Installation path in HDFS for the Kettle environment used to execute a Pentaho MapReduce job. This can be a relative path, anchored to the user's home directory, or an absolute path if it starts with a /.

pmr.libraries.archive.file

Pentaho MapReduce Kettle environment runtime archive to be preloaded into kettle.hdfs.install.dir/pmr.kettle.installation.id

pmr.kettle.additional.plugins

Comma-separated list of additional plugins (by directory name) to be installed with the Kettle environment.
e.g. "steps/DummyPlugin,my-custom-plugin"

Customizing the Kettle Environment used by Pentaho MapReduce
Anchor
customizing
customizing

TODO

Upgrading from the Pentaho Hadoop Distribution (PHD)

The PHD is no longer required and can be safely removed. If you have modified your Pentaho Hadoop Distribution installation you may wish to preserve these files so that the new Distributed Cache mechanism can take advantage of them. To do so follow the instructions here.

...