Include Page | ||||
---|---|---|---|---|
|
Excerpt |
---|
The Pentaho MapReduce job entry allows you to build MapReduce jobs using Kettle transformations as the Mapper, Combiner, and/or Reducer. |
Architecture Overview
...
Overview
Kettle transformations are used to manipulate data and function as the map, combine, and reduce phases of a MapReduce application. The Kettle engine is pushed down to each task node and is executed for each task. The implementation that supports the data type conversion from Hadoop data types to Kettle data types, the passing of tuples between input/output formats to the Kettle engine, and all associated configuration for the MapReduce job is collectively called Pentaho MapReduce.
Type Mapping
In order to pass data between Hadoop and Kettle we must convert between Hadoop IO data types. Here's the type mapping for the built in Kettle types:
Kettle Type | Hadoop Type |
| |
| |
| |
| |
| |
| |
|
Defining your own Type Converter
The Type Converter system is pluggable to support additional data types as required by custom Input/Output formats. The Type Converter SPI is a simple interface to implement: org.pentaho.hadoop.mapreduce.converter.spi.ITypeConverter. We use the Service Locator pattern; specifically Java's ServiceLoader, to resolve available converters at runtime. Providing your own is as easy as implementing ITypeConverter and providing a META-INF/services/org.pentaho.hadoop.mapreduce.converter.spi.ITypeConverter file with your implementation listed, both packaged into a jar placed in the plugins/pentaho-big-data-plugin/lib
directory. You can find the default implementations defined here.
Distributed Cache
Pentaho MapReduce relies on Hadoop's Distributed Cache to distribute the Kettle environment, configuration, and plugins across the cluster. By leveraging the Distributed Cache network traffic is reduced up for subsequent executions as the Kettle environment is automatically configured on each node. This also allows you to use multiple version of Kettle against a single cluster.
...
The default Kettle environment installation path within HDFS is /opt/pentaho/mapreduce/ pmr.kettle.dfs.install.dir
/$id, where $id is generally the version of Kettle the environment contains a uniquely identifying string but can easily be a custom build that is tailored for a specific set of jobs.
The Kettle environment is staged to HDFS at pmr.kettle.dfs.install.dir
/pmr.kettle.installation.
...
id
as follows:
- The contents of
plugins/pentaho-big-data-plugin/pentaho-mapreduce-libraries.zip
are extracted into HDFS athdfs://{pmr.kettle.dfs.install.dir}/{pmr.kettle.installation.id
} - The Big Data Plugin contents are copied into
pmr.kettle.installation.id
/plugins/- Only the active Hadoop configuration is copied, and specifically:
- The active Hadoop configuration's client-only libraries are not copied (
config/lib/client
) - The active Hadoop configuration's "pmr" specific libraries are copied into the main
hdfs://{pmr.kettle.dfs.install.dir}/{pmr.kettle.installation.id}/lib/
of the installation. This allows the Hadoop configuration to provide libraries that are accessible within an Input or Output format (or otherwise outside of the standard transformation execution environment. This is necessary for reading directly out of HBase using the HBase TableInputFormat for example).
- The active Hadoop configuration's client-only libraries are not copied (
- Only the active Hadoop configuration is copied, and specifically:
Anchor | ||||
---|---|---|---|---|
|
Configuration options
Pentaho MapReduce can be configured through the pentaho-mapreduceplugin.properties
found in the plugin's base directory, or overridden per Pentaho MapReduce job entry if they are defined in the User Defined properties tab.
...
Property Name | Description |
| Version of Kettle to use from the Kettle HDFS installation directory. If not set we will use a unique id is generated from the version of Kettle that is used to , the Big Data Plugin version, and the Hadoop Configuration used to communicate with the cluster and submit the Pentaho MapReduce job. |
| Installation path in HDFS for the Kettle environment used to execute a Pentaho MapReduce job. This can be a relative path, anchored to the user's home directory, or an absolute path if it starts with a /. |
| Pentaho MapReduce Kettle environment runtime archive to be preloaded into |
| Comma-separated list of additional plugins (by directory name) to be installed with the Kettle environment. |
Anchor | ||||
---|---|---|---|---|
|
Customizing the Kettle Environment used by Pentaho MapReduce
...
The installation environment used by Pentaho MapReduce will be installed to pmr.kettle.dfs.install.dir
/pmr.kettle.installation.id
when the Pentaho MapReduce job entry is executed. If the installation already exists no modifications will be made and the job will use the environment as is. That means any modifications after the initial run, or any custom pre-loading of a kettle environment, will be used as is by Pentaho MapReduce.
...
- Unzip pentaho-mapreduce-libraries.zip, it contains a single lib/ directory with the required Kettle dependencies
- Copy additional libraries to the lib/ directory
- Zip up the lib/ directory into pentaho-mapreduce-libraries-custom.zip so the archive contains the lib/ with all jars within it (you may create subdirectories within lib/. All jars found in lib/ and its subdirectories will be added to the classpath of the executing job.)
- Update
pentaho-mapreduceplugin.properties
and update the following properties:Code Block pmr.kettle.installation.id=custom pmr.libraries.archive.file=pentaho-mapreduce-libraries-custom.zip
The next time you execute Pentaho MapReduce the custom Kettle environment will be copied into HDFS at pmr.kettle.dfs.install.dir/custom
and used when executing the job. You can switch between Kettle environments by specifying the pmr.kettle.installation.id
property as a User Defined property per Pentaho MapReduce job entry or globally in the pentaho-mapreduceplugin.properties
file*.
*Note: Only if the installation referenced by pmr.kettle.installation.id
does not exist will the archive file and additional plugins currently configured will be used to "install" it into HDFS.
...
See Appendix B for the supported directory structure in HDFS.
Adding JDBC drivers to the Kettle environment
Anchor | ||||
---|---|---|---|---|
|
JDBC drivers and their required dependencies must be placed in the installation directory's lib/ directory.
...
- Remove the
pentaho.*
properties from yourmapred-site.xml
- Remove the directories those properties referenced
- Restart the TaskTracker process
Anchor | ||||
---|---|---|---|---|
|
Appendix A: pentaho-mapreduce-libraries.zip structure
...
Code Block |
---|
pentaho-mapreduce-libraries.zip/ `- lib/ +- kettle-core-{version}.jar +- kettle-engine-{version}.jar `- .. (all other required Kettle dependencies and optional jars) |
Anchor | ||||
---|---|---|---|---|
|
Appendix B: Example Kettle environment installation directory structure within DFS
...
...
Code Block |
---|
/opt/pentaho/mapreduce/ +- 4.3.0/ | +- lib/ | | +- kettle-core-{version}.jar | | +- kettle-engine-{version}.jar | | +- .. (Any files in the active Hadoop configuration's {{lib/pmr/}} directory) | | `- .. (all other required Kettle dependencies and optional jars - including JDBC drivers) | `- plugins/ | +- pentaho-big-data-plugin/ | | `- hadoop-configurations/ | | `- hadoop-20/ (the active Hadoop configuration used to communicate with the cluster) | | +- lib/ (the {{lib/pmr/}} and {{lib/client/}} directories are omitted here) | | `- .. (all other jars) | `- .. (additional optional plugins) `- custom/ +- lib/ | +- kettle-core-{version}.jar | +- kettle-engine-{version}.jar | +- my-custom-code.jar | `- .. (all other required Kettle dependencies and optional jars - including JDBC drivers) `- plugins/ +- pentaho-big-data-plugin/ | .. `- my-custom-plugin/ .. |