Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Include Page
BAD:NavPanelBAD:
NavPanel

Excerpt

The Pentaho MapReduce job entry allows you to build MapReduce jobs using Kettle transformations as the Mapper, Combiner, and/or Reducer.

Architecture Overview

...

Overview

Kettle transformations are used to manipulate data and function as the map, combine, and reduce phases of a MapReduce application. The Kettle engine is pushed down to each task node and is executed for each task. The implementation that supports the data type conversion from Hadoop data types to Kettle data types, the passing of tuples between input/output formats to the Kettle engine, and all associated configuration for the MapReduce job is collectively called Pentaho MapReduce.

Type Mapping

In order to pass data between Hadoop and Kettle we must convert between Hadoop IO data types. Here's the type mapping for the built in Kettle types:

...

Defining your own Type Converter

TODO
See The Type Converter system is pluggable to support additional data types as required by custom Input/Output formats. The Type Converter SPI is a simple interface to implement: org.pentaho.hadoop.mapreduce.converter.spi.ITypeConverter for more info. We use the Service Locator pattern; specifically Java's ServiceLoader, to resolve available converters at runtime. Providing your own is as easy as implementing ITypeConverter and providing a META-INF/services/org.pentaho.hadoop.mapreduce.converter.spi.ITypeConverter file with your implementation listed, both packaged into a jar placed in the plugins/pentaho-big-data-plugin/lib directory. You can find the default implementations defined here.

Distributed Cache

Pentaho MapReduce relies on Hadoop's Distributed Cache to distribute the Kettle environment, configuration, and plugins across the cluster. By leveraging the Distributed Cache network traffic is reduced up for subsequent executions as the Kettle environment is automatically configured on each node. This also allows you to use multiple version of Kettle against a single cluster.

...