Hadoop Configurations

Hadoop Configurations, also known and shims and the Pentaho Big Data Adaptive layer, are collections of Hadoop libraries required to communicate with a specific version of Hadoop (and related tools: Hive, HBase, Sqoop, Pig, etc.). They are designed to be easily configured.

Configuring the default Hadoop configuration

The Pentaho Big Data Plugin will use the Hadoop configuration defined in it's plugin.properties file to communicate with Hadoop. By default, the hadoop-20 configuration is used. You should update this property to match the Hadoop configuration you wish to use when communicating with Hadoop:

# The Hadoop Configuration to use when communicating with a Hadoop cluster. This is used for all Hadoop client tools
# including HDFS, Hive, HBase, and Sqoop.
active.hadoop.configuration=hadoop-20

Structure

Hadoop configurations reside in pentaho-big-data-plugin/hadoop-configurations. They all share a basic structure:

configuration/
 |-- lib/ : contains all libraries specific to the version of Hadoop this configuration was created to communicate with.
 |   |-- client/ : Libraries that are only required on a Hadoop client (e.g. hadoop-core-*, hadoop-client-*)
 |   |-- pmr/ : Jars that contain libraries required for parsing data in input/output formats or otherwise outside of any 
 |   |   PDI-based execution
 |   `-- *.jar : All other libraries required for this shim that are not client-only or special "pmr" jars that need to be 
         available to the entire JVM of Hadoop Job tasks
 |-- config.properties : metadata and configuration options for this Hadoop configuration
 |-- core-site.xml : Hadoop core-site configuration file (left as a placeholder to indicate it could be used)
 `-- configuration-implementation.jar : Implementation of abstraction required by the Big Data Plugin to communicate 
     with this configuration

Creating new configurations

New configurations can be created by identifying the configuration that most closely matches the version of Hadoop you wish to communicate with, copying it and swapping out all jar files in the lib/ directory to match the cluster you want to communicate with. If you compare the default configurations included the differences are apparent.

Developing a new configuration

Sometimes it's not enough to simply copy an existing compiled configuration to communicate with a specific cluster. Occasionally all code that interfaces with Hadoop libraries must be compiled (relinked) with the new libraries.

Anatomy of a configuration

The pentaho-hadoop-shims-api project provides the API/SPI for developing a shim implementation. A Hadoop configuration is a combination of shim implementation and supporting metadata and libraries. The following SPIs exist for interfacing with Hadoop-related libraries:

org.pentaho.hadoop.shim.spi.HadoopShim: Hadoop-related functions including HDFS, Hadoop Configuration, and Hive JDBC driver
org.pentaho.hadoop.shim.spi.SqoopShim: Ability to execute Sqoop tools
org.pentaho.hadoop.shim.spi.PigShim: Simple interface for executing Pig scripts

Defauilt implementations are provided for all shims as well as supporting objects.

SPIs are registered via Java's ServiceLoader mechanism (META-INF/services/<interface-name> files whose contents are the concrete implementations)

Hadoop configurations are loaded with a special class loader that will delegate loading of resources to the configuration's directory (and configured classpath) before walking up the class loader hierarchy. The class loading scheme closely resembles that of an application server's.

config.properties

The config.properties is a way of defining a friendly name for the configuration as well as additiona classpath and native libraries that the configuration requires. See this file's in-line comments for more details.

Project Structure

A shim project relies upon a set of common source, test, resource, and build scripts so reduce the amount of code duplication. They are built with Subfloor.

.
|-- shim-project/
|   |-- src/
|   |-- test-src/
|   |-- build.properties
|   |-- build.xml
|   |-- ivy.xml
|   |-- ivysettings.xml
|   `-- package-ivy.xml
|
.
.
| Other files shim projects rely upon (these are at the same directory depth as the shim project):
|-- common/
|   |-- package-res/
|   |-- src/
|   |-- src-mapred/
|   `-- test-src/
|-- build.xml
|-- common-build.properties
`-- common-shims-build.xml

The common source and tests are implementations that are common across all 0.20-based Hadoop configurations. For now this covers all of our configurations (including CDH4 as well as any 1.x configurations). The common build script (common-shims-build.xml) overrides subfloor built targets to include the common source files where necessary. The build.xml in the root of the shims directory provides a simple place to execute all shim module build scripts from one location (an attempt at a multi-module "project" script).

Building the project

The shim projects are Ant-based projects that rely on Subfloor. To build the project:

ant resolve dist

The resolve target will preload Apache Ivy and download all jar dependencies required for the project. The dist target will compile, jar, and package the configuration.

This package is then what's used during the Pentaho Big Data Plugin project's assembly phase and extracted into pentaho-big-data-plugin/hadoop-configurations/.

To use your new shim plugin extract the packaged tar.gz or zip archive from the dist directory of your shim project into the hadoop-configurations folder within the Big Data Plugin and update the plugin.properties's active.hadoop.configuration property to match the folder name (the identifier) of your new shim.

Example:

.
|-- hadoop-configurations/
|   |-- my-custom-shim/
|   `-- hadoop-20/
`-- plugin.properties

plugin.properties:
active.hadoop.configuration=my-custom-shim