Hadoop Configurations

New in Hadoop Configurations, also known and shims and the Pentaho Big Data Plugin v1.3, Hadoop Configurations Adaptive layer, are collections of Hadoop libraries required to communicate with a specific version of Hadoop (and related tools: Hive, HBase, Sqoop, Pig, etc.). They are designed to be easily configured.

...

Code Block

configuration/
 |-- lib/ : contains all libraries specific to the version of Hadoop this configuration was created to communicate with.
 |   |-- client/ : Libraries that are only required on a Hadoop client (e.g. hadoop-core-*, hadoop-client-*)
 |   |-- pmr/ : Jars that contain libraries required for parsing data in input/output formats or otherwise outside of any
 |   |   PDI-based execution
 |   `-- *.jar : All other libraries required for this shim that are not client-only or special "pmr" jars that need to be
         available to the entire JVM of Hadoop Job tasks
 |-- config.properties : metadata and configuration options for this Hadoop configuration
 |-- core-site.xml : Hadoop core-site configuration file (left as a placeholder to indicate it could be used)
 `-- configuration-implementation.jar : Implementation of abstraction required by the Big Data Plugin to communicate
     with this configuration

...

Developing a new configuration

Sometimes it's not enough to simply copy an existing compiled configuration to communicate with a specific cluster. Occasionally all code that interfaces with Hadoop libraries must be compiled (relinked) with the new libraries.

Basing off an existing configuration

New configurations can be created by identifying the configuration that most closely matches the version of Hadoop you wish to communicate with, copying it and swapping out all jar files in the lib/ directory to match the cluster you want to communicate with. If you compare the default configurations included the differences are apparent

Compatible Distributions

We support various versions of the most common distributions. The best way to see the list would be to refer to the github repo itself: https://github.

Developing a new configuration

Sometimes it's not enough to simply copy an existing compiled configuration to communicate with a specific cluster. Occasionally all code that interfaces with Hadoop libraries must be compiled (relinked) with the new librariescom/pentaho/pentaho-hadoop-shims but here are the highlights:

Apache Hadoop 0.20 -- Plain vanilla distro enabled by default
Cloudera -- Earliest version supported was cdh3u4. We support several dot releases under CDH4 as well as CDH5.
- MRv1 -- Specific configuration changes can be made to the cdh5 version of this shim to submit MapReduce jobs using MRv1 instead of the default, MRv2.
HortonWorks -- We support hdp12, hdp13, and hdp20 so far.
MapR -- We support several dot releases of mapr2 as well as mapr30 and mapr31. Further we have also provided initial support for MapR under Windows.
- There is a special page in the wiki that provides detailed configuration settings for MapR on the different major platforms.
Intel -- We support the idh23 distribution that Intel released before dropping their distribution.

Anatomy of a configuration

The pentaho-hadoop-shims-api project provides the API/SPI for developing a shim implementation. A Hadoop configuration is a combination of shim implementation and supporting metadata and libraries. The following SPIs exist for interfacing with Hadoop-related libraries:

...

The config.properties is a way of defining a friendly name for the configuration as well as additiona additional classpath and native libraries that the configuration requires. See this file's in-line comments for more details.

Project Structure

A shim project relies upon a set of common source, test, resource, and build scripts so reduce the amount of code duplication. They are built with Subfloor.

...

Code Block
plugin.properties: active.hadoop.configuration=my-custom-shim

Published Shim Packages

The pentaho-hadoop-shims repo in GitHub contains the core API and SPI classes for Hadoop interaction, a "common" set of implementations from which most shims extend, and directories for each supported distribution version.
Each shim folder contains the distribution specific libraries, configuration settings, and SPI implementations. They all use subfloor to compile and package a shim for deployment.

A packaged shim has the following structure:

Code Block


configuration/
 |-- lib/ : contains all libraries specific to the version of Hadoop this configuration was created to communicate with.
 |   |-- client/ : Libraries that are only required on a Hadoop client (e.g. hadoop-core-*, hadoop-client-*)
 |   |-- pmr/ : Jars that contain libraries that need to be available to the entire JVM of a Hadoop Job task. Implementations of parsing data in specific input/output formats are the most common case.
 |   `-- *.jar : All other libraries required for this shim that are not client-only or special "pmr" jars that need to be
         available to the entire JVM of Hadoop Job tasks
 |-- config.properties : metadata and configuration options for this Hadoop configuration
 |-- core-site.xml : Hadoop core-site configuration file (left as a placeholder to indicate it could be used)
 `-- configuration-implementation.jar : Implementation of abstraction required by the Big Data Plugin to communicate
     with this configuration

Versions Compared

Old Version 8

New Version Current

Key

Hadoop Configurations

Developing a new configuration

Basing off an existing configuration

Compatible Distributions

Developing a new configuration

Anatomy of a configuration

Project Structure

Published Shim Packages

Page Comparison

Versions Compared

Old Version 8

New Version Current

Key

Hadoop Configurations

Developing a new configuration

Basing off an existing configuration

Compatible Distributions

Developing a new configuration

Anatomy of a configuration

Project Structure

Published Shim Packages