Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
{excerpt} !Common Images^new-icon.png! How to set up and configure Pentaho (Kettle, Pentaho Data Integration, Pentaho Business Analytics Suite) for your specific Hadoop distribution.{excerpt}

*{color:red}This page applies to Kettle, Pentaho Data Integration (DI) and Pentaho Business Analytics (BA) Suite version 5.0 only.  For Kettle 4.4 (or Pentaho BA suite 4.8) see [this page|BAD:4.4 Configuring Pentaho for your Hadoop Distro and Version].{color}*

_NOTE: Pentaho is pre-configured for Apache Hadoop 0.20.2. If you are using  this distribution and version, no further configuration is required._

Pentaho  supports different versions of Hadoop distributions from many vendors  such as Apache, Cloudera, DataStax, Hortonworks, Intel, and MapR. How  can Pentaho support so many Hadoop distributions? The secret is that  Pentaho uses an abstraction layer, called a _shim_, that connects  to the different Hadoop distributions. A shim is a small library that  intercepts API calls and either redirects or handles them, or changes  parameters. Periodically, Pentaho develops new shims as vendors develop  new Hadoop distributions and versions. These big data shims are tested  and certified by Pentaho engineers.  The following Steps will help you get Pentaho set up to work with your Hadoop cluster.

{anchor:GetShim}
h1. Determine the proper shim for your Hadoop Distro and version

In the following table,  click the tab of the Hadoop distribution that you are interested in,  then locate the version of the distribution you want to use. Note the  name of the corresponding shim and the minimum version of the Pentaho  software that supports it.

For example,  if you want to use the Cloudera's CDH 4.2.1, click the Cloudera tab,  then look in the Hadoop version column. CDH4.2.x is supported with shim  cdh42. You need to have Pentaho Business Analytics (or Pentaho Data  Integration) version 5.0 or later installed to use this shim.

----
*Pentaho Shim Support Matrix*
{composition-setup}{composition-setup}
{deck:id=MyDeck|class=tan}


{card:label=Apache}
|| Hadoop Version || Shim || Pentaho Suite Ver || Download || Notes ||
| 0.20.x | hadoop-20 | 5.0 | included | |
| 1.0.x | NS\* | | | Planned see [PDI-10984|http://jira.pentaho.com/browse/PDI-10984] |
| 1.1.x | NS\* | | | Not likely to be done in favor of 1.2.x [PDI-9964|http://jira.pentaho.com/browse/PDI-9964] |
| 1.2.x | NS\* | | | Possibly in patch post 5.0 but not committed [http://jira.pentaho.com/browse/PDI-10393] |
| 2.x.x | NS\* | | | Distro is Alpha |
_Go to_ _[Apache releases|http://hadoop.apache.org/releases.html]_
{card}

{card:label=Cloudera}
|| Hadoop Version || Shim || Pentaho Suite Ver || Download || Notes ||
| CDH4.0, 4.0.1, 4.1, 4.1.1 | cdh4 | 5.0 | [download|https://pentaho.box.com/50-cdh4] | The cdh42 shim also supports this configuration |
| CDH4.1.2 | cdh412 | 5.0 | [download|https://pentaho.box.com/50-cdh412] | The cdh42 shim also supports this configuration |
| CDH4.1.3 | cdh413 | 5.0 | [download|https://pentaho.box.com/50-cdh413] | The cdh42 shim also supports this configuration |
| CDH4.2.x | cdh42 | 5.0 | included | Backward compatible with all earlier cdh4.x distros |
| CDH4.3 - CDH4.5 | cdh42 | 5.0 | included | |
| CDH4.6 | ++cdh42 | 5.0 | included | {color:red}++Not yet QA tested but minor releases rarely have issues{color} [PDI-11605|http://jira.pentaho.com/browse/PDI-11605]|
| CDH5 Beta | cdh5beta | \*\*5.0.4 | [download|https://pentaho.box.com/s/sa8ni1wjkhcmq1rs5lt9] | CDH 5 is currently in beta. |

_Go to_ _[Cloudera releases|https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads]_

{color:navy}\*NOTE: the cdh42 shim supports all versions of CDH from 4.0 through 4.5.x{color}
{card}

{card:label=DataStax}
|| Hadoop Version || Shim || Pentaho Suite Ver || Download || Notes ||
| DSE 3.0.x | NS\* | | | Possibly in patch post 5.0 but not committed [PDI-8036|http://jira.pentaho.com/browse/PDI-8036] |
| DSE 2.2.x | NS\* | | | No current plans to support |
_Go to_ _[DataStax releases|http://www.datastax.com/docs/datastax_enterprise3.0/dse_release_notes]_
{card}

{card:label=Hortonworks}
|| Hadoop Version || Shim || Pentaho Suite Ver || Download || Notes ||
| HDP 1.2.x | hdp12 | 4.8 + BD Plugin 1.3.2\+ | [download|https://pentaho.box.com/50-hdp12] | |
| HDP 1.3.x | hdp13 | 4.8 + BD Plugin 1.3.2\+ | included | |
| HDP 2.0 | hdp20 | **5.0.4| included | |
| HDP 1.3 for Win | NS\* | | | On hold, testing and support is waiting for customer demand. Vote here: [PDI-10266|http://jira.pentaho.com/browse/PDI-10266] |
_Go to_ _[Hortonworks releases|http://hortonworks.com/download/]_
{card}

{card:label=Intel}
|| Hadoop Version || Shim || Pentaho Suite Ver || Download || Notes ||
| IDH 2.3 | idh23 | 4.8 + BD Plugin 1.3.2\+ | [download|https://pentaho.box.com/50-idh23] | |
_Go to_ _[Intel releases|http://hadoop.intel.com/]_
{card}

{card:label=MapR}
|| Hadoop Version || Shim || Pentaho Suite Ver || Download || Notes ||
| 1.1.3, 1.2.0 | mapr | 4.8\+ | [download|https://pentaho.box.com/50-mapr] | |
| 2.0.x | NS\* | | | No Support planned [PDI-9648|http://jira.pentaho.com/browse/PDI-9648] |
| 2.1.x | mapr21 | 4.8 + BD Plugin 1.3.2\+ | included | |
| 3.0.x | mapr30 | \*\*5.0.4 | included with 5.0.4 |  |
_Go to_ _[MapR releases|http://www.mapr.com/doc/display/MapR/MapR+Release+Notes]_
{card}
{deck}
*_\* NS - Not supported._* _See_ _[Hadoop Configurations]_ _for information on how to create or modify a shim to support your configuration_

*_+ Pentaho Ver{_}* _is the earliest version of the Pentaho suite that supports this shim.  Subsequent Pentaho versions will also support this shim unless otherwise noted._

*_\*\* 5.0.4 - Only supported with Big Data Plugin 5.0.4 or later._* EE Customers can upgrade to 5.0.4 by going to [support.pentaho.com|http://support.pentaho.com]  CE Users can upgrade by following the [Upgrade Hadoop in Community Edition to 5.0.4] instructions.
----

h1. If the Hadoop distribution you want is supported but not installed by default*

You need to download the shim from our support site. If you need to download the shim, click the *Download* link for the shim in the above table. For a list of all available shims, go to [5.0.0 shims|https://pentaho.app.box.com/50Shims] for version 5.0.0 to 5.0.3 and [5.0.4 shims|https://pentaho.box.com/s/qti5t2x11t0lgz1rxajl].kgs6rkvjr2l5kg5egifm].  _We recommend that you upgrade to 5.0.4 if possible_

Go to [BAD:Install Hadoop Distribution Shim] for instructions on how to install the shim.

h1. If the Hadoop distribution you want isn't*is not* listed as supported in the table.
* Look * Checkat the [archives|https://pentaho.app.box.com/50Shims] for older shims.  If you find the shim you want there, download it, then go to [Install Hadoop Distribution Shim|http://wiki.pentaho.com/display/BAD/Install+Hadoop+Distribution+Shim] for instructions on how to install the shim.
* Look at the following table of Jira cases. The shim you want might be scheduled for development, but not yet released.
* If you still can't find the shim, but want request that Pentaho develop one, [fill out a Jira ticket|http://pedroalves-bi.blogspot.com/2013/12/on-continuous-effort-to-try-to-improve.html].
* If you still can't find the shim, but want to develop it yourself, check out the [Hadoop Configuration page|http://wiki.pentaho.com/display/BAD/Hadoop+Configurations] for more information.


Open JIRA Cases for Hadoop Distribution Support
{jiraissues:anonymous=true|columns=key;fixVersion;summary;status;assignee;updated|url=http://jira.pentaho.com/sr/jira.issueviews:searchrequest-xml/temp/SearchRequest.xml?jqlQuery=labels+%3D+BD_Distro+AND+status+in+%28Open%2C+%22Resolved%22%2C+%22In+Progress%22%2C+Reopened%2C+%22Ready+For+Test%22%2C+%22Ready+for+Publishing%22%29&tempMax=1000}

{anchor:SetActiveShim}

h2. Set Active Hadoop Distribution

*NOTE:  If you want to set MapR as the active distribution, go to the* *[Special Pentaho Configuration Instructions for Your Hadoop Distributions|Additional Configuration for MapR Shims]** page.*

Specify  which Hadoop Distribution you want to make active. Do this for each  Pentaho component that you want to access the Hadoop distribution from.  Pentaho components include the DI and BA Servers as well as design tools  such as Spoon, Report Designer, and Metadata Editor. Note that only one  distribution can be active at a time; so each time you want to change  the distribution, you will need to reset the active Hadoop distribution.

      1. If you have not done so already, stop the components (Spoon, DI Server, BA Server, Report Designer, Metadata Editor) from which you want to access the Hadoop distribution.

      2. Do these things for each Pentaho component that you want to access the Hadoop distribution from.

      3. Navigate to the directory that contains the *plugin.properties* file for the component.
* DI Server - *data-integration-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin*
* BA Server - *biserver-ee/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin*
* Spoon - *data-integration/plugins/pentaho-big-data-plugin*
* Report Designer - *report-designer/plugins/pentaho-big-data-plugin*
* Metadata Editor - *metadata-editor/plugins/pentaho-big-data-plugin*

     4. Open the *plugin.properties* file.   

      5. Set the *active.hadoop.configuration* property to match the name of  the shim you want to make active. For example, if the name of the shim  is *cdh42*, then the code would look like this: *active.hadoop.configuation=cdh42*.

     6. Save and close the *plugin.properties* file.

     7. If you want to configure CDH 5 to use Map Reduce 1 instead of Map Reduce 2, follow the instructions in the [Special Pentaho Configuration Instructions for Your Hadoop Distributions|Additional Configuration for MapR Shims] page.

     8. Start the component.table of Jira cases below. The shim you want might be scheduled for development, but not yet released.
* You can request that Pentaho develop one, [fill out a Jira ticket|http://pedroalves-bi.blogspot.com/2013/12/on-continuous-effort-to-try-to-improve.html].
* It is possible to develop it yourself, check out the [Hadoop Configuration page|Hadoop Configurations] for more information.

{anchor:SetActiveShim}
h2. Set Active Hadoop Distribution

These steps apply to DI and BA Servers as well as the design tools Spoon, Report Designer, and Metadata Editor.

Specify which Hadoop Distribution (shim) you want to make active. You must do this for each Pentaho application that needs access to the Hadoop cluster.  Only one distribution can be active at a time; so each time you change the distributions or version, you will need to reset the active Hadoop distribution.

# *Stop* the application (e.g. Spoon, DI Server, Report Design, BA Server, Metadata Editor) if it is running.
# *Navigate* to the pentaho-big-data-plugin folder. This folder is different for each application and located:
** DI Server - *data-integration-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin*
** BA Server - *biserver-ee/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin*
** Spoon - *data-integration/plugins/pentaho-big-data-plugin*
** Report Designer - *report-designer/plugins/pentaho-big-data-plugin*
** Metadata Editor - *metadata-editor/plugins/pentaho-big-data-plugin*
# *Edit* the *plugin.properties* file.
# *Set* the *active.hadoop.configuration* property to match the name of  the shim you want to make active. For example, if the name of the shim  is *cdh42*, then the code would look like this: *active.hadoop.configuation=cdh42*.
# *Save* and close the *plugin.properties* file.

*MapR users* need to do further configuration as described in [Additional Configuration for MapR Shims]
*CDH 5 users* who want to configure CDH 5 to use Map Reduce 1 instead of Map Reduce 2, follow the instructions in [Additional Configuration for using MR1 with CDH5]

h1. Open JIRA Cases for Hadoop Distribution Support
{jiraissues:anonymous=true|columns=key;fixVersion;summary;status;assignee;updated|url=http://jira.pentaho.com/sr/jira.issueviews:searchrequest-xml/temp/SearchRequest.xml?jqlQuery=labels+%3D+BD_Distro+AND+status+in+%28Open%2C+%22Resolved%22%2C+%22In+Progress%22%2C+Reopened%2C+%22Ready+For+Test%22%2C+%22Ready+for+Publishing%22%29&tempMax=1000}


h2. Next Steps

Now that you've configured Pentaho for your Hadoop distribution, there are many things you can do.  Here are a few links to get you started\!


* Check out how to [load data in a Hadoop cluster|http://wiki.pentaho.com/display/BAD/Loading+Data+into+a+Hadoop+Cluster].
* Learn how to [transform data within a cluster|http://wiki.pentaho.com/display/BAD/Transforming+Data+within+a+Hadoop+Cluster].
* Read about how to [extract data from a cluster|http://wiki.pentaho.com/display/BAD/Extracting+Data+from+the+Hadoop+Cluster].
* View information on how to [report data in Hadoop|http://wiki.pentaho.com/display/BAD/Reporting+on+Data+in+Hadoop].
* Learn more about [Pentaho MapReduce|http://wiki.pentaho.com/display/BAD/Advanced+Pentaho+MapReduce].
* [Explore the Pentaho Infocenter|http://infocenter.pentaho.com/help/index.jsp] to learn more about Pentaho software.

Want to switch gears and read something a little different? Check out these articles on the evolution of Hadoop.


* [Part I|http://drcos.boudnik.org/2012/01/what-you-wanted-to-know-about-hadoop.html?showComment=1357367337850#c4727344900010484442] and [Part II|http://drcos.boudnik.org/2013/01/what-you-wanted-to-know-about-hadoop.html] of Genealogy of Elephants
* [A brief history of Apache Hadoop branches and releases|http://blog.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/]

[!http://2.bp.blogspot.com/-GO6HF0OAFHw/UOfNEH-4sEI/AAAAAAAAAD0/dEWFFYTRgYw/s1600/output-file.png|width=100,height=75!|http://2.bp.blogspot.com/-GO6HF0OAFHw/UOfNEH-4sEI/AAAAAAAAAD0/dEWFFYTRgYw/s1600/output-file.png] [!http://hortonworks.com/wp-content/uploads/2013/05/hdp13.png|width=100,height=75!|http://hortonworks.com/wp-content/uploads/2013/05/hdp13.png]