Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 76 Next »

null How to set up and configure Pentaho (Kettle, Pentaho Data Integration, Pentaho Business Analytics Suite) for your specific Hadoop distribution.

This page applies to Kettle, Pentaho Data Integration (DI) and Pentaho Business Analytics (BA) Suite version 5.0 only.  For Kettle 4.4 (or Pentaho BA suite 4.8) see this page.

NOTE: Pentaho is pre-configured for Apache Hadoop 0.20.2. If you are using this distribution and version, no further configuration is required.

Pentaho supports different versions of Hadoop distributions from many vendors such as Apache, Cloudera, DataStax, Hortonworks, Intel, and MapR. How can Pentaho support so many Hadoop distributions? The secret is that Pentaho uses an abstraction layer, called a shim, that connects to the different Hadoop distributions. A shim is a small library that intercepts API calls and either redirects or handles them, or changes parameters. Periodically, Pentaho develops new shims as vendors develop new Hadoop distributions and versions. These big data shims are tested and certified by Pentaho engineers. The following Steps will help you get Pentaho set up to work with your Hadoop cluster.

1) Determine the proper shim for your Hadoop Distro and version

In the following table, click the tab of the Hadoop distribution that you are interested in, then locate the version of the distribution you want to use. Note the name of the corresponding shim and the minimum version of the Pentaho software that supports it.

For example, if you want to use the Cloudera's CDH 4.2.1, click the Cloudera tab, then look in the Hadoop version column. CDH4.2.x is supported with shim cdh42. You need to have Pentaho Business Analytics (or Pentaho Data Integration) version 5.0 or later installed to use this shim.


Pentaho Shim Support Matrix

Unknown macro: {composition-setup}
Unknown macro: {deck}
Unknown macro: {card}

Hadoop Version

Shim

Pentaho Suite Ver

Download

Notes

0.20.x

hadoop-20

5.0

included

 

1.0.x

NS*

 

 

Planned see PDI-10984

1.1.x

NS*

 

 

Not likely to be done in favor of 1.2.x PDI-9964

1.2.x

NS*

 

 

Possibly in patch post 5.0 but not committed http://jira.pentaho.com/browse/PDI-10393

2.x.x

NS*

 

 

Distro is Alpha

Go to Apache releases

Unknown macro: {card}

Hadoop Version

Shim

Pentaho Suite Ver

Download

Notes

CDH4.0, 4.0.1, 4.1, 4.1.1

cdh4

5.0

download

The cdh42 shim also supports this configuration

CDH4.1.2

cdh412

5.0

download

The cdh42 shim also supports this configuration

CDH4.1.3

cdh413

5.0

download

The cdh42 shim also supports this configuration

CDH4.2.x

cdh42

5.0

included

Backward compatible with all earlier cdh4.x distros

CDH4.3.x

cdh42

5.0

included

 

CDH4.4.x

cdh42

5.0

included

 

CDH4.5.x

cdh42

5.0

included

 

CDH5 Beta

cdh5beta

**5.0.4

download

CDH 5 is currently in beta.

Go to Cloudera releases

*NOTE: the cdh42 shim supports all versions of CDH from 4.0 through 4.4.x

Unknown macro: {card}

Hadoop Version

Shim

Pentaho Suite Ver

Download

Notes

DSE 3.0.x

NS*

 

 

Possibly in patch post 5.0 but not committed PDI-8036

DSE 2.2.x

NS*

 

 

No current plans to support

Go to DataStax releases

Unknown macro: {card}

Hadoop Version

Shim

Pentaho Suite Ver

Download

Notes

HDP 1.2.x

hdp12

4.8 + BD Plugin 1.3.2+

download

 

HDP 1.3.x

hdp13

4.8 + BD Plugin 1.3.2+

included

 

HDP 2.x

NS*

 

 

In patch post 5.0 - PDI-8962

HDP 1.3 for Win

NS*

 

 

In patch post 5.0 - PDI-10266

Go to Hortonworks releases

Unknown macro: {card}

Hadoop Version

Shim

Pentaho Suite Ver

Download

Notes

IDH 2.3

idh23

4.8 + BD Plugin 1.3.2+

download

 

Go to Intel releases

Unknown macro: {card}

Hadoop Version

Shim

Pentaho Suite Ver

Download

Notes

1.1.3, 1.2.0

mapr

4.8+

download

 

2.0.x

NS*

 

 

No Support planned PDI-9648

2.1.x

mapr21

4.8 + BD Plugin 1.3.2+

included

 

3.0.x

mapr30

**5.0.4

included with 5.0.4

 

Go to MapR releases

* NS - Not supported. See Hadoop Configurations for information on how to create or modify a shim to support your configuration

+ Pentaho Ver is the earliest version of the Pentaho suite that supports this shim. Subsequent Pentaho versions will also support this shim unless otherwise noted.

** 5.0.4 - Only supported with Big Data Plugin 5.0.4 or later. EE Customers can upgrade to 5.0.4 by going to support.pentaho.com CE Users can upgrade by following the Upgrade Hadoop in Community Edition to 5.0.4 instructions.


2) If the Hadoop distribution you want is supported

Determine whether the shim is included with the Pentaho software or if you need to download the shim from our site. If you need to download the shim, click the Download link for the shim. Go to Install Hadoop Distribution Shim  for instructions on how to install the shim.

3) If the Hadoop distribution you want isn't listed as supported in the table.

  • Check the archives for older shims.  If you find the shim you want there, download it, then go to Install Hadoop Distribution Shim for instructions on how to install the shim.
  • Look at the following table of Jira cases. The shim you want might be scheduled for development, but not yet released.
  • If you still can't find the shim, but want request that Pentaho develop one, fill out a Jira ticket.
  • If you still can't find the shim, but want to develop it yourself, check out the Hadoop Configuration page for more information.

Open JIRA Cases for Hadoop Distribution Support

jiraissues: com.atlassian.confluence.extra.jira.exception.JiraIssueMacroException: com.atlassian.confluence.macro.MacroExecutionException: Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Set Active Hadoop Distribution

NOTE:  If you want to set MapR as the active distribution, go to the [Special Pentaho Configuration Instructions for Your Hadoop Distributions] page.

Specify which Hadoop Distribution you want to make active. Do this for each Pentaho component that you want to access the Hadoop distribution from. Pentaho components include the DI and BA Servers as well as design tools such as Spoon, Report Designer, and Metadata Editor. Note that only one distribution can be active at a time; so each time you want to change the distribution, you will need to reset the active Hadoop distribution.

      1. If you have not done so already, stop the components (Spoon, DI Server, BA Server, Report Designer, Metadata Editor) from which you want to access the Hadoop distribution.

      2. Do these things for each Pentaho component that you want to access the Hadoop distribution from.

      3. Navigate to the directory that contains the plugin.properties file for the component.

  • DI Server - data-integration-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin
  • BA Server - biserver-ee/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin
  • Spoon - data-integration/plugins/pentaho-big-data-plugin
  • Report Designer - report-designer/plugins/pentaho-big-data-plugin
  • Metadata Editor - metadata-editor/plugins/pentaho-big-data-plugin

     4. Open the plugin.properties file.   

     5. Set the active.hadoop.configuration property to match the name of the shim you want to make active. For example, if the name of the shim is cdh42, then the code would look like this: active.hadoop.configuation=cdh42.

     6. Save and close the plugin.properties file.

     7. If you want to configure CDH 5 to use Map Reduce 1 instead of Map Reduce 2, follow the instructions in the [Special Pentaho Configuration Instructions for Your Hadoop Distributions] page.

     8. Start the component.

Next Steps

Now that you've configured Pentaho for your Hadoop distribution, there are many things you can do.  Here are a few links to get you started!

Want to switch gears and read something a little different? Check out these articles on the evolution of Hadoop.

  • No labels