Include Page

	Warning - Pentaho 5.2, 5.3, and 5.4 - Configuration
	Warning - Pentaho 5.2, 5.3, and 5.4 - Configuration

Additional Configuration for YARN Shims

To get information on configuring EMR shims, click here.

Configuring CDH, HDP, and MapR Shims

Prerequisite: Make sure that your client has been configured so it can access any node in your cluster.

Set the active Hadoop distribution.
Configure the cluster settings.

NOTE: For all supported hadoop distributions except MapR, add the yarn user on the cluster to the group defined by dfs.permissions.superusergroup property. The dfs.permissions.superusergroup property can be found in hdfs-site.xml file on your cluster or in the cluster management application.
3. Navigate to the folder that contains the shim, then open the yarn-site.xml file in a text editor. Adjust the following parameters as needed.

...

Parameter	Values
yarn.application.classpath

...

HDP 2.2 -

Code Block


<property> <name>yarn.application.classpath</name>
 <value>$HADOOP_CONF_DIR,/usr/hdp/current/hadoop-client/*,
/usr/hdp/current/hadoop-client/lib/*,/usr/hdp/current/hadoop-hdfs-client/*,
/usr/hdp/current/hadoop-hdfs-client/lib/*,/usr/hdp/current/hadoop-yarn-client/*,
/usr/hdp/current/hadoop-yarn-client/lib/*</value>
 </property>

MapR 4.0.1 Windows Client -

Code Block


<property>
<name>yarn.application.classpath</name>
<value>$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*
:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*
:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*
:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*
:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$PWD/*:%PWD%/*
</value>
</property>

All other shims - Classpaths needed to execute YARN applications. Separate paths with a comma.

yarn.resourcemanger.hostname

CDH 5.x - Update the hostname in your environment or use the default: clouderamanager.cdh5.test
HDP 2.x - Update the hostname in your environment or use the default: sandbox.hortonworks.com
All other shims: Hostname in your environment.

yarn.resourcemanager.address

All shims: Update hostname and port to match your environment.

yarn.resourcemanager.admin.address

All shims: Update hostname and port to match your environment.

There are a few more things that you need to do

...

:

4. (CDH 5.1 and 5.2 only): * Navigate to the folder that contains the shim, then open the hive-site.xml file in a text editor.

...

Modify the hive.metastore.uris property so that it points to the location of your hive metastore.

...

Save and close the hive-site.xml file

...

.

5. (All shims): Navigate to the folder that contains the shim, then open the mapred-site.xml file, in a text editor.

...

Then make the changes in the table.

Parameter	Value
mapreduce.jobhistory.address

...

Set this to the place where

...

job history logs are stored.

mapreduce.app-submission.cross-platform

Add this parameter to the mapred-site.xml file between the <property> tags, then set it to true, like this:

Code Block
<name>mapreduce.app-submission.cross-platform</name> <value>true</value>

This property allows mapreduce jobs to run on windows and linux platforms, and vice versa.

6. (HDP 2.2): In the mapred-site.xml file, make the changes shown in the table.

Parameter

Value

mapreduce.application.classpath

Add this parameter to the mapred-site.xml file between the <property> tags.

Code Block


<property>
    <name>mapreduce.application.classpath</name>
    <value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*
:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*
:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*
:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*
:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*
:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure
</value>
  </property>

mapreduce.application.framework.path

Add this parameter to the mapred-site.xml file between the <property> tags.

Code Block
<property> <name>mapreduce.application.framework.path</name> <value>/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework</value> </property>

7. (HDP 2.2 only) In the HDP configuration.properties file on the client, add the following line.

Code Block
java.system.hdp.version=2.2.0.0-2041

Note: Not all shim properties can be set in the Spoon user interface, nor are instructions for modifying them listed here. If you need to set additional properties that are not addressed in these instructions, you will need to set them manually in the *-site.xml files that are in the shim directory. Consult your Hadoop distribution's vendor for details about the properties you want to set.

High Availability for CDH 5.3

Note: If you are configuring CDH 5.3 to be used in High Availability mode, we recommend that you use the Cloudera Manager "Download Client Configuration" feature. The Download Client Configuration feature provides a convenient way to get configuration files from the cluster for a service (such as HBase, HDFS, or Yarn). Use this feature to download the unzip the configuration zip files to the pentaho-big-data-plugin/hadoop-configurations/cdh5x directory.For more information on how to do this, see Cloudera documentation:
http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_client_config.html.

Anchor

	emr
	emr

Configuring EMR Shims

EMR is an Amazon Web Service (AWS) for big data processing and analysis that is a popular alternative to hosting in-house cluster computing.

Note: Pentaho does not support HBase on HDFS in EMR 3.4.

Prerequisites

Set the active Hadoop distribution.
PDI should be installed on an EC2 instance.
A working EC2 cluster should already be set up and configured.
We also recommend that you review Amazon EMR documentation. Here are a few links to get you started.

Copy *-site.xml Cluster Files To PDI Directories

Copy the core-site.xml, hdfs-site.xml, httpfs-site.xml, mapred-site.xml, yarn-site.xml, and emrfs-site.xml files to these directories:

data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/emr3x
server/data-integration-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/emr3x

Edit *-site.xml Files in PDI Directories

Edit *-site.xml files using the instructions that follow.
Note: If you need more information, consult the Apache documentation for additional details about the properties below.

Edit the core-site.xml File

Open the core-site.xml file and do these things.

Add these lines. Enter your AWS Access Key ID and Access Key as indicated.

Code Block
<property> <name>fs.s3.awsAccessKeyId</name> <value>[INSERT YOUR VALUE HERE]</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name> <value>[INSERT YOUR VALUE HERE]</value> </property>

If needed enter the AWS Access Key ID and Access Key for S3N, like this:

Code Block
<property> <name>fs.s3n.awsAccessKeyId</name> <value>[INSERT YOUR VALUE HERE]</value> </property> <property> <name>fs.s3n.awsSecretAccessKey</name> <value>[INSERT YOUR VALUE HERE]</value> </property>

Then, change this:

Code Block


<property>   
   <name>fs.s3n.impl</name>   
   <value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property> 


<property>   
   <name>fs.s3.impl</name>   
   <value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>

to this:

Code Block


<property>
   <name>fs.s3n.impl</name>
   <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>

<property>
   <name>fs.s3.impl</name>
   <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

Edit the mapred-site.xml File

Navigate to the folder that contains the shim, then open the mapred-site.xml file, in a text editor. Then make the changes in the table. When you are finished, save and close the file.

Parameter

Value

mapreduce.app-submission.cross-platform

Add this parameter to the mapred-site.xml file between the <property> tags, then set it to true, like this:

Code Block
<name>mapreduce.app-submission.cross-platform</name> <value>true</value>

When set to true, the user can submit an application cross-platform, which means the application can be submitted from a Windows client to a Linux server or vice versa.

Configure for LZO Compression

LZO is a compression format that EMR supports. If you want to configure for LZO compression, you will need to download a jar file. If you do not, you will need to remove a parameter from the core-site.xml file.

If you are not going to use LZO compression: Remove any references to the io.compression.codecs parameter in the core-site.xml file: com.hadoop.compression.lzo.LzoCodec
If you are going to use LZO compression: Download the LZO jar and add it to pentaho-big-data-plugin/hadoop-configurations/emr3x/lib directory. The LZO jar can be found here: http://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.19/

Version	Old Version 10	New Version Current
Changes made by	Former user	Former user
Saved on	Jun 11, 2014	Oct 12, 2015

Content Comparison

Versions Compared

Key

Additional Configuration for YARN Shims

Configuring CDH, HDP, and MapR Shims

High Availability for CDH 5.3

Configuring EMR Shims

Prerequisites

Copy *-site.xml Cluster Files To PDI Directories

Edit *-site.xml Files in PDI Directories

Edit the core-site.xml File

Edit the mapred-site.xml File

Configure for LZO Compression