Additional Configuration for YARN ShimsShims
Prerequisite: Make sure that your client has been configured so it can access any node in your cluster.
- If you are looking for Amazon EMR instructions, please go here.
- Set the active Hadoop distribution.
- Configure the cluster settings.
- Navigate to the folder that contains the shim, then open the yarn-site.xml file in a text editor. Adjust the following parameters as needed.
...
Note: Not all shim properties can be set in the Spoon user interface, nor are instructions for modifying them listed here. If you need to set additional properties that are not addressed in these instructions, you will need to set them manually in the *-site.xml files that are in the shim directory. Consult your Hadoop distribution's vendor for details about the properties you want to set.
Configuration for Amazon EMR
EMR is an Amazon Web Service (AWS) for big data processing and analysis that is a popular alternative to hosting in-house cluster computing.
NOTE: Pentaho does not support HBase on HDFS in EMR 3.4.
Versions Supported
The versions of EMR that Pentaho supports for 5.4 appears in the Support Matrix.
Prerequisiteshttps://help.pentaho.com/Draft_Content/Chantel_5.4_Work/Big_Data_Placeholder/020/010?action=edit§ionId=3
- PDI should be installed on an EC2 instance.
- A working EC2 cluster should already be set up and configured.
- We also recommend that you review Amazon EMR documentation. Here are a few links to get you started.
- http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html
- http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html
- http://docs.aws.amazon.com/AmazonVPC/latest/GettingStartedGuide/ExerciseOverview.html
- http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Introduction.html
- http://docs.aws.amazon.com/AmazonVPC/latest/NetworkAdminGuide/Welcome.html
- http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html
Copy *-site.xml Cluster Files To PDI Directorieshttps://help.pentaho.com/Draft_Content/Chantel_5.4_Work/Big_Data_Placeholder/020/010?action=edit§ionId=4
Copy the core-site.xml, hdfs-site.xml, httpfs-site.xml, mapred-site.xml, yarn-site.xml, and emrfs-site.xmlfiles to these directories:
- data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/emr3x
- server/data-integration-server/pentaho-solutions/system/kettle/plugins/pentaho-big-data-plugin/hadoop-configurations/emr3x
Edit *-site.xml Files in PDI Directorieshttps://help.pentaho.com/Draft_Content/Chantel_5.4_Work/Big_Data_Placeholder/020/010?action=edit§ionId=5
Edit *-site.xml files using the instructions that follow.
If you need more information, consult the Apache documentation for additional details about the properties below.
Edit the core-site.xml Filehttps://help.pentaho.com/Draft_Content/Chantel_5.4_Work/Big_Data_Placeholder/020/010?action=edit§ionId=6
Open the core-site.xml file and do these things.
Add these lines. Enter your AWS Access Key ID and Access Key as indicated.
1 | <property> |
2 | <name>fs.s3.awsAccessKeyId</name> |
3 | <value>INSERT YOUR VALUE HERE</value> |
4 | </property> |
5 | <property> |
6 | <name>fs.s3.awsSecretAccessKey</name> |
7 | <value>INSERT YOUR VALUE HERE</value> |
8 | </property> |
In the core-site.xml file, change this:
1 | <property> |
2 | <name>fs.s3n.impl</name> |
3 | <value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value> |
4 | </property> |
5 | <property> |
6 | <name>fs.s3.impl</name> |
7 | <value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value> |
8 | </property> |
to this:
1 | <property> |
2 | <name>fs.s3n.impl</name> |
3 | <value>org.apache.hadoop.fs.s3.S3FileSystem</value> |
4 | </property> |
5 | <property> |
6 | <name>fs.s3.impl</name> |
7 | <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value> |
8 | </property> |
Edit the mapred-site.xml Filehttps://help.pentaho.com/Draft_Content/Chantel_5.4_Work/Big_Data_Placeholder/020/010?action=edit§ionId=7
Navigate to the folder that contains the shim, then open the mapred-site.xml file, in a text editor. Then make the changes in the table. When you are finished, save and close the file.
Parameter | Value | ||
---|---|---|---|
mapreduce.app-submission.cross-platform | Add this parameter to the mapred-site.xml file between the <property> tags, then set it to true, like this: | 1 | <name>mapreduce.app-submission.cross-platform</name> |
2 | <value>true</value> |
When set to true, the user can submit an application cross-platform, which means the application can be submitted from a Windows client to a Linux server or vice versa. |
Configure for LZO Compressionhttps://help.pentaho.com/Draft_Content/Chantel_5.4_Work/Big_Data_Placeholder/020/010?action=edit§ionId=8
LZO is a compression format that EMR supports. If you want to configure for LZO compression, you will need to download a jar file. If you do not, you will need to remove a parameter from the core-site.xml file.
- If you are not going to use LZO compression: Remove any references to the iocompression parameter in the core-site.xml file: com.hadoop.compression.lzo.LzoCodec
- If you are going to use LZO compression: Download the LZO jar and add it to pentaho-big-data-plugin/hadoop-configurations/emr3x/lib directory. The LZO jar can be found here: http://maven.twttr.com/com/hadoop/gplcompression/hadoop-lzo/0.4.19/
High Availability for CDH 5.3
...