Amazon EMR Job Executor
PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.
This job entry executes Hadoop jobs on an Amazon Elastic MapReduce (EMR) account. In order to use this step, you must have an Amazon Web Services (AWS) account configured for EMR, and a premade Java JAR to control the remote job.
Option | Definition |
---|---|
Name | The name of this Amazon EMR Job Executer step instance. |
EMR Job Flow Name | The name of the Amazon EMR job flow (series of steps) you are executing. |
Existing Job Flow ID | Indicates the ID for the existing job flow. This field is optional. |
AWS Access Key | Your Amazon Web Services access key. |
AWS Secret Key | Your Amazon Web Services secret key. |
S3 Staging Directory | The Amazon Simple Storage Service (S3) address of the working directory for this Hadoop job. This directory will contain the MapReduce JAR, and log files will be placed here as they are created. |
MapReduce JAR | The Java JAR that contains your Hadoop mapper and reducer classes. The job must be configured and submitted using a static main method in any class in the JAR. |
Command line arguments | Any command line arguments that must be passed to the static main method in the specified JAR. |
Number of Instances | The number of Amazon Elastic Compute Cloud (EC2) instances you want to assign to this job. |
Master Instance Type | The Amazon EC2 instance type that will act as the Hadoop "master" in the cluster, which handles MapReduce task distribution. |
Slave Instance Type | The Amazon EC2 instance type that will act as one or more Hadoop "slaves" in the cluster. Slaves are assigned tasks from the master. This is only valid if the number of instances is greater than 1. |
Enable Blocking | Forces the job to wait until each step completes before continuing to the next step. This is the only way for PDI to be aware of a Hadoop job's status. If left unchecked, the Hadoop job is blindly executed, and PDI moves on to the next step. Error handling/routing will not work unless this option is checked. |
Logging Interval | Number of seconds between log messages. |