Amazon Hive Job Executor
PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.
This job executes Hive jobs on an Amazon Elastic MapReduce (EMR) account. In order to use this step, you must have an Amazon Web Services (AWS) account configured for EMR, and a pre-made Java JAR to control the remote job.
Option | Definition |
---|---|
Name | The name of this job as it appears in the transformation workspace. |
Hive Job Flow Name | The name of the Hive job flow to execute. |
Existing JobFlow Id (optional) | The name of a Hive Script on an existing EMR job flow. |
AWS Access Key | Your Amazon Web Services access key. |
AWS Secret Key | Your Amazon Web Services secret key. |
Bootstrap Actions | References to scripts to invoke before the node begins processing data. See {+}http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/Bootstrap.html+for more information. |
S3 Log Directory | The URL of the Amazon S3 bucket in which your job flow logs will be stored. Artifacts required for execution (e.g. Hive Script) will also be stored here before execution. (Optional) |
Hive Script | The URL of the Hive script to execute within Amazon S3. |
Command Line Arguments | A list of arguments (space-separated strings) to pass to Hive. |
Number of Instances | The number of Amazon EC2 instances used to execute the job flow. |
Master Instance Type | The Amazon EC2 instance type that will act as the Hadoop "master" in the cluster, which handles MapReduce task distribution. |
Slave Instance Type | The Amazon EC2 instance type that will act as one or more Hadoop "slaves" in the cluster. Slaves are assigned tasks from the master. This is only valid if the number of instances is greater than 1. |
Keep Job Flow Alive | Specifies whether the job flow should terminate after completing all steps. |
Enable Blocking | Specifies whether this job step should block until the EMR Hive Job Completes. |
Logging Interval in Seconds | If the job step is blocking then write a status to the log every X seconds. |