PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.

Description

Apache Spark is an open-source cluster computing framework that is an alternative to the Hadoop MapReduce paradigm. The Spark Submit entry allows you to submit Spark jobs to CDH clusters version 5.3 and later, HDP 2.3 and later, MapR 5.1 and later, and EMR 3.10 and later.

Install and Configure Spark Client for PDI Use

Before you use this entry, you will need to install and configure a Spark client on any node from which you will run Spark jobs.

Installation Prerequisites

Install and configure a supported version of CDH that supports Spark. See our Support Matrix for more details on the supported version. You do not need to set CDH as the active Hadoop Configuration.
Before you install Spark, we strongly recommend that you review Spark documentation, release notes, and known issues first. Some helpful references are:

You might also want to reference instructions here to learn how to submit jobs for Spark: https://spark.apache.org/docs/1.2.0/submitting-applications.html

Configuring the Spark Client

You will need to configure the Spark client to work with the cluster on every machine where Sparks jobs can be run from. Complete these steps.

Set the HADOOP_CONF_DIR env variable to the following: pentaho-big-data-plugin/hadoop-configurations/<shim directory>
Navigate to <SPARK_HOME>/conf and create the spark-defaults.conf file using the instructions here - https://spark.apache.org/docs/latest/configuration.html
In the spark-defaults.conf file, add the following line of code. (If necessary, adjust the HDFS name and location to match the path to the spark-assembly.jar in your environment.) Here are a couple of examples:
- CDH Example: spark.yarn.jar hdfs://cdh53unsecure/user/spark/share/lib/spark-assembly.jar
- HDP Example: spark.yarn.jar hdfs://svqxbdcn6hdp23n1.pentahoqa.com:8020/user/spark/hadoop27/spark-assembly.jar
If you are connecting to an HDP cluster, add the following lines in the spark-defaults.conf file:
- spark.driver.extraJavaOptions -Dhdp.version=2.3.0.0-2557
- spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.0.0-2557
  Note: The hdp version should be the same as hdp version used on the cluster.
Create a text file named java-opts in the <SPARK_HOME>/conf folder and add your HDP version to that file. For example: -Dhdp.version=2.3.0.0-2557
If you need to determine your hdp version, run the command hdp-select status hadoop-client.
If you are connecting to a supported version of the HDP cluster, the CDH 5.5 cluster, or the CDH 5.7 cluster; open the core-site.xml file, then comment out the net.topology.script.file property like this:

Troubleshooting

If you are connecting to CDH 5.7 cluster when using Apache Spark 1.6.0 on your client node, an error may occur while trying to run a job containing a Spark Submit entry in yarn-client mode. This error will be similar to the following message:

Caused by: java.io.InvalidClassException: org.apache.spark.rdd.MapPartitionsRDD; local class incompatible: stream classdesc serialVersionUID = -1059539896677275380, local class serialVersionUID = 6732270565076291202

Perform one of the following tasks to resolve this error:

Install and configure CDH 5.7 Spark on the client machine where Pentaho is running instead of Apache Spark 1.6.0. See Cloudera documentation for Spark installation instructions.
If you want to use Apache Spark 1.6.0 on a client machine, then upload spark-assembly.jar from the client machine to your cluster in HDFS, and point the spark.yarn.jar property in the spark-defaults.conf file to this uploaded spark-assembly.jar file on HDFS.

Install and Configure MapR Spark Client (Unsecured Clusters)

This section explains how to set up the Spark client to connect to unsecured MapR clusters.

Set up your MapR packages, repositories, and MapR client for your version of MapR using the instructions at the following link: http://maprdocs.mapr.com/51/AdvancedInstallation/InstallingMapRSoftware.html
Copy the hive-site.xml file from the /opt/mapr/spark/spark-1.6.1/conf folder on the MapR cluster to the client machines MapR configuration folder.
Install the MapR Spark client using the command: sudo apt-get install mapr-spark
Navigate to <SPARK_HOME>/conf folder and create the spark-defaults.conf file using the instructions at the following link: https://spark.apache.org/docs/latest/configuration.html
Edit the spark-defaults.conf file to add the following code using your HDFS name and spark-assembly.jar file path:

spark.yarn.jar maprfs:///user/spark/lib/spark-assembly-1.6.1-mapr-1609-hadoop2.7.0-mapr-1607.jar

Note: If necessary, adjust the HDFS name and location to match the path to the spark-assembly.jar in your environment.

Install and Configure MapR Spark Client (Secured Clusters)

This section explains how to set up the Spark client to connect to secured MapR clusters.

Set up your MapR packages, repositories, and MapR client for your version of MapR using the instructions at the following link: http://maprdocs.mapr.com/51/AdvancedInstallation/InstallingMapRSoftware.html
Copy the hive-site.xml file from the /opt/mapr/spark/spark-1.6.1/conf folder on the MapR cluster to the client machines MapR configuration folder.
Install the MapR Spark client using the command: sudo apt-get install mapr-spark
Navigate to <SPARK_HOME>/conf folder and create the spark-defaults.conf file using the instructions at the following link: https://spark.apache.org/docs/latest/configuration.html
Edit the spark-defaults.conf file to add the following code using your HDFS name and spark-assembly.jar file path:

maprfs:///user/spark/lib/spark-assembly-1.6.1-mapr-1609-hadoop2.7.0-mapr-1607.jar

Note: If necessary, adjust the HDFS name and location to match the path to the spark-assembly.jar in your environment.

Note: When Spark runs on YARN, the MapR client nodes require the hadoop-yarn-server-web-proxy.JAR file to run Spark applications.

The MapR-client package does not include the jar file required to run Spark applications. You must copy the /opt/mapr/hadoop/hadoop-2.x.x/share/hadoop/yarn/hadoop-yarn-server-web-proxy-<version>.jar file from a MapR cluster node to the same location on the MapR client node where you want to run the Spark application.

Spark Submit Entry Properties

Note that we support the yarn-cluster and yarn-client modes. Descriptions of the modes can be found here:

https://spark.apache.org/docs/latest/submitting-applications.html#master-urls

Note: If you have configured your Hadoop Cluster and Spark for Kerberos, a valid Kerberos ticket must already be in the ticket cache area on your client machine before you launch and submit the Spark Submit job.

Job Setup

Option	Description
Entry Name	Name of the entry. You can customize this, or leave it as the default.
Spark-Submit Utility	Script that launches the spark job.
Spark Master URL	The master URL for the cluster. Two options are supported: Yarn-Cluster, which runs the driver program as a thread of the yarn application master, which is on one of the node managers in the cluster. This is very similar to the way mapreduce works. Yarn-Client, which runs the driver program on the yarn client. Tasks are still execute in the node managers of the YARN cluster.
Jar	Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
Class Name	The entry point for your application.
Arguments	Arguments passed to the main method of your main class, if any.
Executor	Amount of memory to use per executor process. Use the JVM format (e.g. 512m, 2g).
Driver	Amount of memory to use per driver. Use the JVM format (e.g. 512m, 2g).
Block Execution	This option is enabled by default. If this option is selected, the job entry waits until the spark job finishes running. If it is not, job proceeds with its execution once the spark job is submitted for execution.
Help	Displays documentation on this entry.
OK	Saves the information and closes the window.
Cancel	Closes the window without saving changes.

Parameters

For more information on spark parameters, including memory parameters, review this documentation: https://spark.apache.org/docs/latest/configuration.html.

Option	Description
Entry Name	Name of the entry. You can customize this, or leave it as the default.
#	Number of the parameter.
Name	Name of the parameter.
Value	Value of the parameter.