Start a PDI Cluster on YARN
Description
Start a PDI Cluster on YARN is used to start a cluster of carte servers on Hadoop nodes, assign them ports, and pass cluster access credentials. When this step is run and a cluster is created, the metadata for that cluster is stored in the shared.xml file or, if you are using the enterprise repository, in the DI Repository. For more information on carte clusters, see Use Carte Clusters in the Pentaho Help documentation.
In earlier versions of Spoon, this step was labeled Start a YARN Kettle Cluster.
Context
Use this step to start a cluster of carter servers. The carte servers in the cluster will continue to run until a Stop a PDI Cluster on YARN step is executed, or you manually stop the cluster.
Prerequisites
If you assign the cluster a name that has not been used before, you will need to create a cluster schema in Spoon. You only need to specify the cluster name when you create the cluster schema, see the Create a Cluster Schema in Spoon topic in the Pentaho Help documentation for more information. A YARN hadoop configuraiton should already be configured. Information on configuring a YARN hadoop configuration appears in Additional Configuration for YARN Shims.
Options
You can configure the cluster through the Start Kettle Cluster on YARN dialog, which appears when you double-click on the job icon. This dialog contains a Step Name field and 2 tabs. The Step Name field is the entry name, which can be customized or left as the default. The 2 tabs enable you to configure the Cluster and Files.
Cluster
The items in the Cluster tab contain cluster configuration details:
Option |
Description |
---|---|
Name Cluster Schema |
Name of the cluster schema. |
Carte User Name |
User name needed to access the carte cluster. |
Carte Password |
Password needed to access the carte cluster. |
Number of Nodes |
Indicates the number of nodes in the cluster. |
Virtual Cores Per Node |
Number of virtual cores per node. |
Carte Port Range Start |
The number of the port that the master node will be assigned. Slave nodes are assigned numbers sequentially using this port number as the starting point. For example, if the start port is 40000, and there are 4 nodes in the cluster, the master node is assigned port 40000, slave node #1 is assigned port 40001, slave node #2 is assigned port 40002, and slave node #3 is assigned port 40003. |
Cluster Data Range Start |
The number of the data port that the master node will be assigned. Slave nodes are assigned numbers sequentially using this data port number as the starting point. |
Application Master Memory |
Indicates the amount of master memory assigned to the application. |
Nodes Memory |
Amount of memory assigned to each node. Memory is in megabytes. |
Files
The items in the Files tab contain file configuration details:
Option |
Description |
---|---|
File System Path |
URL for the default HDFS file system. Make sure that the Default FS setting matches the configured hostname for the HDFS Name node. If it does not, an error message will display and the Kettle Cluster will not start. |
PDI Client Archive |
Indicates the path to the location of the PDI Client (Spoon) that is on the DI Server. When the Start a PDI Cluster on YARN entry is in a job, it can be executed in one of three places: 1) locally on the same host on which you build the job and transformation, 2) on a Carte server or 3) on a remote DI server. If you plan to only execute this entry locally or on a Carte server, leave this field blank. But, if you want to run the entry remotely on a DI Server, you need to indicate the path to the location of the PDI Client (Spoon) installed on the DI Server host. If not, the Kettle cluster will not start properly. If you enter a value in this field, when the job containing this entry runs on the DI Server, it finds the directory (or zip file) and places a copy of it on the HDFS.
|
Copy local resource files to YARN |
Copies the contents of the current user's kettle.properties, repositories.xml, and shared.xml files to the YARN workspace folder. When a job containing this step runs, the contents of the YARN Workspace folder are copied to the cluster. The contents of the YARN Workspace folder are copied to the cluster even when the checkboxes are deselected. If you do not want the contents of the YARN Workspace folder to be copied, you need to remove the contents manually. For more information, see Using the YARN Workspace Folder article in Pentaho Help documentation.
|
When a Kettle Cluster is started on YARN, the configuration files (kettle.properties, shared.xml, repositories.xml) and any additional resource files it might need are deployed from the workspace folder in the shim plugin (pentaho-big-data-plugin/plugin/pentaho-kettle-yarn-plugin). Files can be placed there manually, but the three primary configuration files can be copied to the workspace at runtime if their corresponding checkboxes in the Copy local resource files to YARN section of the Files tab are selected.
If you run the job from a user’s PDI installation, the config files from that user’s KETTLE_HOME directory are used. If the job is scheduled or otherwise runs on a Pentaho DI Server, the config files from that server’s configured KETTLE_HOME are copied when the job starts.
If you want to use different configuration files from what is in your and the server’s KETTLE_HOME directories, you should copy those files manually into the YARN workspace folder and ensure the corresponding checkboxes in the Copy local resource files to YARN section of the Files are not selected.
If you have configuration files appropriate for development. testing, or staging in your KETTLE_HOME directory and the Pentaho DI Server has production configuration files in its KETTLE_HOME, then select the corresponding checkboxes to insure the Kettle Cluster deployed by YARN uses the appropriate configuration files for the environment from which it is run.