{scrollbar} |
Now that the Transformation has been created and executed, the next task is enhancing it.
These are the improvements that you'll make to your existing Transformation:
Here's what happens:
This will be accomplished via a Job, which is a component made by Job Entries linked by Hops. These Entries and Hops are arranged according the expected order of execution. Therefore it is said that a Job is flow-control oriented.
A Job Entry is a unit of execution inside a Job. Each Job Entry is designed to accomplish a specific function, ranging from verifying the existence of a table to sending an email.
From a Job it is possible to execute a Transformation or another Job, that is, Jobs and Transformations are also Job Entries.
A Hop is a graphical representation that identifies the sequence of execution between two Job Entries.
Even when a Hop has only one origin and one destination, a particular Job Entry can be reached by more than a Hop, and more than a Hop can leave any particular Job Entry.
This is the process:
Graphically it's represented like this:
In this part of the tutorial, the input and output files will be in a new folder called Files - go ahead and create it now. Copy the list.csv file to this new directory.
In order to avoid writing the full path each time you need to reference the folder or the files, it makes sense to create a variable containing this information. To do this, edit the kettle.properties configuration file, located in the C:\Documents and Settings\<username>\.kettle* folder on Windows, or the *~/.kettle directory on other platforms. Put this line at the end of the file, changing the path to the one specific to the Files directory you just created:
FILES=/home/PentahoUser/Files |
Spoon reads this file when it starts, so for this change to take effect, you must restart Spoon.
Now you are ready to start. This process involves three stages:
This Step captures information from sources outside the Transformation, like the system date or parameters entered in the command line. In this case, you will use the Step to get the first and only parameter. The configuration window of this Step has a grid. In this grid, each row you fill will become a new column containing system data.
This Step divides the output in two, based upon a condition. Those rows for which the condition evaluates to true follow one path in the diagram, the others follow another.
Now a NULL parameter will reach the Abort Step, and a NOT NULL parameter will reach the Set Variable Step.
You don't have anything to configure in this Step. If a row of data reaches this Step, the Transformation aborts, then fails, and you will use that result in the main Job.
This Step allows you to create variables and put the content of some of the input fields into them. The configuration window of the Step has a grid. Each row in this grid is meant to hold a new variable.
Now you'll create a new variable to use later:
Set Variables.0 - Set variable MY_FILE to value [list] |
Abort.0 - Row nr 1 causing abort : [] Abort.0 - Aborting after having seen 1 rows. |
Near the top of the window you will see a Step Abort message, which indicates that an error occurred and that the Transformation failed, as expected.
Now it's time to modify the Hello Transformation in order to match the names of the files to their corresponding parameters. If the parameter were foo, the Transformation would read the file foo.csv and create the file foo_with_greetings.xml. It would also be helpful to add a filter to discard the empty rows in the input file.
${FILES}/${MY_FILE}.csv |
${FILES}/${MY_FILE}_with_greetings.xml |
To test the changes you made, you need to make sure that the variable MY_FILE exists and has a value. Because this Transformation is independent of the Transformation that creates the variable, in order to execute it, you'll have to create the variable manually.
The last task in this part of the tutorial is the construction of the main Job:
To the left of the workspace there is a palette of Job Entries. Unlike the palette of Transformation Steps, this palette doesn't group the Entries into categories.
Now build the Job:
${Internal.Job.Filename.Directory}/get_file_name.ktr |
${Internal.Job.Filename.Directory}/Hello_with_parameter.ktr |
${FILES}/${MY_FILE}.csv |
Note: Remember that the variable ${FILES} was defined in the kettle.properties file and the variable ${MY_FILE} was created in the Job Entry that is going to be executed before this one.
The file ${FILES}/${MY_FILE}.csv does not exist |
Note: In runtime, the tool will replace the variable names by its values, showing for example: "The file c:/Pentaho/Files/list.csv does not exist
A Job Entry can be executed unconditionally (it's executed always), when the previous Job Entry was successful, and when the previous Job Entry failed. This execution is represented by different colors in the Hops: a black Hop indicates that the following Job Entry is always executed; a green Hop indicates that the following Job Entry is executed only if the previous Job Entry was successful; and a red Hop indicates that the following Job Entry is executed only if the previous Job Entry failed.
As a consequence of the order in which the Job Entries of your Job were created and linked, all of the Hops took the right color, that is, the Steps will execute as you need:
If you wanted to change the condition for the execution of a Job Entry, the steps to follow would be:
When you execute a Job, the execution is tied to the order of the Job Entries, the direction of the Hops, and the condition under which an entry is or not executed. The execution follows a sequence. The execution of a Job Entry cannot begin until the execution of the Job Entries that precede it has finished.
In real-world situations, a Job can be a solution to solve problems related to a sequence of tasks in the Transformations. If you need a part of a Transformation to finish before another part begins, a solution could be to divide the Transformation into two independent Transformations, and execute them from a Job, one after the other.
To execute a Job, you first must supply a parameter. Because the only place where the parameter is used is in the get_file_name Transformation (after that you only use the variable where the parameter is saved) write the parameter as follows:
The tabbed windows corresponding to Jobs are divided into two: an upper and lower half. The upper half shows the Job Entries of your Job. For each executed Job Entry, you'll see, among other data, the result of the execution. The execution of the entries follows a sequence. As a result, if an entry fails, you won't see the entries that follow because they never start. In the second half of the window, you can see the log detail, including the starting and ending time of the Job Entries. In particular, when an Entry is a Transformation, the log corresponding to the transformation is also included.
The new file has been created when you see this at the end of the log text:
Spoon - Job has ended. |
If the input file was list.csv, then the output file should be list_with_greetings.xml and should be in the same folder. Find it and check its content.
Now change the name of the parameter by replacing it with a nonexistend file name and execute the Job again. You'll see that the Job aborts, and the log shows the following message (where <parameter> is the parameter you supplied):
Abort - The file <parameter> does not exist |
Now try deleting the parameter and executing the Job one more time. In this case the Job aborts as well, and in the log you can see this message, as expected:
Abort - The file name is missing |
Kitchen is the tool used to execute Jobs from a terminal window. The script is kitchen.bat on Windows, and kitchen.sh on other platforms, and you'll find it in the installation folder. If you execute it, you'll see a description of the command with a list of the available options.
To execute the Job, try the simplest command:
kitchen /file <Jobs_path>/Hello.kjb <par> /norep |
c:/Pentaho/Tutorial (Windows) |
/home/PentahoUser/Tutorial |
After you enter this command, the Job will be executed in the same way it did inside Spoon. In this case, the log will be written to the terminal unless you redirect it to a file. The format of the log text will vary a little, but the information will be basically the same as in the graphical environment.
Try to execute the Job without parameters, with an invalid parameter (a nonexistent file), and with a valid parameter, and verify that everything works as expected. Also experiment with Kitchen, changing some of the options, such as log level.