.09 Transformation Steps
Transformation Steps
A step is one part of a transformation. Steps can provide you with a wide range of functionality ranging from reading text-files to implementing slowly changing dimensions. This chapter describes various step settings followed by a detailed description of available step types.
The following topics are covered in this section:
- #Launching Several Copies of a Step
- #Distribute or Copy?
- #Step Error Handling
- #Apache Virtual File System (VFS) Support
- #Transformation Step Types
Note: Several parts of this section are still a work in progress.
Launching Several Copies of a Step
For performance reasons, (such as reducing latency), launch a database lookup step three times or more. Launching the same step several times keeps the database busy on different connections, effectively lowering the latency. To launch several copies of a step in a transformation, right-click a step in the graphical view and choose Change number of copies to start...:
The Nr copies of step... dialog box appears. If you enter three (3) this will be shown: (more detail needed here)
It is the technical equivalent of this:(more detail needed here)
Distribute or Copy?
In the example above, (do you mean, the previous example?) there are green lines between each step. This indicates that rows are distributed among the target steps. In this instance, the first row coming from step A goes to step Database lookup 1, the second to Database lookup 2, the third to Database lookup 3, the fourth back to Database lookup 1, and so on. If you right-click step A, however, and select Copy data, you hops are drawn in red:
Copy data indicates that all rows from step A are copied to all three target steps. Step B gets three copies of all the rows that A has sent out.
Note: Because steps are run as different threads, the order in which the single rows arrive at step B is probably not going to be the same as the order they left step A.
Step Error Handling
Step error handling allows you to configure a step so that instead of halting a transformation when an error occurs, the rows that caused an error are passed to a different step. Enable error handling in the Step error handling settings page.
In the example below, an error is generated in the Script Values step when an ID is higher than five (5).
To configure the error handling, right click the step and select Define Error handing:
Note: This menu item only appears when clicking on steps that support the error handling.
At minimum you need to set a target step for the data stream in error and check "Enable the error handling". When you do a preview on this target step, you can see extra fields being added to the "error rows":
This way, you can easily define new data flows in transformations. The typical use-case for this is an alternative way of doing an Upsert (Insert/Update):
This transformation performs an insert regardless of the content of the table. If you put a primary key on the ID (in this case the customer ID) the insert into the table causes an error. Because of the error handling you can pass the rows in error to the update step. Preliminary tests have shown this strategy of performing upserts to be three times faster in some situations (with a low updates to inserts ratio).
A list of error codes can be found here: Step error handling codes
Apache Virtual File System (VFS) Support
Kettle provides support for the Apache Virtual File System (VFS) as an additional way to reference source files, transformations and jobs from any location you like. For more information about VFS, visit http://commons.apache.org/vfs/.
Example: Referencing remote job files
Here is a simple example of using VFS to reference the location of a job file to execute using Kitchen:
sh kitchen.sh -[file:http://www.kettle.be/GenerateRows.kjb]
Open the job using VFS in Spoon; select File|Open file from URL. Enter the URL 'http://www.kettle.be/GenerateRows.kjb' and click OK to load the job in Spoon.
The transformation about to be launched is also located on the Web server. The internal variable for the job name directory is:
Internal.Job.Filename.Directory [http://www.kettle.be/]
This allows the transformation to be referenced as follows:
Note: Although this example is supported, you will not be able to save the job back to the Web server because of permissions.
For more information and examples of working with VFS, visit: http://jakarta.apache.org/commons/vfs/filesystems.html. Examples include direct loading from zip-files, gz-files, jar-files, ram drives, SMB, (s)ftp, (s)http, etc.
Pentaho will extend this list in the future with drivers for the Pentaho solutions repository and for the Kettle repository (something like: psr:// and pdi:// URIs).
Example: Referencing files inside a Zip
The example below illustrates the ability to use a wildcard to select files directly inside of a zip file.
Apache VFS support was implemented in all steps and job entries that are part of the Pentaho Data Integration suite as well as in the recent Pentaho platform code and in Pentaho Analyses (Mondrian).
Transformation Step Types
For a full indexed list of step types, see this page: Pentaho Data Integration Steps