PDI Performance tuning check-list

Mar 31, 2008
Submitted by Matt Casters, Chief of Data Integration, Pentaho

Table of Contents 

JavaScript : turn off compatibility mode

Rewriting JavaScript to use a format that is not compatible with previous versions is in general easy to do and will make the scripts easier to work with and to read.  See also the 2.5 to 3.0 migration guide on the subject.  Pentaho Data Integration 3.0 migration guide or more to the point: Migrating JavaScript from 2.5.x to 3.0.0

JavaScript: Combine steps

To help reduce the overhead : 1 big JavaScript step is going to run faster than 3 small ones behind each other.  

To improve the parallel aspect again you can always start multiple copies of the one big step. (see below)

JavaScript: avoid it or write a custom plugin

Remember that while JavaScript is the fastest scripting language for Java, it is still a scripting language.
If you do the same amount of work in a native step or plugin, you will avoid the overhead of the JS scripting engine.  This has been known to result in massive performance gains.  It's also the main reason why we wrote the Calculator step: to avoid the use of JavaScript for simple calculations.

JavaScript: create a copy of a field

No JavaScript is required for this.  A "Select Values" step will do the trick.  You can specify the same field twice.  Once without a rename, once (or more) with a rename.

Another trick is to use B=NVL(A,A) in a Calculator step where B is forced to be a copy of A.   In version 3.1, we are adding an explicit "create copy of field A" function in the Calculator.

JavaScript: data conversion

Consider doing conversions between data types (dates, numeric data, etc) in a "Select Values" step (version 3.0.2 or higher).  You can do this in the Meta-data tab of the step.

JavaScript: variable creation

If you have variables that can be declared once at the start of the transformation, make sure you put them in a separate script and mark that script as a startup script (right click on the script name in the tab).

Javascript object creation is very expensive so if you can avoid creating a new object for every row you are transforming, this will translate to a performance boost for the step.

Launch several copies of a step

There are 2 main reasons why launching multiple copies of a step might result in better performance:

  • The step uses a lot of CPU and you have multiple processor cores in your computer.   Example: a JavaScript step
  • The step has to deal with network latencies and launching multiple copies can reduce the average latency.  If you have a low network latency of say 5ms and you need to do a round trip to the database, the maximum performance you get is 200 (x5) rows per second, even if the database is running smoothly.  You can try to reduce the round trips with caching, but if not, you can try to run multiple copies.  Example: a database lookup or table output
  • Please consider the drawback of running multiple steps in parallel, since they consume more CPU threads.

Manage thread priorities

In versions 3.0.2 and higher, this feature that is found in the "Transformation Settings" dialog (Misc tab) improves performance by reducing the locking overhead in certain situations.   This feature is enabled by default for new transformations that are created in recent versions, but for older transformations this can be different.

Reference JIRA case: (fixed in 3.0.2 or 3.1.0)


Remove fields in Select Value

If you don't need to do this, don't do it.  It's an expensive step as the engine needs to reconstruct the complete row.
It's almost always faster to add fields to a row compared to deleting fields from a row.

Get Variables

Apparently this is cause for a slowdown as well in recent versions if you use it in a high-volume stream. (accepting input)
To solve it, take the "Get Variables" step out of the transformation (right click, detach).  Then join it back in with a "Join Rows (cart prod)" step.  Make sure to specify the main step to read from in the "Join Rows" step.  Set it to the step that orriganally provided the "Get Variables" step with data.

Text File Input

It might make a big difference if you use the new "CSV Input" or "Fixed Input" steps as they have been re-written for optimal performance.
If you have a fixed width (field/row) input file, you can even read data in parallel. (multiple copies)

These new steps have been rewritten with Non-blocking I/O (NIO) features.  Typically, the larger the NIO buffer you specify in the step, the better your read performance will be. 

Lazy conversion

In these cases where you are reading data from a text file and you write the data back to a text file, you can use Lazy conversion to speed up the process.  The principle behind it is that it delays data conversion with the hope that it doesn't take place at all.  (reading from a file and writing it back comes to mind)
Beyond helping with data conversion, lazy conversion also helps to keep the data in "binary" storage form.  This in turn will help the internal Kettle engine to do faster data serialization.  (sort, clustering, etc)

The Lazy Conversion option is available in the "CSV Input" and "Fixed input" text file reading steps.

Join rows

You need to specify the main step to read from.  This will prevent the step from doing any spooling to disk when this is not needed.
If you are joining with a set of data that can fit into memory, make sure that the cache size (in rows of data) is large enough.  This will avoid (slow) spooling to disk.

Review the big picture: database, commit size, row set size and other factors

Always think of the whole environment that influences your performance.  There can be limitating factors within the transformation but also in the outside world or Pentaho Data Integration. You can play with the optimal commit size for instance, but this is depending on many factors and we can only give you some background and encourage you to test with. It is depending at least on the following: your database, your tables, indexes, the JDBC driver, your hardware, speed of the LAN connection to the database, the row size of data and your transformation itself.
You can try with different commit sizes and also change the number of rows in row sets in your transformation settings or even the buffers in your JDBC drivers or database.

In a lot of cases a smaller row set size actually improves performance since it forces rows through all steps of a (parallel executing) transformation.

Step Performance Monitoring

As of version 3.1.0 of Pentaho Data Integration, we are introducing a way to track the performance of individual steps in a transformation. This is an important tool that allows you to fine-tune performance of transformation because that performance is determined by the slowest step in the transformation. Further information can be found here.