User Defined Java Class
PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.
Description
This step allows you to enter User Defined Java Class to drive the functionality of a complete step. In essence, this step allows you to program your own plugin in a step.
The goal of the "User Defined Java Class" step is not to allow a user to do full-scale Java development inside of a step. Obviously we have a whole plugin system available to help with that part. (see: The PDI SDK)
The goal is to allow users to define methods and logic with as little as code as possible, executed as fast as possible. For this we use the Janino project libraries that compile Java code in the form of classes at runtime.
Not 100% Java
The first thing to know is that Janino and as a consequence this step doesn't need the complete Java class, only the class body: the imports, constructors and methods you need. So to drive the point home, the step doesn't need the full class declaration. The developers of this step selected this approach over the definition of the full class since it was possible to hide a lot of technical details and methods from the user this way.
Kettle adds the following imports:
- org.pentaho.di.trans.steps.userdefinedjavaclass.*
- org.pentaho.di.trans.step.*
- org.pentaho.di.core.row.*
- org.pentaho.di.core.*
- org.pentaho.di.core.exception.*
If you need others you need to include them yourself at the very top of your code, for example:
import java.util.*;
Another thing to note is that Janino, essentially a Java byte-code generator only supports a sub-set of the Java 1.5 specification. To see a complete list of the features and limitations, please go to the Janino homepage. At the time of writing the most apparent limitation is the absence of generics.
Again, if you need to do a lot of Java development we advice you do this in a Java IDE like Eclipse, not inside this step. You can always expose your Java code to this step by throwing it in a jar file and by placing that library the classpath of Kettle (try the libext/ folder).
Input fields
Most of the time, working with input and output fields is the most important thing you'll be doing in your UDJC code. As such, there are a number of ways to handle the manipulation of fields. To start with let's look at the description of the input row:
RowMetaInterface inputRowMeta = getInputRowMeta();
The "inputRowMeta" object contains the metadata of the input row. This includes all the fields, their data types, lengths, names, format masks and much more. You can use this to look up input fields and much more. For example, if you want to look for a field called named "customer" you use the following code:
ValueMetaInterface customer = inputRowMeta.searchValueMeta("year");
Because looking up field names is slow if you need to do it for every row that passes through a transformation, we advice you to look up field names in advance in a first block like this (in the processRow() method):
if (first) { yearIndex = getInputRowMeta().indexOfValue(getParameter("YEAR")); if (yearIndex<0) { throw new KettleException("Year field not found in the input row, check parameter 'YEAR'\!"); } }
To get your hands on the Integer value contained in field "year" you can then use the following construct:
Object[] r = getRow(); ... Long year = inputRowMeta().getInteger(r, yearIndex);
To make this process easier you can use a shortcut in this form:
Long year = get(Fields.In, "year").getInteger(r);
This method will also take into account the index based optimization mentioned above.
IMPORTANT: The Java data types that you get from previous steps always corresponds to the Kettle data type as described on the PDI Rows Of Data page.
Output fields
You can define all the new fields you want in the output of the step in the "Fields" section of the steps dialog:
Doing this will automatically calculate the layout of the output row metadata and store it in "data.outputRowMeta". That in turn allows you to create the output row. In case the step writes as many (or less) rows as it reads, you can simply resize the row you get on input:
Object[] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size());
or more memorable:
Object[] outputRowData = createOutputRow(r, data.outputRowMeta.size());
If rows are being copied make sure to create separate copies to prevent subsequent steps from modifying the same Object[] copy many times at once:
Object[] outputRowData = RowDataUtil.createResizedCopy(r, data.outputRowMeta.size());
Similar to accessing input fields, output fields can be addressed through the index in the output row or using the field helper.
Using the index you can set a value like this:
outputRowData[getInputRowMeta().size()] = easterDate(year.intValue());
or like this with the shortcut:
get(Fields.Out, "easter").setValue(r, easterDate(year.intValue());
IMPORTANT: The Java data types that you pass on to next steps always needs to correspond to the Kettle data type as described on the PDI Rows Of Data page.
Data types
or pass on to next steps can't be just anything but needs to correspond to
Parameters
Because it is not a very good practice to hard-code string values like field-names (for example "customer" in the paragraph above) we allow the usage of parameters in this step:
In this example, taken from your Kettle distribution file "samples/transformations/User Defined Java Class - Calculate the date of Easter.ktr", we have a parameter called YEAR that is referenced with the getParameter() method, for example:
getParameter("YEAR")
At runtime this will return the "year" String value.
Processing rows
The processRow() method is the heart of the step. This method is called by the transformation in a tight loop and will continue until false is returned. A very simple example that calculates firstname+" "+lastname and stores it into a "name" field is this:
String firstnameField; String lastnameField; String nameField; public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException { // Let's look up parameters only once for performance reason. // if (first) { firstnameField = getParameter("FIRSTNAME_FIELD"); lastnameField = getParameter("LASTNAME_FIELD"); nameField = getParameter("NAME_FIELD"); first=false; } // First, get a row from the default input hop // Object[] r = getRow(); // If the row object is null, we are done processing. // if (r == null) { setOutputDone(); return false; } // It is always safest to call createOutputRow() to ensure that your output row's Object[] is large // enough to handle any new fields you are creating in this step. // Object[] outputRow = createOutputRow(r, data.outputRowMeta.size()); String firstname = get(Fields.In, firstnameField).getString(r); String lastname = get(Fields.In, lastnameField).getString(r); // Set the value in the output field // String name = firstname+" "+lastname; get(Fields.Out, nameField).setValue(outputRow, name); // putRow will send the row on to the default output hop. // putRow(data.outputRowMeta, outputRow); return true; }
IMPORTANT: getRow() method must be called before the first get(Fieds.in, FIELD_NAME) - that helps to avoid situations with unexpected fields ordering in the data obtained from the previous step (such as Mapping input specification).
Examples
Look int the samples/transformations folder of your Kettle/PDI distribution for files starting with "User Defined Java Class" like "User Defined Java Class - Calculate the date of Easter.ktr".
Notes on Info Steps and Streams
As GetRow() method returns first row from any input stream( either input stream or info stream), and the only possible and reasonable use of Info steps - is that input rowMeta and info rowMeta varies.
So the adopted approach is to read/get all data from info stream before calling getRow() method. (See example or issues: PDI-8738 and PDI-8740)
if (first){ first = false; /* TODO: Your code here. (Using info fields) FieldHelper infoField = get(Fields.Info, "info_field_name"); RowSet infoStream = findInfoRowSet("info_stream_tag"); Object[] infoRow = null; int infoRowCount = 0; // Read all rows from info step before calling getRow() method, which returns first row from any // input rowset. As rowMeta for info and input steps varies getRow() can lead to errors. while((infoRow = getRowFrom(infoStream)) != null){ // do something with info data infoRowCount++; } */ } Object[] r = getRow(); if (r == null) { setOutputDone(); return false; }
Notes on class member variables and using getVariable()
When getting parameters that point to transformation parameters, the UDJC behaves differently depending on when the getVariable function is called: if in the init() method, everything works fine; if on initialization of a class member variable, the variable gets not resolved by design. (see PDI-8963)
private final String par = getVariable("somePar"); // DOES NOT resolve correctly private String par2 = null; public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException { logBasic("Parameter value="+par+"\[MEMBER INIT\]"); logBasic("Parameter value="+par2+"\[INIT FUNCTION\]"); setOutputDone(); return false; } public boolean init(StepMetaInterface stepMetaInterface, StepDataInterface stepDataInterface) { par2 = getVariable("somePar"); // WORKS FINE return parent.initImpl(stepMetaInterface, stepDataInterface); }
Logging
It is necessary to implement logging yourself. This is because you may wish to log read, written, output, updated etc. Other steps log like so:
putRow( data.outputMeta, r ); if ( checkFeedback( getLinesOutput() ) ) { if ( log.isBasic() ) { logBasic( "Have I got rows for you! " + getLinesOutput() ); } }
Resources:
Blog about this step and it's usage different scenarios: http://type-exit.org/adventures-with-open-source-bi/2010/10/the-user-defined-java-class-step