Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Priority on development

Correctness/Consistency

If a tool is not correct it's not going to be trusted however fast it may be. It can't be that the same input will produce output A in one case, and output B in another case.

...

It should not be a torment to use a tool. It should allow both novice and expert users to get their job done. As example: any XML or configuration file should have a GUI element to manage this and should never be edited manually.

Create JIRA cases

For all your bug fixes, feature implementations and translation efforts, please create a JIRA case : http://jira.pentaho.org/browse/PDI 

Then please mention the case number (for example PDI-9999) in your commit message (PDI in uppercase) to help us keep track of the changes.  JIRA automatically links the Subversion commits to the JIRA case this way.

Use English in the source code

Since PDI is developed by an international group of people use English in the source code for everything: identifiers, comments, ..

Division of functionality in steps and job entries

One of the ideas of Pentaho Data Integration is to make simple steps and job entries which have a single purpose, and be able to make complex transformations by linking them together, much like UNIX utilities.

Putting too much (diverse) functionality in 1 step/job entry will make them less intuitive for people to use, and since most people only start reading manuals when they get into problems we need all the intuitivity we can get.

Rows on a single hop have to be of the same structure

As in user section of this document: all rows that flow over a single hop have to be of the same structure. You shouldn't try to build things that try to circumvent this, which will be harder as of v2.5.0 because of the design time check on the structure of the rows.

Null and "" are the same in PDI

As in Oracle the empty string "" and NULL should be considered the same by all steps. This is to be in line with the rest with PDI.

About converting data to fit the corresponding Metadata

Length & Precision are just metadata pieces.

...

The strategy of keeping the same datatype as long as possible has saved us from many a conversion error like the one described above.

About logging in steps in Pentaho Data Integration

Q: How to use logging in PDI steps? This applies to the main PDI steps as to any PDI steps you develop yourself.

...

Code Block
catch(Exception e)
{
    String message = Messages.getString("FilterRows.Exception.UnexpectedErrorFoundInEvaluationFuction");  //$NON-NLS-1$
    logError(message);
    logError(Messages.getString("FilterRows.Log.ErrorOccurredForRow")+rowMeta.getString(row)); //$NON-NLS-1$
    logError(Const.getStackTracker(e));
    throw new KettleException(message, e);
}

About using XML in Pentaho Data Integration

Q: What's the idea of using XML in PDI?

A: XML is for machines, not for humans. So unless the functionality is in fact on the processing of XML itself (XML input step/XML output step) the XML should be kept hidden. Behind the screens XML will be used but users of PDI should not to be required to know this or manipulate XML in any way. Every XML configuration/setup file should be managed through a GUI element in PDI.

About dropdown boxes and storing values

Don't use the index of the dropdown box as a value in the XML export file or database repository.

...

  • if someone wants to add extra values in the future he must use the order you defined first;
  • it makes the XML output very much unreadable.
    It's better to convert from a Locale string in the GUI to some English equivalent which is then stored. As example:
  • Suppose on the GUI you have a dropdown box with values "Date mask" and "Date time mask";
  • Instead of using a 1 in the output for "Date mask" and 2 for "Date time mask", it would be better to put in the output "DATE_MASK" for "Date mask" and "DATE_TIME_MASK" for "Date time mask";
  • Also note that DATE_MASK/DATE_TIME_MASK would then not be allowed to be subject to I18N translation (which is ok for transformation/job files).

About using I18N in PDI

Q: Some more details on using I18N

...

  • Only translate what a normal user will see, it doesn't make sense to translate all debug message in PDI. Some performance improvements were achieved in PDI just by removing some of translations for debug messages;
  • Make sure you don't translate strings used in the control logic of PDI:
  • If you would e.g. make the default name of a new step "language dependent" this would still make jobs/transformations usable across different locales;
  • If you would e.g. make tags used in the XML generated for the step language dependent there would be a problem when a user would switch his locale;
  • If you would translate non-tag strings used in the control logic you will also have a problem. E.g. in the repository manager "Administrator" is used to indicate which user is administrator (and this is used in the PDI control logic). So if you would translate administrator to a certain language, this would work as long as you wouldn't switch locales.

About using Locale's in PDI

PDI should always use the default Locale, so the Locale should not be hardcoded somewhere to English or so. However some steps may choose to be able to override the default Locale but this is then step specific and it should always be possible to select the Locale via the GUI of the step.

About reformatting source code

Try to keep reformatting code to a minimum, especially on things like {'s at the end of the line or at the start of the next line, not all people like the same and why should your specific preference be used.When changing code try to use the same formatting as the surrounding code, even if it's not your preference.

If you really feel a need for some reformatting do it in a separate SVN check-in, DON'Tmix reformatting source code with real changes. It's VERYannoying not being able to easily see what changed because someone decided to reformat source code at the same time. "My tool does automatic formatting" is a very lame excuse as all known IDEs allow to switch it off.

About checking persistence correctness

All steps and job entries need to implement persistence to XML format ANDthe repository. Since saving to XML format and saving to the repository is done using separate methods it sometimes happened that the meta-data would be saved properly in one format, but not in the other.The following procedure does a basic check whether data is properly saved. It's not a 100% check, but it will detect most obvious mistakes.

...

Part 1 shows that the loading/saving to XML works. If part 1 succeeds, but part 2 chances are very high something is wrong in the methods loading/saving to the repository.

About using non-temporary storage

Don't implement functionality that stores data in non-temporary storage in a PDI specific/internal format. With non-temporary we mean surviving a job or a transformation execution. The reason for this is to avoid having to deal with conversions.

Reason: suppose e.g. a step would serialize rows into a file to be read out in a next transformation. If you would upgrade PDI in between runs the row may not de-serialize correctly anymore. To solve this conversion applications would be required (or some old/new format logic if even possible). As long as you don't save data into an internal format that survives a job/transformation you're always ok.

How do you start developing your own plug-in step

Q: I need some complex calculation in my ETL. I have created own logic in java classes. How can I make my own step and integrate this in PDI?

A: see Writing your own Pentaho Data Integration Plug-In

Checklist for end of step/job entry development

Following is a list of things you need to think of before considering a change to a step/job entry or a complete new step/job entry to be complete.

  • Does the change break anything compared to the previous release(s). If possible nothing should break, but sometimes there's no other way. If something breaks and there's no way around it, at least inform the PDI tech lead;
  • Is the source code completely in English. Especially check the key fields that will be saved in the XML file/repository whether they are in English. It's hard to change these later on;
  • Does loading/saving work correctly with both XML format and a repository. This is mostly for new steps/job entries or for changes in existing attributes. A check for this included in the FAQ "On checking persistence correctness";
  • Is the documentation up to date with the changes?

About using Subversion

  1. Always make sure that whatever you put in SVN keeps PDI buildable and startable. Nothing is more annoying as not being able to even start PDI as someone checked in half-working code. If the change is too big to do at once, work by making small steps towards the full change (but at all times keeping PDI buildable/runnable).
  2. Always comment your commit. The best is to add the PDI-xx number from the bug / feature tracker and a short description (thus you have not to search the description from the PDI-xx number). See the other developers commits as examples.
  3. To keep track of the changes and to follow the software development process you HAVE to add a Jira PDI-xx number to your commit. There are only exceptions when this is like a cosmetic change or fix of a spelling mistake. This helps also other users not involved in the development process to keep track of the changes.

About Serializable and Binary

Q: If I need to select a type I can choose between Serializable and Binary. Are they the same, or what's the difference?

...

Serializable is used for some proprietary plugins build by a company using it to pass Java objects from one step to another. Unless you're developing your own steps/plugins Serializable is not something to be used. The way to read/write data depends on the objects being stored in a Serializable.

Success factors of PDI

Modular design

Pentaho Data Integration as a runtime engine consists of a "row engine" taking care of transporting data from one step to the next. The steps are separate and can even use a plug-in architecture.

...

The biggest disadvantage would be speed, but comparing the speed of PDI to other ETL tools the disadvantage doesn't seem that big (in a lot of cases PDI is even faster than similar jobs in other ETL tools).

More developer information

can be found here: PDI Developer information