What's new in PDI version 3.1

Index

#Introduction
#Ease of use
#Step changes
#Job entry changes
#Databases
#Community and codebase

Introduction

In the period that version 3.1 came about, we had 5 other releases: 2.5.2, 3.0.1, 3.0.2, 3.0.3 and 3.0.4. All the same, we managed to get quite a bit of work done.

The first theme for this release was "Ease of use". It's a theme shared with the rest of the Pentaho platform and tool set. Traditionally, Kettle isn't the worst player in that department, but you can always do better.

The second theme of this release was the complete rework of the documentation set. To keep things manageable by larger groups of people we moved everything we could to the central Pentaho wiki.
Documenting is a difficult task that can never be considered complete but the wiki will help us to keep up with the incredible pace of development that we again achieved in Kettle.

Ease of use

Execution results

To do away with the tab-clutter that came about in the previous release we decided to put the results of executions in a split pane below the graphical view:

Performance graph

To make it easier to see which step is performing well or not, we periodically (configurable) gather performance statistics and we can show those on a graph:

We also allow you to store the raw data behind the graph in a database table so that you can create your own statistics.

FAQ attack

We're constantly on the look out to reduce the size of our FAQ, not increase it. We do this by informing the users of consequences of certain decisions or giving answers to FAQ in the Spoon GUI.

Some of these FAQ attack measures are subtle, like the fact you can now execute a stored procedure without the need for input to go to the step. (it simply executes once).

Others are less subtle, like the tool-tip we show after you dragged the second step onto the canvas:

New database dialog

The old database dialog was sometimes a bit confusing. It became one of the most complete database connection configuration tool, but usability and clarity suffered because of this.
At the same time we had the need for a shared database dialog to be used by different tools in the Pentaho stack. Because of this, we opted to create the dialog in the Mozilla backed XUL standard.
An SWT layer was created and the new dialog is now much easier on the eyes and much easier to use:

As you can see, only those options that are relevant to the selected database and access type are shown.

Zoom

If you are dealing with large transformations or jobs, it could be useful to zoom in and out of it to keep an overview:

Snap to grid

Some people love it, some people hate it, but here it is, the long awaited "snap-to-grid" functionality :

Welcome page / Getting started

We created a "Getting Started" page and linked it on the welcome page. We also linked a number of extra blogs

Changes in steps

INPUT

CSV File Input
- Parallel reading
- Multiple files support
- Encoding support
Fixed File Input
- Multiple files support
- Encoding support
Property Input
- new step to read properties
Get data from XML
- New step to parse any type of XML from any source
- Uses XPath
Generate Random Value
- Handy step in case you want to generate random numbers and strings
Get Files Rows Count
- get row counts from text files
LDIF Input
- LDAP Input File support
Mondrian Input
- Now also supports version 3 of Mondrian

OUTPUT

Property Output
- Write to a Java properties file
SQL File Output
- Write data to a file in the form of SQL statements

LOOKUP

Database Lookup
- cache entire table for better performance
Web services lookup
- complex data types, etc.
HTTP client
- Accept URL from an input field
Check if a column exists
- Verify if a column exists in a database table
Table Exists
- Verify if a database table exists
File exists
- Verify if a file exists

TRANSFORM

Add a checksum
- Calculate a checksum over one or more fields
Calculator
- Various new calculation types
Clone row
- Create one or more copies of the passing rows
Data validator
- Extensive tool to validate your data
Delay row
- Delay for a certain period before passing each row
JavaScript
- Support for EMCA v4
- Additional new functions for file handling and much more
Group By
- Support for cumulative sum and average, stddev, concatenation with specific separator

Metadata structure
- Document the metadata structure of a stream of data
Split field to rows
- split a row containing a delimited field into multiple new rows, one per split value.
Switch / Case
- Split fields into different streams depending on a field value
XSD Validator
- Validate an XML file/string using a schema
XSL Transformation
- Transform an XML file/string

SCRIPTING

Regex Evaluation
- Validate strings using regular expressions
- Grab capture groups and turn them into fields

Joins

XML Join
- The XML Join Step allows to add xml tags from one stream into a leading XML structure from
  a second stream.
- Allows you to create complex XML strings

Bulk Loading

Oracle Bulk Loader

Experimental

Get sub folder names
Mail
Mail validator
MonetDB bulk loader
Greenplum bulk loader

PostgreSQL bulk loader

Job entry changes

The first thing you'll notice is that the job entries are now also split into different categories.

Many job entries have been added in this release and a number got changes too...

File management

Add filenames to result
- allows you to add a set of files or folders to the result list of the job entry
Compare folders
Copy or move result filenames
Create a folder
Delete filenames from result
Delete folders

Conditions

Scripting

Shell
- you can now specify the script to execute in the dialog

File transfer

SSH2 Get
SSH2 Put

Repository

Check if connected to repository
Export repository to XML file

Databases

Besides the new database dialog (see above) we also added support for a few new database types. We now have support for 34 database types and a generic database connection for the others.

Here are the new ones...

MonetDB : the Dutch open source column database
KingbaseES : the popular Chinese RDBMS (PostgreSQL based)
Vertica : The upcoming high performance column database
HP NeoView : HP's answer to operational BI

Internationalization

In the i18n department, all teams made great strides but we would like to especially thank the Korean (Kim YoungWoo) and Japanese (Hiroyuki Kawaguch) translators for an excellent job.

Here is an overview of the translation status:

Language	% Complete	Keys done (shown in the language)	Keys missing (shown in English)
en_US	100,00%	9442	0
it_IT	100,00%	9442	0
fr_FR	100,00%	9442	0
es_AR	64	6069	3373
ko_KR	61	5740	3702
ja_JP	57	5341	4101
zh_CN	53	5021	4421
de_DE	48	4539	4903
es_ES	41	3853	5589
nl_NL	15	1432	8010
pt_BR	13	1237	8205
pt_PT	13	1236	8206

Also many kudos to the Italian (The great Nico Ben) and French (Super Samatar Hassan) translators for keeping up there at 100%. Given the ever so fast development pace, this is no small feat!!

Community and codebase

A word of thanks

As in any good open source project, our community was the driving force behind this excellent release. Pentaho obviously spent a large amount of time on this release but it wouldn't have been the same without the valuable help of all our developers, testers, bug reporters, partners, customers, documenters, translators, forum members, etc. It would lead us too far to thank everyone but it's all of you that keep Kettle going!

Even though all contributions are valued a lot, I would like to give special thanks to Samatar Hassan, Daniel Einspanjer (at Mozilla) and Ingo Klose (at SHS-Viveon) for their contributions to this release.

On the Pentaho team I would like to applaud Jem for porting that pesky Spoon users guide over to the Wiki. Many thanks to the whole team for all the help!

Codebase

Even though we try our best to re-factor and simplify the codebase all the time, there is no denying that the codebase keeps growing.
Right before every release we run the following command:

find . -name "*.java" -exec wc -l {} \; | awk '{ sum+=$1 } END { print sum }'

This is what that gave us over the last releases:

Version	Lines of code	Increase
2.1.4	160,000
2.2.2	177,450	17,450
2.3.0	213,489	36,039
2.4.0	256,030	42,541
2.5.0	292,241	36,211
3.0.0	348,575	56,334
3.1.0	456,772	108,197

As you can see, there is no sign of any slowdown in the development of the Kettle codebase. Looking at the roadmap this is bound to stay like that for the foreseeable future.

Matt Casters - Okegem/Belgium - September 18th 2008