Feature checkboxes

Flat files...Can it take flat files as Input and also genarate Flat files as o/p.=0A

Yes, all kinds of flat files, parallel reading is supported on a single machine and accross a cluster/SAN.

Can it support Cobol,xml ,excel spreadsheets .Whta is the relevant transformation ?

Cobol is a programming language.  All other file formats are supported.

Can it support Mainframes...If yes ,how will we connect to mainframes

A mainframe is a system with an operating system.  We can connect to the databases on it using JDBC or ODBC.

Can it support Oracle,Teradata,Netezza,SQL Server 2005/2000, Ms Access.

Yes, around 34 database types.

How many transformations does the tool has?

Around 100 steps are present in the latest version.

How is the performance when you are loading large amount of Data( 3 TB's) and do you have any chapter related to documentation for best practices related to performance?

Impossible to say, we have bulk loaders for a number of databases though.  They are as fast as it gets.

Docs on performance...
Performance monitoring: http://wiki.pentaho.com/display/EAI/Step+performance+monitoring
And our performance check list:


Mozilla is loading billions of rows with Pentaho Data Integration :


Is there a development repository?

You can create as many repositories as you like, usually 3 or 4 for Development, Test, User Acceptance and Production.

can users copy & paste objects and sessions into one or more workflows?

Yes, all copy/paste operations go over the clipboard using XML.  This makes it easy to copy/paste not only into the same application but also into others.

can users nest sessions within other sessions?

I'm not sure what that means, but you can open as many GUI instances as you like.

check-in, check-out versioning?

This is on our roadmap next actually. At the moment we recommend people with versioning needs to stick to XML based jobs and transformations and to check those into their favorite versioning tool. (CVS, Subversion, etc)

Is their a visual debugger?


Can users set breakpoints?

See this page for a sample: http://wiki.pentaho.com/display/EAI/Getting+Started

What code does the product generate internally?

No code is being generated, we execute directly based on the ETL metadata.  This is safer as it doesn't allow for occasional errors in the code generation and deployment process.
Also, when things go wrong (and they always do) in the generated code, you don't need to become an expert in the language.

What scripting languages does it support for custom coding?

We support JavaScript and Regular expressions besides the usual SQL.
We also have an extensive framework to allow you to write your own plugins in Java.

What third party objects does the tool support?

At the moment we support 34 databases + a generic connection in JDBC, ODBC, OCI etc.
That includes SAP/R3.
Besides that we can read all sorts of XML, Excel, Access, Text files, LDAP, LDAP input files, Java properties files, etc. We do that in all sorts of variants and we can do that in nearly all locale/codepages.

Can reports visually depict dependencies among components in multiple workflows?

Not yet, but this is also on our road map.

Can the tool execute multiple jobs sequentially and also in parallel?

Yes, both on the same machine as in a clustered mode accross different servers.

Do you have any partnerships with a company for the data quality ?

For example : http://wwww.infosolvetech.com/
They actually OEM Pentaho Data Integration

Does the tool support partitioning for full parallel partitioning and is their any price or is it embedded in the tool?

It does support it.  You can even write your own partitioning plug-in although you would probably need our help to do it (since few people need it).
It's embedded in the tool.

What are your upcoming enhancements in next version ?

At the time of writing, Pentaho is releasing a management services console (monitoring, alerts, trending, performance graphs, etc).
Next up for end of this year, beginning of next is a profiling suite (batch oriented and on-line with a repository and a profiling service)
Also, we're working on getting support for version managment linked to third party version control systems (from typical uses to CMS systems).
A cluster manager is also up next, etc, etc. too much to mention.  Check our JIRA system and the homepage for updates.

Does it have a consistent graphical user interface among all modules ..meaning  like can you navigate from one window to all the other modules like source,target,repository manager,transformation development,output window,in source analyser to any of target analyser or repository manager or workflow navigator or output monitor  ? or do i have to login to repository each time to connect to different modules?

Yes. The only exception is when you switch repositories, from test to production for example.  We want to make this a delibarate choice, not an accident.

Can you run the jobs from command line in UNIX ?

Yes, obviously:  http://wiki.pentaho.com/display/EAI/Kitchen+User+Documentation

Do you have any wizards for supporting SCD type 1, type 2 and type 3 ? or does the tool support any other wizards?

We have no need for a wizard since we have a Dimension Lookup/Update step that manages it all.

Do you have a concept called reusable mappings ..wherein suppose you have a common business functionality that you can use any number of times ...then you can create one resuable mapping and use it many times?

Yes, it's called the "Mapping step":

In the same way do you have reusable transformations wherein one can use it many times without recreating ?(For example you have a lookup transformation wherein you are checking the zip code and made it reusable --then everyone in your team can use the same reusable transformation without creating again ).

Same answer.

Can the user access and modify the internal code?

Yes, our code is licensed under the Lesser GNU Public License.  Check the license for the limitations that apply.
As mentioned earlier, that's all the code there is, we don't generate anything.

Can the tool generate SQL ?

Yes, in all database steps, we have a SQL button that will generate the required table and/or indexes.  It's also available on the transformation and job level.

Suppose you have 3 jobs ? Can you nest then one after other either serially or in parallel as they might be dependent ? (Eg: Like job2 can start only after job 1 completes...or job1 and job2 ,job3 can run in parallel..)

That's how our jobs work.  We do also support parallel execution of job entries.

Can users automatically update copied objects or sessions by reconfiguring a base template?

No, so far we didn't have a need for it given all the functionality mentioned above.
That being said, we do allow steps and job entries to be shared meaning they'll be available in all transformations and jobs.
You can use those as templates.  Then again, very few people actually use that functionality.

Can it generate visual reports that enable business users to view the origins of a components..this is related to data lineage reports?

Yes, in each step in a transformation you can see where the data is coming from.
Specific for databases,we also have a (simple) database impact analyses tool.

What other reports  can the repository generate out of the box?

Since the ERD of the repository is public, you can create as many reports on it as you like.  One community member went as far as creating a complete documentation system on top of it.

Can it Scale linearly across multiple CPUs?

There isn't a tool in the world that can do that unfortunately.  That being said, we do run all our steps in parallel across all available CPU if needed.

Can it Support load balancing and fail over across clustered servers?

Not out of the box, you need a cluster manager for that. You can either script that, integrate it in existing fail over software (like a few of our customers do) or wait for the upcoming PDI cluster manager.

What target systems are supported?

Pretty much all of them.

What bulk load utilities are supported? I know i saw oracle bulk loader yesterday..

It's sometimes difficult to put these in production since they often require intimate RDBMS vendor knowlege, but these are the once you can use:

MS SQL Server

Can users turn off referential integrity and indexes?

You would want to use a tool suited for the database you're working with.

Can it do Load target partitions?

Creating / deleting partitions is something that can be set up in a transformation dynamically.
Writing to a partitioned table usually doesn't require any extra logic.
We do allow writing to "manually" partitioned tables in the "TABLE_200810" style.

Automatically generate DDL?

Absolutely not!  Technically it would be very easy to create this. However, we don't allow the users to do that since it's a very bad feature, not in line with DBA best practices and typical system ownership rules.

What type of complex transformations does the tool support out of the box?

About 100 : http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+Steps

How does it generate  surrogate key --does it have transformation?

They are generated automatically in the "Dimension Lookup/Update" (SCD) and "Combination Lookup/Update" step (Junk dimensions)
It uses either auto-increment primary keys, sequences or our own counters.

Does it support  Incremental dimensional aggregates ? Please let know..

Sorry, I never even heard of that one.  A definition would be nice to have.
In general, incremental aggregation is something that is not easy to solve automatically either, especially with late arriving data.

Does it have a scheduler or it has to integrate with third party tools?

See the Kitchen documentation linked to above.
It is also possible to use the Pentaho platform to schedule and the next version of the management services platform will also integrate advanced scheduling.

Can the tool validate jobs before running them?

Manually, not automatically.

Can tool recover to last checkpoint or point of failure without manual work?

No. Actually, looking beyond marketing claims, I don't know any tool that can do that in a generic way.
There's a lot you can do to set it up, we even allow rollbacks on a transformation level.  However, if you want to go further, you need to do manual work.
The only exceptions are the "Text File Input" and "Excel Input" operators that are capable of for example logging the error rows and to re-run only those error rows when run again.
Many steps also support error handling allowing you to reroute error rows to different steps in a transformation for automatic processing.

Can tool restart from point of failure?

Yes, but not automatically.
See 27)

Can tool restart entire session?

If you like, yes.  We rarely see these try/retry loops though.

Does the tool generate Job log, statistics, and diagnostics and any error reports and fixing ?

Yes, although I don't know what you mean by "fixing".

Are there any additional products or add-ins that customers need to buy from your company or third party to install this product?

No, all you need is a Java Runtime Environment version 1.5 or higher.

How do you export mappings that you create in dev to test or prod environments? can they be migrated to test and deployed as xml files or is their a different way?

Either as XML or directly saved into the test repository.

Does it have an event where it can do like a file watch? ( Eg: You are looking for a file on the UNIX box at 2:00am in the morning ...the job runs only when it sees the file)..what is the event called?

No, you can wait for a file, but it's not event driven.  For that you would need operating system specific drivers.  We prefer not to do that.

Does it has something like --event succeeded ...(eg; Suppose you have job1 and job2 and they are running sequentially ..and you want the job 2 to run only when job1 succeeds...) and likewise email generation when a task fails...

Yes, as answered above.

Can the tool join heterogenious sourcess..like flat file and database table ... or two different tables from two different databases..(Does any transformation accomplish this ...

Yes, obviously. We have various ways to do it: in-memory (Stream) lookups, merge joins, database joins, database lookup, etc.

How does the tool handle  the SCD TYPE 2 (hISTORY) where it has to TAKE CARE OF insert/update strategy?    Does it has any transformation...?

Again, it's handled automatically by the "Dimension Lookup/Update" step:

You don't need to do anything, just fill in the dialog.