Kettle dependency management

These notes are from a short spike I did on adding a dependency management to Kettle.

Releasing Kettle as a product is a pretty interesting situation.  There are many jars which are checked into Kettle which come from other Pentaho projects (some of which also depend on Kettle).  These jars must be built at release time and then checked back into Kettle.  If a core library is updated, we have to remember to check it into Kettle (if desired).  The bi-server, pentaho report designer and pentaho metadata editor are built with Kettle as a server/engine resource.  Since we have graduated from the days of having binary jars checked into our source repository, these projects explicitly list Kettle and all of its dependencies using ivy which can resolve these artifacts.  The artifacts are specified by version, which might be always the latest version (snapshots) or a specific released version.  When Kettle's dependencies are updated, we have to audit the libext folder for changes so we can update other projects.

Pentaho has created a set of ant scripts which can automatically install ivy, resolve jars (or other artifacts), build and publish artifacts.  Kettle has been upgraded to use subfloor (simply means the build.xml inherits the subfloor build script).  Subfloor resolves jars using ivy which retrieves jars from the pentaho artifactory (http://repository.pentaho.org/artifactory/) or the ibiblio maven2 repository.  The ibiblio repository is used for most 3rd party jars (eg. apache-commons).  The pentaho repository is used for internal pentaho projects or third-party libraries not available on ibiblio.  In order to resolve Kettle's dependencies we have to create a list of them in ivy.xml.  This file explicitly lists every jar with no transitive dependencies.  This means that the mapping of the libext folder is 1-1 with the ivy.xml file.  When the resolver finishes, the set of jars should not be magic, it should be quite natural and logical.  Having said this, one of the goals was to investigate the possibility of removing all checked in jars into Kettle.  libext has now been pruned, as well as test/libext.  test/libext is pruned by way of specifying a test configuration for ivy.

More about Ivy...

Apache Ivyâ„¢ is a popular dependency manager focusing on flexibility and simplicity. Find out more about its unique enterprise features, what people say about it, and how it can improve your build system!

Using Ivy/Kettle with an IDE

First, check out the development branch of Kettle into your favorite IDE:

svn://source.pentaho.org/svnkettleroot/Kettle/branches/mdd-ivy/

If you are using the ivyde plugin for Eclipse, simply checkout the source code for Kettle and ivyde will automatically resolve and build the project.

If you are not using ivyde, you can still quickly and easily get up and running.

  1. invoke the ant target 'resolve' - this will populate a lib folder called 'resolved-libs'
  2. update the classpath with either of the following methods:
    1. manually add these jars to the classpath using your IDE or
    2. invoke the ant target 'create-dot-classpath' which will modify your .classpath (be sure to refresh the project to pickup the changes)
Building Kettle

Although there are plenty of things we can take advantage of in the future, the Kettle build has been largely preserved.  You may invoke the ant target 'distrib' as you have done in the past.  distrib depends on resolve so we will automatically make sure the latest jars are in play. 

As always, you can checkout Kettle and immediately invoke "ant distrib".

I have also added the capability of adding Kettle plugins to the build process.  Kettle plugins, such as those written by Pentaho for hadoop are now resolved as plugins and installed to the distribution during build time.  This approach can be used to guarantee that certain plugins are always included in Kettle builds without having to check the plugin source code or artifacts directly into Kettle.

Work In Progress

The ivy.xml file is still a work in progress.  Here are the remaining items that we are still working on:

pentaho-database - This is a commons project that depends on kettle-db, but is used by kettle-ui.  This requires Kettle to be half-built, pentaho-database to be published or checked in, and then the rest of kettle to be compiled.  We're going to either need to bring the database project into kettle, or remove the kettle dependency from the database project before we can build kettle in a single pass.  Right now the way the dependency is defined in ivy.xml, it causes a circular resolve dependency.

swt - the swt jars are still not a part of the ivy.xml file, we'll be working on phasing those into ivy as well

library configurations - each kettle library (kettle-db, kettle-core, etc) should have its own list of dependencies defined in the ivy.xml file.  This way downstream uses of these libraries can inherit the specific dependencies needed, instead of inheriting the entire list of kettle dependencies.

checked-in plugins - The currently checked in plugins, DummyJob, DummyPlugin, S3CsvInput, ShapeFileReader3, and versioncheck, should all be moved into the ivy "plugin" configuration.

Feedback Needed

We realize this is a significant development change, so we want to get it right before it becomes standard.  Please check out the source and let us know what you think!