Beginners FAQ
Problems starting spoon
Q: When I start Spoon.bat in a Windows environment nothing happens. How can I analyze the problem?
A: Use the SpoonDebug.bat file to start Spoon. Review the "SpoonDebug.txt" log file to review any errors.
Q: When I start spoon I get the following error: "Could not find the main class. Program will exit". How do I start spoon?
A: You can get the above message because of several reasons, the root cause is always that kettle jars are not on the classpath.
- Assuming you downloaded the binary version of Pentaho Data Integration: check whether you extracted the zip file maintaining the directory structure: under the main directory there should be a directory called "lib" that contains a file called kettle-engine-XXXXX.jar. If this is not the case re-extract the zip file in the proper way.
- When you fetched the sources of Pentaho Data Integration and compiled yourself you are probably executing the spoon script from the wrong directory. The source distribution has a directory called "assembly/package-res" that contains the scripts, but if you compile the proper way the "distribution"-ready Pentaho Data Integration will be in a directory called "dist". You should start the spoon script from that directory.
Q: When I start spoon I get one of the following errors or similar:
- Exception in thread "main" java.lang.NoSuchMethodError: method java.lang.Class.a sSubclass with signature (Ljava.lang.Class;)Ljava.lang.Class; was not found.
- log4j:ERROR Parsing error on line 2 and column 48
- log4j:ERROR Could not parse input source
A: Since Kettle version 5 you need Java 7 (aka 1.7), download this version from Oracle. When the right version is not found on the path (verify with java -version on a command line) you can set this within the Spoon.bat file (see the set PATH line).
What's the difference between transformations and jobs?
Q: In Spoon I can make jobs and transformations, what's the difference between the two?
A: Transformations are about moving and transforming rows from source to target. Jobs are more about high level flow control: executing transformations, sending mails on failure, transferring files via FTP, ...
Another key difference is that all the steps in a transformation execute in parallel, but the steps in a job execute in order.
Rule on mixing row 'types' on a hop in a transformation
Q: In the manuals I read that row types may not be mixed, what does that mean?
A: Not mixing of rows means that every row which is sent over a single hop needs to be of the same structure: same fieldnames, types, order of fields. So if you want to do stuff like "add an extra field if condition is true for a row but not otherwise" it will not work (because you will get different type of rows depending on conditions). You can switch on "Enable safe mode" to explicitly check for this at runtime.
PDI checks for mixing of rows automatically at design/verify time, but "Enable safe mode" still needs to be switched on to check it at runtime (as this causes a slight processing overhead).
A way to look at this is that a hop is very similar to a database table in some aspects, you also cannot store different type of rows in a database table. Theoretically the reason is that PDI wants to be able to do uniform/consistent transformations on your data and having variable rows makes this much more complex.
Technically, most of the steps use optimization techniques which map column names into field numbers (e.g. The column with name "sid" is field 4). Â Having different row structures would cause these steps to break.
Q:Â Is it possible to add/mix different meta-data into one Dummy Step?
A:Â You can do this. But be aware that it will only use the first row meta-data to display all rows. Even with the "Safe mode" on, which is used to find issues with different data types, does not check for different meta-data. The reason is, that PDI keeps internally all the available precision and changes the format only when viewing (preview) or saving into a file for example. To solve, this issue, all meta-data in the incoming streams have to be the same.
About duplicate fieldnames in a transformation
Q:Â Can I duplicate fieldnames in a single row?
A:Â No. PDI will complain in most of the cases if you have duplicate fieldnames.
About empty strings and NULL
PDI follows Oracle in its use of empty string and NULLs: they are considered to be the same (e.g. in the Filter step) and empty strings are written out as a NULL value e.g. to a target table. If you would find a step that doesn't follow this convention, let us know since it's probably a bug.
When you use e.g. the Add constants step or Generate rows step and enter an empty string, then a NULL value is created.Â
Since PDI 4.0, you can set a kettle property, KETTLE_EMPTY_STRING_DIFFERS_FROM_NULL=Y, to change this behavior (see also PDI-2277).
How to copy/duplicate a field in a transformation?
Q: How do you duplicate a field in a row in a transformation?
A:Â There are generally many ways to accomplish any task in PDI. Â The easiest solution is to use the Calculator step, and use the "Create a copy of field A" calculation.
How to do a database join with PDI?
Q: How do you do a database join with PDI?
A: If you want to join 2 tables from the same database, you can use a "Table Input" step and do the join in SQL itself, which will provide the best possible performance.
If you want to join 2 tables that are not in the same database. You can use the the "Database Join" step. A query for each input row from the main stream will be executed on the target database, which will result in lower performance due to the number of queries that are executed on the database.
A third option is to execute the 2 queries in separate Table Input steps, and join the rows using a "Merge Join" step. Â The rows must be properly sorted before being sent to the Merge Join step, and for best performance, this could be done in the SQL queries via the "ORDER BY" SQL clause.
How to sequentialize transformations?
Q: How can I make it so that 1 row gets processed completely until the end before the next row is processed?
A: One of the basic design principles in PDI is that all of the steps in a transformation are executed in parallel. This would require architectural changes to PDI and sequential processing would also result in very slow processing.
How to make a DB connection different for different Kettle installations (e.g. sandbox/staging/production)?
Q: When you create a normal database connection, you have to edit the transformation or job to connect to a different host or database. How can I make this a variable?
A: Here are the steps to make a connection based on variables and share the connection for easier reuse:
1. In Spoon, open the transformation containing the current hardcoded form of the DB connection.
2. Expand the "Database connections" section of the navigation tree.
3. Double click on the connection you currently have defined to edit it.
4. In the "server host name" textbox, change the currently hardcoded value (e.g. fubarserver) to a variable (e.g. ${DB_HOSTNAME})
5. Save and close that dialog (Click OK)
6. Right click the connection you just edited and select the option "Share", to share it.
7. Save the transformation you had open. (Shared connections don't get written out until you save something)
8. Using a file explorer, navigate to the .kettle directory inside your home directory (i.e. "C:\Users\<username>\.kettle" for Windows, "/home/<username>/.kettle" for Linux/Unix)
9. The shared connection should now be in .kettle/shared.xml. This file can be copied and pasted to any new Kettle installation.
10. Edit the kettle.properties file using a standard text editor.
11. Create a new line in it below the comments with the name of the variable you defined in step 4.
      (The new line would read as follows if you named the variable DB_HOSTNAME: DB_HOSTNAME = localhost)
12. On any new installation, you can edit that kettle.properties file and define a new value for that variable.
13. Once you have completed all of the above, either restart kettle or select the Set environment variables option in the Edit menu.