초보자 FAQ
Problems starting spoon
Q: When I start Spoon.bat in a Windows environment nothing happens. How can I analyze the problem?
A: Edit the Spoon.bat file and 1) replace in the last line "start javaw" with only "java" 2) add a "pause" in the next line 3) Save and try it again. Then you will get an error message and can analyze this with the following questions.
Q: When I start spoon I get the following error: "Could not find the main class. Program will exit". How do I start spoon?
A: You can get the above message because of several reasons, the root cause is always that kettle jars are not on the classpath.
- Assuming you downloaded the binary version of Pentaho Data Integration: check whether you extracted the zip file maintaining the directory structure: under the main directory there should be a directory called "lib" that contains a file called kettle.jar (in v2.5.x or lower) or 2 jar files with names starting with "kettle" (as of v3.0). If this is not the case re-extract the zip file in the proper way.
- When you fetched the sources of Pentaho Data Integration and compiled yourself you are probably executing the spoon script from the wrong directory. The source distribution has a directory called "bin" that contains the scripts, but if you compile the proper way the "distribution"-ready Pentaho Data Integration will be in a directory called "distrib". You should start the spoon script from that directory.
Q: When I start spoon I get one of the following errors or similar:
- Exception in thread "main" java.lang.NoSuchMethodError: method java.lang.Class.a sSubclass with signature (Ljava.lang.Class;)Ljava.lang.Class; was not found.
- log4j:ERROR Parsing error on line 2 and column 48
- log4j:ERROR Could not parse input source
A: Since Kettle version 3 you need Java 5 (aka 1.5), downlad this version from http://www.javasoft.com When the right version is not found on the path (verify with java -version on a command line) you can set this within the Spoon.bat file (see the set PATH line).
What's the difference between transformations and jobs?
Q: In Spoon I can make jobs and transformations, what's the difference between the two?
A: Transformations are about moving and transforming rows from source to target. Jobs are more about high level flow control: executing transformations, sending mails on failure, ftp'ing files, ...
Rule on mixing row 'types' on a hop in a transformation
Q: In the manuals I read that row types may not be mixed, what does that mean?
A: Not mixing of rows means that every row which is sent over a single hop needs to be of the same structure: same fieldnames, types, order of fields. So if you want to do stuff like "add an extra field if condition is true for a row but not otherwise" it will not work (because you will get different type of rows depending on conditions). You can switch on "Enable safe mode" to explicitly check for this at runtime.
As of v2.5.0 checking of mixing of rows is also done automatically at design/verify time, but "Enable safe mode" still needs to be switched on to check it at runtime (as this causes a slight processing overhead).
A way to look at this is that a hop is very similar to a database table in some aspects, you also cannot store different type of rows in a database table. Theoretically the reason is that PDI wants to be able to do uniform/consistent transformations on your data and having variable rows makes this much more complex.
Technically most of the steps use optimization techniques which transformation column names into indexes into a row (e.g. The column with name "sid" is column 4), having variable row would make this not possible (and break most of the current steps).
About duplicate fieldnames in a transformation
Q: Why can't I duplicate fieldnames in a single row?
A: You can't. PDI will complain in most of the cases if you have duplicate fieldnames. Before PDI v2.5.0 you were able to force duplicate fields, but also only the first value of the duplicate fields could ever be used.
About empty strings and NULL
PDI follows Oracle in its use of empty string and NULLs: they are considered to be the same (e.g. in the Filter step). If you would find a step that doesn't follow this convention, let us knows since it's probably a bug.
When you use e.g. the Add constants step or Generate rows step and enter an empty string, then a NULL value is created.
How to copy/duplicate a field in a transformation?
Q: How do you duplicate a field in a row in a transformation?
A: Several solutions exist:
- Use a "Select Values" step renaming a field while selecting also the original one. The result will be that the original field will be duplicated to another name. It will look as follows:
This will duplicate fieldA to fieldB and fieldC.
- Use a calculator step and use e.g. The NLV(A,B) operation as follows:
This will have the same effect as the first solution: 3 fields in the output which are copies of each other: fieldA, fieldB, and fieldC.
- Use a JavaScript step to copy the field:
This will have the same effect as the previous solutions: 3 fields in the output which are copies of each other: fieldA, fieldB, and fieldC.
How to do a database join with PDI?
Q: How do you do a database join with PDI?
A: If you want to join 2 tables from the same database, you use a "Table Input" step and do the join in SQL itself, in 1 database this the fastest solution.
If you want to join 2 tables that are not in the same database. You can use the the "Database Join" step, but you have to realize that PDI will launch a query for each input row from the main stream. You can find more information on this in 1 of the weekly tips.
A third option is to launch the 2 queries against the 2 databases and join the rows using a "Merge Join" step. Make sure you sort the data on the join key in the database as this is in all likelihood the fastest sort method.
How to sequentialize transformations?
Q: By default all steps in a transformation run in parallel, how can I make it so that 1 row gets processed completely until the end before the next row is processed?
A: This is not possible, one of the basic things in PDI transformations is that all of the steps run in parallel. So you can't sequentialize them. This would require architectural changes to PDI and sequential processing would also result in very slow processing.
How to make a DB connection different for different Kettle installations (e.g. sandbox/staging/production)?
Q: When you create a normal database connection, you have to edit the transformation or job to connect to a different host or database. How can I make this a variable?
A: Here are the steps to make a connection based on variables and share the connection for easier reuse:
1. In Spoon, open the transformation containing the current hardcoded form of the DB connection.
2. Expand the "Database connections" section of the navigation tree.
3. Double click on the connection you currently have defined to edit it.
4. In the "server host name" textbox, change the currently hardcoded value (e.g. fubarserver) to a variable (e.g. ${DB_HOSTNAME})
5. Save and close that dialog (Click OK)
6. Right click the connection you just edited and select the option "Share", to share it.
7. Save the transformation you had open. (Shared connections don't get written out until you save something)
8. Using a file explorer, navigate to the .kettle directory inside your home directory (i.e. "C:\Documents and Settings\<username>\.kettle" for Windows. "/home/<username>/.kettle" for Linux/Unix)
9. The shared connection should now be in .kettle/shared.xml. This file can be copied and pasted to any new Kettle installation.
10. Edit the kettle.properties file using a standard text editor.
11. Create a new line in it below the comments with the name of the variable you defined in step 4.
(The new line would read as follows if you named the variable DB_HOSTNAME: DB_HOSTNAME = localhost)
12. On any new installation, you can edit that kettle.properties file and define a new value for that variable.
13. Once you have completed all of the above, either restart kettle or select the Set environment variables option in the Edit menu.