Performance considerations
Speed can be boosted by using some simple JDBC-connection settings (published in the following blog entry Increase MySQL output to 80K rows/second in Pentaho Data Integration by Julien Hofstede):
useServerPrepStmts=false
rewriteBatchedStatements=true
useCompression=true
These options can be entered in PDI at the connection. Double click the connection, go to Options and set these values.
Communications link failure due to underlying exception
For long running queries on MySQL, you might get this exception:
Communications link failure due to underlying exception: ** BEGIN NESTED EXCEPTION ** java.io.EOFException STACKTRACE: java.io.EOFException at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1934) etc.
The MySQL server parameter "net_write_timeout" defaults to 60 seconds.
I found that by increasing this to a larger value, the exception goes away.
Then we do not have to turn off the result set cursor and we don't need
to read the entire result set into memory by turning off result set streaming (cursor emulation)
This parameter is set in my.ini. I set it to 1800, but probably a
smaller value would suffice.
Please also see the chapter with the nice title "MySQL server has gone away" over here:
http://dev.mysql.com/doc/refman/5.1/en/gone-away.html
Another reported solution was: Set the tcpKeepAlive setting in the MySQL connecto (since MySQL Connector/J 5.0.7, please see about JDBC driver versions).
JDBC driver versions
We found several times that v5.x of the MySQL JDBC driver doesn't play along nicely with Kettle.
vanayka found out the same in this thread: http://forums.pentaho.org/showthread.php?p=184717
Using an alias
There was a change in the usage of getColumnLabel and getColumnName in the JDBC drivers version 4.0 and later.
This has e.g. impact for SQL queries with an alias for the fieldname like:
SELECT column AS aliasName from table
With JDBC drivers version 4.0 and later you will eventually not get the aliasName.
PDI-2905 is adressing this issue, see also http://bugs.mysql.com/bug.php?id=35610
"In Connector/J 5.1 the behaviour was changed, so that instead of allowing only index
number or column label, only index number or column name are accepted."
This bug was fixed by MySQL and lead to a changed JDBC behaviour in this regard.
Character encoding issues
Make sure that you set the correct JDBC options like for example in the case of a Unicode target database:
useUnicode = true characterEncoding = utf8 characterSetResults = utf8
Even if you don't have a target Unicode database, it makes sense to set the encodings explicitly in those situations where source and target encodings are different etc.
Exception: Lock wait timeout exceeded; try restarting transaction
This is a MySQL issue and we found the following solution for this:
Change the value in my.ini to a higher value, e.g. "innodb_lock_wait_timeout=5". To find out what your current lock wait timeout is, run "show variables like 'innodb_lock_wait%';" If you are running in a MySQL cluster: The innodb_lock_wait_timout variable has no bearing on the NDB tables. This timeout is administered within the NDBD nodes. You need to update the config.ini file with a higher value for TransactionDeadlockDetectionTimeout. http://dev.mysql.com/doc/mysql-cluster-excerpt/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-param-ndbd-definition-transactiondeadlockdetectiontimeout
Issues with MySQL tinyints
We got a report of a tinyint issue on PDI-8861. Unfortunately, we did not get a confirmation if this was solved but we keep it for the records here:
We proposed to set the transformedBitIsBoolean or tinyInt1isBit options.
Getting the meta data of a table with a large amount of rows could be slow
As reported in PDI-9457 in the table input step using the "Get SQL select statement" and "Do you want to include the field names in the SQL?" slows down with large amounts of rows.
This could happen also in other scenarios, when the meta data is retrieved the first time.
After researching this issue, we found, this is a MySQL bug reported over here: http://bugs.mysql.com/bug.php?id=64621
Proposed workaround:
set connection property useServerPrepStmts = true
see http://dev.mysql.com/doc/refman/5.5/en/connector-j-reference-configuration-properties.html
Further information from the MySQL case: "As for drawbacks of using this option, there are none. As a matter of fact, this is the preferred way. We discouraged it's use long ago while the code is server was still flaky. Now that it's stable you should definitely use SS PS for your work."
This looks resolved in JDBC driver version 5.1.19.
Additional info: This is only an issue for the first query - subsequent queries of the meta data will be pulled from the DB cache.