PDI Logging

Since v4 Kettle uses a central log store and a logging registry with which you can interact.

Here are some of the advantages:

Log separation between objects (transformations, jobs, ...) on the same server
Central logging store infrastructure with central memory management
Logging data lineage so we know where each log row comes from
Incremental log updates from the central log store
Error line identification to allow for color coding

The Central Log Store

The complete PDI 4.0 logging infrastructure comes in the form of a single Log4J Appender that keeps all rows in memory and keeps track of where every log row comes from.
For example, if you are running a transformation "Trans trans;" you can get the logging text like this:

Log4jBufferAppender appender = CentralLogStore.getAppender();
String logText = appender.getBuffer(trans.getLogChannelId(), false).toString();

As you can from the example, every runnable object in Kettle has a log channel ID and as such you can ask the logging text for steps, job entries, databases and so on.

One thing to look out for is to make sure to discard your log lines "when you no longer need them". You can use the following construct:

CentralLogStore.discardLines(trans.getLogChannelId(), false);

Logging levels

Since PDI version 4 it is no longer possible to change the logging level while a transformation or job is running. That is because every object that is executed keeps its own log level for the duration of the execution. Obviously this was implemented to allow different transformations and jobs to run with different logging levels on the same Carte, DI or BI server.

That means that even though it is not yet implemented in the Spoon user interface, it is now possible to change the logging level of individual steps, job entries and database connections. For example:

TableOutput tableOutput = (TableOutput)trans.findBaseSteps("Table Output").get(0);
tableOutput.setLogLevel(LogLevel.DEBUG);

or:

TableOutputData tableOutputData = (TableOutputData) trans.findDataInterface("Table Output");
tableOutputData.db.setLogLevel(LogLevel.DEBUG);

It is expected that in the coming releases we'll see more features appear in the UI to allow for the specification of different log levels for different objects as it would help in identifying problems in specific steps and databases connections.

The Logging Registry

Pentaho Data Integration doesn't only keep track of the log line, it also knows where it came from. Object like transformations, jobs, steps, databases and so on register themselves with the logging registry when they start. That process also includes leaving a bread-crumb trail from parent to child.

For example, it is possible to ask the logging registry for all the children of a transformation:

LoggingRegistry loggingRegistry = LoggingRegistry.getInstance();
List<String> childChannelIds = loggingRegistry.getLogChannelChildren(trans.getLogChannelId());

It is this information that is logged into the "log channel" log table and it gives you complete insight into the execution lineage of transformations and jobs.

Memory Management

As you can imagine, keeping log lines in memory indefinitely will cause some memory leaks over time, especially if a data integration user wants to log with incredibly high
If you don't want to discard lines (for example on a DI server where you don't know when the user will be querying the log text) you can set log line time-out and a maximum size of the central log buffer. These options are available with Pan/Kitchen/Carte-XML/DI Server but also as a number of environment variables...

Variable name	Description	Default
KETTLE_MAX_LOG_SIZE_IN_LINES	The maximum number of log lines that are kept internally by Kettle. Set to 0 to keep all rows	0 (<4.2) 5000 (>=4.2)
KETTLE_MAX_LOG_TIMEOUT_IN_MINUTES	The maximum age (in minutes) of a log line while being kept internally by Kettle. Set to 0 to keep all rows indefinitely	0 (<4.2) 1440 (>=4.2)

The following options were introduced in 4.2.0-M1 to further keep memory usage under control while using repeat loops in jobs and so on:

Variable name	Description	Default
KETTLE_MAX_JOB_ENTRIES_LOGGED	The maximum number of job entry results kept in memory for logging purposes.	1000
KETTLE_MAX_JOB_TRACKER_SIZE	The maximum number of job trackers kept in memory	1000
KETTLE_MAX_LOGGING_REGISTRY_SIZE	The maximum number of logging registry entries kept in memory for logging purposes.	1000

4.2.0 also sets sane defaults on all these values to make sure that by default you don't run out of memory.