Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

CDC is used to load only new or changed data from a source system. There are no dedicated CDC functions in KettlePDI but there are many ways to achieve CDC functionality within KettlePDI.

Source Data-Based CDC

In this case you use the time stamp or sequenced IDs to identify the last loaded rows and store this information in a status table. This can even be combined with transactions: This status table holds for all jobs/transformations all tables that need to be in a consistent state. For all tables the last processed keys (source/target) and the status is saved. Some tables might need compound keys depending on the ER-design. It is also possible to combine this approach with the own Kettle transformation log tables and the Dates and Dependencies functionality. There is an extended example in the Pentaho Data Integration for Database Developers (PDI2000C) course in module ETL patterns (Patterns: Batching, Transaction V - Status Table)

...