Unique Rows (HashSet)

PLEASE NOTE: This documentation applies to Pentaho 8.0 and earlier. For Pentaho 8.1 and later, see Unique Rows (HashSet) on the Pentaho Enterprise Edition documentation site.

Description

The Unique Rows (HashSet) transformation step tracks exact duplicate rows. The step can also remove duplicate rows and leave only unique occurrences. Unlike the Unique Rows transformation step, which only correctly evaluates consecutive duplicate rows unless used with a sorted input, the Unique Rows (HashSet) step does not require a sorted input to process duplicate rows, instead it tracks duplicates in memory.

Option	Definition
Step name	Name of this step as it appears in the transformation workspace
Compare using stored row values	Stores values for the selected fields in memory for every record. Storing row values requires more memory, but it prevents possible false positives if there are hash collisions.
Redirect duplicate row	Processes duplicate rows as an error and redirect rows to the error stream of the step. Requires you to set error handling for this step.
Error description	Sets the error handling description to display when duplicate rows are detected. Only available when Redirect duplicate row is checked.
Fields to compare table	Lists the fields to compare---no entries means the step compares an entire row