Group By
PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.
Description
This step allows you to calculate values over a defined group of fields. Examples of common use cases are: calculate the average sales per product or get the number of yellow shirts that we have in stock.
Note: This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly. When using the Sort step, this works fine. When you sorted the data outside of PDI, you may run into issues with the internal case sensitive/insensitive flag. Further information can be found on PDI-11440 (illustrating an issue with the Merge join step, but the issue is valid for this step as well).
See also the Memory Group by step to handle non sorted input.
Options
The following table provides a description of the options available for the Group By step:
Option | Description |
---|---|
Step name | Name of the step; this name has to be unique in a single transformation |
Include all rows? | Enable if you want all rows in the output, not just the aggregation; to differentiate between the two types of rows in the output, a flag is required in the output. You must specify the name of the flag field in that case (the type is boolean). |
Temporary files directory | The directory in which the temporary files are stored (needed when the Include all rows option is enabled and the number or grouped rows exceed 5000 rows); the default is the standard temporary directory for the system |
TMP-file prefix | Specify the file prefix used when naming temporary files |
Add line number, restart in each group | Enable to add a line number that restarts at 1 in each group |
Line number field name | Enable to add a line number that restarts at 1 in each group |
Always give back a row | If you enable this option, the Group By step will always give back a result row, even if there is no input row. |
Group fields table | Specify the fields over which you want to group. Click Get Fields to add all fields from the input stream(s). |
Aggregates table | Specify the fields that must be aggregated, the method and the name of the resulting new field.
|
Metadata Injection Support
All fields of this step support metadata injection. You can use this step with ETL Metadata Injection to pass metadata to your transformation at runtime.
Note: The metadata injection values for the aggregation type are (in respective order): SUM, AVERAGE, MEDIAN, PERCENTILE, MIN, MAX, COUNT_ALL, CONCAT_COMMA, FIRST, LAST, FIRST_INCL_NULL, LAST_INCL_NULL, CUM_SUM, CUM_AVG, STD_DEV, CONCAT_STRING, COUNT_DISTINCT, COUNT_ANY
Examples
These are the examples that are available in our distribution:
samples/transformations/Group By - Calculate standard deviation.ktr samples/transformations/Group by - include all rows and calculations .ktr samples/transformations/Group By - include all rows without a grouping.ktr