Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

This step executes an Python script, inside the CPython environment, from within a PDI transformation. It can accept zero or more incoming row sets. Row sets are sent to Python as named pandas data frames. Data can be sent to Python in batches, as samples, row-by-row or as all available rows.

Output can be one or more variables that are set in Python after the user's script executes. In the case of a single variable this can be a data frame, in which case the columns of the frame become output fields from this step. In the case where multiple variables are retrieved from Python, they are retrieved in string form or as png image data - the step automatically detects if a variable is an image. In this mode, there is one row output from the step, where each outgoing field holds the string/serializable value of a single Python variable.

Installation Instructions

This step requires Python 2.7 or 3.4 to be installed. It also requires the pandas, numpy, matplotlib and sklearn packages to be installed in Python. The python executable must be available in the user's PATH.

See also Mark Hall's blog post: CPython Scripting in Pentaho Data Integration