Pentaho Data Integration Steps

Introduction

This page contains the index for the documentation on all the standard steps in Pentaho Data Integration.
We invite everyone to add more details, tips and samples to the step pages.

NOTE

You may not be viewing the most up-to-date documentation for these steps. View the most recent Pentaho documentation here.

Name

Category

ID

Description

Metadata Java class
opdts = org.pentaho.di.trans.steps

Abort

Flow

Abort

Abort a transformation

opdts.abort.AbortMeta

Add a checksum

Transform

CheckSum

Add a checksum column for each input row

opdts.checksum.CheckSumMeta

Add constants

Transform

Constant

Add one or more constants to the input rows

opdts.constant.ConstantMeta

Add sequence

Transform

Sequence

Get the next value from an sequence

opdts.addsequence.AddSequenceMeta

Add value fields changing sequence

Transform

FieldsChangeSequence

Add sequence depending of fields value change. Each time value of at least one field change, PDI will reset sequence.

opdts.fieldschangesequence.FieldsChangeSequenceMeta

Add XML

Transform

AddXML

Encode several fields into an XML fragment

opdts.addxml.AddXMLMeta

Aggregate Rows

Deprecated




Analytic Query

Statistics

AnalyticQuery

Execute analytic queries over a sorted dataset (LEAD/LAG/FIRST/LAST)

opdts.analyticquery.AnalyticQueryMeta

Append streams

Flow

Append

Append 2 streams in an ordered way

opdts.append.AppendMeta

ARFF Output

Data Mining

Arff Output

Writes data in ARFF format to a file

opdts.append.arff.ArffOutputMeta

Automatic Documentation Output

Output

AutoDoc

This step automatically generates documentation based on input in the form of a list of transformations and jobs

opdts.autodoc.AutoDocMeta

Avro Input (Deprecated)

Deprecated (pre- v.8.0)

Input (v.8.0 and after)

AvroInput

Decode binary or Json Avro data from a file or a field

opdts.avroinput.AvroInputMeta

Avro Output

Output

AvroOutput

Encode binary or Json Avro data to a file

opdts.avrooutput.AvroOutputMeta



Block this step until steps finish

Flow

BlockUntilStepsFinish

Block this step until selected steps finish.

opdts.blockuntilstepsfinish.BlockUntilStepsFinishMeta

Blocking Step

Flow

BlockingStep

This step blocks until all incoming rows have been processed. Subsequent steps only recieve the last input row to this step.

opdts.blockingstep.BlockingStepMeta

Calculator

Transform

Calculator

Create new fields by performing simple calculations

opdts.calculator.CalculatorMeta

Call DB Procedure

Lookup

DBProc

Get back information by calling a database procedure.

opdts.dbproc.DBProcMeta

Call Endpoint

BA Server

CallEndpointStep

Calls API endpoints from the BA server within a PDI transformation.

org.pentaho.di.baserver.utils.CallEndpointMeta

Change file encoding

Utility

ChangeFileEncoding

Change file encoding and create a new file

opdts.changefileencoding.ChangeFileEncodingMeta

Cassandra input

Big Data

CassandraInput

Read from a Cassandra column family

opdts.cassandrainput.CassandraInputMeta

Cassandra output

Big Data

CassandraOutput

Write to a Cassandra column family

opdts.cassandraoutput.CassandraOutputMeta

Check if a column exists

Lookup

ColumnExists

Check if a column exists in a table on a specified connection.

opdts.columnexists.ColumnExistsMeta

Check if file is locked

Lookup

FileLocked

Check if a file is locked by another process

opdts.filelocked.FileLockedMeta

Check if webservice is available

Lookup

WebServiceAvailable

Check if a webservice is available

opdts.webserviceavailable.WebServiceAvailableMeta

Clone row

Utility

CloneRow

Clone a row as many times as needed

opdts.clonerow.CloneRowMeta

Closure Generator

Transform

ClosureGenerator

This step allows you to generates a closure table using parent-child relationships.

opdts.closure.ClosureGeneratorMeta

Combination lookup/update

Data Warehouse

CombinationLookup

Update a junk dimension in a data warehouse. Alternatively, look up information in this dimension. The primary key of a junk dimension are all the fields.

opdts.combinationlookup.CombinationLookupMeta

Concat Fields

Transform

ConcatFields

The Concat Fields step is used to concatenate multiple fields into one target field. The fields can be separated by a separator and the enclosure logic is completely compatible with the Text File Output step.

opdts.concatfields.ConcatFieldsMeta

Copy rows to result

Job

RowsToResult

Use this step to write rows to the executing job. The information will then be passed to the next entry in this job.

opdts.rowstoresult.RowsToResultMeta

CouchDB Input

Big Data

CouchDbInput

Retrieves all documents from a given view in a given design document from a given database

opdts.couchdbinput.CouchDbInputMeta

Credit card validator

Validation

CreditCardValidator

The Credit card validator step will help you tell: (1) if a credit card number is valid (uses LUHN10 (MOD-10) algorithm) (2) which credit card vendor handles that number (VISA, MasterCard, Diners Club, EnRoute, American Express (AMEX),...)

opdts.creditcardvalidator.CreditCardValidatorMeta

CSV file input

Input

CsvInput

Simple CSV file input

opdts.csvinput.CsvInputMeta

Data Grid

Input

DataGrid

Enter rows of static data in a grid, usually for testing, reference or demo purpose

opdts.datagrid.DataGridMeta

Data Validator

Validation

Validator

Validates passing data based on a set of rules

opdts.validator.ValidatorMeta

Database join

Lookup

DBJoin

Execute a database query using stream values as parameters

opdts.databasejoin.DatabaseJoinMeta

Database lookup

Lookup

DBLookup

Look up values in a database using field values

opdts.databaselookup.DatabaseLookupMeta

De-serialize from file

Input

CubeInput

Read rows of data from a data cube.

opdts.cubeinput.CubeInputMeta

Delay row

Utility

Delay

Output each input row after a delay

opdts.delay.DelayMeta

Delete

Output

Delete

Delete data in a database table based upon keys

opdts.delete.DeleteMeta

Detect empty stream

Flow

DetectEmptyStream

This step will output one empty row if input stream is empty (ie when input stream does not contain any row)

opdts.detectemptystream.DetectEmptyStreamMeta

Dimension lookup/update

Data Warehouse

DimensionLookup

Update a slowly changing dimension in a data warehouse. Alternatively, look up information in this dimension.

opdts.dimensionlookup.DimensionLookupMeta

Dummy (do nothing)

Flow

Dummy

This step type doesn't do anything. It's useful however when testing things or in certain situations where you want to split streams.

opdts.dummytrans.DummyTransMeta

Dynamic SQL row

Lookup

DynamicSQLRow

Execute dynamic SQL statement build in a previous field

opdts.dynamicsqlrow.DynamicSQLRowMeta

Edi to XML

Utility

TypeExitEdi2XmlStep

Converts an Edifact message to XML to simplify data extraction (Available in PDI 4.4, already present in CI trunk builds)

opdts.edi2xml.Edi2XmlMeta

ElasticSearch Bulk Insert

Bulk loading

ElasticSearchBulk

Performs bulk inserts into ElasticSearch

opdts.elasticsearchbulk.ElasticSearchBulkMeta

Email messages input

Input

MailInput

Read POP3/IMAP server and retrieve messages

opdts.mailinput.MailInputMeta

ESRI Shapefile Reader

Input

ShapeFileReader

Reads shape file data from an ESRI shape file and linked DBF file

org.pentaho.di.shapefilereader.ShapeFileReaderMeta

ETL Metadata Injection

Flow

MetaInject

This step allows you to inject metadata into an existing transformation prior to execution. This allows for the creation of dynamic and highly flexible data integration solutions.

opdts.metainject.MetaInjectMeta

Example Step (Deprecated)

Deprecated




Execute a process

Utility

ExecProcess

Execute a process and return the result

opdts.execprocess.ExecProcessMeta

Execute row SQL script

Scripting

ExecSQLRow

Execute SQL script extracted from a field created in a previous step.

opdts.execsqlrow.ExecSQLRowMeta

Execute SQL script

Scripting

ExecSQL

Execute an SQL script, optionally parameterized using input rows

opdts.sql.ExecSQLMeta

File exists

Lookup

FileExists

Check if a file exists

opdts.fileexists.FileExistsMeta

Filter Rows

Flow

FilterRows

Filter rows using simple equations

opdts.filterrows.FilterRowsMeta

Fixed file input

Input

FixedInput

Fixed file input

opdts.fixedinput.FixedInputMeta

Formula

Scripting

Formula

Calculate a formula using Pentaho's libformula

opdts.formula.FormulaMeta

Fuzzy match

Lookup

FuzzyMatch

Finding approximate matches to a string using matching algorithms. Read a field from a main stream and output approximative value from lookup stream.

opdts.fuzzymatch.FuzzyMatchMeta

Generate random credit card numbers

Input

RandomCCNumberGenerator

Generate random valide (luhn check) credit card numbers

opdts.randomccnumber.RandomCCNumberGeneratorMeta

Generate random value

Input

RandomValue

Generate random value

opdts.randomvalue.RandomValueMeta

Generate Rows

Input

RowGenerator

Generate a number of empty or equal rows.

opdts.rowgenerator.RowGeneratorMeta

Get data from XML

Input

getXMLData

Get data from XML file by using XPath. This step also allows you to parse XML defined in a previous field.

opdts.getxmldata.GetXMLDataMeta

Get File Names

Input

GetFileNames

Get file names from the operating system and send them to the next step.

opdts.getfilenames.GetFileNamesMeta

Get files from result

Job

FilesFromResult

This step allows you to read filenames used or generated in a previous entry in a job.

opdts.filesfromresult.FilesFromResultMeta

Get Files Rows Count

Input

GetFilesRowsCount

Get Files Rows Count

opdts.getfilesrowscount.GetFilesRowsCountMeta

Get ID from slave server

Transform

GetSlaveSequence

Retrieves unique IDs in blocks from a slave server. The referenced sequence needs to be configured on the slave server in the XML configuration file.

opdts.getslavesequence.GetSlaveSequenceMeta

Get previous row fields

Deprecated




Get repository names

Input

GetRepositoryNames

Lists detailed information about transformations and/or jobs in a repository

opdts.getrepositorynames.GetRepositoryNamesMeta

Get rows from result

Job

RowsFromResult

This allows you to read rows from a previous entry in a job

opdts.rowsfromresult.RowsFromResultMeta

Get Session Variables

BA Server

GetSessionVariableStep

Retrieves the value of a session variable

org.pentaho.di.baserver.utils.GetSessionVariableMeta

Get SubFolder names

Input

GetSubFolders

Read a parent folder and return all subfolders

opdts.getsubfolders.GetSubFoldersMeta

Get System Info

Input

SystemInfo

Get information from the system like system date, arguments, etc.

opdts.systemdata.SystemDataMeta

Get table names

Input

GetTableNames

Get table names from database connection and send them to the next step

opdts.gettablenames.GetTableNamesMeta

Get Variables

Job

GetVariable

Determine the values of certain (environment or Kettle) variables and put them in field values.

opdts.getvariable.GetVariableMeta

Google Analytics

Input

TypeExitGoogleAnalyticsInputStep

Fetches data from google analytics account

opdts.googleanalytics.GaInputStepMeta

Google Docs Input

Input




Greenplum Bulk Loader (Deprecated)

Deprecated

GPBulkLoader

Greenplum Bulk Loader

opdts.gpbulkloader.GPBulkLoaderMeta

Greenplum Load

Bulk loading

GPLoad

Greenplum Load


Group by

Statistics

GroupBy

Builds aggregates in a group by fashion. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.

opdts.groupby.GroupByMeta

GZIP CSV Input

Input

ParallelGzipCsvInput

Parallel GZIP CSV file input reader

opdts.parallelgzipcsv.ParGzipCsvInputMeta

Hadoop File Input

Big Data

HadoopFileInputPlugin

Read data from a variety of different text-file types stored on a Hadoop cluster

opdts.hadoopfileinput.HadoopFileInputMeta

Hadoop File Output

Big Data

HadoopFileOutputPlugin

Write data to a variety of different text-file types stored on a Hadoop cluster

opdts.hadoopfileoutput.HadoopFileOutputMeta

HBase input

Big Data

HbaseInput

Read from an HBase column family

opdts.hbaseinput.HBaseInputMeta

HBase output

Big Data

HbaseOutput

Write to an HBase column family

opdts.hbaseoutput.HBaseOutputMeta

HBase Row Decoder

Big Data

HBaseRowDecoder

Decodes an incoming key and HBase result object according to a mapping

opdts.hbaserowdecoder.HBaseRowDecoderMeta

HL7 Input

Input

HL7Input

Read data from HL7 data streams.

opdt.hl7.plugins.hl7input

HTTP client

Lookup

HTTP

Call a web service over HTTP by supplying a base URL by allowing parameters to be set dynamically

opdts.http.HTTPMeta

HTTP Post

Lookup

HTTPPOST

Call a web service request over HTTP by supplying a base URL by allowing parameters to be set dynamically

opdts.httppost.HTTPPOSTMeta

IBM Websphere MQ Consumer (Deprecated)

Deprecated

MQInput

Receive messages from any IBM Websphere MQ Server


IBM Websphere MQ Producer (Deprecated)

Deprecated

MQOutput

Send messages to any IBM Websphere MQ Server


Identify last row in a stream

Flow

DetectLastRow

Last row will be marked

opdts.detectlastrow.DetectLastRowMeta

If field value is null

Utility

IfNull

Sets a field value to a constant if it is null.

opdts.ifnull.IfNullMeta

Infobright Loader

Bulk loading

InfobrightOutput

Load data to an Infobright database table

opdts.infobrightoutput.InfobrightLoaderMeta

Ingres VectorWise Bulk Loader

Bulk loading

VectorWiseBulkLoader

This step interfaces with the Ingres VectorWise Bulk Loader "COPY TABLE" command.

opdts.ivwloader.IngresVectorwiseLoaderMeta