Regex Evaluation

(warning) PLEASE NOTE: This documentation applies to an earlier version. For the most recent documentation, visit the Pentaho Enterprise Edition documentation site.

Description

This step type allows you to match the String value of an input field against a text pattern defined by a regular expression. Optionally, you can use the regular expression step to extract particular substrings from the input text field matching a portion of the text pattern into new output fields. This is known as "capturing".  

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. For example, the equivalent regex for wildcard notations such as *.txt to find all text files in a file manager is:

.*\.txt

If you enable capture groups, this step can be used to parse a complex string of text and create several new fields out of it.  For instance, if you had a field of text containing an author's name in quotes and the number of posts made by them:

"Author, Ann" - 53 posts

You could use the following regex with two capture groups to create two new fields in the transformation, one for the name, and one for the number of posts:

^"([^"]*)" - (\d*) posts$

The regex evaluation step is implemented using the java.util.regex package. The exact syntax for creating regular expressions is defined in the java.util.regex.Pattern javadoc.

See also:

IMPORTANT: Don't panic!  For people new to regular expressions, the cryptic nature of the language can be a bit daunting.  However, regular expressions pack a lot of punch and are very much worth the time you spend on it.  There are several websites and software packages available to help you create and test regular expressions.  Just do a web search with the terms regular expression editor.

Settings Tab

Option

Description

Step name

Name of the step.

Note: This name has to be unique in a single transformation.

Field to evaluate

Name of the field from the incoming stream which is to be matched against the regular expression

Result Fieldname

The name of the output field (boolean). This field will be added to the output stream and indicate whether the value of the input field matched the regular expression. Y means the value of the input field matched the regular expression, N means it didn't match.

Create fields for capture groups

Enable this if you want to create new fields based on capture groups in the regular expression. Capturing groups are those parts of the regular expression pattern which are enclosed in a pair of left and right parenthesis. If this option is enabled, substrings of the input field value corresponding to the capturing groups in the regular expression will be extracted and stored in new output fields. If this option is enabled, the "Capture group fields" grid needs to define one field for each capturing group. 

Replace previous fields

This option is available in case the "Create fields for capture groups" option is enabled. When the "Replace previous fields" option is checked, fields created for capturing groups will replace existing fields in the incoming stream with the same name. If not enabled, new fields will be added to the output stream for each capturing group field.

Regular expression

Put here the regular expression to match. See the java.util.regex.Pattern javadoc for reference documentation of the particular regular expression syntax used by this step.

Use variable substitution

Enable this if your regular expression contains variable references. By enabling this, variable references will be expanded to their value before evaluating the regular expression pattern.

Capture group fields

Here you can specify the new fields for any substrings captured by the regular expression from the input string. If the "Create fields for capture groups" option is enabled, you need to use this grid to enter a field definition corresponding to each capturing group in the regular expression. The order of the fields is the same as the order of the capturing groups in the regular expression. The columns in the grid allow you to change to the required data type right away.

Note:  A capture group is a part of the expression between a matching pair of left and right parenthesis.

For example, suppose your input field contains a text value like "Author, Ann" - 53 posts. The following regular expression contains 4 capturing groups and can be used to parse out the different parts:
^"((["]), (["]))" - (\d+) posts\.$

This expression implies 4 fields, one for each capturing group:

  • Fullname: ((["]), (["]))
  • Lastname: ([^"]+)
  • Firstname: ([^"]+)
  • Number of posts: (\d+)
    In this particualr example, a field definition must be present for each of these capturing groups. It is an error in case the number of field definitions does not match up with the number of capturing groups in the regular expression.
    Note that capturing groups can be nested. In the example above the fields Lastname and Firstname are correspond to the capturing groups that are themselves contained inside the Fullname capturing group

Content

Option

Description

Ignore differences in Unicode encodings

Check to ignore differences.

Note: This may improve performance, but be sure you data only contains US ASCII characters.

Enables case-insensitive matching

By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the 'Unicode-aware case...' flag in conjunction with this flag.

Note: You can also enable this via the embedded flag expression (?i).

Permit whitespace and and comments in pattern

When enabled, the step will ignore whitespace and embedded comments starting with # through the end of the line.
In this mode, you must use the \s token to match whitespace. (If this option is not enabled, any whitespace characters appearing in the regular expression are matched as-is).

Note: Comments mode can also be enabled via the embedded flag expression (?x).

Enable dotall mode

When enabled, the expression '.' matches any character including the line terminator. By default, this expression matches any character except line terminators.

Note: Dotall mode can also be enabled via the flag expression (?s).

Enable multiline mode

When enabled, the expressions '^' and '$' match just after or just before, respectively, a line terminator or the end of the input sequence. By default, these expressions only match at the beginning and the end of the entire input sequence.

Note: Multiline mode can also be enabled via the flag expression (?m)

Enable Unicode-aware case folding

When enabled, in conjunction with the Case-insensitive flag, case-insensitive matching is done in a manner consistent with the Unicode standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.

Note: Unicode-aware case folding can also be enabled via the embedded flag expression (?u).

Enables Unix lines mode

When enabled, only the line terminator is recognized in the behavior of '.', '^', and '$'.

Note: Unix lines mode can also be enabled via the embedded flag mode (?d).


Example


samples/transformations/Regex Eval - parse NCSA access log records.ktr