Regex Evaluation

Description

This step type allows you to validate an input field against regular expression. A regular expression (regex or regexp for short) is a special text string for describing a search pattern. For example, the equivalent regex for wildcard notations such as *.txt to find all text files in a file manager is:

.*\.txt

If you enable capture groups, this step can be used to parse a complex string of text and create several new fields out of it. For instance, if you had a field of text containing an author's name in quotes and the number of posts made by them:

"Author, Ann" - 53 posts

You could use the following regex with two capture groups to create two new fields in the transformation, one for the name, and one for the number of posts:

^"([^"]*)" - (\d*) posts$

Settings Tab

Option	Description
Step name	Name of the step. *Note*: This name has to be unique in a single transformation.
Field to evaluate	Name of the field to evaluate
Result Fieldname	The name of the return field (boolean)
Create fields for capture groups	Enable this if you want to create new fields based on capture groups in the regular expression. If this option is not enabled, the step will determine whether the row matches the regular expression or not.
Regular expression	Put here the regular expression to match.
Use variable substitution	If you use variable, return it's content by selecting this option.
Capture group fields	Here you can specify the new fields you would like to capture. IMPORTANT: The order of occurrence is the same as the order of the capture groups in the regular expression. The different columns allow you to change to the required data type right away. Note: A capture group is a part of the expression between parenthesis. For example, you can capture a sequence of numerical characters like this: ([0-9]+)

Content

Option	Description
Ignore differences in Unicode encodings	Check to ignore differences. *Note*: This may improve performance, but be sure you data only contains US ASCII characters.
Enables case-insensitive matching	By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the 'Unicode-aware case...' flag in conjunction with this flag. *Note*: You can also enable this via the embedded flag expression (?i).
Permit whitespace and and comments in pattern	When enabled, the step will ignore whitespace and embedded comments starting with # through the end of the line. In this mode, you must use the \s token to match whitespace. *Note*: Comments mode can also be enabled via the embedded flag expression (?x).
Enable dotall mode	When enabled, the expression '.' matches any character including the line terminator. By default, this expression does not match the line terminators. *Note*: Dotall mode can also be enabled via the flag expression (?s).
Enable multiline mode	When enabled, the expressions '^' and '$' match just after or just before, respectively, a line terminator or the end of the input sequence. By default, these expressions only match at the beginning and the end of the entire input sequence. *Note*: Multiline mode can also be enabled via the flag expression (?m)
Enable Unicode-aware case folding	When enabled, in conjunction with the Case-insensitive flag, case-insensitive matching is done in a manner consistent with the Unicode standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. *Note*: Unicode-aware case folding can also be enabled via the embedded flag expression (?u).
Enables Unix lines mode	When enabled, only the line terminator is recognized in the behavior of '.', '^', and '$'. *Note*: Unix lines mode can also be enabled via the embedded flag mode (?d).

Example

samples/transformations/Regex Eval - parse NCSA access log records.ktr