Description
This step type allows you to validate an input field against regular expression. A regular expression (regex or regexp for short) is a special text string for describing a search pattern. For example, the equivalent regex for wildcard notations such as *.txt to find all text files in a file manager is:
.*\.txt
If you enable capture groups, this step can be used to parse a complex string of text and create several new fields out of it. For instance, if you had a field of text containing an author's name in quotes and the number of posts made by them:
"Author, Ann" - 53 posts
You could use the following regex with two capture groups to create two new fields in the transformation, one for the name, and one for the number of posts:
^"([^"]*)" - (\d*) posts$
See also:
- Wikipedia on regular expressions
- A regular expressions tutorial
- External sample parsing Tomcat log records
IMPORTANT: Don't panic! For people new to regular expressions, the cryptic nature of the language can be a bit daunting. However, regular expressions pack a lot of punch and are very much worth the time you spend on it. There are several websites and software packages available to help you create and test regular expressions. Just do a web search with the terms regular expression editor.
Settings Tab
Option |
Description |
---|---|
Step name |
Name of the step.
|
Field to evaluate |
Name of the field to evaluate |
Result Fieldname |
The name of the return field (boolean) |
Create fields for capture groups |
Enable this if you want to create new fields based on capture groups in the regular expression. If this option is not enabled, the step will determine whether the row matches the regular expression or not. |
Regular expression |
Put here the regular expression to match. |
Use variable substitution |
If you use variable, return it's content by selecting this option. |
Capture group fields |
Here you can specify the new fields you would like to capture. |
Content
Option |
Description |
---|---|
Ignore differences in Unicode encodings |
Check to ignore differences.
|
Enables case-insensitive matching |
By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the 'Unicode-aware case...' flag in conjunction with this flag.
|
Permit whitespace and and comments in pattern |
When enabled, the step will ignore whitespace and embedded comments starting with # through the end of the line.
|
Enable dotall mode |
When enabled, the expression '.' matches any character including the line terminator. By default, this expression does not match the line terminators.
|
Enable multiline mode |
When enabled, the expressions '^' and '$' match just after or just before, respectively, a line terminator or the end of the input sequence. By default, these expressions only match at the beginning and the end of the entire input sequence.
|
Enable Unicode-aware case folding |
When enabled, in conjunction with the Case-insensitive flag, case-insensitive matching is done in a manner consistent with the Unicode standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.
|
Enables Unix lines mode |
When enabled, only the line terminator is recognized in the behavior of '.', '^', and '$'.
|
Example
samples/transformations/Regex Eval - parse NCSA access log records.ktr