Version: v2.4 
RegexTokenizer
This node creates a new DataFrame by the process of taking text (such as a sentence) and breaking it into individual terms (usually words) based on regular express
Type
transform
Fields
| Name | Title | Description |
|---|---|---|
| inputCol | Column | input column for tokenizing |
| outputCol | Tokenized Column | New output column after tokenization |
| pattern | Pattern | The regex pattern used to match delimiters |
| gaps | Gaps | Indicates whether the regex splits on gaps |
Examples
Input
| label | message | id |
|---|---|---|
| DoubleType | StringType | DoubleType |
| 1.0 | this is a spam | 2.0 |
| 0.0 | i am going to work | 1.0 |
Parameters
| Name | Value |
|---|---|
| Column | message |
| Tokenized Column | token_output |
| Pattern | \s+ |
| Gaps | false |
Output
| label | message | id | token_output |
|---|---|---|---|
| DoubleType | StringType | DoubleType | ArrayType(StringType,true) |
| 1.0 | this is a spam | 2.0 | WrappedArray(this, is, a, spam) |
| 0.0 | i am going to work | 1.0 | WrappedArray(i, am, going, to, work) |