StringToWordVector
Package
weka.filters.unsupervised.attribute
Synopsis
Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).
Options
The table below describes the options available for StringToWordVector.
Option |
Description |
---|---|
IDFTransform |
Sets whether if the word frequencies in a document should be transformed into: |
TFTransform |
Sets whether if the word frequencies should be transformed into: |
attributeIndices |
Specify range of attributes to act on. This is a comma separated list of attribute indices, with "first" and "last" valid values. Specify an inclusive range with "-". E.g: "first-3,5,6-10,last". |
attributeNamePrefix |
Prefix for the created attribute names. (default: "") |
doNotOperateOnPerClassBasis |
If this is set, the maximum number of words and the minimum term frequency is not enforced on a per-class basis but based on the documents in all the classes (even if a class attribute is set). |
invertSelection |
Set attribute selection mode. If false, only selected attributes in the range will be worked on; if true, only non-selected attributes will be processed. |
lowerCaseTokens |
If set then all the word tokens are converted to lower case before being added to the dictionary. |
minTermFreq |
Sets the minimum term frequency. This is enforced on a per-class basis. |
normalizeDocLength |
Sets whether if the word frequencies for a document (instance) should be normalized or not. |
outputWordCounts |
Output word counts rather than boolean 0 or 1(indicating presence or absence of a word). |
periodicPruning |
Specify the rate (x% of the input dataset) at which to periodically prune the dictionary. wordsToKeep prunes after creating a full dictionary. You may not have enough memory for this approach. |
stemmer |
The stemming algorithm to use on the words. |
stopwords |
The file containing the stopwords (if this is a directory then the default ones are used). |
tokenizer |
The tokenizing algorithm to use on the strings. |
useStoplist |
Ignores all the words that are on the stoplist, if set to true. |
wordsToKeep |
The number of words (per class if there is a class attribute assigned) to attempt to keep. |
Capabilities
The table below describes the capabilites of StringToWordVector.
Capability |
Supported |
---|---|
Class |
No class, Relational class, Unary class, Binary class, Numeric class, Empty nominal class, Date class, Missing class values, Nominal class, String class |
Attributes |
Relational attributes, Empty nominal attributes, Date attributes, Binary attributes, String attributes, Missing values, Nominal attributes, Unary attributes, Numeric attributes |
Min # of instances |
0 |