IteratorFactory (S-Space Package 2.0.1 API)

java.lang.Object
- edu.ucla.sspace.text.IteratorFactory

```
public class IteratorFactory
extends Object
```
A factory class for generating Iterator<String> tokenizers for streams of tokens such as BufferedReader instances. This class manages all of the internal configurations and properties for how to tokenize. SemanticSpace instances are encouraged to utilize this class for creating iterators over the tokens in the documents rather than creating the iterators themsevles, as this class may contain additional settings to be applied to which the SemanticSpace instance would not have access.
This class offers two configurable parameters for controlling the tokenizing of streams.

Property: "edu.ucla.sspace.text.TokenizerFactory.tokenFilter"
Default: unset
This property sets a configuration of a TokenFilter that should be applied to all token streams.

Property: "edu.ucla.sspace.text.TokenizerFactory.stemmer"
Default: unset
This property sets enables the use of the Stemmer on all the tokens returned by iterators of this class. The property value should be the fully qualified class name of a Stemmer class implementation.

Property: "edu.ucla.sspace.text.TokenizerFactory.tokenCountLimit"
Default: unset
This property sets the maximum number of tokens returned by any iterator returned from this class. It can be used to artificially limit the total number of tokens per document.

Property: "edu.ucla.sspace.text.TokenizerFactory.compoundTokens"
Default: unset
This property sets the name of a file that Contains all of the recognized compound words (or multi-token tokens) recognized by any iterators returned by this class.

Note that tokens will be combined into a compound token prior to filtering. Therefore if filtering is enabled, any compound token should also be permitted by the word filter.
Note that this class provides two distinct ways to access the token streams if filtering is enabled. The tokenize method will filter out any tokens without any indication. This can significantly alter the original ordering of the token stream. For applications where the original ordering needs to be preserved, the tokenizeOrdered method should be used instead. This method will return the IteratorFactor.EMTPY_TOKEN value to indicate that a token has been removed. This preserves the original token ordering without requiring applications to do the filtering themselves. Note that If filtering is disabled, the two methods will return the same tokens.
This class is thread-safe.

See Also:
WordIterator, TokenFilter, CompoundWordIterator

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`COMPOUND_TOKENS_FILE_PROPERTY` Specifies the name of a file that contains all the recognized compound tokens
`static String`	`EMPTY_TOKEN` The signifier that stands in place of a token has been removed from an iterator's token stream by means of a `TokenFilter`.
`static Set<String>`	`ITERATOR_FACTORY_PROPERTIES` A list of all the factory properties supported for configuration by the `IteratorFactory`.
`static String`	`STEMMER_PROPERTY` Specifies the `Stemmer` to use on tokens.
`static String`	`TOKEN_COUNT_LIMIT_PROPERTY` Specifices an upper limit on the number of tokens each iterator can return.
`static String`	`TOKEN_FILTER_PROPERTY` Specifies the `TokenFilter` to apply to all iterators generated by this factory
`static String`	`TOKEN_REPLACEMENT_FILE_PROPERTY` Specifies the name of a file which contains term replacement mappings for a `WordReplacementIterator`.

Method Summary

Methods
Modifier and Type	Method and Description
`static void`	`setProperties(Properties props)` Reconfigures the type of iterator returned by this factory based on the specified properties.
`static void`	`setResourceFinder(ResourceFinder finder)` Sets the `ResourceFinder` used by the iterator factory to locate its file-based resources when configuring the tokenization.
`static Iterator<String>`	`tokenize(BufferedReader reader)` Tokenizes the contents of the reader according to the system configuration and returns an iterator over all the tokens, excluding those that were removed by any configured `TokenFilter`.
`static Iterator<String>`	`tokenize(String str)` Tokenizes the contents of the string according to the system configuration and returns an iterator over all the tokens, excluding those that were removed by any configured `TokenFilter`.
`static Iterator<String>`	`tokenizeOrdered(BufferedReader reader)` Tokenizes the contents of the reader according to the system configuration and returns an iterator over all the tokens where any removed tokens have been replaced with the `IteratorFactory.EMPTY_TOKEN` value.
`static Iterator<String>`	`tokenizeOrdered(String str)` Tokenizes the contents of the string according to the system configuration and returns an iterator over all the tokens where any removed tokens have been replaced with the `IteratorFactory.EMPTY_TOKEN` value.
`static Iterator<String>`	`tokenizeOrderedWithReplacement(BufferedReader reader)` Wraps an iterator returned by `tokenizeOrdered` to also include term replacement of tokens.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - EMPTY_TOKEN
```
public static final String EMPTY_TOKEN
```
    The signifier that stands in place of a token has been removed from an iterator's token stream by means of a TokenFilter. Tokens returned by tokenizeOrdered may be checked against this value to determine whether a token at that position in the stream would have been returned but was removed.
    
    See Also:
    Constant Field Values
  - TOKEN_FILTER_PROPERTY
```
public static final String TOKEN_FILTER_PROPERTY
```
    Specifies the TokenFilter to apply to all iterators generated by this factory
    
    See Also:
    Constant Field Values
  - STEMMER_PROPERTY
```
public static final String STEMMER_PROPERTY
```
    Specifies the Stemmer to use on tokens. If not set, no stemming is done.
    
    See Also:
    Constant Field Values
  - COMPOUND_TOKENS_FILE_PROPERTY
```
public static final String COMPOUND_TOKENS_FILE_PROPERTY
```
    Specifies the name of a file that contains all the recognized compound tokens
    
    See Also:
    Constant Field Values
  - TOKEN_REPLACEMENT_FILE_PROPERTY
```
public static final String TOKEN_REPLACEMENT_FILE_PROPERTY
```
    Specifies the name of a file which contains term replacement mappings for a WordReplacementIterator.
    
    See Also:
    Constant Field Values
  - TOKEN_COUNT_LIMIT_PROPERTY
```
public static final String TOKEN_COUNT_LIMIT_PROPERTY
```
    Specifices an upper limit on the number of tokens each iterator can return.
    
    See Also:
    Constant Field Values
  - ITERATOR_FACTORY_PROPERTIES
```
public static final Set<String> ITERATOR_FACTORY_PROPERTIES
```
    A list of all the factory properties supported for configuration by the IteratorFactory.
- Method Detail
  - setProperties
```
public static void setProperties(Properties props)
```
    Reconfigures the type of iterator returned by this factory based on the specified properties.
  - setResourceFinder
```
public static void setResourceFinder(ResourceFinder finder)
```
    Sets the ResourceFinder used by the iterator factory to locate its file-based resources when configuring the tokenization. This method should be set prior to calling setProperties to ensure that the resources are accessed correctly. Most applications will never need to call this method.
    
    Parameters:
    finder - the resource finder used to find and open file-based resources
  - tokenize
```
public static Iterator<String> tokenize(BufferedReader reader)
```
    Tokenizes the contents of the reader according to the system configuration and returns an iterator over all the tokens, excluding those that were removed by any configured TokenFilter.
    
    Parameters:
    reader - a reader whose contents are to be tokenized
    
    Returns:
    an iterator over all of the optionally-filtered tokens in the reader
  - tokenize
```
public static Iterator<String> tokenize(String str)
```
    Tokenizes the contents of the string according to the system configuration and returns an iterator over all the tokens, excluding those that were removed by any configured TokenFilter.
    
    Parameters:
    str - a string whose contents are to be tokenized
    
    Returns:
    an iterator over all of the optionally-filtered tokens in the string
  - tokenizeOrdered
```
public static Iterator<String> tokenizeOrdered(BufferedReader reader)
```
    Tokenizes the contents of the reader according to the system configuration and returns an iterator over all the tokens where any removed tokens have been replaced with the IteratorFactory.EMPTY_TOKEN value. Tokens returned by this method may be checked against this value to determine whether a token at that position in the stream would have been returned but was removed. In doing this, the original order and positioning is retained.
    
    Parameters:
    reader - a reader whose contents are to be tokenized
    
    Returns:
    an iterator over all of the tokens in the reader where any tokens removed due to filtering have been replaced with the IteratorFactory.EMPTY_TOKEN value
  - tokenizeOrdered
```
public static Iterator<String> tokenizeOrdered(String str)
```
    Tokenizes the contents of the string according to the system configuration and returns an iterator over all the tokens where any removed tokens have been replaced with the IteratorFactory.EMPTY_TOKEN value. Tokens returned by this method may be checked against this value to determine whether a token at that position in the stream would have been returned but was removed. In doing this, the original order and positioning is retained.
    
    Parameters:
    str - a string whose contents are to be tokenized
    
    Returns:
    an iterator over all of the tokens in the string where any tokens removed due to filtering have been replaced with the IteratorFactory.EMPTY_TOKEN value
  - tokenizeOrderedWithReplacement
```
public static Iterator<String> tokenizeOrderedWithReplacement(BufferedReader reader)
```
    Wraps an iterator returned by tokenizeOrdered to also include term replacement of tokens. Terms will be replaced based on a mapping provided through the system configuration.
    
    Parameters:
    reader - A reader whose contents are to be tokenized.
    
    Returns:
    An iterator over all the tokens in the reader where any tokens removed due to filtering have been replaced with the IteratorFactory.EMPTY_TOKEN value, and tokens may be replaced based on system configuration.

Class IteratorFactory

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

EMPTY_TOKEN

TOKEN_FILTER_PROPERTY

STEMMER_PROPERTY

COMPOUND_TOKENS_FILE_PROPERTY

TOKEN_REPLACEMENT_FILE_PROPERTY

TOKEN_COUNT_LIMIT_PROPERTY

ITERATOR_FACTORY_PROPERTIES

Method Detail

setProperties

setResourceFinder

tokenize

tokenize

tokenizeOrdered

tokenizeOrdered

tokenizeOrderedWithReplacement