public class IteratorFactory extends Object
Iterator<String>
tokenizers for
streams of tokens such as BufferedReader
instances. This class
manages all of the internal configurations and properties for how to
tokenize. SemanticSpace
instances are encouraged to utilize this class for creating iterators over
the tokens in the documents rather than creating the iterators themsevles, as
this class may contain additional settings to be applied to which the SemanticSpace
instance would not have
access.
This class offers two configurable parameters for controlling the tokenizing of streams.
"edu.ucla.sspace.text.TokenizerFactory.tokenFilter"
TokenFilter
that should be applied to all token streams.
"edu.ucla.sspace.text.TokenizerFactory.stemmer"
Stemmer
on all the tokens returned by iterators of this class.
The property value should be the fully qualified class name of a Stemmer
class implementation.
"edu.ucla.sspace.text.TokenizerFactory.tokenCountLimit"
"edu.ucla.sspace.text.TokenizerFactory.compoundTokens"
Note that tokens will be combined into a compound token prior to filtering. Therefore if filtering is enabled, any compound token should also be permitted by the word filter.
Note that this class provides two distinct ways to access the token streams
if filtering is enabled. The tokenize
method will filter out any tokens without any indication. This can
significantly alter the original ordering of the token stream. For
applications where the original ordering needs to be preserved, the tokenizeOrdered
method should be used
instead. This method will return the IteratorFactor.EMTPY_TOKEN
value to indicate that a token has been removed. This preserves the original
token ordering without requiring applications to do the filtering themselves.
Note that If filtering is disabled, the two methods will return the same
tokens.
This class is thread-safe.
WordIterator
,
TokenFilter
,
CompoundWordIterator
Modifier and Type | Field and Description |
---|---|
static String |
COMPOUND_TOKENS_FILE_PROPERTY
Specifies the name of a file that contains all the recognized compound
tokens
|
static String |
EMPTY_TOKEN
The signifier that stands in place of a token has been removed from an
iterator's token stream by means of a
TokenFilter . |
static Set<String> |
ITERATOR_FACTORY_PROPERTIES
A list of all the factory properties supported for configuration by the
IteratorFactory . |
static String |
STEMMER_PROPERTY
Specifies the
Stemmer to use on tokens. |
static String |
TOKEN_COUNT_LIMIT_PROPERTY
Specifices an upper limit on the number of tokens each iterator can
return.
|
static String |
TOKEN_FILTER_PROPERTY
Specifies the
TokenFilter to apply to all iterators generated by
this factory |
static String |
TOKEN_REPLACEMENT_FILE_PROPERTY
Specifies the name of a file which contains term replacement mappings for
a
WordReplacementIterator . |
Modifier and Type | Method and Description |
---|---|
static void |
setProperties(Properties props)
Reconfigures the type of iterator returned by this factory based on the
specified properties.
|
static void |
setResourceFinder(ResourceFinder finder)
Sets the
ResourceFinder used by the iterator factory to locate
its file-based resources when configuring the tokenization. |
static Iterator<String> |
tokenize(BufferedReader reader)
Tokenizes the contents of the reader according to the system
configuration and returns an iterator over all the tokens, excluding
those that were removed by any configured
TokenFilter . |
static Iterator<String> |
tokenize(String str)
Tokenizes the contents of the string according to the system
configuration and returns an iterator over all the tokens, excluding
those that were removed by any configured
TokenFilter . |
static Iterator<String> |
tokenizeOrdered(BufferedReader reader)
Tokenizes the contents of the reader according to the system
configuration and returns an iterator over all the tokens where any
removed tokens have been replaced with the
IteratorFactory.EMPTY_TOKEN value. |
static Iterator<String> |
tokenizeOrdered(String str)
Tokenizes the contents of the string according to the system
configuration and returns an iterator over all the tokens where any
removed tokens have been replaced with the
IteratorFactory.EMPTY_TOKEN value. |
static Iterator<String> |
tokenizeOrderedWithReplacement(BufferedReader reader)
Wraps an iterator returned by
tokenizeOrdered to also include term replacement of tokens. |
public static final String EMPTY_TOKEN
TokenFilter
. Tokens
returned by tokenizeOrdered
may
be checked against this value to determine whether a token at that
position in the stream would have been returned but was removed.public static final String TOKEN_FILTER_PROPERTY
TokenFilter
to apply to all iterators generated by
this factorypublic static final String STEMMER_PROPERTY
Stemmer
to use on tokens. If not set, no stemming
is done.public static final String COMPOUND_TOKENS_FILE_PROPERTY
public static final String TOKEN_REPLACEMENT_FILE_PROPERTY
WordReplacementIterator
.public static final String TOKEN_COUNT_LIMIT_PROPERTY
public static final Set<String> ITERATOR_FACTORY_PROPERTIES
IteratorFactory
.public static void setProperties(Properties props)
public static void setResourceFinder(ResourceFinder finder)
ResourceFinder
used by the iterator factory to locate
its file-based resources when configuring the tokenization. This method
should be set prior to calling setProperties
to ensure that the resources are accessed correctly. Most
applications will never need to call this method.finder
- the resource finder used to find and open file-based
resourcespublic static Iterator<String> tokenize(BufferedReader reader)
TokenFilter
.reader
- a reader whose contents are to be tokenizedpublic static Iterator<String> tokenize(String str)
TokenFilter
.str
- a string whose contents are to be tokenizedpublic static Iterator<String> tokenizeOrdered(BufferedReader reader)
IteratorFactory.EMPTY_TOKEN
value. Tokens returned by this method may
be checked against this value to determine whether a token at that
position in the stream would have been returned but was removed. In
doing this, the original order and positioning is retained.reader
- a reader whose contents are to be tokenizedIteratorFactory.EMPTY_TOKEN
valuepublic static Iterator<String> tokenizeOrdered(String str)
IteratorFactory.EMPTY_TOKEN
value. Tokens returned by this method may
be checked against this value to determine whether a token at that
position in the stream would have been returned but was removed. In
doing this, the original order and positioning is retained.str
- a string whose contents are to be tokenizedIteratorFactory.EMPTY_TOKEN
valuepublic static Iterator<String> tokenizeOrderedWithReplacement(BufferedReader reader)
tokenizeOrdered
to also include term replacement of tokens. Terms will
be replaced based on a mapping provided through the system configuration.reader
- A reader whose contents are to be tokenized.IteratorFactory.EMPTY_TOKEN
value, and tokens may be replaced
based on system configuration.Copyright © 2012. All Rights Reserved.