public class TokenFilter extends Object
An inclusive filter will accept only those tokens with which it was initialized. For an example, an inclusive filter initialized with all of the words in the english dictionary would exclude all misspellings or foreign words in a token stream.
An exclusive filter will aceept only those tokens that are not in set with which it was initialized. An exclusive filter is often used with a list of common words that should be excluded, which is also known as a "stop list."
TokenFilter
instances may be combined into a linear chain of filters.
This allows for a highly configurable filter to be made from mulitple rules.
Chained filters are created in a linear order and each filter must accept the
token for the last filter to return . If the any of the earlier
filters return
false
, then the token is not accepted.
This class also provides a static utility function loadFromSpecification
for initializing a
chain of filters from a text configuration. This is intended to facility
command-line tools that want to provide easily configurable filters. An
example configuration might look like:
include=top-tokens.txt:test-words.txt,exclude=stop-words.txt
FilteredIterator
Constructor and Description |
---|
TokenFilter(Set<String> tokens)
Constructs a filter that accepts only those tokens present in
tokens . |
TokenFilter(Set<String> tokens,
boolean excludeTokens)
Constructs a filter using
tokens that if excludeTokens is
false will accept those in tokens , or if excludeTokens is true , will accept those that are not in
tokens . |
TokenFilter(Set<String> tokens,
boolean excludeTokens,
TokenFilter parent)
Constructs a chained filter that accepts the subset of what the parent
accepts after applying its own filter to any tokens that the parent
accepts.
|
Modifier and Type | Method and Description |
---|---|
boolean |
accept(String token)
Returns
true if the token is valid according to the configuration
of this filter. |
TokenFilter |
combine(TokenFilter parent)
Creates a chained filter by accepting the subset of whatever
parent accepts less what tokens this filter rejects. |
static TokenFilter |
loadFromSpecification(String configuration)
Loads a series of chained
TokenFilter instances from the
specified configuration string. |
static TokenFilter |
loadFromSpecification(String configuration,
ResourceFinder finder)
Loads a series of chained
TokenFilter instances from the
specified configuration string using the provided ResourceFinder
to locate the resources. |
public TokenFilter(Set<String> tokens)
tokens
.public TokenFilter(Set<String> tokens, boolean excludeTokens)
tokens
that if excludeTokens
is
false
will accept those in tokens
, or if excludeTokens
is true
, will accept those that are not in
tokens
.tokens
- the set of tokens to use in filtering the outputexcludeTokens
- true
if tokens in tokens
should be
excluded, false
if only tokens in tokens
should
be includedpublic TokenFilter(Set<String> tokens, boolean excludeTokens, TokenFilter parent)
tokens
- the set of tokens to use in filtering the outputexcludeTokens
- true
if tokens in tokens
should be
excluded, false
if only tokens in tokens
should
be includedparent
- a filter to be applied before determining whether a token
is to be acceptedpublic boolean accept(String token)
true
if the token is valid according to the configuration
of this filter.token
- a token to be consideredtrue
if this token is validpublic TokenFilter combine(TokenFilter parent)
parent
accepts less what tokens this filter rejects.parent
- a filter to be applied before determining whether a token
is to be acceptednull
if one had not been
assignedpublic static TokenFilter loadFromSpecification(String configuration)
TokenFilter
instances from the
specified configuration string. This method will assume that all
specified resources exist on the local file system.
A configuration lists sets of files that contain tokens to be included or
excluded. The behavior, include
or exclude
is specified
first, followed by one or more file names, each separated by colons.
Multiple behaviors may be specified one after the other using a ,
character to separate them. For example, a typicaly configuration may
look like: "include=top-tokens.txt,test-words.txt:exclude=stop-words.txt"
Note behaviors are applied in the order they are presented on the
command-line.
configuration
- a token filter configurationnull
if the configuration did not specify any filtersIOError
- if any error occurs when reading the word list filespublic static TokenFilter loadFromSpecification(String configuration, ResourceFinder finder)
TokenFilter
instances from the
specified configuration string using the provided ResourceFinder
to locate the resources. This method is provided for applications that
need to load resources from a custom environment or file system.
A configuration lists sets of files that contain tokens to be included or
excluded. The behavior, include
or exclude
is specified
first, followed by one or more file names, each separated by colons.
Multiple behaviors may be specified one after the other using a ,
character to separate them. For example, a typicaly configuration may
look like: "include=top-tokens.txt,test-words.txt:exclude=stop-words.txt"
Note behaviors are applied in the order they are presented on the
command-line.
configuration
- a token filter configurationfinder
- the ResourceFinder
used to locate the file
resources specified in the configuration string.null
if the configuration did not specify any filtersIOError
- if any error occurs when reading the word list filesCopyright © 2012. All Rights Reserved.