public class StreamingWordsi extends BaseWordsi
Wordsi
implementation that utilizes streaming, or online,
clustering algorithms. This model will immediate assign a context vector to
one of the clusters generated for a particular focus word, or create a new
cluster if needed. After processing is compelete, the AssignmentReporter
will be informed of all the data point assignments made
by the clustering algorithm for each word.Constructor and Description |
---|
StreamingWordsi(Set<String> acceptedWords,
ContextExtractor extractor,
Generator<OnlineClustering<SparseDoubleVector>> clusterGenerator,
AssignmentReporter reporter,
int numClusters)
Creates a new
StreamingWordsi . |
Modifier and Type | Method and Description |
---|---|
SparseDoubleVector |
getVector(String term)
Returns the semantic vector for the provided word.
|
Set<String> |
getWords()
Returns the set of words that are represented in this semantic space.
|
void |
handleContextVector(String focusKey,
String secondaryKey,
SparseDoubleVector context)
Performs some operation with
contextVector , which can be indexed
by either primaryKey , secondaryKey , or both. |
void |
processSpace(Properties props)
Once all the documents have been processed, performs any post-processing
steps on the data.
|
acceptWord, getSpaceName, getVectorLength, processDocument
public StreamingWordsi(Set<String> acceptedWords, ContextExtractor extractor, Generator<OnlineClustering<SparseDoubleVector>> clusterGenerator, AssignmentReporter reporter, int numClusters)
StreamingWordsi
.acceptedWords
- The set of words that Wordsi
should
represent. This may be null
or empty}.extractor
- The ContextExtractor
used to parse documentstrackSecondaryKeys
- If true, cluster assignments and secondary keys
will be tracked. If this is false, the AssignmentReporter
will not be used.clusterGenerator
- A Generator
responsible for creating new
instances of a OnlineClustering
algorithm.reporter
- The AssignmentReporter
responsible for generating
a report that details the cluster assignments. This may be null
. If trackSecondaryKeys
is false, this is not used.public Set<String> getWords()
public SparseDoubleVector getVector(String term)
term
- a word that may be in the semantic spaceVector
for the provided word or null
if the
word was not in the space.public void handleContextVector(String focusKey, String secondaryKey, SparseDoubleVector context)
contextVector
, which can be indexed
by either primaryKey
, secondaryKey
, or both. This
operation will likely assign the contextVector
to some cluster
immediately or store the contextVector
so that it may be
clustered with all other other context vecetors generated for primaryKey
.
The secondaryKey
does not need to be used, but some experiments
may require it, such as the SenseEval/SemEval evaluation or pseudo-word
disambiguation. For SenseEval/SemEval evaluations, a SenseEvalContextExtractor
should be used, which will provide the context
id as the secondaryKey
; reporting should be done with a SenseEvalReporter
. For pseudo-word disambiguation/discrimination, a
PseudoWordContextExtractor
should be used, which will create
pseudo-words for some set of tokens. This extractor will use the
pseudo-word for the primaryKey
and the original token as the
secondaryKey
.focusKey
- The primary key for contextVector
context
- a SparseDoubleVector
that represents a
single context for a wordpublic void processSpace(Properties props)
properties
argument.
By general contract, once this method has been called, processDocument
will not be called again.
props
- a set of properties and values that may be used to
configure any exposed parameters of the algorithm.Copyright © 2012. All Rights Reserved.