public class StreamingWordsi extends BaseWordsi
Wordsi implementation that utilizes streaming, or online,
clustering algorithms. This model will immediate assign a context vector to
one of the clusters generated for a particular focus word, or create a new
cluster if needed. After processing is compelete, the AssignmentReporter will be informed of all the data point assignments made
by the clustering algorithm for each word.| Constructor and Description |
|---|
StreamingWordsi(Set<String> acceptedWords,
ContextExtractor extractor,
Generator<OnlineClustering<SparseDoubleVector>> clusterGenerator,
AssignmentReporter reporter,
int numClusters)
Creates a new
StreamingWordsi. |
| Modifier and Type | Method and Description |
|---|---|
SparseDoubleVector |
getVector(String term)
Returns the semantic vector for the provided word.
|
Set<String> |
getWords()
Returns the set of words that are represented in this semantic space.
|
void |
handleContextVector(String focusKey,
String secondaryKey,
SparseDoubleVector context)
Performs some operation with
contextVector, which can be indexed
by either primaryKey, secondaryKey, or both. |
void |
processSpace(Properties props)
Once all the documents have been processed, performs any post-processing
steps on the data.
|
acceptWord, getSpaceName, getVectorLength, processDocumentpublic StreamingWordsi(Set<String> acceptedWords, ContextExtractor extractor, Generator<OnlineClustering<SparseDoubleVector>> clusterGenerator, AssignmentReporter reporter, int numClusters)
StreamingWordsi.acceptedWords - The set of words that Wordsi should
represent. This may be null or empty}.extractor - The ContextExtractor used to parse documentstrackSecondaryKeys - If true, cluster assignments and secondary keys
will be tracked. If this is false, the AssignmentReporter
will not be used.clusterGenerator - A Generator responsible for creating new
instances of a OnlineClustering algorithm.reporter - The AssignmentReporter responsible for generating
a report that details the cluster assignments. This may be null. If trackSecondaryKeys is false, this is not used.public Set<String> getWords()
public SparseDoubleVector getVector(String term)
term - a word that may be in the semantic spaceVector for the provided word or null if the
word was not in the space.public void handleContextVector(String focusKey, String secondaryKey, SparseDoubleVector context)
contextVector, which can be indexed
by either primaryKey, secondaryKey, or both. This
operation will likely assign the contextVector to some cluster
immediately or store the contextVector so that it may be
clustered with all other other context vecetors generated for primaryKey.
The secondaryKey does not need to be used, but some experiments
may require it, such as the SenseEval/SemEval evaluation or pseudo-word
disambiguation. For SenseEval/SemEval evaluations, a SenseEvalContextExtractor should be used, which will provide the context
id as the secondaryKey; reporting should be done with a SenseEvalReporter. For pseudo-word disambiguation/discrimination, a
PseudoWordContextExtractor should be used, which will create
pseudo-words for some set of tokens. This extractor will use the
pseudo-word for the primaryKey and the original token as the
secondaryKey.focusKey - The primary key for contextVectorcontext - a SparseDoubleVector that represents a
single context for a wordpublic void processSpace(Properties props)
properties argument.
By general contract, once this method has been called, processDocument will not be called again.
props - a set of properties and values that may be used to
configure any exposed parameters of the algorithm.Copyright © 2012. All Rights Reserved.