public interface Wordsi
Wordsi
implementation will likely contain four parts: a ContextExtractor
, a
clustering method, and a ContextAssignmentMap
, and a AssignmentReporter
. The Extractor
will genrate context
vectors for a set of words within a given BufferedReader and call handleContextVector
for each context vector that is generated. Each context
vector can be index by two keys: the primary key, which is generally the
focus word for the context vectors and the secondary key, which is either the
same as the focus word or an additional value such as a SenseEval/SemEval
instance identifier. The ContextAssignmentMap
is reponsible for
recording which secondary keys and context id's are assigned to each focus
term, in many cases, this is not neccesary, but if the exact clustering for
each context is required, one should use a ContextAssignmentMap
. The
clustering method will assign the context vector to some cluster, either
immediately or by storing the context vectors and performing a batch
clustering. The AssignmentReporter
is reponsible for reporting which
context vectors were assigned to which clusters. The three major components
to Wordsi
are separated so that each various context extraction
algorithms can be combined with various clustering algorithms and reporting
methods.
Implementations are suggested to subclass BaseWordsi
, since it
provides some methods for accepting and rejecting terms and dispatching the
ContextExtractor
.ContextExtractor
,
AssignmentReporter
Modifier and Type | Method and Description |
---|---|
boolean |
acceptWord(String word)
Returns true if this
Wordsi implementation should generate a
semantic vector for word . |
void |
handleContextVector(String primaryKey,
String secondaryKey,
SparseDoubleVector contextVector)
Performs some operation with
contextVector , which can be indexed
by either primaryKey , secondaryKey , or both. |
boolean acceptWord(String word)
Wordsi
implementation should generate a
semantic vector for word
.void handleContextVector(String primaryKey, String secondaryKey, SparseDoubleVector contextVector)
contextVector
, which can be indexed
by either primaryKey
, secondaryKey
, or both. This
operation will likely assign the contextVector
to some cluster
immediately or store the contextVector
so that it may be
clustered with all other other context vecetors generated for primaryKey
.
The secondaryKey
does not need to be used, but some experiments
may require it, such as the SenseEval/SemEval evaluation or pseudo-word
disambiguation. For SenseEval/SemEval evaluations, a SenseEvalContextExtractor
should be used, which will provide the context
id as the secondaryKey
; reporting should be done with a SenseEvalReporter
. For pseudo-word disambiguation/discrimination, a
PseudoWordContextExtractor
should be used, which will create
pseudo-words for some set of tokens. This extractor will use the
pseudo-word for the primaryKey
and the original token as the
secondaryKey
.primaryKey
- The primary key for contextVector
secondarykey
- A secondary key for contextVector
contextVector
- a SparseDoubleVector
that represents a
single context for a wordCopyright © 2012. All Rights Reserved.