public abstract class GenericWordsiMain extends GenericMain
Wordsi
executables. This class provides
base arguments that nearly all Wordsi
executables will require, along
with basic processing for those arguments.
This class provides access to three different word sense modes : online
clustering, offline clustering, and an evaluation mode. For the two
clustering modes, word senses are generated by clustering individual context
vectors. The first mode uses StreamingWordsi
and the latter mode
uses WaitingWordsi
. The third mode assumes that the word sense have
already been learned and are fixed. Individual contexts are labeled with the
most similar word sense.
This class provides access to two evaluation modes: Pseudo Word
Discrimination and the SenseEval/SemEval evaluation. When training a Wordsi
model for a pseudo word task, the -e
option should be set
with the "pseudoWord} argument. The -P
option should be set so that
Wordsi
knows which words form pseudo words. Wordsi
will
generate a report that specifies how many times each core word in a pseudo
word was assigned to a word sense for the pseudo word. When running in
evaluation mode, the -e
option must be set.
Since Wordsi
instances will need to reuse features during training
and testing, the --Save
and --Load
options are provided.
--Save
will store any data structures that are required for
generating context vectors. --Load
will load these same data
structures from disk and re-use them. In general, --Save
should be
used during training and --Load
should be used during testing.
Different Wordsi
executables will serialize different data
structures, but these will generally be a mapping from strings to some other
data type.
GenericMain
provides the core options used by this base executible.
This class provides the following addition options:
-s
, --streamingClustering=CLASSNAME
Specifies the
streaming clustering algorithm to use for forming word senses.
-b
, --batchClustering=CLASSNAME
Specifies the batch
clustering algorithm to use for forming word senses.
-e
, --evaluationClustering=FILE
Specifies a trained
Wordsi semantic space to be used for evaluation. When set, one of the
Evaluation Type arguments must be set.
-P
, --pseudoWordEvaluation=FILENAME
Specifies a
mapping from raw tokens to their pseudo word token. Only the raw tokens
in this mapping will be represented in the Wordsi
space. A
PseudoWordReporter
will be generated for these pseudo words.
-E
, --semEvalEvaluation=STRING
Signifies that the
data files are in the SemEval format and that only test instance words
should be represented in the Wordsi space. Each line must correspond to
an instance context and the focus word must be precceded by the token
given as the argument to this option.
-a
, --acceptedWords=FILENAME
Specifies the set of
words which should be represented by Wordsi. (Default: all words).
-c
, --clusters
Specifies the desired number of
clusters, or word senses. (Default: 0).
-w
, --windowSize
Specifies the number of words, in
one direction, that form a valid context. For example, a window size of
5 means that up to 5 words before and after a focus word are used to form
the context. (Default: 5).
-S
, --save
Specfies a file to which all files
needed to generate context vectors will be serialized.
-L
, --load
Specfies a file from which all files
needed to generate context vectors will be deserialized.
argOptions, EXT, isMultiThreaded, verbose
Constructor and Description |
---|
GenericWordsiMain() |
Modifier and Type | Method and Description |
---|---|
protected void |
addExtraOptions(ArgOptions options)
Adds options to the provided
ArgOptions instance, which will be
used to parse the command line. |
protected ContextExtractor |
contextExtractorFromGenerator(ContextGenerator generator)
Returns a
ContextExtractor that uses the given ContextGenerator which will process the corpus in the format specified
by the command line. |
protected Set<String> |
getAcceptedWords()
Returns a set of strings that the
Wordsi implementations should
represent, or null , which signifies that all words should be
represented. |
protected Iterator<Document> |
getDocumentIterator()
Returns the iterator for all of the documents specified on the command
line or throws an
Error if no documents are specified. |
protected abstract ContextExtractor |
getExtractor()
Returns a
ContextExtractor , which will be responsible for
creating context vectors for documents. |
protected Map<String,String> |
getPseudoWordMap()
Returns a mapping from real tokens to their pseudo word tokens, or
null if the -P option is not specified. |
protected SemanticSpace |
getSpace()
Returns the
SemanticSpace that will be used for processing. |
protected <T> T |
loadObject(ObjectInputStream inStream)
Returns an object of type
T from the provided ObjectInputStream . |
protected ObjectInputStream |
openLoadFile()
Returns an
ObjectInputStream for the file referred to by the
--Load option or null if the option was not used. |
protected ObjectOutputStream |
openSaveFile()
Returns an
ObjectOutputStream for the file referred to by the
--Save option or null if the option was not used. |
protected void |
saveObject(ObjectOutputStream outStream,
Object obj)
Writes the
obj to the given ObjectOutputStream . |
protected int |
windowSize()
Returns the window size used in a sliding context window.
|
addCorpusReaderIterators, addDocIterators, addFileIterators, getAlgorithmSpecifics, getSpaceFormat, handleExtraOptions, loadValidTermSet, parseDocumentsMultiThreaded, parseDocumentsSingleThreaded, postProcessing, processDocumentsAndSpace, run, saveSSpace, setupOptions, setupProperties, usage, verbose, verbose
protected void addExtraOptions(ArgOptions options)
ArgOptions
instance, which will be
used to parse the command line. This method allows subclasses the
ability to add extra command line options.addExtraOptions
in class GenericMain
options
- the ArgOptions object which more main specific options can
be added to.GenericMain.handleExtraOptions()
protected abstract ContextExtractor getExtractor()
ContextExtractor
, which will be responsible for
creating context vectors for documents.protected Set<String> getAcceptedWords()
Wordsi
implementations should
represent, or null
, which signifies that all words should be
represented.protected Map<String,String> getPseudoWordMap()
null
if the -P
option is not specified.protected ContextExtractor contextExtractorFromGenerator(ContextGenerator generator)
ContextExtractor
that uses the given ContextGenerator
which will process the corpus in the format specified
by the command line. This is just a helper function for sub-classes
implementing getExtractor()
.protected int windowSize()
protected Iterator<Document> getDocumentIterator() throws IOException
GenericMain
Error
if no documents are specified. If
subclasses should override either GenericMain.addFileIterators(java.util.Collection<java.util.Iterator<edu.ucla.sspace.text.Document>>, java.lang.String[])
or GenericMain.addDocIterators(java.util.Collection<java.util.Iterator<edu.ucla.sspace.text.Document>>, java.lang.String[])
if they use different file format. Alternatively,
oen can implement a CorpusReader
and use the
-R
option.getDocumentIterator
in class GenericMain
IOException
protected SemanticSpace getSpace()
SemanticSpace
that will be used for processing. This
method is guaranteed to be called after the command line arguments have
been parsed, so the contents of GenericMain.argOptions
are valid.getSpace
in class GenericMain
protected ObjectOutputStream openSaveFile()
ObjectOutputStream
for the file referred to by the
--Save
option or null
if the option was not used.protected ObjectInputStream openLoadFile()
ObjectInputStream
for the file referred to by the
--Load
option or null
if the option was not used.protected void saveObject(ObjectOutputStream outStream, Object obj)
obj
to the given ObjectOutputStream
.protected <T> T loadObject(ObjectInputStream inStream)
T
from the provided ObjectInputStream
. This method does the casting, so assignments should
be done directly to a pointer and not through a ternary operator,
otherwise the cast will need to be done a second time.Copyright © 2012. All Rights Reserved.