public class LSAMain extends GenericMain
LatentSemanticAnalysis
(LSA) from the
command line. This class takes in several command line arguments.
-d
, --docFile=FILE[,FILE...]
a file where each line is
a document. This is the preferred input format for large corpora
-f
, --fileList=FILE[,FILE...]
a list of document files
where each file is specified on its own line.
--dimensions=<int>
how many dimensions to use for the LSA
vectors. See LatentSemanticAnalysis
for default value
--preprocess=<class name>
specifies an instance of edu.ucla.sspace.lsa.MatrixTransformer
to use in preprocessing the
word-document matrix compiled by LSA prior to computing the SVD. See
LatentSemanticAnalysis
for default value
-F
, --tokenFilter=FILE[include|exclude][,FILE...]
specifies a list of one or more files to use for filtering
the documents. An option
flag may be added to each file to specify how the words in the filter
filter should be used: include
if only the words in the filter
file should be retained in the document; exclude
if only the
words not in the filter file should be retained in the
document.
-S
, --svdAlgorithm
=SVD.Algorithm
species a specific SVD.Algorithm
method to use when reducing the dimensionality in LSA.
In general, users should not need to specify this option, as the
default setting will choose the fastest algorithm available on the
system. This is only provided as an advanced option for users who
want to compare the algorithms' performance or any variations between
the SVD results.
-o
, --outputFormat=
text|binary} Specifies the
output formatting to use when generating the semantic space (.sspace
) file. See SemanticSpaceUtils
for format details.
-t
, --threads=INT
how many threads to use when
processing the documents. The default is one per core.
-w
, --overwrite=BOOL
specifies whether to overwrite
the existing output files. The default is true
. If set to
false
, a unique integer is inserted into the file name.
-v
, --verbose
specifies whether to print runtime
information to standard out
An invocation will produce one file as output lsa-semantic-space.sspace
. If overwrite
was set to true
,
this file will be replaced for each new semantic space. Otherwise, a new
output file of the format lsa-semantic-space<number>.sspace
will be
created, where <number>
is a unique identifier for that program's
invocation. The output file will be placed in the directory specified on the
command line.
This class is desgined to run multi-threaded and performs well with one thread per core, which is the default setting.
LatentSemanticAnalysis
,
Transform
argOptions, EXT, isMultiThreaded, verbose
Modifier and Type | Method and Description |
---|---|
protected void |
addExtraOptions(ArgOptions options)
Adds all of the options to the
ArgOptions . |
protected String |
getAlgorithmSpecifics()
Returns a string describing algorithm-specific options and behaviods.
|
protected SemanticSpace |
getSpace()
Returns the
SemanticSpace that will be used for processing. |
protected SemanticSpaceIO.SSpaceFormat |
getSpaceFormat()
Returns the format as the default
format of a
LatentSemanticAnalysis space. |
static void |
main(String[] args) |
protected void |
postProcessing()
Allows subclasses to interact with the
SemanticSpace after the
space has finished processing all of the text. |
addCorpusReaderIterators, addDocIterators, addFileIterators, getDocumentIterator, handleExtraOptions, loadValidTermSet, parseDocumentsMultiThreaded, parseDocumentsSingleThreaded, processDocumentsAndSpace, run, saveSSpace, setupOptions, setupProperties, usage, verbose, verbose
protected void addExtraOptions(ArgOptions options)
ArgOptions
.addExtraOptions
in class GenericMain
options
- the ArgOptions object which more main specific options can
be added to.GenericMain.handleExtraOptions()
protected SemanticSpace getSpace()
GenericMain
SemanticSpace
that will be used for processing. This
method is guaranteed to be called after the command line arguments have
been parsed, so the contents of GenericMain.argOptions
are valid.getSpace
in class GenericMain
protected SemanticSpaceIO.SSpaceFormat getSpaceFormat()
LatentSemanticAnalysis
space.getSpaceFormat
in class GenericMain
protected void postProcessing()
GenericMain
SemanticSpace
after the
space has finished processing all of the text.postProcessing
in class GenericMain
protected String getAlgorithmSpecifics()
getAlgorithmSpecifics
in class GenericMain
Copyright © 2012. All Rights Reserved.