public abstract class GenericMain extends Object
SemanticSpace
algorithms. All derived main
classes must implement the abstract functions. Derived classes have the
option of adding more command line options, which can then be handled
independently by the derived class to build the SemanticSpace correctly, or
produce the Properties object required for processing the space.
All mains which inherit from this class will automatically have the ability
to process the documents in parallel, and from a variety of file sources.
The provided command line arguments are as follows:
-d
, --docFile=FILE[,FILE...]
a file containing a list
of file names, each of which is treated as a separate document.
-f
, --fileList=FILE[,FILE...]
a file where each line
is treated as a separate document. This is the preferred option when
working with large corpora due to reduced I/O demands for multiple
files.
-o
, --outputFormat=
text|binary} Specifies the
output formatting to use when generating the semantic space (.sspace
) file. See SemanticSpaceIO
for format details.
-t
, --threads=INT
how many threads to use when
processing the documents. The default is one per core.
-w
, --overwrite=BOOL
specifies whether to overwrite
the existing output files. The default is true
. If set to
false
, a unique integer is inserted into the file name.
-v
, --verbose
specifies whether to print runtime
information to standard out
Modifier and Type | Field and Description |
---|---|
protected ArgOptions |
argOptions
The processed argument options available to the main classes.
|
static String |
EXT
Extension used for all saved semantic space files.
|
protected boolean |
isMultiThreaded
Whether the
SemanticSpace class is capable of running with
multiple threads. |
protected boolean |
verbose
Whether to emit messages to
stdout when the verbose
methods are used. |
Constructor and Description |
---|
GenericMain() |
GenericMain(boolean isMultiThreaded) |
Modifier and Type | Method and Description |
---|---|
protected void |
addCorpusReaderIterators(Collection<Iterator<Document>> docIters,
String[] fileNames)
Adds a corpus reader for each file listed.
|
protected void |
addDocIterators(Collection<Iterator<Document>> docIters,
String[] fileNames)
Adds a
OneLinePerDocumentIterator to docIters for each
file name provided. |
protected void |
addExtraOptions(ArgOptions options)
Adds options to the provided
ArgOptions instance, which will be
used to parse the command line. |
protected void |
addFileIterators(Collection<Iterator<Document>> docIters,
String[] fileNames)
Adds a
FileListDocumentIterator to docIters for each file
name provided. |
protected String |
getAlgorithmSpecifics()
Returns a string describing algorithm-specific options and behaviods.
|
protected Iterator<Document> |
getDocumentIterator()
Returns the iterator for all of the documents specified on the command
line or throws an
Error if no documents are specified. |
protected abstract SemanticSpace |
getSpace()
Returns the
SemanticSpace that will be used for processing. |
protected SemanticSpaceIO.SSpaceFormat |
getSpaceFormat()
Returns the
format in which the
finished SemanticSpace should be saved. |
protected void |
handleExtraOptions()
Once the command line has been parsed, allows the subclasses to perform
additional steps based on class-specific options.
|
protected static Set<String> |
loadValidTermSet(String validTermsFileName)
Returns a set of terms based on the contents of the provided file.
|
protected void |
parseDocumentsMultiThreaded(SemanticSpace sspace,
Iterator<Document> docIter,
int numThreads)
Calls
processDocument once for every document in docIter using a the
specified number thread to call processSpace on the SemanticSpace instance. |
protected void |
parseDocumentsSingleThreaded(SemanticSpace sspace,
Iterator<Document> docIter)
Calls
processDocument once for every document in docIter using a
single thread to interact with the SemanticSpace instance. |
protected void |
postProcessing()
Allows subclasses to interact with the
SemanticSpace after the
space has finished processing all of the text. |
protected void |
processDocumentsAndSpace(SemanticSpace space,
Iterator<Document> docIter,
int numThreads,
Properties props)
Processes all the documents held by the iterator and process the space.
|
void |
run(String[] args)
Processes the arguments and begins processing the documents using the
SemanticSpace returned by getSpace . |
protected void |
saveSSpace(SemanticSpace sspace,
File outputFile)
Serializes the
SemanticSpace object to outputFile . |
protected ArgOptions |
setupOptions()
Adds the default options for running semantic space algorithms from the
command line.
|
protected Properties |
setupProperties()
Returns the
Properties object that will be used when calling
SemanticSpace.processSpace(Properties) . |
protected void |
usage()
Prints out information on how to run the program to
stdout using
the option descriptions for compound words, tokenization, .sspace formats
and help. |
protected void |
verbose(String msg) |
protected void |
verbose(String format,
Object... args) |
public static final String EXT
protected boolean verbose
stdout
when the verbose
methods are used.protected final ArgOptions argOptions
protected final boolean isMultiThreaded
SemanticSpace
class is capable of running with
multiple threads.public GenericMain()
public GenericMain(boolean isMultiThreaded)
protected abstract SemanticSpace getSpace()
SemanticSpace
that will be used for processing. This
method is guaranteed to be called after the command line arguments have
been parsed, so the contents of argOptions
are valid.protected String getAlgorithmSpecifics()
protected void usage()
stdout
using
the option descriptions for compound words, tokenization, .sspace formats
and help.protected SemanticSpaceIO.SSpaceFormat getSpaceFormat()
format
in which the
finished SemanticSpace
should be saved. Subclasses should
override this function if they want to specify a specific format that is
most suited for their space, when one is not manually specified by the
user.protected void addExtraOptions(ArgOptions options)
ArgOptions
instance, which will be
used to parse the command line. This method allows subclasses the
ability to add extra command line options.options
- the ArgOptions object which more main specific options can
be added to.handleExtraOptions()
protected void handleExtraOptions()
getSpace
.addExtraOptions(ArgOptions)
protected void postProcessing()
SemanticSpace
after the
space has finished processing all of the text.protected Properties setupProperties()
Properties
object that will be used when calling
SemanticSpace.processSpace(Properties)
. Subclasses should
override this method if they need to specify additional properties for
the space. This method will be called once before getSpace()
.Properties
used for processing the semantic space.protected ArgOptions setupOptions()
protected Iterator<Document> getDocumentIterator() throws IOException
Error
if no documents are specified. If
subclasses should override either addFileIterators(java.util.Collection<java.util.Iterator<edu.ucla.sspace.text.Document>>, java.lang.String[])
or addDocIterators(java.util.Collection<java.util.Iterator<edu.ucla.sspace.text.Document>>, java.lang.String[])
if they use different file format. Alternatively,
oen can implement a CorpusReader
and use the
-R
option.Error
- if no document source is specifiedIOException
protected void addCorpusReaderIterators(Collection<Iterator<Document>> docIters, String[] fileNames) throws IOException
fileNames
is expected to be the class type of the corpus reader.IOException
protected void addFileIterators(Collection<Iterator<Document>> docIters, String[] fileNames) throws IOException
FileListDocumentIterator
to docIters
for each file
name provided.IOException
protected void addDocIterators(Collection<Iterator<Document>> docIters, String[] fileNames) throws IOException
OneLinePerDocumentIterator
to docIters
for each
file name provided.IOException
public void run(String[] args) throws Exception
SemanticSpace
returned by getSpace
.args
- arguments used to configure this program and the SemanticSpace
Exception
protected void saveSSpace(SemanticSpace sspace, File outputFile) throws IOException
SemanticSpace
object to outputFile
.
This uses outputFormat
if set by the commandline. If not, this
uses the SemanticSpaceIO.SSpaceFormat
returned by getSpaceFormat()
.IOException
protected void processDocumentsAndSpace(SemanticSpace space, Iterator<Document> docIter, int numThreads, Properties props) throws Exception
Exception
protected void parseDocumentsSingleThreaded(SemanticSpace sspace, Iterator<Document> docIter) throws IOException
processDocument
once for every document in docIter
using a
single thread to interact with the SemanticSpace
instance.sspace
- the space to builddocIter
- an iterator over all the documents to processIOException
protected void parseDocumentsMultiThreaded(SemanticSpace sspace, Iterator<Document> docIter, int numThreads) throws IOException, InterruptedException
processDocument
once for every document in docIter
using a the
specified number thread to call processSpace
on the SemanticSpace
instance.sspace
- the space to builddocIter
- an iterator over all the documents to processnumThreads
- the number of threads to useIOException
InterruptedException
protected static Set<String> loadValidTermSet(String validTermsFileName) throws IOException
IOException
protected void verbose(String msg)
Copyright © 2012. All Rights Reserved.