public abstract class GenericMain extends Object
SemanticSpace algorithms. All derived main
classes must implement the abstract functions. Derived classes have the
option of adding more command line options, which can then be handled
independently by the derived class to build the SemanticSpace correctly, or
produce the Properties object required for processing the space.
All mains which inherit from this class will automatically have the ability
to process the documents in parallel, and from a variety of file sources.
The provided command line arguments are as follows:
-d, --docFile=FILE[,FILE...] a file containing a list
of file names, each of which is treated as a separate document.
-f, --fileList=FILE[,FILE...] a file where each line
is treated as a separate document. This is the preferred option when
working with large corpora due to reduced I/O demands for multiple
files.
-o, --outputFormat=text|binary} Specifies the
output formatting to use when generating the semantic space (.sspace) file. See SemanticSpaceIO for format details.
-t, --threads=INT how many threads to use when
processing the documents. The default is one per core.
-w, --overwrite=BOOL specifies whether to overwrite
the existing output files. The default is true. If set to
false, a unique integer is inserted into the file name.
-v, --verbose specifies whether to print runtime
information to standard out
| Modifier and Type | Field and Description |
|---|---|
protected ArgOptions |
argOptions
The processed argument options available to the main classes.
|
static String |
EXT
Extension used for all saved semantic space files.
|
protected boolean |
isMultiThreaded
Whether the
SemanticSpace class is capable of running with
multiple threads. |
protected boolean |
verbose
Whether to emit messages to
stdout when the verbose
methods are used. |
| Constructor and Description |
|---|
GenericMain() |
GenericMain(boolean isMultiThreaded) |
| Modifier and Type | Method and Description |
|---|---|
protected void |
addCorpusReaderIterators(Collection<Iterator<Document>> docIters,
String[] fileNames)
Adds a corpus reader for each file listed.
|
protected void |
addDocIterators(Collection<Iterator<Document>> docIters,
String[] fileNames)
Adds a
OneLinePerDocumentIterator to docIters for each
file name provided. |
protected void |
addExtraOptions(ArgOptions options)
Adds options to the provided
ArgOptions instance, which will be
used to parse the command line. |
protected void |
addFileIterators(Collection<Iterator<Document>> docIters,
String[] fileNames)
Adds a
FileListDocumentIterator to docIters for each file
name provided. |
protected String |
getAlgorithmSpecifics()
Returns a string describing algorithm-specific options and behaviods.
|
protected Iterator<Document> |
getDocumentIterator()
Returns the iterator for all of the documents specified on the command
line or throws an
Error if no documents are specified. |
protected abstract SemanticSpace |
getSpace()
Returns the
SemanticSpace that will be used for processing. |
protected SemanticSpaceIO.SSpaceFormat |
getSpaceFormat()
Returns the
format in which the
finished SemanticSpace should be saved. |
protected void |
handleExtraOptions()
Once the command line has been parsed, allows the subclasses to perform
additional steps based on class-specific options.
|
protected static Set<String> |
loadValidTermSet(String validTermsFileName)
Returns a set of terms based on the contents of the provided file.
|
protected void |
parseDocumentsMultiThreaded(SemanticSpace sspace,
Iterator<Document> docIter,
int numThreads)
Calls
processDocument once for every document in docIter using a the
specified number thread to call processSpace on the SemanticSpace instance. |
protected void |
parseDocumentsSingleThreaded(SemanticSpace sspace,
Iterator<Document> docIter)
Calls
processDocument once for every document in docIter using a
single thread to interact with the SemanticSpace instance. |
protected void |
postProcessing()
Allows subclasses to interact with the
SemanticSpace after the
space has finished processing all of the text. |
protected void |
processDocumentsAndSpace(SemanticSpace space,
Iterator<Document> docIter,
int numThreads,
Properties props)
Processes all the documents held by the iterator and process the space.
|
void |
run(String[] args)
Processes the arguments and begins processing the documents using the
SemanticSpace returned by getSpace. |
protected void |
saveSSpace(SemanticSpace sspace,
File outputFile)
Serializes the
SemanticSpace object to outputFile. |
protected ArgOptions |
setupOptions()
Adds the default options for running semantic space algorithms from the
command line.
|
protected Properties |
setupProperties()
Returns the
Properties object that will be used when calling
SemanticSpace.processSpace(Properties). |
protected void |
usage()
Prints out information on how to run the program to
stdout using
the option descriptions for compound words, tokenization, .sspace formats
and help. |
protected void |
verbose(String msg) |
protected void |
verbose(String format,
Object... args) |
public static final String EXT
protected boolean verbose
stdout when the verbose
methods are used.protected final ArgOptions argOptions
protected final boolean isMultiThreaded
SemanticSpace class is capable of running with
multiple threads.public GenericMain()
public GenericMain(boolean isMultiThreaded)
protected abstract SemanticSpace getSpace()
SemanticSpace that will be used for processing. This
method is guaranteed to be called after the command line arguments have
been parsed, so the contents of argOptions are valid.protected String getAlgorithmSpecifics()
protected void usage()
stdout using
the option descriptions for compound words, tokenization, .sspace formats
and help.protected SemanticSpaceIO.SSpaceFormat getSpaceFormat()
format in which the
finished SemanticSpace should be saved. Subclasses should
override this function if they want to specify a specific format that is
most suited for their space, when one is not manually specified by the
user.protected void addExtraOptions(ArgOptions options)
ArgOptions instance, which will be
used to parse the command line. This method allows subclasses the
ability to add extra command line options.options - the ArgOptions object which more main specific options can
be added to.handleExtraOptions()protected void handleExtraOptions()
getSpace.addExtraOptions(ArgOptions)protected void postProcessing()
SemanticSpace after the
space has finished processing all of the text.protected Properties setupProperties()
Properties object that will be used when calling
SemanticSpace.processSpace(Properties). Subclasses should
override this method if they need to specify additional properties for
the space. This method will be called once before getSpace().Properties used for processing the semantic space.protected ArgOptions setupOptions()
protected Iterator<Document> getDocumentIterator() throws IOException
Error if no documents are specified. If
subclasses should override either addFileIterators(java.util.Collection<java.util.Iterator<edu.ucla.sspace.text.Document>>, java.lang.String[]) or addDocIterators(java.util.Collection<java.util.Iterator<edu.ucla.sspace.text.Document>>, java.lang.String[]) if they use different file format. Alternatively,
oen can implement a CorpusReader and use the
-R option.Error - if no document source is specifiedIOExceptionprotected void addCorpusReaderIterators(Collection<Iterator<Document>> docIters, String[] fileNames) throws IOException
fileNames is expected to be the class type of the corpus reader.IOExceptionprotected void addFileIterators(Collection<Iterator<Document>> docIters, String[] fileNames) throws IOException
FileListDocumentIterator to docIters for each file
name provided.IOExceptionprotected void addDocIterators(Collection<Iterator<Document>> docIters, String[] fileNames) throws IOException
OneLinePerDocumentIterator to docIters for each
file name provided.IOExceptionpublic void run(String[] args) throws Exception
SemanticSpace returned by getSpace.args - arguments used to configure this program and the SemanticSpaceExceptionprotected void saveSSpace(SemanticSpace sspace, File outputFile) throws IOException
SemanticSpace object to outputFile.
This uses outputFormat if set by the commandline. If not, this
uses the SemanticSpaceIO.SSpaceFormat returned by getSpaceFormat().IOExceptionprotected void processDocumentsAndSpace(SemanticSpace space, Iterator<Document> docIter, int numThreads, Properties props) throws Exception
Exceptionprotected void parseDocumentsSingleThreaded(SemanticSpace sspace, Iterator<Document> docIter) throws IOException
processDocument once for every document in docIter using a
single thread to interact with the SemanticSpace instance.sspace - the space to builddocIter - an iterator over all the documents to processIOExceptionprotected void parseDocumentsMultiThreaded(SemanticSpace sspace, Iterator<Document> docIter, int numThreads) throws IOException, InterruptedException
processDocument once for every document in docIter using a the
specified number thread to call processSpace on the SemanticSpace instance.sspace - the space to builddocIter - an iterator over all the documents to processnumThreads - the number of threads to useIOExceptionInterruptedExceptionprotected static Set<String> loadValidTermSet(String validTermsFileName) throws IOException
IOExceptionprotected void verbose(String msg)
Copyright © 2012. All Rights Reserved.