public class VectorSpaceModel extends GenericTermDocumentVectorSpace
The VSM first processes documents into a word-document matrix where each
unique word is a assigned a row in the matrix, and each column represents a
document. The values of ths matrix correspond to the number of times the
row's word occurs in the column's document. Optionally, after the matrix has
been completely, its values may be transformed. This is frequently done
using the Tf-Idf Transform
.
This class offers one configurable parameter.
"edu.ucla.sspace.vsm.VectorSpaceModel.transform"
Transform
. The
class should be public, not abstract, and should provide a public no-arg
constructor.
This class is thread-safe for concurrent calls of processDocument
. Once processSpace
has been called, no further calls to
processDocument
should be made. This implementation does not support
access to the semantic vectors until after processSpace
has been
called.
Transform
Modifier and Type | Field and Description |
---|---|
static String |
MATRIX_TRANSFORM_PROPERTY
The property to define the
Transform class to be used
when processing the space after all the documents have been seen. |
documentCounter, LOG, wordSpace
Constructor and Description |
---|
VectorSpaceModel()
Constructs the
VectorSpaceModel using the system properties
for configuration. |
VectorSpaceModel(boolean readHeaderToken,
BasisMapping<String,String> termToIndex,
MatrixBuilder termDocumentMatrixBuilder)
Constructs a new
VectorSpaceModel using the provided
objects for processing. |
Modifier and Type | Method and Description |
---|---|
String |
getSpaceName()
Returns a unique string describing the name and configuration of this
algorithm.
|
void |
processSpace(Properties properties)
Once all the documents have been processed, performs any post-processing
steps on the data.
|
getVector, getVectorLength, getWords, handleDocumentHeader, processDocument, processSpace
public static final String MATRIX_TRANSFORM_PROPERTY
Transform
class to be used
when processing the space after all the documents have been seen.public VectorSpaceModel() throws IOException
VectorSpaceModel
using the system properties
for configuration.IOException
- if this instance encounters any errors when creatng
the backing array files required for processingpublic VectorSpaceModel(boolean readHeaderToken, BasisMapping<String,String> termToIndex, MatrixBuilder termDocumentMatrixBuilder) throws IOException
VectorSpaceModel
using the provided
objects for processing.readHeaderToken
- If true, the first token of each document will be
read and passed to handleDocumentHeader
, which discards the header.termToIndex
- The BasisMapping
used to map strings to
indices.termDocumentMatrixBuilder
- The MatrixBuilder
used to write
document vectors to disk which later get processed in processSpace
.IOException
- if this instance encounters any errors when creatng
the backing array files required for processingpublic String getSpaceName()
public void processSpace(Properties properties)
properties
argument.
By general contract, once this method has been called, processDocument
will not be called again.
properties
- a set of properties and values that may be used to
configure any exposed parameters of the algorithm. See this class's javadoc
for the full list of supported properties.Copyright © 2012. All Rights Reserved.