public class LatentSemanticAnalysis extends GenericTermDocumentVectorSpace
LSA first processes documents into a word-document matrix where each unique word is a assigned a row in the matrix, and each column represents a document. The values of ths matrix correspond to the number of times the row's word occurs in the column's document. After the matrix has been built, the Singular Value Decomposition (SVD) is used to reduce the dimensionality of the original word-document matrix, denoted as A. The SVD is a way of factoring any matrix A into three matrices U Σ VT such that Σ is a diagonal matrix containing the singular values of A. The singular values of Σ are ordered according to which causes the most variance in the values of A. The original matrix may be approximated by recomputing the matrix with only k of these singular values and setting the rest to 0. The approximated matrix  = Uk Σk VkT is the least squares best-fit rank-k approximation of A. LSA reduces the dimensions by keeping only the first k dimensions from the row vectors of U. These vectors form the semantic space of the words.
This class offers configurable preprocessing and dimensionality reduction.
through three parameters. These properties should be specified in the Properties object passed to the processSpace method.
"edu.ucla.sspace.lsa.LatentSemanticAnalysis.transform"
LogEntropyTransform
Transform. The class should be public, not abstract,
and should provide a public no-arg constructor.
"edu.ucla.sspace.lsa.LatentSemanticAnalysis.dimensions"
300
"edu.ucla.sspace.lsa.LatentSemanticAnalysis.svd.algorithm"
SVD.Algorithm.ANY
"edu.ucla.sspace.lsa.LatentSemanticAnalysis.retainDocSpace"
false
processSpace. Setting this
property to true will enable the getDocumentVector method.
This class is thread-safe for concurrent calls of processDocument. Once processSpace has been called, no further calls to
processDocument should be made. This implementation does not support
access to the semantic vectors until after processSpace has been
called.
| Modifier and Type | Field and Description |
|---|---|
static String |
LSA_DIMENSIONS_PROPERTY
The property to set the number of dimension to which the space should be
reduced using the SVD
|
static String |
LSA_SVD_ALGORITHM_PROPERTY
The property to set the specific SVD algorithm used by an instance during
processSpace. |
static String |
MATRIX_TRANSFORM_PROPERTY
The property to define the
Transform class to be used
when processing the space after all the documents have been seen. |
static String |
RETAIN_DOCUMENT_SPACE_PROPERTY
The property whose boolean value indicate whether the document space
should be retained after
processSpace. |
documentCounter, LOG, wordSpace| Constructor and Description |
|---|
LatentSemanticAnalysis()
Creates a new
LatentSemanticAnalysis instance. |
LatentSemanticAnalysis(boolean retainDocumentSpace,
int dimensions,
Transform transform,
MatrixFactorization reducer,
boolean readHeaderToken,
BasisMapping<String,String> termToIndex)
Constructs a new
LatentSemanticAnalysis using the provided
objects for processing. |
| Modifier and Type | Method and Description |
|---|---|
int |
documentSpaceSize()
Returns the number of documents processed by
LatentSemanticAnalysis if the document space has been retained. |
DoubleVector |
getDocumentVector(int documentNumber)
Returns the semantics of the document as represented by a numeric vector.
|
String |
getSpaceName()
Returns a unique string describing the name and configuration of this
algorithm.
|
void |
processSpace(Properties properties)
Once all the documents have been processed, performs any post-processing
steps on the data.
|
getVector, getVectorLength, getWords, handleDocumentHeader, processDocument, processSpacepublic static final String MATRIX_TRANSFORM_PROPERTY
Transform class to be used
when processing the space after all the documents have been seen.public static final String LSA_DIMENSIONS_PROPERTY
public static final String LSA_SVD_ALGORITHM_PROPERTY
processSpace. The value should be the name of a SVD.Algorithm. If this property is unset, any
available algorithm will be used according to the ordering defined in
SVD.public static final String RETAIN_DOCUMENT_SPACE_PROPERTY
processSpace. Setting this property to
true will enable the getDocumentVector method.public LatentSemanticAnalysis()
throws IOException
LatentSemanticAnalysis instance. This intializes
with the default parameters set in the
original paper.IOExceptionpublic LatentSemanticAnalysis(boolean retainDocumentSpace,
int dimensions,
Transform transform,
MatrixFactorization reducer,
boolean readHeaderToken,
BasisMapping<String,String> termToIndex)
throws IOException
LatentSemanticAnalysis using the provided
objects for processing.retainDocumentSpace - If true, the document space will be made
accessibledimensions - The number of dimensions to retain in the reduced spacetransform - The Transform to apply before reductionMatrixFactorization - The MatrixFactorization algorithm to
apply to reduce the transformed term document matrixreadHeaderToken - If true, the first token of each document will be
read and passed to handleDocumentHeader, which discards the headertermToIndex - The ConcurrentMap used to map strings to
indicesIOException - if this instance encounters any errors when creatng
the backing array files required for processingpublic String getSpaceName()
public DoubleVector getDocumentVector(int documentNumber)
getVector, this method is only to be used after processSpace has been called. By default, the document space is not
retained unless retainDocumentSpace is set to true.
Implementation note: If a specific document ordering is needed, caution
should be used when using this class in a multi-threaded environment.
Beacuse the document number is based on what order it was
processed, no guarantee is made that this will correspond with the
original document ordering as it exists in the corpus files. However, in
a single-threaded environment, the ordering will be preserved.documentNumber - the number of the document according to when it was
processedIllegalArgumentException - If the document space was not retained
or the document number is out of range.public int documentSpaceSize()
LatentSemanticAnalysis if the document space has been retained.IllegalArgumentException - If the document space has not been
retained.public void processSpace(Properties properties)
properties argument.
By general contract, once this method has been called, processDocument will not be called again.
properties - a set of properties and values that may be used to
configure any exposed parameters of the algorithm. See this class's javadoc for the full list of supported
properties.Copyright © 2012. All Rights Reserved.