public class LatentSemanticAnalysis extends GenericTermDocumentVectorSpace
LSA first processes documents into a word-document matrix where each unique word is a assigned a row in the matrix, and each column represents a document. The values of ths matrix correspond to the number of times the row's word occurs in the column's document. After the matrix has been built, the Singular Value Decomposition (SVD) is used to reduce the dimensionality of the original word-document matrix, denoted as A. The SVD is a way of factoring any matrix A into three matrices U Σ VT such that Σ is a diagonal matrix containing the singular values of A. The singular values of Σ are ordered according to which causes the most variance in the values of A. The original matrix may be approximated by recomputing the matrix with only k of these singular values and setting the rest to 0. The approximated matrix  = Uk Σk VkT is the least squares best-fit rank-k approximation of A. LSA reduces the dimensions by keeping only the first k dimensions from the row vectors of U. These vectors form the semantic space of the words.
This class offers configurable preprocessing and dimensionality reduction.
through three parameters. These properties should be specified in the Properties
object passed to the processSpace
method.
"edu.ucla.sspace.lsa.LatentSemanticAnalysis.transform"
LogEntropyTransform
Transform
. The class should be public, not abstract,
and should provide a public no-arg constructor.
"edu.ucla.sspace.lsa.LatentSemanticAnalysis.dimensions"
300
"edu.ucla.sspace.lsa.LatentSemanticAnalysis.svd.algorithm"
SVD.Algorithm.ANY
"edu.ucla.sspace.lsa.LatentSemanticAnalysis.retainDocSpace"
false
processSpace
. Setting this
property to true
will enable the getDocumentVector
method.
This class is thread-safe for concurrent calls of processDocument
. Once processSpace
has been called, no further calls to
processDocument
should be made. This implementation does not support
access to the semantic vectors until after processSpace
has been
called.
Modifier and Type | Field and Description |
---|---|
static String |
LSA_DIMENSIONS_PROPERTY
The property to set the number of dimension to which the space should be
reduced using the SVD
|
static String |
LSA_SVD_ALGORITHM_PROPERTY
The property to set the specific SVD algorithm used by an instance during
processSpace . |
static String |
MATRIX_TRANSFORM_PROPERTY
The property to define the
Transform class to be used
when processing the space after all the documents have been seen. |
static String |
RETAIN_DOCUMENT_SPACE_PROPERTY
The property whose boolean value indicate whether the document space
should be retained after
processSpace . |
documentCounter, LOG, wordSpace
Constructor and Description |
---|
LatentSemanticAnalysis()
Creates a new
LatentSemanticAnalysis instance. |
LatentSemanticAnalysis(boolean retainDocumentSpace,
int dimensions,
Transform transform,
MatrixFactorization reducer,
boolean readHeaderToken,
BasisMapping<String,String> termToIndex)
Constructs a new
LatentSemanticAnalysis using the provided
objects for processing. |
Modifier and Type | Method and Description |
---|---|
int |
documentSpaceSize()
Returns the number of documents processed by
LatentSemanticAnalysis if the document space has been retained. |
DoubleVector |
getDocumentVector(int documentNumber)
Returns the semantics of the document as represented by a numeric vector.
|
String |
getSpaceName()
Returns a unique string describing the name and configuration of this
algorithm.
|
void |
processSpace(Properties properties)
Once all the documents have been processed, performs any post-processing
steps on the data.
|
getVector, getVectorLength, getWords, handleDocumentHeader, processDocument, processSpace
public static final String MATRIX_TRANSFORM_PROPERTY
Transform
class to be used
when processing the space after all the documents have been seen.public static final String LSA_DIMENSIONS_PROPERTY
public static final String LSA_SVD_ALGORITHM_PROPERTY
processSpace
. The value should be the name of a SVD.Algorithm
. If this property is unset, any
available algorithm will be used according to the ordering defined in
SVD
.public static final String RETAIN_DOCUMENT_SPACE_PROPERTY
processSpace
. Setting this property to
true
will enable the getDocumentVector
method.public LatentSemanticAnalysis() throws IOException
LatentSemanticAnalysis
instance. This intializes
with the default parameters set in the
original paper.IOException
public LatentSemanticAnalysis(boolean retainDocumentSpace, int dimensions, Transform transform, MatrixFactorization reducer, boolean readHeaderToken, BasisMapping<String,String> termToIndex) throws IOException
LatentSemanticAnalysis
using the provided
objects for processing.retainDocumentSpace
- If true, the document space will be made
accessibledimensions
- The number of dimensions to retain in the reduced spacetransform
- The Transform
to apply before reductionMatrixFactorization
- The MatrixFactorization
algorithm to
apply to reduce the transformed term document matrixreadHeaderToken
- If true, the first token of each document will be
read and passed to handleDocumentHeader
, which discards the headertermToIndex
- The ConcurrentMap
used to map strings to
indicesIOException
- if this instance encounters any errors when creatng
the backing array files required for processingpublic String getSpaceName()
public DoubleVector getDocumentVector(int documentNumber)
getVector
, this method is only to be used after processSpace
has been called. By default, the document space is not
retained unless retainDocumentSpace
is set to true.
Implementation note: If a specific document ordering is needed, caution
should be used when using this class in a multi-threaded environment.
Beacuse the document number is based on what order it was
processed, no guarantee is made that this will correspond with the
original document ordering as it exists in the corpus files. However, in
a single-threaded environment, the ordering will be preserved.documentNumber
- the number of the document according to when it was
processedIllegalArgumentException
- If the document space was not retained
or the document number is out of range.public int documentSpaceSize()
LatentSemanticAnalysis
if the document space has been retained.IllegalArgumentException
- If the document space has not been
retained.public void processSpace(Properties properties)
properties
argument.
By general contract, once this method has been called, processDocument
will not be called again.
properties
- a set of properties and values that may be used to
configure any exposed parameters of the algorithm. See this class's javadoc
for the full list of supported
properties.Copyright © 2012. All Rights Reserved.