GenericTermDocumentVectorSpace (S-Space Package 2.0.1 API)

java.lang.Object
- edu.ucla.sspace.common.GenericTermDocumentVectorSpace

All Implemented Interfaces:

SemanticSpace

Direct Known Subclasses:

ExplicitSemanticAnalysis, LatentSemanticAnalysis, LocalityPreservingSemanticAnalysis, VectorSpaceModel
```
public abstract class GenericTermDocumentVectorSpace
extends Object
implements SemanticSpace
```
This base class centralizes much of the common text processing needed for term-document based SemanticSpaces. It processes a document by tokenizing all of the provided text and counting the term occurrences within the document. Each column in these spaces represent a document, and the column values initially represent the number of occurrences for each word. After all documents are processed, the word space can be modified with one of the many Matrix Transform classes. The transform, if provided, will be used to rescore each term document occurrence count. Typically, this reweighting is typically done to increase the score for important and distinguishing terms while less salient terms, such as stop words, are given a lower score. After calling processSpace, sub classes should call assign thei final data matrix to wordSpace. This final matrix should maintain the same row ordering, but the column ordering and dimensionality can be modified in any way.
This class is thread-safe for concurrent calls of processDocument. Once processSpace has been called, no further calls to processDocument should be made. This implementation does not support access to the semantic vectors until after processSpace has been called.

Author:

Keith Stevens

See Also:
Transform, SVD

Field Summary

Fields
Modifier and Type	Field and Description
`protected AtomicInteger`	`documentCounter` The counter for recording the current number of documents observed.
`protected static Logger`	`LOG`
`protected Matrix`	`wordSpace` The word space of the term document based word space model.

Constructor Summary

Constructors
Constructor and Description
`GenericTermDocumentVectorSpace()` Constructs the `GenericTermDocumentVectorSpace`.
`GenericTermDocumentVectorSpace(boolean readHeaderToken, BasisMapping<String,String> termToIndex, MatrixBuilder termDocumentMatrixBuilder)` Constructs the `GenericTermDocumentVectorSpace` using the provided objects for processing.

Method Summary

Methods
Modifier and Type	Method and Description
`Vector`	`getVector(String word)` Returns the semantic vector for the provided word.
`int`	`getVectorLength()` Returns the length of vectors in this semantic space.
`Set<String>`	`getWords()` Returns the set of words that are represented in this semantic space.
`protected void`	`handleDocumentHeader(int docIndex, String header)` Subclasses should override this method if they need to utilize a header token for each document.
`void`	`processDocument(BufferedReader document)` Tokenizes the document using the `IteratorFactory` and updates the term-document frequency counts.
`protected MatrixFile`	`processSpace(Transform transform)` Processes the `GenericTermDocumentVectorSpace` with the provided `Transform` if it is not `null` as a `MatrixFile`.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface edu.ucla.sspace.common.SemanticSpace
getSpaceName, processSpace

- Field Detail
  - LOG
```
protected static final Logger LOG
```
  - documentCounter
```
protected final AtomicInteger documentCounter
```
    The counter for recording the current number of documents observed. Subclasses can use this for any reporting.
  - wordSpace
```
protected Matrix wordSpace
```
    The word space of the term document based word space model. If the word space is reduced, it is the left factor matrix of the SVD of the word-document matrix. This matrix is only available after the processSpace method has been called.
- Constructor Detail
  - GenericTermDocumentVectorSpace
```
public GenericTermDocumentVectorSpace()
                               throws IOException
```
    Constructs the GenericTermDocumentVectorSpace.
    
    Throws:
    
    IOException - if this instance encounters any errors when creatng the backing array files required for processing
  - GenericTermDocumentVectorSpace
```
public GenericTermDocumentVectorSpace(boolean readHeaderToken,
                              BasisMapping<String,String> termToIndex,
                              MatrixBuilder termDocumentMatrixBuilder)
                               throws IOException
```
    Constructs the GenericTermDocumentVectorSpace using the provided objects for processing.
    
    Parameters:
    readHeaderToken - If true, the first token of each document will be read and passed to handleDocumentHeader, which by default discards the header.
    termToIndex - The BasisMapping used to map strings to indices.
    termDocumentMatrixBuilder - The MatrixBuilder used to write document vectors to disk which later get processed in processSpace.
    
    Throws:
    
    IOException - if this instance encounters any errors when creatng the backing array files required for processing
- Method Detail
  - processDocument
```
public void processDocument(BufferedReader document)
                     throws IOException
```
    Tokenizes the document using the IteratorFactory and updates the term-document frequency counts.
    This method is thread-safe and may be called in parallel with separate documents to speed up overall processing time.
    
    Specified by:
    
    processDocument in interface SemanticSpace
    
    Parameters:
    document - a reader that allows access to the text of the document
    
    Throws:
    
    IOException - if any error occurs while reading the document
  - getWords
```
public Set<String> getWords()
```
    Returns the set of words that are represented in this semantic space.
    
    Specified by:
    
    getWords in interface SemanticSpace
    
    Returns:
    the set of words that are represented in this semantic space.
  - getVector
```
public Vector getVector(String word)
```
    Returns the semantic vector for the provided word.
    
    Specified by:
    
    getVector in interface SemanticSpace
    
    Parameters:
    word - a word that may be in the semantic space
    
    Returns:
    The Vector for the provided word or null if the word was not in the space.
  - getVectorLength
```
public int getVectorLength()
```
    Returns the length of vectors in this semantic space. Implementations are left free to define whether the returned value is valid before processSpace is called.
    
    Specified by:
    
    getVectorLength in interface SemanticSpace
  - processSpace
```
protected MatrixFile processSpace(Transform transform)
```
    Processes the GenericTermDocumentVectorSpace with the provided Transform if it is not null as a MatrixFile. Otherwise, the raw term document counts are returned. Sub classes must call this in order to access the term document counts before doing any other processing.
    
    Parameters:
    transform - A matrix transform used to rescale the original raw document counts. If null no transform is done.
  - handleDocumentHeader
```
protected void handleDocumentHeader(int docIndex,
                        String header)
```
    Subclasses should override this method if they need to utilize a header token for each document. Implementations of this method must be thread safe. The default action is a no-op.
    
    Parameters:
    docIndex - The document id assigned to the current document
    documentName - The name of the current document.

Class GenericTermDocumentVectorSpace

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface edu.ucla.sspace.common.SemanticSpace

Field Detail

LOG

documentCounter

wordSpace

Constructor Detail

GenericTermDocumentVectorSpace

GenericTermDocumentVectorSpace

Method Detail

processDocument

getWords

getVector

getVectorLength

processSpace

handleDocumentHeader