public abstract class GenericTermDocumentVectorSpace extends Object implements SemanticSpace
SemanticSpaces. It processes a document by
tokenizing all of the provided text and counting the term occurrences within
the document. Each column in these spaces represent a document, and the
column values initially represent the number of occurrences for each word.
After all documents are processed, the word space can be modified with one of
the many Matrix Transform classes. The transform, if
provided, will be used to rescore each term document occurrence count.
Typically, this reweighting is typically done to increase the score for
important and distinguishing terms while less salient terms, such as stop
words, are given a lower score. After calling processSpace, sub classes should call assign thei
final data matrix to wordSpace. This final matrix should maintain
the same row ordering, but the column ordering and dimensionality can be
modified in any way.
This class is thread-safe for concurrent calls of processDocument. Once processSpace has been called, no further calls to
processDocument should be made.
This implementation does not support access to the semantic vectors until
after processSpace has been called.
| Modifier and Type | Field and Description |
|---|---|
protected AtomicInteger |
documentCounter
The counter for recording the current number of documents observed.
|
protected static Logger |
LOG |
protected Matrix |
wordSpace
The word space of the term document based word space model.
|
| Constructor and Description |
|---|
GenericTermDocumentVectorSpace()
Constructs the
GenericTermDocumentVectorSpace. |
GenericTermDocumentVectorSpace(boolean readHeaderToken,
BasisMapping<String,String> termToIndex,
MatrixBuilder termDocumentMatrixBuilder)
Constructs the
GenericTermDocumentVectorSpace using the provided
objects for processing. |
| Modifier and Type | Method and Description |
|---|---|
Vector |
getVector(String word)
Returns the semantic vector for the provided word.
|
int |
getVectorLength()
Returns the length of vectors in this semantic space.
|
Set<String> |
getWords()
Returns the set of words that are represented in this semantic space.
|
protected void |
handleDocumentHeader(int docIndex,
String header)
Subclasses should override this method if they need to utilize a header
token for each document.
|
void |
processDocument(BufferedReader document)
Tokenizes the document using the
IteratorFactory and updates the
term-document frequency counts. |
protected MatrixFile |
processSpace(Transform transform)
Processes the
GenericTermDocumentVectorSpace with the provided
Transform if it is not null as a MatrixFile. |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitgetSpaceName, processSpaceprotected static final Logger LOG
protected final AtomicInteger documentCounter
protected Matrix wordSpace
processSpace
method has been called.public GenericTermDocumentVectorSpace()
throws IOException
GenericTermDocumentVectorSpace.IOException - if this instance encounters any errors when creatng
the backing array files required for processingpublic GenericTermDocumentVectorSpace(boolean readHeaderToken,
BasisMapping<String,String> termToIndex,
MatrixBuilder termDocumentMatrixBuilder)
throws IOException
GenericTermDocumentVectorSpace using the provided
objects for processing.readHeaderToken - If true, the first token of each document will be
read and passed to handleDocumentHeader, which by default discards the header.termToIndex - The BasisMapping used to map strings to
indices.termDocumentMatrixBuilder - The MatrixBuilder used to write
document vectors to disk which later get processed in processSpace.IOException - if this instance encounters any errors when creatng
the backing array files required for processingpublic void processDocument(BufferedReader document) throws IOException
IteratorFactory and updates the
term-document frequency counts.
This method is thread-safe and may be called in parallel with separate documents to speed up overall processing time.
processDocument in interface SemanticSpacedocument - a reader that allows access to the text of the documentIOException - if any error occurs while reading the documentpublic Set<String> getWords()
getWords in interface SemanticSpacepublic Vector getVector(String word)
getVector in interface SemanticSpaceword - a word that may be in the semantic spaceVector for the provided word or null if the
word was not in the space.public int getVectorLength()
processSpace is called.getVectorLength in interface SemanticSpaceprotected MatrixFile processSpace(Transform transform)
GenericTermDocumentVectorSpace with the provided
Transform if it is not null as a MatrixFile.
Otherwise, the raw term document counts are returned. Sub classes must
call this in order to access the term document counts before doing any
other processing.transform - A matrix transform used to rescale the original raw
document counts. If null no transform is done.protected void handleDocumentHeader(int docIndex,
String header)
docIndex - The document id assigned to the current documentdocumentName - The name of the current document.Copyright © 2012. All Rights Reserved.