public abstract class GenericTermDocumentVectorSpace extends Object implements SemanticSpace
SemanticSpace
s. It processes a document by
tokenizing all of the provided text and counting the term occurrences within
the document. Each column in these spaces represent a document, and the
column values initially represent the number of occurrences for each word.
After all documents are processed, the word space can be modified with one of
the many Matrix
Transform
classes. The transform, if
provided, will be used to rescore each term document occurrence count.
Typically, this reweighting is typically done to increase the score for
important and distinguishing terms while less salient terms, such as stop
words, are given a lower score. After calling processSpace
, sub classes should call assign thei
final data matrix to wordSpace
. This final matrix should maintain
the same row ordering, but the column ordering and dimensionality can be
modified in any way.
This class is thread-safe for concurrent calls of processDocument
. Once processSpace
has been called, no further calls to
processDocument
should be made.
This implementation does not support access to the semantic vectors until
after processSpace
has been called.
Modifier and Type | Field and Description |
---|---|
protected AtomicInteger |
documentCounter
The counter for recording the current number of documents observed.
|
protected static Logger |
LOG |
protected Matrix |
wordSpace
The word space of the term document based word space model.
|
Constructor and Description |
---|
GenericTermDocumentVectorSpace()
Constructs the
GenericTermDocumentVectorSpace . |
GenericTermDocumentVectorSpace(boolean readHeaderToken,
BasisMapping<String,String> termToIndex,
MatrixBuilder termDocumentMatrixBuilder)
Constructs the
GenericTermDocumentVectorSpace using the provided
objects for processing. |
Modifier and Type | Method and Description |
---|---|
Vector |
getVector(String word)
Returns the semantic vector for the provided word.
|
int |
getVectorLength()
Returns the length of vectors in this semantic space.
|
Set<String> |
getWords()
Returns the set of words that are represented in this semantic space.
|
protected void |
handleDocumentHeader(int docIndex,
String header)
Subclasses should override this method if they need to utilize a header
token for each document.
|
void |
processDocument(BufferedReader document)
Tokenizes the document using the
IteratorFactory and updates the
term-document frequency counts. |
protected MatrixFile |
processSpace(Transform transform)
Processes the
GenericTermDocumentVectorSpace with the provided
Transform if it is not null as a MatrixFile . |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getSpaceName, processSpace
protected static final Logger LOG
protected final AtomicInteger documentCounter
protected Matrix wordSpace
processSpace
method has been called.public GenericTermDocumentVectorSpace() throws IOException
GenericTermDocumentVectorSpace
.IOException
- if this instance encounters any errors when creatng
the backing array files required for processingpublic GenericTermDocumentVectorSpace(boolean readHeaderToken, BasisMapping<String,String> termToIndex, MatrixBuilder termDocumentMatrixBuilder) throws IOException
GenericTermDocumentVectorSpace
using the provided
objects for processing.readHeaderToken
- If true, the first token of each document will be
read and passed to handleDocumentHeader
, which by default discards the header.termToIndex
- The BasisMapping
used to map strings to
indices.termDocumentMatrixBuilder
- The MatrixBuilder
used to write
document vectors to disk which later get processed in processSpace
.IOException
- if this instance encounters any errors when creatng
the backing array files required for processingpublic void processDocument(BufferedReader document) throws IOException
IteratorFactory
and updates the
term-document frequency counts.
This method is thread-safe and may be called in parallel with separate documents to speed up overall processing time.
processDocument
in interface SemanticSpace
document
- a reader that allows access to the text of the documentIOException
- if any error occurs while reading the documentpublic Set<String> getWords()
getWords
in interface SemanticSpace
public Vector getVector(String word)
getVector
in interface SemanticSpace
word
- a word that may be in the semantic spaceVector
for the provided word or null
if the
word was not in the space.public int getVectorLength()
processSpace
is called.getVectorLength
in interface SemanticSpace
protected MatrixFile processSpace(Transform transform)
GenericTermDocumentVectorSpace
with the provided
Transform
if it is not null
as a MatrixFile
.
Otherwise, the raw term document counts are returned. Sub classes must
call this in order to access the term document counts before doing any
other processing.transform
- A matrix transform used to rescale the original raw
document counts. If null
no transform is done.protected void handleDocumentHeader(int docIndex, String header)
docIndex
- The document id assigned to the current documentdocumentName
- The name of the current document.Copyright © 2012. All Rights Reserved.