OrderedTemporalRandomIndexing (S-Space Package 2.0.1 API)

java.lang.Object
- edu.ucla.sspace.tri.OrderedTemporalRandomIndexing

All Implemented Interfaces:

Filterable, SemanticSpace, TemporalSemanticSpace

Direct Known Subclasses:

FixedDurationTemporalRandomIndexing
```
public abstract class OrderedTemporalRandomIndexing
extends Object
implements TemporalSemanticSpace, Filterable
```
A simplified version of TemporalRandomIndexing that imposes restrictions on the document input ordering to improve efficiency at the cost of functionality. Specifically, this class assumes:
1. Documents will be processed in an on-line manner such that all documents that comprise a semantic slice will be contiguous.
2. After a semantic slice has been built and processed, it does not need to be referenced any longer may be discarded.
The first property requires that the intial data be sorted according to some predetermined ordering. The second property limits the semantics that are retained at any given time period.
Because each slice is calculated and then discarded, this class provides a way for users to be notified when a semantic slice has been completed. Users may add a Runnable via the addPartitionHook(Runnable) method. When the input stream of documents partitions the current semantic slice from the next (i.e. the slice is complete), each runnable will be invoked. This allows users to perform any operations on the slice as necessary, such as save it to disk or compute various statistics.
This class implements Filterable, which allows for fine-grained control of which semantics are retained. The setSemanticFilter(Set) method can be used to speficy which words should have their semantics retained. Note that the words that are filtered out will still be used in computing the semantics of other words. This behavior is intended for use with a large corpora where retaining the semantics of all words in memory is infeasible.
This base class defines the following configurable properties:

Property: "edu.ucla.sspace.tri.OrderedTemporalRandomIndexing.windowSize"
Default: 4
This variable sets the number of words before and after that are counted as co-occurring. With the default value, 5 words are counted before and 5 words are counter after. This class always uses a symmetric window.

Property: "edu.ucla.sspace.tri.OrderedTemporalRandomIndexing.vectorLength"
Default: 10000
This variable sets the number of dimensions to be used for the index and semantic vectors.

Property: "edu.ucla.sspace.tri.OrderedTemporalRandomIndexing.sparseSemantics"
Default: true
This property specifies whether to use a sparse encoding for each word's semantics. Using a sparse encoding can result in a large saving in memory, while requiring more time to process each document.

Due to the ordered nature of its processing, great care must be used when invoking processDocument from multiple threads. Multiple threads may order the documents such that the time stamps at semantic slice boundaries overlap. This may causes the shouldPartitionSpace(long) method to return true for slices with only a single document. Subclasses must make it clear whether any such multithreading behavior is permissable and how to correctly invoke it to avoid triggering semantic slice boundary edge cases.
In its base behavior, instances of this class do not support the optional getTimeSteps, getVectorAfter, getVectorBefore and getVectorBetween methods. However, subclasses may add this functionality.
Author:

David Jurgens

See Also:
RandomIndexing, TemporalRandomIndexing, TemporalSemanticSpace

Field Summary

Fields
Modifier and Type	Field and Description
`protected RandomIndexing`	`currentSlice` The current semantic slice, which is updated as new documents are processed and has its semantics cleared when `shouldPartitionSpace(long)` returns `true`.
`static int`	`DEFAULT_VECTOR_LENGTH` The default number of dimensions to be used by the index and semantic vectors.
`static int`	`DEFAULT_WINDOW_SIZE` The default number of words to view before and after each word in focus.
`protected Long`	`endTime` The most recent time stamp seen during the current semantic slice
`protected Collection<Runnable>`	`partitionHooks` The collection of hooks that are to be run prior to every time this instances partitions its semantic space.
`static String`	`PERMUTATION_FUNCTION_PROPERTY` The property to specify the fully qualified named of a `edu.ucla.sspace.ri.PermutationFunction` if using permutations is enabled.
`protected Long`	`startTime` The least recent time stamp seen during the current semantic slice
`static String`	`USE_PERMUTATIONS_PROPERTY` The property to specify whether the index vectors for co-occurrent words should be permuted based on their relative position.
`static String`	`USE_SPARSE_SEMANTICS_PROPERTY` Specifies whether to use a sparse encoding for each word's semantics, which saves space but requires more computation.
`static String`	`VECTOR_LENGTH_PROPERTY` The property to specify the number of dimensions to be used by the index and semantic vectors.
`static String`	`WINDOW_SIZE_PROPERTY` The property to specify the number of words to view before and after each word in focus.

Constructor Summary

Constructors
Constructor and Description
`OrderedTemporalRandomIndexing()` Creates an instance of `OrderedTemporalRandomIndexing` using the system properties to configure the behavior.
`OrderedTemporalRandomIndexing(Properties props)` Creates an instance of `OrderedTemporalRandomIndexing` using the system properties to configure the behavior.

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`addPartitionHook(Runnable hook)` Adds the provided `Runnable` to the list of hooks that will be invoked immediately prior to the partitioning of this space.
`protected void`	`clear()` Clears the semantic content of this space as a part of the partitioning processing.
`Long`	`endTime()` Returns the time for the latest semantics contained within this space.
`abstract String`	`getSpaceName()` Returns a unique string describing the name and configuration of this algorithm.
`SortedSet<Long>`	`getTimeSteps(String word)` Not supported
`Vector`	`getVector(String word)` Returns the provided word's semantic vector based on all temporal occurrences.
`Vector`	`getVectorAfter(String word, long startTime)` Not supported
`Vector`	`getVectorBefore(String word, long endTime)` Not supported
`Vector`	`getVectorBetween(String word, long startTime, long endTime)` Not supported
`int`	`getVectorLength()` Returns the length of vectors in this semantic space.
`Set<String>`	`getWords()` Returns the set of words that are represented in this semantic space.
`Map<String,TernaryVector>`	`getWordToIndexVector()` Returns an unmodifiable view on the token to `TernaryVector` mapping used by this instance.
`void`	`processDocument(BufferedReader document)` Processes the contents of the provided reader as a document, using the current time as the timestamp.
`void`	`processDocument(BufferedReader document, long timeStamp)` Processes the contents of the provided buffer as a document, using the provided timestamp as the date when the document was written.
`void`	`processSpace(Properties props)` Does nothing.
`void`	`setSemanticFilter(Set<String> semanticsToRetain)` Sets a filter such that only words that are in the set have their semantics retained by this instance.
`void`	`setWordToIndexVector(Map<String,TernaryVector> m)` Assigns the token to `TernaryVector` mapping to be used by this instance.
`protected abstract boolean`	`shouldPartitionSpace(long nextTimeStamp)` Returns `true` if the current contents of this semantic space should be partitioned and discarded prior to processing the next document with the specified time stamp.
`Long`	`startTime()` Returns the time for the earliest semantics contained within this space.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - PERMUTATION_FUNCTION_PROPERTY
```
public static final String PERMUTATION_FUNCTION_PROPERTY
```
    The property to specify the fully qualified named of a edu.ucla.sspace.ri.PermutationFunction if using permutations is enabled.
    
    See Also:
    Constant Field Values
  - USE_PERMUTATIONS_PROPERTY
```
public static final String USE_PERMUTATIONS_PROPERTY
```
    The property to specify whether the index vectors for co-occurrent words should be permuted based on their relative position.
    
    See Also:
    Constant Field Values
  - USE_SPARSE_SEMANTICS_PROPERTY
```
public static final String USE_SPARSE_SEMANTICS_PROPERTY
```
    Specifies whether to use a sparse encoding for each word's semantics, which saves space but requires more computation.
    
    See Also:
    Constant Field Values
  - VECTOR_LENGTH_PROPERTY
```
public static final String VECTOR_LENGTH_PROPERTY
```
    The property to specify the number of dimensions to be used by the index and semantic vectors.
    
    See Also:
    Constant Field Values
  - WINDOW_SIZE_PROPERTY
```
public static final String WINDOW_SIZE_PROPERTY
```
    The property to specify the number of words to view before and after each word in focus.
    
    See Also:
    Constant Field Values
  - DEFAULT_VECTOR_LENGTH
```
public static final int DEFAULT_VECTOR_LENGTH
```
    The default number of dimensions to be used by the index and semantic vectors.
    
    See Also:
    Constant Field Values
  - DEFAULT_WINDOW_SIZE
```
public static final int DEFAULT_WINDOW_SIZE
```
    The default number of words to view before and after each word in focus.
    
    See Also:
    Constant Field Values
  - partitionHooks
```
protected final Collection<Runnable> partitionHooks
```
    The collection of hooks that are to be run prior to every time this instances partitions its semantic space.
  - currentSlice
```
protected final RandomIndexing currentSlice
```
    The current semantic slice, which is updated as new documents are processed and has its semantics cleared when shouldPartitionSpace(long) returns true.
  - endTime
```
protected Long endTime
```
    The most recent time stamp seen during the current semantic slice
  - startTime
```
protected Long startTime
```
    The least recent time stamp seen during the current semantic slice
- Constructor Detail
  - OrderedTemporalRandomIndexing
```
public OrderedTemporalRandomIndexing()
```
    Creates an instance of OrderedTemporalRandomIndexing using the system properties to configure the behavior.
  - OrderedTemporalRandomIndexing
```
public OrderedTemporalRandomIndexing(Properties props)
```
    Creates an instance of OrderedTemporalRandomIndexing using the system properties to configure the behavior.
    
    Parameters:
    props - the properties used to configure this instance
- Method Detail
  - addPartitionHook
```
public void addPartitionHook(Runnable hook)
```
    Adds the provided Runnable to the list of hooks that will be invoked immediately prior to the partitioning of this space. This method provides a mechanism for users to perform additional processing on the current semantic slice of this space before it is discarded.
    
    Parameters:
    hook - a runnable to be invoked.
  - clear
```
protected void clear()
```
    Clears the semantic content of this space as a part of the partitioning processing.
  - processDocument
```
public void processDocument(BufferedReader document)
                     throws IOException
```
    Processes the contents of the provided reader as a document, using the current time as the timestamp.
    
    Specified by:
    
    processDocument in interface SemanticSpace
    
    Specified by:
    
    processDocument in interface TemporalSemanticSpace
    
    Parameters:
    document - a reader that allows access to the text of the document
    
    Throws:
    
    IOException - if any error occurs while reading the document
  - processDocument
```
public void processDocument(BufferedReader document,
                   long timeStamp)
                     throws IOException
```
    Processes the contents of the provided buffer as a document, using the provided timestamp as the date when the document was written.
    
    Specified by:
    
    processDocument in interface TemporalSemanticSpace
    
    Parameters:
    document - a reader that allows access to the text of the document
    timeStamp - the time at which the document was written
    
    Throws:
    
    IOException - if any error occurs while reading the document
  - setSemanticFilter
```
public void setSemanticFilter(Set<String> semanticsToRetain)
```
    Sets a filter such that only words that are in the set have their semantics retained by this instance. Note that all words will still have an index vector assigned to them, which is necessary to properly compute the semantics.
    
    Specified by:
    
    setSemanticFilter in interface Filterable
    
    Parameters:
    semanticsToRetain - the set of words for which semantics should be computed.
  - shouldPartitionSpace
```
protected abstract boolean shouldPartitionSpace(long nextTimeStamp)
```
    Returns true if the current contents of this semantic space should be partitioned and discarded prior to processing the next document with the specified time stamp. Subclasses should use this method to specify the conditions under which the temporal semantics are to be divided.
    
    Parameters:
    nextTimeStamp - the time stamp of the next document that has yet to be processed
    
    Returns:
    true if the current contents of this space should be partitioned and discarded before processing the next document
  - startTime
```
public Long startTime()
```
    Returns the time for the earliest semantics contained within this space.
    
    Specified by:
    
    startTime in interface TemporalSemanticSpace
  - endTime
```
public Long endTime()
```
    Returns the time for the latest semantics contained within this space.
    
    Specified by:
    
    endTime in interface TemporalSemanticSpace
  - getSpaceName
```
public abstract String getSpaceName()
```
    Returns a unique string describing the name and configuration of this algorithm. Any configurable parameters that would affect the resulting semantic space should be expressed as a part of this name.
    
    Specified by:
    
    getSpaceName in interface SemanticSpace
  - getTimeSteps
```
public SortedSet<Long> getTimeSteps(String word)
```
    Not supported
    
    Specified by:
    
    getTimeSteps in interface TemporalSemanticSpace
    
    Parameters:
    word -
    
    Throws:
    
    UnsupportedOperationException - if called
  - getVectorAfter
```
public Vector getVectorAfter(String word,
                    long startTime)
```
    Not supported
    
    Specified by:
    
    getVectorAfter in interface TemporalSemanticSpace
    
    Parameters:
    word - a word in the semantic space
    startTime - a UNIX timestamp that denotes the time after which all occurrences of the provided word should be counted.
    
    Returns:
    the semantic vector for the word after the provided time or null if the word was not in the space.
    
    Throws:
    
    UnsupportedOperationException - if called
  - getVectorBefore
```
public Vector getVectorBefore(String word,
                     long endTime)
```
    Not supported
    
    Specified by:
    
    getVectorBefore in interface TemporalSemanticSpace
    
    Parameters:
    word - a word in the semantic space
    endTime - a UNIX timestamp that denotes the time before which all occurrences of the provided would should be counted.
    
    Returns:
    the semantic vector for the word after the provided time or null if the word was not in the space.
    
    Throws:
    
    UnsupportedOperationException - if called
  - getVectorBetween
```
public Vector getVectorBetween(String word,
                      long startTime,
                      long endTime)
```
    Not supported
    
    Specified by:
    
    getVectorBetween in interface TemporalSemanticSpace
    
    Parameters:
    word - a word in the semantic space
    startTime - a UNIX timestamp that denotes the time before which no occurrences of the word should be counted.
    endTime - a UNIX timestamp that denotes the time after which no occurrences of the word should be counted.
    
    Returns:
    the semantic vector for the word after the provided time or null if the word was not in the space.
    
    Throws:
    
    UnsupportedOperationException - if called
  - getVector
```
public Vector getVector(String word)
```
    Returns the provided word's semantic vector based on all temporal occurrences.
    
    Specified by:
    
    getVector in interface SemanticSpace
    
    Specified by:
    
    getVector in interface TemporalSemanticSpace
    
    Parameters:
    word - a word that may be in the semantic space
    
    Returns:
    The Vector for the provided word or null if the word was not in the space.
  - getVectorLength
```
public int getVectorLength()
```
    Returns the length of vectors in this semantic space. Implementations are left free to define whether the returned value is valid before processSpace is called.
    
    Specified by:
    
    getVectorLength in interface SemanticSpace
  - getWords
```
public Set<String> getWords()
```
    Returns the set of words that are represented in this semantic space. Note that this set only includes the words that are present in the current semantic slice, which may be a subset of the all the words seen in all semantic slices.
    
    Specified by:
    
    getWords in interface SemanticSpace
    
    Returns:
    the set of words that are represented in this semantic space.
  - getWordToIndexVector
```
public Map<String,TernaryVector> getWordToIndexVector()
```
    Returns an unmodifiable view on the token to TernaryVector mapping used by this instance. Any further changes made by this instance to its token to TernaryVector mapping will be reflected in the return map.
    
    Returns:
    a mapping from the current set of tokens to the index vector used to represent them
  - processSpace
```
public void processSpace(Properties props)
```
    Does nothing.
    
    Specified by:
    
    processSpace in interface SemanticSpace
    
    Parameters:
    props - a set of properties and values that may be used to configure any exposed parameters of the algorithm.
  - setWordToIndexVector
```
public void setWordToIndexVector(Map<String,TernaryVector> m)
```
    Assigns the token to TernaryVector mapping to be used by this instance. The contents of the map are copied, so any additions of new index words by this instance will not be reflected in the parameter's mapping.
    
    Parameters:
    m - a mapping from token to the TernaryVector that should be used represent it when calculating other word's semantics

Class OrderedTemporalRandomIndexing

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

PERMUTATION_FUNCTION_PROPERTY

USE_PERMUTATIONS_PROPERTY

USE_SPARSE_SEMANTICS_PROPERTY

VECTOR_LENGTH_PROPERTY

WINDOW_SIZE_PROPERTY

DEFAULT_VECTOR_LENGTH

DEFAULT_WINDOW_SIZE

partitionHooks

currentSlice

endTime

startTime

Constructor Detail

OrderedTemporalRandomIndexing

OrderedTemporalRandomIndexing

Method Detail

addPartitionHook

clear

processDocument

processDocument

setSemanticFilter

shouldPartitionSpace

startTime

endTime

getSpaceName

getTimeSteps

getVectorAfter

getVectorBefore

getVectorBetween

getVector

getVectorLength

getWords

getWordToIndexVector

processSpace

setWordToIndexVector