RandomIndexing (S-Space Package 2.0.1 API)

java.lang.Object
- edu.ucla.sspace.ri.RandomIndexing

All Implemented Interfaces:

Filterable, SemanticSpace
```
public class RandomIndexing
extends Object
implements SemanticSpace, Filterable
```
A co-occurrence based approach to statistical semantics that uses a randomized projection of a full co-occurrence matrix to perform dimensionality reduction. This implementation is based on three papers:
- M. Sahlgren, "Vector-based semantic analysis: Representing word meanings based on random labels," in Proceedings of the ESSLLI 2001 Workshop on Semantic Knowledge Acquisition and Categorisation, Helsinki, Finland, 2001.
- M. Sahlgren, "An introduction to random indexing," in Proceedings of the Methods and Applicatons of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, 2005.
- M. Sahlgren, A. Holst, and P. Kanerva, "Permutations as a means to encode order in word space," in Proceedings of the 30th Annual Meeting of the Cognitive Science Society (CogSci’08), 2008.
Random Indexing (RI) is an efficient way of capturing word co-occurence. In most co-occurence models, a word-by-word matrix is constructed, where the values denote how many times the columns's word occurred in the context of the row's word. RI instead represents co-occurrence through index vectors. Each word is assigned a high-dimensional, random vector that is known as its index vector. These index vectors are very sparse - typically 7 ± 2 non zero bits for a vector of length 2048, which ensures that the the chance of any two arbitrary index vectors having an overlapping meaning (i.e. a cosine similarity that is non-zero) is very low. Word semantics are calculated for each word by keeping a running sum of all of the index vectors for the words that co-occur.
Sahlgren et al. (2008) introduced another variation on RI, where the semantics also capture word order by using a permutation function. For each occurrence of a word, rather than summing the index vectors of the co-occurring words, the permutation function is used to transform the co-occurring words based on their position. For example, consider the sentece, "the quick brown fox jumps over the lazy dog." With a window-size of 2, the semantic vector for "fox" is added with the values Π^-2(quick_index) + Π^-1(brown_index) + Π¹(jumps_index) + Π²(over_index), where Π^k denotes the k^th permutation of the specified index vector.
This class defines the following configurable properties that may be set using either the System properties or using the RandomIndexing(Properties) constructor.

Property: "edu.ucla.sspace.ri.RandomIndexing.windowSize"
Default: 2
This property sets the number of words before and after that are counted as co-occurring. With the default value, 2 words are counted before and 2 words are counter after. This class always uses a symmetric window.

Property: "edu.ucla.sspace.ri.RandomIndexing.vectorLength"
Default: 4000
This property sets the number of dimensions to be used for the index and semantic vectors.

Property: "edu.ucla.sspace.ri.RandomIndexing.usePermutations"
Default: false
This property specifies whether to enable permuting the index vectors of co-occurring words. Enabling this option will cause the word semantics to include word-ordering information. However this option is best used with a larger corpus.

Property: "edu.ucla.sspace.ri.RandomIndexing.permutationFunction"
Default: DefaultPermutationFunction
This property specifies the fully qualified class name of a PermutationFunction instance that will be used to permute index vectors. If the "edu.ucla.sspace.ri.RandomIndexing.usePermutations" is set to false, the value of this property has no effect.

Property: "edu.ucla.sspace.ri.RandomIndexing.sparseSemantics"
Default: true
This property specifies whether to use a sparse encoding for each word's semantics. Using a sparse encoding can result in a large saving in memory, while requiring more time to process each document.

This class implements Filterable, which allows for fine-grained control of which semantics are retained. The setSemanticFilter(Set) method can be used to speficy which words should have their semantics retained. Note that the words that are filtered out will still be used in computing the semantics of other words. This behavior is intended for use with a large corpora where retaining the semantics of all words in memory is infeasible.
This class is thread-safe for concurrent calls of processDocument. At any given point in processing, the getVector method may be used to access the current semantics of a word. This allows callers to track incremental changes to the semantics as the corpus is processed.
The processSpace method does nothing for this class and calls to it will not affect the results of getVectorFor.
Author:

David Jurgens

See Also:
PermutationFunction, IndexVectorGenerator

Field Summary

Fields
Modifier and Type	Field and Description
`static int`	`DEFAULT_VECTOR_LENGTH` The default number of dimensions to be used by the index and semantic vectors.
`static int`	`DEFAULT_WINDOW_SIZE` The default number of words to view before and after each word in focus.
`static String`	`PERMUTATION_FUNCTION_PROPERTY` The property to specify the fully qualified named of a `PermutationFunction` if using permutations is enabled.
`static String`	`RI_SSPACE_NAME`
`static String`	`USE_PERMUTATIONS_PROPERTY` The property to specify whether the index vectors for co-occurrent words should be permuted based on their relative position.
`static String`	`USE_SPARSE_SEMANTICS_PROPERTY` Specifies whether to use a sparse encoding for each word's semantics, which saves space but requires more computation.
`static String`	`VECTOR_LENGTH_PROPERTY` The property to specify the number of dimensions to be used by the index and semantic vectors.
`static String`	`WINDOW_SIZE_PROPERTY` The property to specify the number of words to view before and after each word in focus.

Constructor Summary

Constructors
Constructor and Description
`RandomIndexing()` Creates a new `RandomIndexing` instance using the current `System` properties for configuration.
`RandomIndexing(Properties properties)` Creates a new `RandomIndexing` instance using the provided properites for configuration.

Method Summary

Methods
Modifier and Type	Method and Description
`void`	`clearSemantics()` Removes all associations between word and semantics while still retaining the word to index vector mapping.
`String`	`getSpaceName()` Returns a unique string describing the name and configuration of this algorithm.
`Vector`	`getVector(String word)` Returns the semantic vector for the provided word.
`int`	`getVectorLength()` Returns the length of vectors in this semantic space.
`Set<String>`	`getWords()` Returns the set of words that are represented in this semantic space.
`Map<String,TernaryVector>`	`getWordToIndexVector()` Returns an unmodifiable view on the token to `IntegerVector` mapping used by this instance.
`void`	`processDocument(BufferedReader document)` Updates the semantic vectors based on the words in the document.
`void`	`processSpace(Properties properties)` Does nothing.
`void`	`setSemanticFilter(Set<String> semanticsToRetain)` Specifies the set of words that should have their semantics retained, where all other words do not.
`void`	`setWordToIndexVector(Map<String,TernaryVector> m)` Assigns the token to `IntegerVector` mapping to be used by this instance.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - RI_SSPACE_NAME
```
public static final String RI_SSPACE_NAME
```
    See Also:
    Constant Field Values
  - VECTOR_LENGTH_PROPERTY
```
public static final String VECTOR_LENGTH_PROPERTY
```
    The property to specify the number of dimensions to be used by the index and semantic vectors.
    
    See Also:
    Constant Field Values
  - WINDOW_SIZE_PROPERTY
```
public static final String WINDOW_SIZE_PROPERTY
```
    The property to specify the number of words to view before and after each word in focus.
    
    See Also:
    Constant Field Values
  - USE_PERMUTATIONS_PROPERTY
```
public static final String USE_PERMUTATIONS_PROPERTY
```
    The property to specify whether the index vectors for co-occurrent words should be permuted based on their relative position.
    
    See Also:
    Constant Field Values
  - PERMUTATION_FUNCTION_PROPERTY
```
public static final String PERMUTATION_FUNCTION_PROPERTY
```
    The property to specify the fully qualified named of a PermutationFunction if using permutations is enabled.
    
    See Also:
    Constant Field Values
  - USE_SPARSE_SEMANTICS_PROPERTY
```
public static final String USE_SPARSE_SEMANTICS_PROPERTY
```
    Specifies whether to use a sparse encoding for each word's semantics, which saves space but requires more computation.
    
    See Also:
    Constant Field Values
  - DEFAULT_WINDOW_SIZE
```
public static final int DEFAULT_WINDOW_SIZE
```
    The default number of words to view before and after each word in focus.
    
    See Also:
    Constant Field Values
  - DEFAULT_VECTOR_LENGTH
```
public static final int DEFAULT_VECTOR_LENGTH
```
    The default number of dimensions to be used by the index and semantic vectors.
    
    See Also:
    Constant Field Values
- Constructor Detail
  - RandomIndexing
```
public RandomIndexing()
```
    Creates a new RandomIndexing instance using the current System properties for configuration.
  - RandomIndexing
```
public RandomIndexing(Properties properties)
```
    Creates a new RandomIndexing instance using the provided properites for configuration.
- Method Detail
  - clearSemantics
```
public void clearSemantics()
```
    Removes all associations between word and semantics while still retaining the word to index vector mapping. This method can be used to re-use the same instance of a RandomIndexing on multiple corpora while keeping the same semantic space.
  - getVector
```
public Vector getVector(String word)
```
    Returns the semantic vector for the provided word.
    
    Specified by:
    
    getVector in interface SemanticSpace
    
    Parameters:
    word - a word that may be in the semantic space
    
    Returns:
    The Vector for the provided word or null if the word was not in the space.
  - getSpaceName
```
public String getSpaceName()
```
    Returns a unique string describing the name and configuration of this algorithm. Any configurable parameters that would affect the resulting semantic space should be expressed as a part of this name.
    
    Specified by:
    
    getSpaceName in interface SemanticSpace
  - getVectorLength
```
public int getVectorLength()
```
    Returns the length of vectors in this semantic space. Implementations are left free to define whether the returned value is valid before processSpace is called.
    
    Specified by:
    
    getVectorLength in interface SemanticSpace
  - getWords
```
public Set<String> getWords()
```
    Returns the set of words that are represented in this semantic space.
    
    Specified by:
    
    getWords in interface SemanticSpace
    
    Returns:
    the set of words that are represented in this semantic space.
  - getWordToIndexVector
```
public Map<String,TernaryVector> getWordToIndexVector()
```
    Returns an unmodifiable view on the token to IntegerVector mapping used by this instance. Any further changes made by this instance to its token to IntegerVector mapping will be reflected in the returned map.
    
    Returns:
    a mapping from the current set of tokens to the index vector used to represent them
  - processDocument
```
public void processDocument(BufferedReader document)
                     throws IOException
```
    Updates the semantic vectors based on the words in the document.
    
    Specified by:
    
    processDocument in interface SemanticSpace
    
    Parameters:
    document - a reader that allows access to the text of the document
    
    Throws:
    
    IOException - if any error occurs while reading the document
  - processSpace
```
public void processSpace(Properties properties)
```
    Does nothing.
    
    Specified by:
    
    processSpace in interface SemanticSpace
    
    Parameters:
    properties - a set of properties and values that may be used to configure any exposed parameters of the algorithm.
  - setWordToIndexVector
```
public void setWordToIndexVector(Map<String,TernaryVector> m)
```
    Assigns the token to IntegerVector mapping to be used by this instance. The contents of the map are copied, so any additions of new index words by this instance will not be reflected in the parameter's mapping.
    
    Parameters:
    m - a mapping from token to the IntegerVector that should be used represent it when calculating other word's semantics
  - setSemanticFilter
```
public void setSemanticFilter(Set<String> semanticsToRetain)
```
    Specifies the set of words that should have their semantics retained, where all other words do not. Note that all words will still have an index vector assigned to them, which is necessary to properly compute the semantics.
    
    Specified by:
    
    setSemanticFilter in interface Filterable
    
    Parameters:
    semanticsToRetain - the set of words for which semantics should be computed.

Class RandomIndexing

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

RI_SSPACE_NAME

VECTOR_LENGTH_PROPERTY

WINDOW_SIZE_PROPERTY

USE_PERMUTATIONS_PROPERTY

PERMUTATION_FUNCTION_PROPERTY

USE_SPARSE_SEMANTICS_PROPERTY

DEFAULT_WINDOW_SIZE

DEFAULT_VECTOR_LENGTH

Constructor Detail

RandomIndexing

RandomIndexing

Method Detail

clearSemantics

getVector

getSpaceName

getVectorLength

getWords

getWordToIndexVector

processDocument

processSpace

setWordToIndexVector

setSemanticFilter