HierarchicalAgglomerativeClustering (S-Space Package 2.0.1 API)

java.lang.Object
- edu.ucla.sspace.clustering.HierarchicalAgglomerativeClustering

All Implemented Interfaces:

Clustering
```
public class HierarchicalAgglomerativeClustering
extends Object
implements Clustering
```
A utility class for performing Hierarchical Agglomerative Clustering on matrix data in a file.
This class provides static accessors to several variations of agglomerative clustering and conforms to the Clustering interface, which allows this method to be used in place of other clustering algorithms.
In addition to clustering, this implementation also exposes the ability to view the iterative bottom-up merge through the buildDendogram methods. These methods return a series of Merge operations that can be used to construct a dendrogram and see the partial clustering at any point during the agglomerative merging process. For example, to view the clustering solution after four steps, the following code might be used:
```
   Matrix matrix; 
   List<Merge> merges = buildDendogram(matrix, ...);
   List<Merge> fourMergeSteps = merges.subList(0, 4);
   MultiMap<Integer,Integer> clusterToRows =  new HashMultiMap<Integer,Integer>();
   for (int i = 0; i < matrix.rows(); ++i)
       clusterToElements.put(i, i);

   for (Merge m : fourMergeSteps) {
       clusterToElements.putMulti(m.remainingCluster(), 
           clusterToElements.remove(m.mergedCluster()));
   }
```
The resulting MultiMap clusterToRows contains the mapping from each cluster to the rows that are a part of it.
Implementation Note: The current version runs in O(n³) worst case time for the number of rows in the matrix. While O(n² * log(n)) methods exist, these require storing similarity comparisons in a priority queue, which has a substantially higher memory overhead. Therefore, this implementation has opted for a more expensive running time in order to be able to process larger matrices.
When using the Clustering.cluster(Matrix,Properties) interface, this class supports the following properties for controlling the clustering.

Property: "edu.ucla.sspace.clustering.HierarchicalAgglomerativeClustering.clusterThreshold"
Default: unset
This property specifies the cluster similarity threshold at which two clusters are merged together. Merging will continue until either all clusters have similarities below this threshold or the number of desired clusters has been reached. This property provides an alternative to the num of clusters property for deciding when to stop agglomeratively merging clusters. Both properties cannot be specified at the same time.

Property: "edu.ucla.sspace.clustering.HierarchicalAgglomerativeClustering.clusterLinkage"
Default: "COMPLETE_LINKAGE"
This property specifies the HierarchicalAgglomerativeClustering.ClusterLinkage to use when computing cluster similarity.

Property: "edu.ucla.sspace.clustering.HierarchicalAgglomerativeClustering.simFunc"
Default: COSINE
This property specifies the name of Similarity.SimType to use when computing the similarity of two data points.

Property: "edu.ucla.sspace.clustering.HierarchicalAgglomerativeClustering.numClusters"
Default: unset
This property specifies the number of clusters to generate from the data. Clusters are agglomeratively merged until the specified number of clusters is reached. This property provides an alternative to the cluster similarity property for deciding when to stop agglomeratively merging clusters. Both properties cannot be specified at the same time.
Author:

David Jurgens

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class HierarchicalAgglomerativeClustering.ClusterLinkage
The method to use when comparing the similarity of two clusters.

Nested Classes
Modifier and Type	Class and Description
`static class`	`HierarchicalAgglomerativeClustering.ClusterLinkage` The method to use when comparing the similarity of two clusters.

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`CLUSTER_LINKAGE_PROPERTY` The property for specifying the cluster linkage to use.
`static String`	`DEFAULT_CLUSTER_LINKAGE_PROPERTY` The default linkage method to use.
`static String`	`MIN_CLUSTER_SIMILARITY_PROPERTY` The property for specifying the cluster similarity threshold.
`static String`	`NUM_CLUSTERS_PROPERTY` The property for specifying the similarity function to use.
`static String`	`PROPERTY_PREFIX` A prefix for specifying properties.
`static String`	`SIMILARITY_FUNCTION_PROPERTY` The property for specifying the similarity function to use.

Constructor Summary

Constructors
Constructor and Description

HierarchicalAgglomerativeClustering()

Constructors
Constructor and Description
`HierarchicalAgglomerativeClustering()`

Method Summary

Methods
Modifier and Type	Method and Description
`List<Merge>`	`buildDendogram(Matrix m, HierarchicalAgglomerativeClustering.ClusterLinkage linkage, Similarity.SimType similarityFunction)` Builds a dendrogram of the rows of similarity matrix by iteratelyve linking each row according to the linkage policy in a bottom up manner.
`List<Merge>`	`buildDendrogram(Matrix similarityMatrix, HierarchicalAgglomerativeClustering.ClusterLinkage linkage)` Builds a dendrogram of the rows of similarity matrix by iteratively linking each row according to the linkage policy in a bottom up manner.
`Assignments`	`cluster(Matrix m, int numClusters, Properties props)` Clusters the set of rows in the given `Matrix` into the specified number of clusters.
`Assignments`	`cluster(Matrix matrix, Properties props)` Clusters the set of rows in the given `Matrix` without a specified number of clusters (optional operation).
`static int[]`	`clusterRows(Matrix m, double clusterSimilarityThreshold, HierarchicalAgglomerativeClustering.ClusterLinkage linkage, Similarity.SimType similarityFunction)` Clusters all rows in the matrix using the specified cluster similarity measure for comparison and threshold for when to stop clustering.
`static int[]`	`partitionRows(Matrix m, int numClusters, HierarchicalAgglomerativeClustering.ClusterLinkage linkage, Similarity.SimType similarityFunction)` Clusters all rows in the matrix using the specified cluster similarity measure for comparison and stopping when the number of clusters is equal to the specified number.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - PROPERTY_PREFIX
```
public static final String PROPERTY_PREFIX
```
    A prefix for specifying properties.
    
    See Also:
    Constant Field Values
  - MIN_CLUSTER_SIMILARITY_PROPERTY
```
public static final String MIN_CLUSTER_SIMILARITY_PROPERTY
```
    The property for specifying the cluster similarity threshold.
    
    See Also:
    Constant Field Values
  - CLUSTER_LINKAGE_PROPERTY
```
public static final String CLUSTER_LINKAGE_PROPERTY
```
    The property for specifying the cluster linkage to use.
    
    See Also:
    Constant Field Values
  - SIMILARITY_FUNCTION_PROPERTY
```
public static final String SIMILARITY_FUNCTION_PROPERTY
```
    The property for specifying the similarity function to use.
    
    See Also:
    Constant Field Values
  - NUM_CLUSTERS_PROPERTY
```
public static final String NUM_CLUSTERS_PROPERTY
```
    The property for specifying the similarity function to use.
    
    See Also:
    Constant Field Values
  - DEFAULT_CLUSTER_LINKAGE_PROPERTY
```
public static final String DEFAULT_CLUSTER_LINKAGE_PROPERTY
```
    The default linkage method to use.
    
    See Also:
    Constant Field Values
- Constructor Detail
  - HierarchicalAgglomerativeClustering
```
public HierarchicalAgglomerativeClustering()
```
- Method Detail
  - cluster
```
public Assignments cluster(Matrix matrix,
                  Properties props)
```
    Clusters the set of rows in the given Matrix without a specified number of clusters (optional operation). The set of cluster assignments are returned for each row in the matrix.
    
    Specified by:
    
    cluster in interface Clustering
    
    Parameters:
    matrix - the Matrix whose row data points are to be clustered
    props - the properties to use for any parameters each clustering algorithm may need
    
    Returns:
    an array of Assignment instances that indicate zero or more clusters to which each row belongs.
  - cluster
```
public Assignments cluster(Matrix m,
                  int numClusters,
                  Properties props)
```
    Clusters the set of rows in the given Matrix into the specified number of clusters. The set of cluster assignments are returned for each row in the matrix. The value of the numClusters parameter will override the "edu.ucla.sspace.clustering.HierarchicalAgglomerativeClustering.numClusters" if it was specified.
    
    Specified by:
    
    cluster in interface Clustering
    
    Parameters:
    m - the Matrix whose row data points are to be clustered
    numClusters - the number of clusters to generate
    props - the properties to use for any parameters each clustering algorithm may need
    
    Returns:
    an array of Assignment instances that indicate zero or more clusters to which each row belongs.
  - partitionRows
```
public static int[] partitionRows(Matrix m,
                  int numClusters,
                  HierarchicalAgglomerativeClustering.ClusterLinkage linkage,
                  Similarity.SimType similarityFunction)
```
    Clusters all rows in the matrix using the specified cluster similarity measure for comparison and stopping when the number of clusters is equal to the specified number.
    
    Parameters:
    m - a matrix whose rows are to be clustered
    numClusters - the number of clusters into which the matrix should divided
    linkage - the method to use for computing the similarity of two clusters
    
    Returns:
    an array where each element corresponds to a row and the value is the cluster number to which that row was assigned. Cluster numbers will start at 0 and increase.
  - clusterRows
```
public static int[] clusterRows(Matrix m,
                double clusterSimilarityThreshold,
                HierarchicalAgglomerativeClustering.ClusterLinkage linkage,
                Similarity.SimType similarityFunction)
```
    Clusters all rows in the matrix using the specified cluster similarity measure for comparison and threshold for when to stop clustering. Clusters will be repeatedly merged until the highest cluster similarity is below the threshold.
    
    Parameters:
    m - a matrix whose rows are to be clustered
    clusterSimilarityThreshold - the threshold to use when deciding whether two clusters should be merged. If the similarity of the clusters is below this threshold, they will not be merged and the clustering process will be stopped.
    linkage - the method to use for computing the similarity of two clusters
    
    Returns:
    an array where each element corresponds to a row and the value is the cluster number to which that row was assigned. Cluster numbers will start at 0 and increase.
  - buildDendogram
```
public List<Merge> buildDendogram(Matrix m,
                         HierarchicalAgglomerativeClustering.ClusterLinkage linkage,
                         Similarity.SimType similarityFunction)
```
    Builds a dendrogram of the rows of similarity matrix by iteratelyve linking each row according to the linkage policy in a bottom up manner. The dendrogram is represented as a series of merge steps for the rows of the similarity matrix, where each row is initially assigned to its own cluster. By following a sequence of merge operations, a particular partitioning of the rows of m can be determined. For example, to find the partitioning after 4 merge operations, one might do the following:
```
   Matrix matrix; 
   List merges = buildDendogram(matrix, ...);
   List fourMergeSteps = merges.subList(0, 4);
   MultiMap clusterToRows =  new HashMultiMap();
   for (int i = 0; i < matrix.rows(); ++i)
       clusterToElements.put(i, i);

   for (Merge m : fourMergeSteps) {
       clusterToElements.putMulti(m.remainingCluster(), 
           clusterToElements.remove(m.mergedCluster()));
   }
```
    The resulting MultiMap clusterToRows contains the mapping from each cluster to the rows that are a part of it.
    Parameters:
    m - a matrix whose rows are to be compared and agglomeratively merged into clusters
    linkage - how two clusters should be compared for similarity when deciding which clusters to merge together
    similarityFunction - how to compare two rows of a matrix for similarity
    
    Returns:
    a dendrogram corresponding to the merge steps for each cluster, where each row is initially assigned to its own cluster whose id is the same as its row's index
  - buildDendrogram
```
public List<Merge> buildDendrogram(Matrix similarityMatrix,
                          HierarchicalAgglomerativeClustering.ClusterLinkage linkage)
```
    Builds a dendrogram of the rows of similarity matrix by iteratively linking each row according to the linkage policy in a bottom up manner. The dendrogram is represented as a series of merge steps for the rows of the similarity matrix, where each row is initially assigned to its own cluster.
    
    Parameters:
    similarityMatrix - a square matrix whose (i, j) values denote the similarity of row i to row j.
    
    Returns:
    a dendrogram corresponding to the merge steps for each cluster, where each row is initially assigned to its own cluster whose id is the same as its row's index
    
    Throws:
    
    IllegalArgumentException - if similarityMatrix is not a square matrix

Class HierarchicalAgglomerativeClustering

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

PROPERTY_PREFIX

MIN_CLUSTER_SIMILARITY_PROPERTY

CLUSTER_LINKAGE_PROPERTY

SIMILARITY_FUNCTION_PROPERTY

NUM_CLUSTERS_PROPERTY

DEFAULT_CLUSTER_LINKAGE_PROPERTY

Constructor Detail

HierarchicalAgglomerativeClustering

Method Detail

cluster

cluster

partitionRows

clusterRows

buildDendogram

buildDendrogram