public class HierarchicalAgglomerativeClustering extends Object implements Clustering
Clustering
interface,
which allows this method to be used in place of other clustering algorithms.
In addition to clustering, this implementation also exposes the ability
to view the iterative bottom-up merge through the buildDendogram
methods.
These methods return a series of Merge
operations that can be used to
construct a dendrogram
and see the partial clustering at any point during the agglomerative merging
process. For example, to view the clustering solution after four steps, the
following code might be used:
Matrix matrix; List<Merge> merges = buildDendogram(matrix, ...); List<Merge> fourMergeSteps = merges.subList(0, 4); MultiMap<Integer,Integer> clusterToRows = new HashMultiMap<Integer,Integer>(); for (int i = 0; i < matrix.rows(); ++i) clusterToElements.put(i, i); for (Merge m : fourMergeSteps) { clusterToElements.putMulti(m.remainingCluster(), clusterToElements.remove(m.mergedCluster())); }The resulting
MultiMap
clusterToRows
contains the mapping from each cluster to the rows that are a part of it.
Implementation Note: The current version runs in O(n3) worst case time for the number of rows in the matrix. While O(n2 * log(n)) methods exist, these require storing similarity comparisons in a priority queue, which has a substantially higher memory overhead. Therefore, this implementation has opted for a more expensive running time in order to be able to process larger matrices.
When using the Clustering.cluster(Matrix,Properties)
interface,
this class supports the following properties for controlling the clustering.
"edu.ucla.sspace.clustering.HierarchicalAgglomerativeClustering.clusterThreshold"
"edu.ucla.sspace.clustering.HierarchicalAgglomerativeClustering.clusterLinkage"
HierarchicalAgglomerativeClustering.ClusterLinkage
to use when computing cluster similarity.
"edu.ucla.sspace.clustering.HierarchicalAgglomerativeClustering.simFunc"
COSINE
Similarity.SimType
to use when computing the similarity of two data points.
"edu.ucla.sspace.clustering.HierarchicalAgglomerativeClustering.numClusters"
Modifier and Type | Class and Description |
---|---|
static class |
HierarchicalAgglomerativeClustering.ClusterLinkage
The method to use when comparing the similarity of two clusters.
|
Modifier and Type | Field and Description |
---|---|
static String |
CLUSTER_LINKAGE_PROPERTY
The property for specifying the cluster linkage to use.
|
static String |
DEFAULT_CLUSTER_LINKAGE_PROPERTY
The default linkage method to use.
|
static String |
MIN_CLUSTER_SIMILARITY_PROPERTY
The property for specifying the cluster similarity threshold.
|
static String |
NUM_CLUSTERS_PROPERTY
The property for specifying the similarity function to use.
|
static String |
PROPERTY_PREFIX
A prefix for specifying properties.
|
static String |
SIMILARITY_FUNCTION_PROPERTY
The property for specifying the similarity function to use.
|
Constructor and Description |
---|
HierarchicalAgglomerativeClustering() |
Modifier and Type | Method and Description |
---|---|
List<Merge> |
buildDendogram(Matrix m,
HierarchicalAgglomerativeClustering.ClusterLinkage linkage,
Similarity.SimType similarityFunction)
Builds a dendrogram of the rows of similarity matrix by iteratelyve
linking each row according to the linkage policy in a bottom up manner.
|
List<Merge> |
buildDendrogram(Matrix similarityMatrix,
HierarchicalAgglomerativeClustering.ClusterLinkage linkage)
Builds a dendrogram of the rows of similarity matrix by iteratively
linking each row according to the linkage policy in a bottom up manner.
|
Assignments |
cluster(Matrix m,
int numClusters,
Properties props)
Clusters the set of rows in the given
Matrix into the specified
number of clusters. |
Assignments |
cluster(Matrix matrix,
Properties props)
Clusters the set of rows in the given
Matrix without a specified
number of clusters (optional operation). |
static int[] |
clusterRows(Matrix m,
double clusterSimilarityThreshold,
HierarchicalAgglomerativeClustering.ClusterLinkage linkage,
Similarity.SimType similarityFunction)
Clusters all rows in the matrix using the specified cluster similarity
measure for comparison and threshold for when to stop clustering.
|
static int[] |
partitionRows(Matrix m,
int numClusters,
HierarchicalAgglomerativeClustering.ClusterLinkage linkage,
Similarity.SimType similarityFunction)
Clusters all rows in the matrix using the specified cluster similarity
measure for comparison and stopping when the number of clusters is equal
to the specified number.
|
public static final String PROPERTY_PREFIX
public static final String MIN_CLUSTER_SIMILARITY_PROPERTY
public static final String CLUSTER_LINKAGE_PROPERTY
public static final String SIMILARITY_FUNCTION_PROPERTY
public static final String NUM_CLUSTERS_PROPERTY
public static final String DEFAULT_CLUSTER_LINKAGE_PROPERTY
public Assignments cluster(Matrix matrix, Properties props)
Matrix
without a specified
number of clusters (optional operation). The set of cluster assignments
are returned for each row in the matrix.cluster
in interface Clustering
matrix
- the Matrix
whose row data points are to be
clusteredprops
- the properties to use for any parameters each clustering
algorithm may needAssignment
instances that indicate zero or
more clusters to which each row belongs.public Assignments cluster(Matrix m, int numClusters, Properties props)
Matrix
into the specified
number of clusters. The set of cluster assignments are returned for each
row in the matrix. The value of the numClusters
parameter will
override the "edu.ucla.sspace.clustering.HierarchicalAgglomerativeClustering.numClusters" if it was specified.cluster
in interface Clustering
m
- the Matrix
whose row data points are to be
clusterednumClusters
- the number of clusters to generateprops
- the properties to use for any parameters each clustering
algorithm may needAssignment
instances that indicate zero or
more clusters to which each row belongs.public static int[] partitionRows(Matrix m, int numClusters, HierarchicalAgglomerativeClustering.ClusterLinkage linkage, Similarity.SimType similarityFunction)
m
- a matrix whose rows are to be clusterednumClusters
- the number of clusters into which the matrix should
dividedlinkage
- the method to use for computing the similarity of two
clusterspublic static int[] clusterRows(Matrix m, double clusterSimilarityThreshold, HierarchicalAgglomerativeClustering.ClusterLinkage linkage, Similarity.SimType similarityFunction)
m
- a matrix whose rows are to be clusteredclusterSimilarityThreshold
- the threshold to use when deciding
whether two clusters should be merged. If the similarity of the
clusters is below this threshold, they will not be merged and the
clustering process will be stopped.linkage
- the method to use for computing the similarity of two
clusterspublic List<Merge> buildDendogram(Matrix m, HierarchicalAgglomerativeClustering.ClusterLinkage linkage, Similarity.SimType similarityFunction)
m
can be determined. For example, to
find the partitioning after 4 merge operations, one might do the
following:
Matrix matrix; ListThe resultingmerges = buildDendogram(matrix, ...); List fourMergeSteps = merges.subList(0, 4); MultiMap clusterToRows = new HashMultiMap (); for (int i = 0; i < matrix.rows(); ++i) clusterToElements.put(i, i); for (Merge m : fourMergeSteps) { clusterToElements.putMulti(m.remainingCluster(), clusterToElements.remove(m.mergedCluster())); }
MultiMap
clusterToRows
contains the mapping from each cluster to the rows that are a part of it.m
- a matrix whose rows are to be compared and agglomeratively
merged into clusterslinkage
- how two clusters should be compared for similarity when
deciding which clusters to merge togethersimilarityFunction
- how to compare two rows of a matrix for
similaritypublic List<Merge> buildDendrogram(Matrix similarityMatrix, HierarchicalAgglomerativeClustering.ClusterLinkage linkage)
similarityMatrix
- a square matrix whose (i, j) values denote the
similarity of row i to row j.IllegalArgumentException
- if similarityMatrix
is not a
square matrixCopyright © 2012. All Rights Reserved.