K-Means#

K-Means Parameters#

class cuvs.cluster.kmeans.KMeansParams( metric=None, *, n_clusters=None, init_method=None, max_iter=None, tol=None, n_init=None, oversampling_factor=None, hierarchical=None, hierarchical_n_iters=None, )#

Hyper-parameters for the kmeans algorithm

Parameters:

metricstr: String denoting the metric type.
n_clustersint: The number of clusters to form as well as the number of centroids to generate
init_methodstr: Method for initializing clusters. One of: “KMeansPlusPlus” : Use scalable k-means++ algorithm to select initial cluster centers “Random” : Choose ‘n_clusters’ observations at random from the input data “Array” : Use centroids as initial cluster centers
max_iterint: Maximum number of iterations of the k-means algorithm for a single run
tolfloat: Relative tolerance with regards to inertia to declare convergence.
n_initint: Number of instance k-means algorithm will be run with different seeds
oversampling_factordouble: Oversampling factor for use in the k-means|| algorithm
hierarchicalbool: Whether to use hierarchical (balanced) kmeans or not
hierarchical_n_itersint: For hierarchical k-means , defines the number of training iterations

Attributes:

hierarchical
hierarchical_n_iters
init_method
max_iter
metric
n_clusters
n_init
oversampling_factor
tol

K-Means Fit#

cuvs.cluster.kmeans.fit( KMeansParams params, X, centroids=None, sample_weights=None, resources=None, )[source]#

Find clusters with the k-means algorithm

Parameters:

paramsKMeansParams: Parameters to use to fit KMeans model
XInput CUDA array interface compliant matrix shape (m, k)
centroidsOptional writable CUDA array interface compliant matrix: shape (n_clusters, k)
sample_weightsOptional input CUDA array interface compliant matrix shape: (n_clusters, 1) default: None
resourcesOptional cuVS Resource handle for reusing CUDA resources.: If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling resources.sync() before accessing the output.

Returns:

centroidsraft.device_ndarray: The computed centroids for each cluster
inertiafloat: Sum of squared distances of samples to their closest cluster center
n_iterint: The number of iterations used to fit the model

Examples

>>> import cupy as cp
>>>
>>> from cuvs.cluster.kmeans import fit, KMeansParams
>>>
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>>
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)

>>> params = KMeansParams(n_clusters=n_clusters)
>>> centroids, inertia, n_iter = fit(params, X)

K-Means Predict#

cuvs.cluster.kmeans.predict( KMeansParams params, X, centroids, sample_weights=None, labels=None, normalize_weight=True, resources=None, )[source]#

Predict clusters with the k-means algorithm

Parameters:

paramsKMeansParams: Parameters to used in fitting KMeans model
XInput CUDA array interface compliant matrix shape (m, k)
centroidsCUDA array interface compliant matrix, calculated by fit: shape (n_clusters, k)
sample_weightsOptional input CUDA array interface compliant matrix shape: (n_clusters, 1) default: None
labelsOptional preallocated CUDA array interface matrix shape (m, 1): to hold the output
normalize_weight: bool: True if the weights should be normalized
resourcesOptional cuVS Resource handle for reusing CUDA resources.: If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling resources.sync() before accessing the output.

Returns:

labelsraft.device_ndarray: The label for each datapoint in X
inertiafloat: Sum of squared distances of samples to their closest cluster center

Examples

>>> import cupy as cp
>>>
>>> from cuvs.cluster.kmeans import fit, predict, KMeansParams
>>>
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>>
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)

>>> params = KMeansParams(n_clusters=n_clusters)
>>> centroids, inertia, n_iter = fit(params, X)
>>>
>>> labels, inertia = predict(params, X, centroids)

K-Means Cluster Cost#

cuvs.cluster.kmeans.cluster_cost(X, centroids, resources=None)[source]#

Compute cluster cost given an input matrix and existing centroids

Parameters:

XInput CUDA array interface compliant matrix shape (m, k)
centroidsInput CUDA array interface compliant matrix shape: (n_clusters, k)
resourcesOptional cuVS Resource handle for reusing CUDA resources.: If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling resources.sync() before accessing the output.

Returns:

inertiafloat: The cluster cost between the input matrix and existing centroids

Examples

>>> import cupy as cp
>>>
>>> from cuvs.cluster.kmeans import cluster_cost
>>>
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>>
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)

>>> centroids = cp.random.random_sample((n_clusters, n_features),
...                                      dtype=cp.float32)

>>> inertia = cluster_cost(X, centroids)