K-Means#

K-Means Parameters#

class cuvs.cluster.kmeans.KMeansParams(
metric=None,
*,
n_clusters=None,
init_method=None,
max_iter=None,
tol=None,
n_init=None,
oversampling_factor=None,
hierarchical=None,
hierarchical_n_iters=None,
)#

Hyper-parameters for the kmeans algorithm

Parameters:
metricstr

String denoting the metric type.

n_clustersint

The number of clusters to form as well as the number of centroids to generate

init_methodstr

Method for initializing clusters. One of: “KMeansPlusPlus” : Use scalable k-means++ algorithm to select initial cluster centers “Random” : Choose ‘n_clusters’ observations at random from the input data “Array” : Use centroids as initial cluster centers

max_iterint

Maximum number of iterations of the k-means algorithm for a single run

tolfloat

Relative tolerance with regards to inertia to declare convergence.

n_initint

Number of instance k-means algorithm will be run with different seeds

oversampling_factordouble

Oversampling factor for use in the k-means|| algorithm

hierarchicalbool

Whether to use hierarchical (balanced) kmeans or not

hierarchical_n_itersint

For hierarchical k-means , defines the number of training iterations

Attributes:
hierarchical
hierarchical_n_iters
init_method
max_iter
metric
n_clusters
n_init
oversampling_factor
tol

K-Means Fit#

cuvs.cluster.kmeans.fit(
KMeansParams params,
X,
centroids=None,
sample_weights=None,
resources=None,
)[source]#

Find clusters with the k-means algorithm

Parameters:
paramsKMeansParams

Parameters to use to fit KMeans model

XInput CUDA array interface compliant matrix shape (m, k)
centroidsOptional writable CUDA array interface compliant matrix

shape (n_clusters, k)

sample_weightsOptional input CUDA array interface compliant matrix shape

(n_clusters, 1) default: None

resourcesOptional cuVS Resource handle for reusing CUDA resources.

If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling resources.sync() before accessing the output.

Returns:
centroidsraft.device_ndarray

The computed centroids for each cluster

inertiafloat

Sum of squared distances of samples to their closest cluster center

n_iterint

The number of iterations used to fit the model

Examples

>>> import cupy as cp
>>>
>>> from cuvs.cluster.kmeans import fit, KMeansParams
>>>
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>>
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)
>>> params = KMeansParams(n_clusters=n_clusters)
>>> centroids, inertia, n_iter = fit(params, X)

K-Means Predict#

cuvs.cluster.kmeans.predict(
KMeansParams params,
X,
centroids,
sample_weights=None,
labels=None,
normalize_weight=True,
resources=None,
)[source]#

Predict clusters with the k-means algorithm

Parameters:
paramsKMeansParams

Parameters to used in fitting KMeans model

XInput CUDA array interface compliant matrix shape (m, k)
centroidsCUDA array interface compliant matrix, calculated by fit

shape (n_clusters, k)

sample_weightsOptional input CUDA array interface compliant matrix shape

(n_clusters, 1) default: None

labelsOptional preallocated CUDA array interface matrix shape (m, 1)

to hold the output

normalize_weight: bool

True if the weights should be normalized

resourcesOptional cuVS Resource handle for reusing CUDA resources.

If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling resources.sync() before accessing the output.

Returns:
labelsraft.device_ndarray

The label for each datapoint in X

inertiafloat

Sum of squared distances of samples to their closest cluster center

Examples

>>> import cupy as cp
>>>
>>> from cuvs.cluster.kmeans import fit, predict, KMeansParams
>>>
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>>
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)
>>> params = KMeansParams(n_clusters=n_clusters)
>>> centroids, inertia, n_iter = fit(params, X)
>>>
>>> labels, inertia = predict(params, X, centroids)

K-Means Cluster Cost#

cuvs.cluster.kmeans.cluster_cost(X, centroids, resources=None)[source]#

Compute cluster cost given an input matrix and existing centroids

Parameters:
XInput CUDA array interface compliant matrix shape (m, k)
centroidsInput CUDA array interface compliant matrix shape

(n_clusters, k)

resourcesOptional cuVS Resource handle for reusing CUDA resources.

If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling resources.sync() before accessing the output.

Returns:
inertiafloat

The cluster cost between the input matrix and existing centroids

Examples

>>> import cupy as cp
>>>
>>> from cuvs.cluster.kmeans import cluster_cost
>>>
>>> n_samples = 5000
>>> n_features = 50
>>> n_clusters = 3
>>>
>>> X = cp.random.random_sample((n_samples, n_features),
...                             dtype=cp.float32)
>>> centroids = cp.random.random_sample((n_clusters, n_features),
...                                      dtype=cp.float32)
>>> inertia = cluster_cost(X, centroids)