K-Means#
K-Means Parameters#
- class cuvs.cluster.kmeans.KMeansParams(
- metric=None,
- *,
- n_clusters=None,
- init_method=None,
- max_iter=None,
- tol=None,
- n_init=None,
- oversampling_factor=None,
- hierarchical=None,
- hierarchical_n_iters=None,
Hyper-parameters for the kmeans algorithm
- Parameters:
- metricstr
String denoting the metric type.
- n_clustersint
The number of clusters to form as well as the number of centroids to generate
- init_methodstr
Method for initializing clusters. One of: “KMeansPlusPlus” : Use scalable k-means++ algorithm to select initial cluster centers “Random” : Choose ‘n_clusters’ observations at random from the input data “Array” : Use centroids as initial cluster centers
- max_iterint
Maximum number of iterations of the k-means algorithm for a single run
- tolfloat
Relative tolerance with regards to inertia to declare convergence.
- n_initint
Number of instance k-means algorithm will be run with different seeds
- oversampling_factordouble
Oversampling factor for use in the k-means|| algorithm
- hierarchicalbool
Whether to use hierarchical (balanced) kmeans or not
- hierarchical_n_itersint
For hierarchical k-means , defines the number of training iterations
- Attributes:
- hierarchical
- hierarchical_n_iters
- init_method
- max_iter
- metric
- n_clusters
- n_init
- oversampling_factor
- tol
K-Means Fit#
- cuvs.cluster.kmeans.fit(
- KMeansParams params,
- X,
- centroids=None,
- sample_weights=None,
- resources=None,
Find clusters with the k-means algorithm
- Parameters:
- paramsKMeansParams
Parameters to use to fit KMeans model
- XInput CUDA array interface compliant matrix shape (m, k)
- centroidsOptional writable CUDA array interface compliant matrix
shape (n_clusters, k)
- sample_weightsOptional input CUDA array interface compliant matrix shape
(n_clusters, 1) default: None
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()
before accessing the output.
- Returns:
- centroidsraft.device_ndarray
The computed centroids for each cluster
- inertiafloat
Sum of squared distances of samples to their closest cluster center
- n_iterint
The number of iterations used to fit the model
Examples
>>> import cupy as cp >>> >>> from cuvs.cluster.kmeans import fit, KMeansParams >>> >>> n_samples = 5000 >>> n_features = 50 >>> n_clusters = 3 >>> >>> X = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32)
>>> params = KMeansParams(n_clusters=n_clusters) >>> centroids, inertia, n_iter = fit(params, X)
K-Means Predict#
- cuvs.cluster.kmeans.predict(
- KMeansParams params,
- X,
- centroids,
- sample_weights=None,
- labels=None,
- normalize_weight=True,
- resources=None,
Predict clusters with the k-means algorithm
- Parameters:
- paramsKMeansParams
Parameters to used in fitting KMeans model
- XInput CUDA array interface compliant matrix shape (m, k)
- centroidsCUDA array interface compliant matrix, calculated by fit
shape (n_clusters, k)
- sample_weightsOptional input CUDA array interface compliant matrix shape
(n_clusters, 1) default: None
- labelsOptional preallocated CUDA array interface matrix shape (m, 1)
to hold the output
- normalize_weight: bool
True if the weights should be normalized
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()
before accessing the output.
- Returns:
- labelsraft.device_ndarray
The label for each datapoint in X
- inertiafloat
Sum of squared distances of samples to their closest cluster center
Examples
>>> import cupy as cp >>> >>> from cuvs.cluster.kmeans import fit, predict, KMeansParams >>> >>> n_samples = 5000 >>> n_features = 50 >>> n_clusters = 3 >>> >>> X = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32)
>>> params = KMeansParams(n_clusters=n_clusters) >>> centroids, inertia, n_iter = fit(params, X) >>> >>> labels, inertia = predict(params, X, centroids)
K-Means Cluster Cost#
- cuvs.cluster.kmeans.cluster_cost(X, centroids, resources=None)[source]#
Compute cluster cost given an input matrix and existing centroids
- Parameters:
- XInput CUDA array interface compliant matrix shape (m, k)
- centroidsInput CUDA array interface compliant matrix shape
(n_clusters, k)
- resourcesOptional cuVS Resource handle for reusing CUDA resources.
If Resources aren’t supplied, CUDA resources will be allocated inside this function and synchronized before the function exits. If resources are supplied, you will need to explicitly synchronize yourself by calling
resources.sync()
before accessing the output.
- Returns:
- inertiafloat
The cluster cost between the input matrix and existing centroids
Examples
>>> import cupy as cp >>> >>> from cuvs.cluster.kmeans import cluster_cost >>> >>> n_samples = 5000 >>> n_features = 50 >>> n_clusters = 3 >>> >>> X = cp.random.random_sample((n_samples, n_features), ... dtype=cp.float32)
>>> centroids = cp.random.random_sample((n_clusters, n_features), ... dtype=cp.float32)
>>> inertia = cluster_cost(X, centroids)