K-Means#
Parameters#
#include <cuvs/cluster/kmeans.h>
-
enum cuvsKMeansInitMethod#
Values:
-
enumerator KMeansPlusPlus#
Sample the centroids using the kmeans++ strategy
-
enumerator Random#
Sample the centroids uniformly at random
-
enumerator Array#
User provides the array of initial centroids
-
enumerator KMeansPlusPlus#
-
enum cuvsKMeansType#
Type of k-means algorithm.
Values:
-
enumerator CUVS_KMEANS_TYPE_KMEANS#
-
enumerator CUVS_KMEANS_TYPE_KMEANS_BALANCED#
-
enumerator CUVS_KMEANS_TYPE_KMEANS#
-
typedef struct cuvsKMeansParams *cuvsKMeansParams_t#
-
typedef struct cuvsKMeansParams_v2 *cuvsKMeansParams_v2_t#
-
cuvsError_t cuvsKMeansParamsCreate(cuvsKMeansParams_t *params)#
Allocate KMeans params, and populate with default values.
Note
In cuVS 26.08 (next ABI major version) this signature will be replaced by cuvsKMeansParamsCreate_v2.
- Parameters:
params – [in] cuvsKMeansParams_t to allocate
- Returns:
-
cuvsError_t cuvsKMeansParamsDestroy(cuvsKMeansParams_t params)#
De-allocate KMeans params.
Note
In cuVS 26.08 (next ABI major version) this signature will be replaced by cuvsKMeansParamsDestroy_v2.
- Parameters:
params – [in]
- Returns:
-
cuvsError_t cuvsKMeansParamsCreate_v2(cuvsKMeansParams_v2_t *params)#
Allocate KMeans params.
Mirrors cuvsKMeansParamsCreate but operates on cuvsKMeansParams_v2. Will become the unsuffixed cuvsKMeansParamsCreate in cuVS 26.08.
- Parameters:
params – [in] cuvsKMeansParams_v2_t to allocate
- Returns:
-
cuvsError_t cuvsKMeansParamsDestroy_v2(cuvsKMeansParams_v2_t params)#
De-allocate KMeans params allocated by cuvsKMeansParamsCreate_v2.
- Parameters:
params – [in]
- Returns:
-
struct cuvsKMeansParams#
- #include <kmeans.h>
Hyper-parameters for the kmeans algorithm NB: The inertia_check field is kept for ABI compatibility. Removed in cuvsKMeansParams_v2. TODO: CalVer for the replacement: 26.08.
Public Members
-
int n_clusters#
The number of clusters to form as well as the number of centroids to generate (default:8).
-
cuvsKMeansInitMethod init#
Method for initialization, defaults to k-means++:
cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm to select the initial cluster centers.
cuvsKMeansInitMethod::Random (random): Choose ‘n_clusters’ observations (rows) at random from the input data for the initial centroids.
cuvsKMeansInitMethod::Array (ndarray): Use ‘centroids’ as initial cluster centers.
-
int max_iter#
Maximum number of iterations of the k-means algorithm for a single run.
-
double tol#
Relative tolerance with regards to inertia to declare convergence.
-
int n_init#
Number of instance k-means algorithm will be run with different seeds.
-
double oversampling_factor#
Oversampling factor for use in the k-means|| algorithm
-
int batch_samples#
batch_samples and batch_centroids are used to tile 1NN computation which is useful to optimize/control the memory footprint Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 then don’t tile the centroids
-
int batch_centroids#
if 0 then batch_centroids = n_clusters
-
bool inertia_check#
Deprecated, ignored. Kept for ABI compatibility.
-
bool hierarchical#
Whether to use hierarchical (balanced) kmeans or not
-
int hierarchical_n_iters#
For hierarchical k-means , defines the number of training iterations
-
int64_t streaming_batch_size#
Number of samples to process per GPU batch for the batched (host-data) API. When set to 0, defaults to n_samples (process all at once).
-
int64_t init_size#
Number of samples to draw for KMeansPlusPlus initialization. When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data, or n_samples for device data.
-
int n_clusters#
-
struct cuvsKMeansParams_v2#
- #include <kmeans.h>
Hyper-parameters for the kmeans algorithm TODO: Remove this after cuvsKMeansParams is replaced in ABI 2.0.
Public Members
-
int n_clusters#
The number of clusters to form as well as the number of centroids to generate (default:8).
-
cuvsKMeansInitMethod init#
Method for initialization, defaults to k-means++:
cuvsKMeansInitMethod::KMeansPlusPlus (k-means++): Use scalable k-means++ algorithm to select the initial cluster centers.
cuvsKMeansInitMethod::Random (random): Choose ‘n_clusters’ observations (rows) at random from the input data for the initial centroids.
cuvsKMeansInitMethod::Array (ndarray): Use ‘centroids’ as initial cluster centers.
-
int max_iter#
Maximum number of iterations of the k-means algorithm for a single run.
-
double tol#
Relative tolerance with regards to inertia to declare convergence.
-
int n_init#
Number of instance k-means algorithm will be run with different seeds.
-
double oversampling_factor#
Oversampling factor for use in the k-means|| algorithm
-
int batch_samples#
batch_samples and batch_centroids are used to tile 1NN computation which is useful to optimize/control the memory footprint Default tile is [batch_samples x n_clusters] i.e. when batch_centroids is 0 then don’t tile the centroids
-
int batch_centroids#
if 0 then batch_centroids = n_clusters
-
bool hierarchical#
Whether to use hierarchical (balanced) kmeans or not
-
int hierarchical_n_iters#
For hierarchical k-means , defines the number of training iterations
-
int64_t streaming_batch_size#
Number of samples to process per GPU batch for the batched (host-data) API. When set to 0, defaults to n_samples (process all at once).
-
int64_t init_size#
Number of samples to draw for KMeansPlusPlus initialization. When set to 0, uses heuristic min(3 * n_clusters, n_samples) for host data, or n_samples for device data.
-
int n_clusters#
Functions#
#include <cuvs/cluster/kmeans.h>
- cuvsError_t cuvsKMeansFit(
- cuvsResources_t res,
- cuvsKMeansParams_t params,
- DLManagedTensor *X,
- DLManagedTensor *sample_weight,
- DLManagedTensor *centroids,
- double *inertia,
- int *n_iter
Find clusters with k-means algorithm.
Initial centroids are chosen with k-means++ algorithm. Empty clusters are reinitialized by choosing new centroids with k-means++ algorithm.
X may reside on either host (CPU) or device (GPU) memory. When X is on the host the data is streamed to the GPU in batches controlled by params->streaming_batch_size.
Note
In cuVS 26.08 (next ABI major version) this signature will be replaced by cuvsKMeansFit_v2.
- Parameters:
res – [in] opaque C handle
params – [in] Parameters for KMeans model.
X – [in] Training instances to cluster. The data must be in row-major format. May be on host or device memory. [dim = n_samples x n_features]
sample_weight – [in] Optional weights for each observation in X. Must be on the same memory space as X. [len = n_samples]
centroids – [inout] [in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. Must be on device. [dim = n_clusters x n_features]
inertia – [out] Sum of squared distances of samples to their closest cluster center.
n_iter – [out] Number of iterations run.
- cuvsError_t cuvsKMeansFit_v2(
- cuvsResources_t res,
- cuvsKMeansParams_v2_t params,
- DLManagedTensor *X,
- DLManagedTensor *sample_weight,
- DLManagedTensor *centroids,
- double *inertia,
- int *n_iter
Find clusters with k-means algorithm (v2 params layout).
Mirrors cuvsKMeansFit but takes cuvsKMeansParams_v2_t. Will become the unsuffixed cuvsKMeansFit in cuVS 26.08.
- Parameters:
res – [in] opaque C handle
params – [in] Parameters for KMeans model (v2 layout).
X – [in] Training instances to cluster. The data must be in row-major format. May be on host or device memory. [dim = n_samples x n_features]
sample_weight – [in] Optional weights for each observation in X. Must be on the same memory space as X. [len = n_samples]
centroids – [inout] [in] When init is InitMethod::Array, use centroids as the initial cluster centers. [out] The generated centroids from the kmeans algorithm are stored at the address pointed by ‘centroids’. Must be on device. [dim = n_clusters x n_features]
inertia – [out] Sum of squared distances of samples to their closest cluster center.
n_iter – [out] Number of iterations run.
- cuvsError_t cuvsKMeansPredict(
- cuvsResources_t res,
- cuvsKMeansParams_t params,
- DLManagedTensor *X,
- DLManagedTensor *sample_weight,
- DLManagedTensor *centroids,
- DLManagedTensor *labels,
- bool normalize_weight,
- double *inertia
Predict the closest cluster each sample in X belongs to.
Note
In cuVS 26.08 (next ABI major version) this signature will be replaced by cuvsKMeansPredict_v2.
- Parameters:
res – [in] opaque C handle
params – [in] Parameters for KMeans model.
X – [in] New data to predict. [dim = n_samples x n_features]
sample_weight – [in] Optional weights for each observation in X. [len = n_samples]
centroids – [in] Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
normalize_weight – [in] True if the weights should be normalized
labels – [out] Index of the cluster each sample in X belongs to. [len = n_samples]
inertia – [out] Sum of squared distances of samples to their closest cluster center.
- cuvsError_t cuvsKMeansPredict_v2(
- cuvsResources_t res,
- cuvsKMeansParams_v2_t params,
- DLManagedTensor *X,
- DLManagedTensor *sample_weight,
- DLManagedTensor *centroids,
- DLManagedTensor *labels,
- bool normalize_weight,
- double *inertia
Predict the closest cluster each sample in X belongs to (v2 params layout).
Mirrors cuvsKMeansPredict but takes cuvsKMeansParams_v2_t. Will become the unsuffixed cuvsKMeansPredict in cuVS 26.08.
- Parameters:
res – [in] opaque C handle
params – [in] Parameters for KMeans model (v2 layout).
X – [in] New data to predict. [dim = n_samples x n_features]
sample_weight – [in] Optional weights for each observation in X. [len = n_samples]
centroids – [in] Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
normalize_weight – [in] True if the weights should be normalized
labels – [out] Index of the cluster each sample in X belongs to. [len = n_samples]
inertia – [out] Sum of squared distances of samples to their closest cluster center.
- cuvsError_t cuvsKMeansClusterCost(
- cuvsResources_t res,
- DLManagedTensor *X,
- DLManagedTensor *centroids,
- double *cost
Compute cluster cost.
- Parameters:
res – [in] opaque C handle
X – [in] Training instances to cluster. The data must be in row-major format. [dim = n_samples x n_features]
centroids – [in] Cluster centroids. The data must be in row-major format. [dim = n_clusters x n_features]
cost – [out] Resulting cluster cost