make_blobs#

cuml.dask.datasets.make_blobs(n_samples=100, n_features=2, centers=None, cluster_std=1.0, n_parts=None, center_box=(-10, 10), shuffle=True, random_state=None, return_centers=False, verbose=False, order='F', dtype='float32', client=None, workers=None)[source]#

Makes labeled Dask-Cupy arrays containing blobs for a randomly generated set of centroids.

This function calls make_blobs from cuml.datasets on each Dask worker and aggregates them into a single Dask Dataframe.

For more information on Scikit-learn’s make_blobs.

Parameters:
n_samplesint

number of rows

n_featuresint

number of features

centersint or array of shape [n_centers, n_features],

optional (default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is array-like, centers must be either None or an array of length equal to the length of n_samples.

cluster_stdfloat (default = 1.0)

standard deviation of points around centroid

n_partsint (default = None)

number of partitions to generate (this can be greater than the number of workers)

center_boxtuple (int, int) (default = (-10, 10))

the bounding box which constrains all the centroids

random_stateint (default = None)

sets random seed (or use None to reinitialize each time)

return_centersbool, optional (default=False)

If True, then return the centers of each cluster

verboseint or boolean (default = False)

Logging level.

shufflebool (default=False)

Shuffles the samples on each worker.

order: str, optional (default=’F’)

The order of the generated samples

dtypestr, optional (default=’float32’)

Dtype of the generated samples

clientdask.distributed.Client (optional)

Dask client to use

workersoptional, list of strings

Dask addresses of workers to use for computation. If None, all available Dask workers will be used. (e.g. : workers = list(client.scheduler_info()['workers'].keys()))

Returns:
Xdask.array backed by CuPy array of shape [n_samples, n_features]

The input samples.

ydask.array backed by CuPy array of shape [n_samples]

The output values.

centersdask.array backed by CuPy array of shape

[n_centers, n_features], optional The centers of the underlying blobs. It is returned only if return_centers is True.

Examples

>>> from dask_cuda import LocalCUDACluster
>>> from dask.distributed import Client
>>> from cuml.dask.datasets import make_blobs

>>> cluster = LocalCUDACluster(threads_per_worker=1)
>>> client = Client(cluster)

>>> workers = list(client.scheduler_info()['workers'].keys())
>>> X, y = make_blobs(1000, 10, centers=42, cluster_std=0.1,
...                   workers=workers)

>>> client.close()
>>> cluster.close()