make_blobs#
- cuml.dask.datasets.make_blobs(n_samples=100, n_features=2, centers=None, cluster_std=1.0, n_parts=None, center_box=(-10, 10), shuffle=True, random_state=None, return_centers=False, verbose=False, order='F', dtype='float32', client=None, workers=None)[source]#
Makes labeled Dask-Cupy arrays containing blobs for a randomly generated set of centroids.
This function calls
make_blobsfromcuml.datasetson each Dask worker and aggregates them into a single Dask Dataframe.For more information on Scikit-learn’s make_blobs.
- Parameters:
- n_samplesint
number of rows
- n_featuresint
number of features
- centersint or array of shape [n_centers, n_features],
optional (default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is array-like, centers must be either None or an array of length equal to the length of n_samples.
- cluster_stdfloat (default = 1.0)
standard deviation of points around centroid
- n_partsint (default = None)
number of partitions to generate (this can be greater than the number of workers)
- center_boxtuple (int, int) (default = (-10, 10))
the bounding box which constrains all the centroids
- random_stateint (default = None)
sets random seed (or use None to reinitialize each time)
- return_centersbool, optional (default=False)
If True, then return the centers of each cluster
- verboseint or boolean (default = False)
Logging level.
- shufflebool (default=False)
Shuffles the samples on each worker.
- order: str, optional (default=’F’)
The order of the generated samples
- dtypestr, optional (default=’float32’)
Dtype of the generated samples
- clientdask.distributed.Client (optional)
Dask client to use
- workersoptional, list of strings
Dask addresses of workers to use for computation. If None, all available Dask workers will be used. (e.g. :
workers = list(client.scheduler_info()['workers'].keys()))
- Returns:
- Xdask.array backed by CuPy array of shape [n_samples, n_features]
The input samples.
- ydask.array backed by CuPy array of shape [n_samples]
The output values.
- centersdask.array backed by CuPy array of shape
[n_centers, n_features], optional The centers of the underlying blobs. It is returned only if return_centers is True.
Examples
>>> from dask_cuda import LocalCUDACluster >>> from dask.distributed import Client >>> from cuml.dask.datasets import make_blobs >>> cluster = LocalCUDACluster(threads_per_worker=1) >>> client = Client(cluster) >>> workers = list(client.scheduler_info()['workers'].keys()) >>> X, y = make_blobs(1000, 10, centers=42, cluster_std=0.1, ... workers=workers) >>> client.close() >>> cluster.close()