esda.adbscan.ADBSCAN

class esda.adbscan.ADBSCAN(eps, min_samples, algorithm='auto', n_jobs=1, pct_exact=0.1, reps=100, keep_solus=False, pct_thr=0.9)[source]

A-DBSCAN, as introduced in [].

A-DSBCAN is an extension of the original DBSCAN algorithm that creates an ensemble of solutions generated by running DBSCAN on a random subset and “extending” the solution to the rest of the sample through nearest-neighbor regression.

See the original reference ([]) for more details or the notebook guide for an illustration. …

Parameters:
epsfloat

The maximum distance between two samples for them to be considered as in the same neighborhood.

min_samplesint

The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional

The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.

n_jobsint

[Optional. Default=1] The number of parallel jobs to run. If -1, then the number of jobs is set to the number of CPU cores.

pct_exactfloat

[Optional. Default=0.1] Proportion of the entire dataset used to calculate DBSCAN in each draw

repsint

[Optional. Default=100] Number of random samples to draw in order to build final solution

keep_solusBoolean

[Optional. Default=False] If True, the solus and solus_relabelled objects are kept, else it is deleted to save memory

pct_thrfloat

[Optional. Default=0.9] Minimum proportion of replications that a non-noise label need to be assigned to an observation for that observation to be labelled as such

Examples

>>> import pandas
>>> from esda.adbscan import ADBSCAN
>>> import numpy as np
>>> np.random.seed(10)
>>> db = pandas.DataFrame({'X': np.random.random(25),                                'Y': np.random.random(25)                               })

ADBSCAN can be run following scikit-learn like API as:

>>> np.random.seed(10)
>>> clusterer = ADBSCAN(0.03, 3, reps=10, keep_solus=True)
>>> _ = clusterer.fit(db)
>>> clusterer.labels_
array(['-1', '-1', '-1', '0', '-1', '-1', '-1', '0', '-1', '-1', '-1',
       '-1', '-1', '-1', '0', '0', '0', '-1', '0', '-1', '0', '-1', '-1',
       '-1', '-1'], dtype=object)

We can inspect the winning label for each observation, as well as the proportion of votes:

>>> print(clusterer.votes.head().to_string())
  lbls  pct
0   -1  0.7
1   -1  0.5
2   -1  0.7
3    0  1.0
4   -1  0.7

If you have set the option to keep them, you can even inspect each solution that makes up the ensemble:

>>> print(clusterer.solus.head().to_string())
  rep-00 rep-01 rep-02 rep-03 rep-04 rep-05 rep-06 rep-07 rep-08 rep-09
0      0      1      1      0      1      0      0      0      1      0
1      1      1      1      1      0      1      0      1      1      1
2      0      1      1      0      0      1      0      0      1      0
3      0      1      1      0      0      1      1      1      0      0
4      0      1      1      1      0      1      0      1      0      1

If we select only one replication and the proportion of the entire dataset that is sampled to 100%, we obtain a traditional DBSCAN:

>>> clusterer = ADBSCAN(0.2, 5, reps=1, pct_exact=1)
>>> np.random.seed(10)
>>> _ = clusterer.fit(db)
>>> clusterer.labels_
array(['0', '-1', '0', '0', '0', '-1', '-1', '0', '-1', '-1', '0', '-1',
       '-1', '-1', '0', '0', '0', '-1', '0', '0', '0', '-1', '-1', '0',
       '-1'], dtype=object)
Attributes:
labels_array

[Only available after fit] Cluster labels for each point in the dataset given to fit(). Noisy (if the proportion of the most common label is < pct_thr) samples are given the label -1.

votesDataFrame

[Only available after fit] Table indexed on X.index with labels_ under the lbls column, and the frequency across draws of that label under pct

solusDataFrame, shape = [n, reps]

[Only available after fit] Each solution of labels for every draw

solus_relabelled: DataFrame, shape = [n, reps]

[Only available after fit] Each solution of labels for every draw, relabelled to be consistent across solutions

__init__(eps, min_samples, algorithm='auto', n_jobs=1, pct_exact=0.1, reps=100, keep_solus=False, pct_thr=0.9)[source]

Methods

__init__(eps, min_samples[, algorithm, ...])

fit(X[, y, sample_weight, xy])

Perform ADBSCAN clustering from fetaures

fit_predict(X[, y])

Perform clustering on X and returns cluster labels.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.