menelaus.partitioners

Partitioner algorithms are typically an early data-processing component for drift detection methods. Data may be partitioned into some new structure (e.g., trees, histograms, or other distributional representations). Separate structures (representing separate batches of data) may then be more efficiently compared by drift detection algorithms in order to raise alarms.

menelaus.partitioners.KDQTreePartitioner

class menelaus.partitioners.KDQTreePartitioner.KDQTreeNode(num_samples_in_compared_subtrees, axis, midpoint_at_axis, left, right)[source]

Bases: object

This class represents a node in the KDQ Tree data structure described above. Its static methods provide the engine for recursively building, filling, and displaying trees.

Build / fill trees are identified by a tree ID. During initialization, a node typically takes num_samples_in_compared_subtrees (among other values needed for building), which is a dict containing counts for each tree ID.

Each node stores the number of data points contained within itself and its subtrees as a value in this dict. A tree_id key identifies which build those counts are associated with (e.g., the first call to build() stores counts in the node.num_samples_in_compared_subtrees['build'] by default). Subsequent calls to fill() with new data store additional counts for the node, according to the ID the user passes in.

num_samples_in_compared_subtrees

number of data points, by tree ID, contained within node and its children

Type

dict

axis

axis at which to build/fill (for recursive construction)

Type

int

midpoint_at_axis

midpoint at provided axis

Type

float

left

left child

Type

KDQTreeNode

right

right child

Type

KDQTreeNode

__init__(num_samples_in_compared_subtrees, axis, midpoint_at_axis, left, right)[source]
Parameters
  • num_samples_in_compared_subtrees (dict) – number of data points, by tree ID, contained within node and its children

  • axis (int) – axis at which to build/fill (for recursive construction)

  • midpoint_at_axis (float) – midpoint at provided axis

  • left (KDQTreeNode) – left child

  • right (KDQTreeNode) – right child

static as_flattened_array(node, tree_id1, tree_id2, output=[], name='kdqTree', parent_idx=None, depth=0, input_cols=None)[source]

Generates a list containing dicts with information about each node’s structure for the tree rooted at node.

Parameters
  • node (KDQTreeNode) – root node of desired subtree

  • tree_id1 (str) – identifier for reference tree

  • tree_id2 (str) – identifier for test tree

  • output (list, optional) – list of dictionaries containing information about each node. Default [].

  • name (str) – name of root node. Default 'kdqTree'.

  • parent_idx (int, optional) – unique ID for parent of current node. Default None.

  • depth (int, optional) – depth of current subtree. Default 0.

  • input_cols (list, optional) – list of column names for the input data. Default None.

static as_text(node, tree_id='build')[source]

Produce a text representation of the tree structure and its counts.

Parameters
  • node (KDQTreeNode) – root node of desired subtree

  • tree_id (str, optional) – identifier for desired subtree. Default build.

static build(data, count_ubound, min_cutpoint_sizes, leaves, depth=0)[source]

Creates a new kdqTree by partitioning the data into square nodes in the feature-space.

Parameters
  • data (numpy.array) – data to be partitioned

  • count_ubound (int) – upper bound of count of points to be put into any 1 cell

  • min_cutpoint_sizes (list) – minimum sizes of features that a leaf can have

  • leaves (list) – list of leaf nodes

  • depth (int, optional) – current depth of tree. Default 0.

Returns

root node of tree

Return type

KDQTreeNode

static fill(data, node, count_ubound, tree_id, reset=False)[source]

For a new sample, partition it according to the pre-existing subtree rooted at node. Its counts will be added to those for the tree_id index.

Parameters
  • data (numpy.array) – new data to be partitioned into existing tree

  • node (KDQTreeNode) – current node

  • count_ubound (int) – upper bound of points in any one cell

  • tree_id (str) – identifier for new data counts to be stored

  • reset (bool, optional) – if True, counts for this tree_id will be overwritten with ones calculated from current sample, else added. Default False.

static reset(node, value, tree_id)[source]

For the subtree rooted at node, set the counts for tree_id equal to value.

Parameters
  • node (KDQTreeNode) – root of subtree

  • value (int) – value to set counts to

  • tree_id (str) – identifier for tree

class menelaus.partitioners.KDQTreePartitioner.KDQTreePartitioner(count_ubound=200, cutpoint_proportion_lbound=0.25)[source]

Bases: object

This class encapsulates a KDQ Tree for partitioning data which runs through drift detection algorithms, as described by Dasu et. al. (2006).

The general algorithm to build a tree performs until a stop condition:

  • Identify axis to split upon.

  • Identify midpoint at axis.

  • Split into lower/upper halves using midpoint.

  • Recursively split sub-trees.

This class can be used to build a tree, ‘fill’ a tree (send new data to be forcibly partitioned according to an existing build), access leaf nodes, and perform visualization / printing. There are also convenience functions to calculate distances between a built and filled tree.

count_ubound

upper bound of count of data points to be partitioned into any one cell

Type

int

cutpoint_proportion_lbound

min. proportion of all features’ range values that any constructed cell size must satisfy

Type

float

node

reference to root node of build tree

Type

KDQTreeNode

leaves

list of leaf nodes of build tree

Type

list(KDQTreeNode)

__init__(count_ubound=200, cutpoint_proportion_lbound=0.25)[source]
Parameters
  • count_ubound (int, optional) – upper bound of count of data points to be partitioned into any one cell. Default 200.

  • cutpoint_proportion_lbound (float, optional) – min. proportion of all features’ range values that any constructed cell size must satisfy. Default 0.25.

build(data)[source]

Creates a new kdqTree by partitioning the data into square nodes in the feature-space.

Parameters

data (numpy.array) – data to be partitioned

Returns

root node of constructed KDQ-Tree

Return type

KDQTreeNode

fill(data, tree_id, reset=False)[source]

For a new sample data, partition it according to the pre-existing tree structure. Its counts will be added to those for the tree_id index.

Parameters
  • data (numpy.array) – new data to be partitioned

  • tree_id (str) – identifier for new data counts to be stored

  • reset (bool, optional) – if True, counts for this tree_id will be overwritten with ones calculated from current sample, else added. Default False

Returns

root node of KDQ-Tree

Return type

KDQTreeNode

kl_distance(tree_id1, tree_id2)[source]

For the empirical distributions defined by the counts at tree_id1 and tree_id2, calculate the Kullback-Leibler divergence from tree_id2 to tree_id1.

Parameters
  • tree_id1 (str) – identifier for reference tree

  • tree_id2 (str) – identifier for comparison tree

Returns

Kullback-Leibler divergence among trees

Return type

float

leaf_counts(tree_id)[source]

Return the counts for each leaf of the tree at tree_id.

Parameters

tree_id (str) – identifier of tree for which to return counts

Returns

list of counts at leaves of KDQ-Tree

Return type

list

reset(value=0, tree_id='build')[source]

Sets the counts for the given tree_id to the given value.

Parameters
  • value (int, optional) – value to be written for each node. Default 0.

  • tree_id (str, optional) – identifier for set of counts to be overwritten. Default 'build'.

to_plotly_dataframe(tree_id1='build', tree_id2=None, max_depth=None, input_cols=None)[source]

Generates a dataframe containing information about the kdqTree’s structure and some node characteristics, intended for use with plotly. DataFrame columns capture:

  • name, a label corresponding to which feature this split is on.

  • idx, a unique ID for the node, to pass to

    plotly.express.treemap’s id argument.

  • parent_idx, the ID of the node’s parent.

  • cell_count, how many samples are in this node in the reference

    tree.

  • depth, how deep the node is in the tree.

  • count_diff, if tree_id2 is specified, the change in counts

    from the reference tree.

  • kss, the Kulldorff Spatial Scan Statistic for this node,

    defined as the KL-divergence for this node between the reference and test trees, using the individual node and all other nodes combined as the bins for the distributions.

Parameters
  • tree_id1 (str) – identifier for reference tree. Default 'build'

  • tree_id2 (str) – identifier for test tree. Default None.

  • max_depth (int, optional) – tree depth up to which the method recurses. Default None.

  • input_cols (list, optional) – list of column names for the input data. Default None.

Returns

where each row represents a node

Return type

pandas.DataFrame

menelaus.partitioners.NNSpacePartitioner

class menelaus.partitioners.NNSpacePartitioner.NNSpacePartitioner(k: int)[source]

Bases: object

This class encodes the Nearest Neighbors Space Partitioning scheme (NNPS) for use in the Nearest Neigbors Density Variation Identification (NN-DVI) drift detection algorithm, both of which are introduced in Liu et al. (2018).

Broadly, NNSP combines two input data samples, finds the adjacency matrix using Nearest Neighbor search, and transforms the data points into a set of shared subspaces for estimating density and changes in density between the two samples, via a distance function.

k

the ‘k’ in k-Nearest-Neighbor (k-NN) search

Type

int

D

combined data from two samples

Type

numpy.array

v1

indices of first sample data within D

Type

numpy.array

v2

indices of second sample data within D

Type

numpy.array

nnps_matrix

NNSP shared-subspace representation of D and its adjacency matrix

Type

numpy.array

adjacency_matrix

result of k-NN search on D

Type

numpy.array

__init__(k: int)[source]
Parameters

k (int) – the ‘k’ in k-NN search, describing size of searched neighborhood

build(sample1: array, sample2: array)[source]

Builds an NNSP representation matrix given two samples of data. Internally stores computed union set, adjacency matrix, index arrays for the two samples, and NNSP representation matrix.

Parameters
  • sample1 (numpy.array) – first (possibly the ‘reference’) sample set

  • sample2 (numpy.array) – second (possibly the ‘test’) sample set

static compute_nnps_distance(nnps_matrix, v1, v2)[source]

Breaks NNSP reprsentation matrix into NNPS matrices for two samples using indices, computes difference in densities of shared subspaces, between samples.

Parameters
  • nnps_matrix (numpy.array) – NNSP representation matrix

  • v1 (numpy.array) – indices of first sample in D, nnps_matrix, in one-hot encoding format

  • v2 (numpy.array) – indices of second sample in D, nnps_matrix, in one-hot encoding format

Returns

distance value between samples, representing difference

in shared subspace densities

Return type

float