menelaus.partitioners
Partitioner algorithms are typically an early data-processing component for drift detection methods. Data may be partitioned into some new structure (e.g., trees, histograms, or other distributional representations). Separate structures (representing separate batches of data) may then be more efficiently compared by drift detection algorithms in order to raise alarms.
menelaus.partitioners.KDQTreePartitioner
- class menelaus.partitioners.KDQTreePartitioner.KDQTreeNode(num_samples_in_compared_subtrees, axis, midpoint_at_axis, left, right)[source]
Bases:
object
This class represents a node in the KDQ Tree data structure described above. Its static methods provide the engine for recursively building, filling, and displaying trees.
Build / fill trees are identified by a tree ID. During initialization, a node typically takes
num_samples_in_compared_subtrees
(among other values needed for building), which is adict
containing counts for each tree ID.Each node stores the number of data points contained within itself and its subtrees as a value in this
dict
. Atree_id
key identifies which build those counts are associated with (e.g., the first call tobuild()
stores counts in thenode.num_samples_in_compared_subtrees['build']
by default). Subsequent calls tofill()
with new data store additional counts for the node, according to the ID the user passes in.- num_samples_in_compared_subtrees
number of data points, by tree ID, contained within node and its children
- Type
dict
- axis
axis at which to build/fill (for recursive construction)
- Type
int
- midpoint_at_axis
midpoint at provided axis
- Type
float
- left
left child
- Type
- right
right child
- Type
- __init__(num_samples_in_compared_subtrees, axis, midpoint_at_axis, left, right)[source]
- Parameters
num_samples_in_compared_subtrees (dict) – number of data points, by tree ID, contained within node and its children
axis (int) – axis at which to build/fill (for recursive construction)
midpoint_at_axis (float) – midpoint at provided axis
left (KDQTreeNode) – left child
right (KDQTreeNode) – right child
- static as_flattened_array(node, tree_id1, tree_id2, output=[], name='kdqTree', parent_idx=None, depth=0, input_cols=None)[source]
Generates a list containing dicts with information about each node’s structure for the tree rooted at node.
- Parameters
node (KDQTreeNode) – root node of desired subtree
tree_id1 (str) – identifier for reference tree
tree_id2 (str) – identifier for test tree
output (list, optional) – list of dictionaries containing information about each node. Default
[]
.name (str) – name of root node. Default
'kdqTree'
.parent_idx (int, optional) – unique ID for parent of current node. Default
None
.depth (int, optional) – depth of current subtree. Default
0
.input_cols (list, optional) – list of column names for the input data. Default
None
.
- static as_text(node, tree_id='build')[source]
Produce a text representation of the tree structure and its counts.
- Parameters
node (KDQTreeNode) – root node of desired subtree
tree_id (str, optional) – identifier for desired subtree. Default
build
.
- static build(data, count_ubound, min_cutpoint_sizes, leaves, depth=0)[source]
Creates a new kdqTree by partitioning the data into square nodes in the feature-space.
- Parameters
data (numpy.array) – data to be partitioned
count_ubound (int) – upper bound of count of points to be put into any 1 cell
min_cutpoint_sizes (list) – minimum sizes of features that a leaf can have
leaves (list) – list of leaf nodes
depth (int, optional) – current depth of tree. Default
0
.
- Returns
root node of tree
- Return type
- static fill(data, node, count_ubound, tree_id, reset=False)[source]
For a new sample, partition it according to the pre-existing subtree rooted at node. Its counts will be added to those for the
tree_id
index.- Parameters
data (numpy.array) – new data to be partitioned into existing tree
node (KDQTreeNode) – current node
count_ubound (int) – upper bound of points in any one cell
tree_id (str) – identifier for new data counts to be stored
reset (bool, optional) – if
True
, counts for thistree_id
will be overwritten with ones calculated from current sample, else added. DefaultFalse
.
- static reset(node, value, tree_id)[source]
For the subtree rooted at
node
, set the counts fortree_id
equal tovalue
.- Parameters
node (KDQTreeNode) – root of subtree
value (int) – value to set counts to
tree_id (str) – identifier for tree
- class menelaus.partitioners.KDQTreePartitioner.KDQTreePartitioner(count_ubound=200, cutpoint_proportion_lbound=0.25)[source]
Bases:
object
This class encapsulates a KDQ Tree for partitioning data which runs through drift detection algorithms, as described by Dasu et. al. (2006).
The general algorithm to build a tree performs until a stop condition:
Identify axis to split upon.
Identify midpoint at axis.
Split into lower/upper halves using midpoint.
Recursively split sub-trees.
This class can be used to build a tree, ‘fill’ a tree (send new data to be forcibly partitioned according to an existing build), access leaf nodes, and perform visualization / printing. There are also convenience functions to calculate distances between a built and filled tree.
- count_ubound
upper bound of count of data points to be partitioned into any one cell
- Type
int
- cutpoint_proportion_lbound
min. proportion of all features’ range values that any constructed cell size must satisfy
- Type
float
- node
reference to root node of build tree
- Type
- leaves
list of leaf nodes of build tree
- Type
list(KDQTreeNode)
- __init__(count_ubound=200, cutpoint_proportion_lbound=0.25)[source]
- Parameters
count_ubound (int, optional) – upper bound of count of data points to be partitioned into any one cell. Default
200
.cutpoint_proportion_lbound (float, optional) – min. proportion of all features’ range values that any constructed cell size must satisfy. Default
0.25
.
- build(data)[source]
Creates a new kdqTree by partitioning the data into square nodes in the feature-space.
- Parameters
data (numpy.array) – data to be partitioned
- Returns
root node of constructed KDQ-Tree
- Return type
- fill(data, tree_id, reset=False)[source]
For a new sample data, partition it according to the pre-existing tree structure. Its counts will be added to those for the tree_id index.
- Parameters
data (numpy.array) – new data to be partitioned
tree_id (str) – identifier for new data counts to be stored
reset (bool, optional) – if
True
, counts for thistree_id
will be overwritten with ones calculated from current sample, else added. DefaultFalse
- Returns
root node of KDQ-Tree
- Return type
- kl_distance(tree_id1, tree_id2)[source]
For the empirical distributions defined by the counts at tree_id1 and tree_id2, calculate the Kullback-Leibler divergence from tree_id2 to tree_id1.
- Parameters
tree_id1 (str) – identifier for reference tree
tree_id2 (str) – identifier for comparison tree
- Returns
Kullback-Leibler divergence among trees
- Return type
float
- leaf_counts(tree_id)[source]
Return the counts for each leaf of the tree at
tree_id
.- Parameters
tree_id (str) – identifier of tree for which to return counts
- Returns
list of counts at leaves of KDQ-Tree
- Return type
list
- reset(value=0, tree_id='build')[source]
Sets the counts for the given tree_id to the given value.
- Parameters
value (int, optional) – value to be written for each node. Default
0
.tree_id (str, optional) – identifier for set of counts to be overwritten. Default
'build'
.
- to_plotly_dataframe(tree_id1='build', tree_id2=None, max_depth=None, input_cols=None)[source]
Generates a dataframe containing information about the kdqTree’s structure and some node characteristics, intended for use with plotly. DataFrame columns capture:
name
, a label corresponding to which feature this split is on.idx
, a unique ID for the node, to pass toplotly.express.treemap
’sid
argument.
parent_idx
, the ID of the node’s parent.cell_count
, how many samples are in this node in the referencetree.
depth
, how deep the node is in the tree.count_diff
, iftree_id2
is specified, the change in countsfrom the reference tree.
kss
, the Kulldorff Spatial Scan Statistic for this node,defined as the KL-divergence for this node between the reference and test trees, using the individual node and all other nodes combined as the bins for the distributions.
- Parameters
tree_id1 (str) – identifier for reference tree. Default
'build'
tree_id2 (str) – identifier for test tree. Default
None
.max_depth (int, optional) – tree depth up to which the method recurses. Default
None
.input_cols (list, optional) – list of column names for the input data. Default
None
.
- Returns
where each row represents a node
- Return type
pandas.DataFrame
menelaus.partitioners.NNSpacePartitioner
- class menelaus.partitioners.NNSpacePartitioner.NNSpacePartitioner(k: int)[source]
Bases:
object
This class encodes the Nearest Neighbors Space Partitioning scheme (NNPS) for use in the Nearest Neigbors Density Variation Identification (NN-DVI) drift detection algorithm, both of which are introduced in Liu et al. (2018).
Broadly, NNSP combines two input data samples, finds the adjacency matrix using Nearest Neighbor search, and transforms the data points into a set of shared subspaces for estimating density and changes in density between the two samples, via a distance function.
- k
the ‘k’ in k-Nearest-Neighbor (k-NN) search
- Type
int
- D
combined data from two samples
- Type
numpy.array
- v1
indices of first sample data within
D
- Type
numpy.array
- v2
indices of second sample data within
D
- Type
numpy.array
- nnps_matrix
NNSP shared-subspace representation of
D
and its adjacency matrix- Type
numpy.array
- adjacency_matrix
result of k-NN search on
D
- Type
numpy.array
- __init__(k: int)[source]
- Parameters
k (int) – the ‘k’ in k-NN search, describing size of searched neighborhood
- build(sample1: array, sample2: array)[source]
Builds an NNSP representation matrix given two samples of data. Internally stores computed union set, adjacency matrix, index arrays for the two samples, and NNSP representation matrix.
- Parameters
sample1 (numpy.array) – first (possibly the ‘reference’) sample set
sample2 (numpy.array) – second (possibly the ‘test’) sample set
- static compute_nnps_distance(nnps_matrix, v1, v2)[source]
Breaks NNSP reprsentation matrix into NNPS matrices for two samples using indices, computes difference in densities of shared subspaces, between samples.
- Parameters
nnps_matrix (numpy.array) – NNSP representation matrix
v1 (numpy.array) – indices of first sample in
D
,nnps_matrix
, in one-hot encoding formatv2 (numpy.array) – indices of second sample in
D
,nnps_matrix
, in one-hot encoding format
- Returns
- distance value between samples, representing difference
in shared subspace densities
- Return type
float