menelaus.data_drift

Data drift detection algorithms are focused on detecting changes in the distribution of the variables within datasets. This could include shifts univariate statistics, such as the range, mean, or standard deviations, or shifts in multivariate relationships between variables, such as shifts in correlations or joint distributions.

Data drift detection algorithms are ideal for researchers seeking to better understand the change of their data over time or for the maintenance of deployed models in situations where labels are unavailable. Labels may not be readily available if obtaining them is computationally expensive or if, due to the nature of the use case, a significant time lag exists between when the models are applied and when the results are verified. Data drift detection is also applicable in unsupervised learning settings.

menelaus.data_drift.cdbd

class menelaus.data_drift.cdbd.CDBD(divergence='KL', detect_batch=1, statistic='tstat', significance=0.05, subsets=5)[source]

Bases: HistogramDensityMethod

The Confidence Distribution Batch Detection (CDBD) algorithm is a statistical test that seeks to detect concept drift in classifiers, without the use of labeled data. It is intended to monitor a classifier’s confidence scores but could be substituted with any univariate performance related statistic obtained from a learner., e.g. posterior probabilities.

This method relies upon three statistics:

KL divergence: the Kullback-Leibler Divergence (KLD) measure

Epsilon: the differences in Hellinger distances between sets of reference and test batches.

Beta: the adaptive threshold adapted at each time stamp. It is based on the mean of Epsilon plus the scaled standard deviation of Epsilon. The scale applied to the standard deviation is determined by the statistic parameter. It is either the number of standard deviations deemed significant ("stdev") or the t-statistic ("tstat").

CDBD operates by:

Estimating density functions of reference and test data using histograms. The number of bins in each histogram equals the square root of the length of reference window. Bins are aligned by computing the minimum and maximum value for each feature from both test and reference window.

Computing the distance between reference and test distributions. The KL divergence metric is used to calculate the distance between univariate histograms.

Computing Epsilon.

Computing the adaptive threshold Beta.

Comparing current Epsilon to Beta. If Epsilon > Beta, drift is detected. The new reference batch is now the test batch on which drift was detected. All statistics are reset. If Epsilon <= Beta, drift is not detected. The reference batch is updated to include this most recent test batch. All statistics are maintained.

Ref. Lindstrom et al. [2013]

Epsilon

stores Epsilon values since the last drift detection.

Type: list

reference_n

number of samples in reference batch.

Type: int

total_epsilon

stores running sum of Epsilon values until drift is detected, initialized to 0.

Type: int

bins

number of bins in histograms, equivalent to square root of number of samples in reference batch .

Type: int

num_feat

number of features in reference batch.

Type: int

lambda

batch number on which last drift was detected.

Type: int

distances

For each batch seen (key), stores the distance between test and reference batch (value). Useful for visualizing drift detection statistics.

Type: dict

epsilon_values

For each batch seen (key), stores the Epsilon value between the current and previous test and reference batches (value). Useful for visualizing drift detection statistics. Does not store the bootstrapped estimate of Epsilon, if used.

Type: dict

thresholds

For each batch seen (key), stores the Beta thresholds between test and reference batch (value). Useful for visualizing drift detection statistics.

Type: dict

__init__(divergence='KL', detect_batch=1, statistic='tstat', significance=0.05, subsets=5)[source]

Parameters

divergence (str) –
divergence measure used to compute distance between histograms. Default is “H”.
- ”H” - Hellinger distance, original use is for HDDDM
- ”KL” - Kullback-Leibler Divergence, original use is for CDBD
- User can pass in custom divergence function. Input is two two-dimensional arrays containing univariate histogram estimates of density, one for reference, one for test. It must return the distance value between histograms. To be a valid distance metric, it must satisfy the following properties: non-negativity, identity, symmetry, and triangle inequality, e.g. that in examples/cbdb_example.py.
detect_batch (int) –
the test batch on which drift will be detected. See class docstrings for more information on this modification. Defaults to 1.
- if detect_batch == 1 - CDBD can detect drift on the first test batch passed to the update method. Total samples and samples since reset will be number of batches passed to HDM plus 1, due to splitting of reference batch
- if detect_batch == 2 - CDBD can detect drift on the second test batch passed to the update method.
- if detect_batch == 3 - CDBD can detect drift on the third test batch passed to the update method.
statistic (str) –
statistical method used to compute adaptive threshold. Defaults to "tstat"
- "tstat" - t-statistic with desired significance level and degrees of freedom = 2 for hypothesis testing on two populations.
- "stdev" - uses number of standard deviations deemed significant to compute threshold.
significance (float) –
statistical significance used to identify adaptive threshold. Defaults to 0.05.
- if statistic = "tstat" - statistical significance of t-statistic, e.g. .05 for 95% significance level.
- if statistic = "stdev" - number of standard deviations of change around the mean accepted.
subsets (int) –
the number of subsets of reference data to take to compute initial estimate of Epsilon.
- if too small - initial Epsilon value will be too small. Increases risk of missing drift.
- if too high - intial Epsilon value will be too large. Increases risk of false alarms.

reset(): Initialize relevant attributes to original values, to ensure information only stored from batches_since_reset (lambda) onwards. Intended for use after drift_state == 'drift'.

set_reference(X, y_true=None, y_pred=None)[source]

Initialize detector with a reference batch. After drift, reference batch is automatically set to most recent test batch. Option for user to specify alternative reference batch using this method.

Parameters

X (pandas.DataFrame) – initial baseline dataset
y_true (numpy.array) – true labels for dataset - not used by CDBD
y_pred (numpy.array) – predicted labels for dataset - not used by CDBD

update(X, y_true=None, y_pred=None)[source]

Update the detector with a new test batch. If drift is detected, new reference batch becomes most recent test batch. If drift is not detected, reference batch is updated to include most recent test batch.

Parameters

X (DataFrame) – next batch of data to detect drift on.
y_true (numpy.ndarray) – true labels of next batch - not used in CDBD
y_pred (numpy.ndarray) – predicted labels of next batch - not used in CDBD

property batches_since_reset

Number of batches since last drift detection.

Returns: int

property drift_state: Set detector’s drift state to "drift", "warning", or None.

input_type = 'batch'

property total_batches

Total number of batches the drift detector has been updated with.

Returns: int

menelaus.data_drift.hdddm

class menelaus.data_drift.hdddm.HDDDM(detect_batch=1, divergence='H', statistic='tstat', significance=0.05, subsets=5)[source]

Bases: HistogramDensityMethod

HDDDM is a batch-based, unsupervised drift detection algorithm that detects changes in feature distributions. It uses the Hellinger distance metric to compare test and reference batches and is capable of detecting gradual or abrupt changes in data.

This method relies upon three statistics:

Hellinger distance: the sum of the normalized, squared differences in frequency counts for each bin between reference and test datasets, averaged across all features.

Epsilon: the differences in Hellinger distances between sets of reference and test batches.

Beta: the adaptive threshold adapted at each time stamp. It is based on the mean of Epsilon plus the scaled standard deviation of Epsilon. The scale applied to the standard deviation is determined by the statistic parameter. It is either the number of standard deviations deemed significant ("stdev") or the t-statistic ("tstat").

HDDDM operates by:

Estimating density functions of reference and test data using histograms. The number of bins in each histogram equals the square root of the length of reference window. Bins are aligned by computing the minimum and maximum value for each feature from both test and reference window.

Computing the distance between reference and test distributions. The Hellinger distance is first calculated between each feature in the reference and test batches. Then, the final Hellinger statistic used is the average of each feature’s distance.

Computing Epsilon.

Computing the adaptive threshold Beta.

Comparing current Epsilon to Beta. If Epsilon > Beta, drift is detected. The new reference batch is now the test batch on which drift was detected. All statistics are reset. If Epsilon <= Beta, drift is not detected. The reference batch is updated to include this most recent test batch. All statistics are maintained.

Two key modifications were added to Ditzler and Polikar’s presentation of HDDDM:

To answer the research question of “where is drift occuring?”, it stores the distance values and Epsilon values for each feature. These statistics can be used to identify and visualize the features containing the most significant drifts.

The Hellinger distance values are calculated for each feature in the test batch. These values can be accessed when drift occurs using the self.feature_info dictionary.

The Epsilon values for each feature are stored, for each set of reference and test batches. For each feature, these values represent the difference in Hellinger distances within the test and reference batch at time t, to the Hellinger distances within the test and reference batch at time t-1. These can be acccessed with each update call using the self.feature_epsilons variable. They also can be accessed when drift occurs using the self.feature_info dictionary.

The original algorithm cannot detect drift until it is updated with the third test batch after either a) initilization or b) reset upon drift, because the threshold for drift detection is defined from the difference Epsilon. To have sufficient values to define this threshold, then, three batches are needed. The detect_batch parameter can be set such that bootstrapping is used to define this threshold earlier than the third test batch.

if detect_batch == 3, HDDDM will operate as described in Ditzler and Polikar [2011].

if detect_batch == 2, HDDDM will detect drift on the second test batch. On the second test batch, HDDDM uses bootstrapped samples from the reference batch to estimate the mean and standard deviation of Epsilon; this is used to calculate the necessary threshold. On the third test batch, this value is removed from all proceeding calculations.

if detect_batch == 1, HDDDM will detect drift on the first test batch. The initial reference batch is split randomly into two halves. The first half will serve as the original reference batch. The second half will serve as a proxy for the first test batch, allowing us to calculate the distance statistic. When HDDDM is updated with the first actual test batch, HDDDM will perform the method for bootstrapping Epsilon, as described in the above bullet for detect_batch == 2. This will allow a Beta threshold to be calculated using the first test batch, allowing for detection of drift on this batch.

Ref. Ditzler and Polikar [2011]

Epsilon

stores Epsilon values since the last drift detection.

Type: list

reference_n

number of samples in reference batch.

Type: int

total_epsilon

stores running sum of Epsilon values until drift is detected, initialized to 0.

Type: int

bins

number of bins in histograms, equivalent to square root of number of samples in reference batch.

Type: int

num_feat

number of features in reference batch.

Type: int

lambda

batch number on which last drift was detected.

Type: int

distances

For each batch seen (key), stores the Hellinger distance between test and reference batch (value). Useful for visualizing drift detection statistics.

Type: dict

epsilon_values

For each batch seen (key), stores the Epsilon value between the current and previous test and reference batches (value). Useful for visualizing drift detection statistics. Does not store the bootstrapped estimate of Epsilon, if used.

Type: dict

thresholds

For each batch seen (key), stores the Beta thresholds between test and reference batch (value). Useful for visualizing drift detection statistics.

Type: dict

__init__(detect_batch=1, divergence='H', statistic='tstat', significance=0.05, subsets=5)[source]

Parameters

divergence (str) –
divergence measure used to compute distance between histograms. Default is “H”.
- ”H” - Hellinger distance, original use is for HDDDM
- ”KL” - Kullback-Leibler Divergence, original use is for CDBD
- User can pass in custom divergence function. Input is two two-dimensional arrays containing univariate histogram estimates of density, one for reference, one for test. It must return the distance value between histograms. To be a valid distance metric, it must satisfy the following properties: non-negativity, identity, symmetry, and triangle inequality, e.g. that in examples/hdddm_example.py.
detect_batch (int) –
the test batch on which drift will be detected. See class docstrings for more information on this modification. Defaults to 1.
- if detect_batch == 1 - HDDDM can detect drift on the first test batch passed to the update method. Total batches and batches since reset will be number of batches passed to HDM plus 1, due to splitting of reference batch
- if detect_batch == 2 - HDDDM can detect drift on the second test batch passed to the update method.
- if detect_batch == 3 - HDDDM can detect drift on the third test batch passed to the update method.
statistic (str) –
statistical method used to compute adaptive threshold. Defaults to "tstat".
- "tstat" - t-statistic with desired significance level and degrees of freedom = 2 for hypothesis testing on two populations.
- "stdev" - uses number of standard deviations deemed significant to compute threhsold.
significance (float) –
statistical significance used to identify adaptive threshold. Defaults to 0.05.
- if statistic = "tstat" - statistical significance of t-statistic, e.g. .05 for 95% significance level.
- if statistic = "stdev" - number of standard deviations of change around the mean accepted.
subsets (int) –
the number of subsets of reference data to take to compute initial estimate of Epsilon.
- if too small - initial Epsilon value will be too small. Increases risk of missing drift.
- if too high - intial Epsilon value will be too large. Increases risk of false alarms.

reset(): Initialize relevant attributes to original values, to ensure information only stored from batches_since_reset (lambda) onwards. Intended for use after drift_state == 'drift'.

set_reference(X, y_true=None, y_pred=None)

Initialize detector with a reference batch. After drift, reference batch is automatically set to most recent test batch. Option for user to specify alternative reference batch using this method.

Parameters

X (pandas.DataFrame) – initial baseline dataset
y_true (numpy.array) – true labels for dataset - not used by HDM
y_pred (numpy.array) – predicted labels for dataset - not used by HDM

update(X, y_true=None, y_pred=None)[source]

Update the detector with a new test batch. If drift is detected, new reference batch becomes most recent test batch. If drift is not detected, reference batch is updated to include most recent test batch.

Parameters

X (DataFrame) – next batch of data to detect drift on.
y_true (numpy.ndarray) – true labels of next batch - not used in HDDDM
y_pred (numpy.ndarray) – predicted labels of next batch - not used in HDDDM

property batches_since_reset

Number of batches since last drift detection.

Returns: int

property drift_state: Set detector’s drift state to "drift", "warning", or None.

input_type = 'batch'

property total_batches

Total number of batches the drift detector has been updated with.

Returns: int

menelaus.data_drift.histogram_density_method

class menelaus.data_drift.histogram_density_method.HistogramDensityMethod(divergence, detect_batch, statistic, significance, subsets)[source]

Bases: BatchDetector

The Histogram Density Method (HDM) is the base class for both HDDDM and CDBD. HDDDM differs from CDBD by relying upon the Hellinger distance measure while CDBD uses KL divergence.

This method relies upon three statistics:

Distance metric:

Hellinger distance (default if called via HDDDM): the sum of the normalized, squared differences in frequency counts for each bin between reference and test datasets, averaged across all features.

KL divergence (default if called via CDBD): Jensen-Shannon distances, a symmetric and bounded measure based upon Kullback-Leibler Divergence

Optional user-defined distance metric

Epsilon: the differences in Hellinger distances between sets of reference and test batches.

Beta: the adaptive threshold adapted at each time stamp. It is based on the mean of Epsilon plus the scaled standard deviation of Epsilon. The scale applied to the standard deviation is determined by the statistic parameter. It is either the number of standard deviations deemed significant ("stdev") or the t-statistic ("tstat").

HDM operates by:

Estimating density functions of reference and test data using histograms. The number of bins in each histogram equals the square root of the length of reference window. Bins are aligned by computing the minimum and maximum value for each feature from both test and reference window.

Computing the distance between reference and test distributions. In HDDDM, the Hellinger distance is first calculated between each feature in the reference and test batches. Then, the final Hellinger statistic used is the average of each feature’s distance. In CDBD, the KL divergence metric is used to calculate the distance between univariate histograms.

Computing Epsilon.

Computing the adaptive threshold Beta.

Comparing current Epsilon to Beta. If Epsilon > Beta, drift is detected. The new reference batch is now the test batch on which drift was detected. All statistics are reset. If Epsilon <= Beta, drift is not detected. The reference batch is updated to include this most recent test batch. All statistics are maintained.

Two key modifications were added to Ditzler and Polikar’s presentation:

For HDDDM, to answer the question of “where is drift occuring?”, it stores the distance values and Epsilon values for each feature. These statistics can be used to identify and visualize the features containing the most significant drifts.

The Hellinger distance values are calculated for each feature in the test batch. These values can be accessed when drift occurs using the self.feature_info dictionary.

The Epsilon values for each feature are stored, for each set of reference and test batches. For each feature, these values represent the difference in Hellinger distances within the test and reference batch at time t, to the Hellinger distances within the test and reference batch at time t-1. These can be acccessed with each update call using the self.feature_epsilons variable. They also can be accessed when drift occurs using the self.feature_info dictionary.

The original algorithm cannot detect drift until it is updated with the third test batch after either a) initilization or b) reset upon drift, because the threshold for drift detection is defined from the difference Epsilon. To have sufficient values to define this threshold, then, three batches are needed. The detect_batch parameter can be set such that bootstrapping is used to define this threshold earlier than the third test batch.

if detect_batch == 3, HDM will operate as described in Ditzler and Polikar [2011].

if detect_batch == 2, HDM will detect drift on the second test batch. On the second test batch, HDM uses bootstrapped samples from the reference batch to estimate the mean and standard deviation of Epsilon; this is used to calculate the necessary threshold. On the third test batch, this value is removed from all proceeding

if detect_batch = 1, HDM will detect drift on the first test batch. The initial reference batch is split randomly into two halves. The first half will serve as the original reference batch. The second half will serve as a proxy for the first test batch, allowing us to calculate the distance statistic. When HDM is updated with the first actual test batch, HDM will perform the method for bootstrapping Epsilon, as described in the above bullet for detect_batch = 2. This will allow a Beta threshold to be calculated using the first test batch, allowing for detection of drift on this batch.

Ref. Lindstrom et al. [2013] and Ditzler and Polikar [2011]

Epsilon

stores Epsilon values since the last drift detection.

Type: list

reference_n

number of samples in reference batch.

Type: int

total_epsilon

stores running sum of Epsilon values until drift is detected, initialized to 0.

Type: int

distances

For each batch seen (key), stores the distance between test and reference batch (value). Useful for visualizing drift detection statistics.

Type: dict

epsilon_values

For each batch seen (key), stores the Epsilon value between the current and previous test and reference batches (value). Useful for visualizing drift detection statistics. Does not store the bootstrapped estimate of Epsilon, if used.

Type: dict

thresholds

For each batch seen (key), stores the Beta thresholds between test and reference batch (value). Useful for visualizing drift detection statistics.

Type: dict

__init__(divergence, detect_batch, statistic, significance, subsets)[source]

Parameters

divergence (str or function) –
divergence measure used to compute distance between histograms. Default is “H”.
- ”H” - Hellinger distance, original use is for HDDDM
- ”KL” - Kullback-Leibler Divergence, original use is for CDBD
- User can pass in custom divergence function. Input is two two-dimensional arrays containing univariate histogram estimates of density, one for reference, one for test. It must return the distance value between histograms. To be a valid distance metric, it must satisfy the following properties: non-negativity, identity, symmetry, and triangle inequality, e.g. that in examples/cdbd_example.py or examples/hdddm_example.py.
detect_batch (int) –
the test batch on which drift will be detected. See class docstrings for more information on this modification. Defaults to 1.
- if detect_batch == 1 - HDM can detect drift on the first test batch passed to the update method
- if detect_batch == 2 - HDM can detect drift on the second test batch passed to the update method
- if detect_batch == 3 - HDM can detect drift on the third test batch passed to the update method
statistic (str) –
statistical method used to compute adaptive threshold. Defaults to "tstat".
- "tstat" - t-statistic with desired significance level and degrees of freedom = 2 for hypothesis testing on two populations
- "stdev" - uses number of standard deviations deemed significant to compute threhsold
significance (float) –
statistical significance used to identify adaptive threshold. Defaults to 0.05.
- if statistic = "tstat" - statistical significance of t-statistic, e.g. .05 for 95% significance level
- if statistic = "stdev" - number of standard deviations of change around the mean accepted
subsets (int) –
the number of subsets of reference data to take to compute initial estimate of Epsilon.
- if too small - initial Epsilon value will be too small. Increases risk of missing drift
- if too high - intial Epsilon value will be too large. Increases risk of false alarms.

reset()[source]: Initialize relevant attributes to original values, to ensure information only stored from batches_since_reset (lambda) onwards. Intended for use after drift_state == 'drift'.

set_reference(X, y_true=None, y_pred=None)[source]

Initialize detector with a reference batch. After drift, reference batch is automatically set to most recent test batch. Option for user to specify alternative reference batch using this method.

Parameters

X (pandas.DataFrame) – initial baseline dataset
y_true (numpy.array) – true labels for dataset - not used by HDM
y_pred (numpy.array) – predicted labels for dataset - not used by HDM

update(X, y_true=None, y_pred=None)[source]

Update the detector with a new test batch. If drift is detected, new reference batch becomes most recent test batch. If drift is not detected, reference batch is updated to include most recent test batch.

Parameters

X (DataFrame) – next batch of data to detect drift on.
y_true (numpy.ndarray) – true labels of next batch - not used in HDM
y_pred (numpy.ndarray) – predicted labels of next batch - not used in HDM

property batches_since_reset

Number of batches since last drift detection.

Returns: int

property drift_state: Set detector’s drift state to "drift", "warning", or None.

input_type = 'batch'

property total_batches

Total number of batches the drift detector has been updated with.

Returns: int

menelaus.data_drift.kdq_tree

class menelaus.data_drift.kdq_tree.KdqTreeBatch(alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]

Bases: KdqTreeDetector, BatchDetector

Implements the kdqTree drift detection algorithm in a batch data context. Inherits from KdqTreeDetector and BatchDetector (see docs).

kdqTree is a drift detection algorithm which detects drift via the Kullback-Leibler divergence, calculated after partitioning the data space via constructing a k-d-quad-tree (kdq-tree). A reference window of initial data is compared to a test window of later data. The Kullback-Leibler divergence between the empirical distributions of the reference and test windows is calculated, and drift is alarmed when a threshold is reached. A kdqtree is a combination of k-d trees and quad-trees; it is a binary tree (k-d) whose nodes contain square cells (quad) which are created via sequential splits along each dimension. This structure allows the calculation of the K-L divergence for continuous distributions, as the K-L divergence is defined on probability mass functions. The number of samples in each leaf of the tree is an empirical distribution for either dataset; this allows us to calculate the K-L divergence.

If used in a streaming data setting, the reference window is used to construct a kdq-tree, and the data in both the reference and test window are filed into it. If used in a batch data setting, the reference window - the first batch passed in - is used to construct a kdq-tree, and data in test batches are compared to it. When drift is detected on a test batch, that test batch is set to be the new reference window - unless the user specifies a reference window using the set_reference method.

The threshold for drift is determined using the desired alpha level by a bootstrap estimate for the critical value of the K-L divergence, drawing a sample of num_boostrap_samples repeatedly, 2 * window_size times, from the reference window.

Additionally, the Kulldorff spatial scan statistic, which is a special case of the KL-divergence, can be calculated at each node of the kdq-tree, which gives a measure of the regions of the data space which have the greatest divergence between the reference and test windows. This can be used to visualize which regions of data space have the greatest drift. Note that these statistics are specific to the partitions of the data space by the kdq-tree, rather than (necessarily) the maximally different region in general. KSSs are made available via to_plotly_dataframe, which produces output structured for use with plotly.express.treemap.

Ref. Dasu et al. [2006]

__init__(alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]

Parameters

alpha (float, optional) – Achievable significance level. Defaults to 0.01.
bootstrap_samples (int, optional) – The number of bootstrap samples to use to approximate the empirical distributions. Equivalent to kappa in Dasu (2006), which recommends 500-1000 samples. Defaults to 500.
count_ubound (int, optional) – An upper bound for the number of samples stored in a leaf node of the kdqTree. No leaf shall contain more samples than this value, unless further divisions violate the cutpoint_proportion_lbound restriction. Default 100.
cutpoint_proportion_lbound (float, optional) – A lower bound for the size of the leaf nodes. No node shall have a size length smaller than this proportion, relative to the original feature length. Defaults to 2e-10.

reset()[source]: Initialize the detector’s drift state and other relevant attributes. Intended for use after drift_state == "drift" or initialization.

set_reference(X, y_true=None, y_pred=None)[source]

Initialize detector with a reference batch. The user may specify an alternate reference batch than the one maintained by kdq-Tree. This will reset the detector.

Parameters

X (pandas.DataFrame or numpy.array) – baseline dataset
y_true (numpy.array) – actual labels of dataset - not used in KdqTree
y_pred (numpy.array) – predicted labels of dataset - not used in KdqTree

to_plotly_dataframe(tree_id1='build', tree_id2='test', max_depth=None, input_cols=None)

Generates a dataframe containing information about the kdqTree’s structure and some node characteristics, intended for use with plotly.

Parameters

tree_id1 (str, optional) – Reference tree. If tree_id2 is not specified, the only tree described. Defaults to "build".
tree_id2 (str, optional) – Test tree. If this is specified, the dataframe will also contain information about the difference between counts in each node for the reference vs. the test tree. Defaults to "test".
max_depth (int, optional) – Depth in the tree to which to recurse. Defaults to None.
input_cols (list, optional) – List of column names for the input data. Defaults to None.

Returns

A dataframe where each row corresponds to a node, and each column contains some information:

name: a label corresponding to which feature this split is on

idx: a unique ID for the node, to pass plotly.express.treemap’s id argument

parent_idx: the ID of the node’s parent

cell_count: how many samples are in this node in the reference tree.

depth: how deep the node is in the tree

count_diff: if tree_id2 is specified, the change in counts from the reference tree.

kss: the Kulldorff Spatial Scan Statistic for this node, defined as the Kullback-Leibler divergence for this node between the reference and test trees, using the individual node and all other nodes combined as the bins for the distributions.

Return type

pd.DataFrame

update(X, y_true=None, y_pred=None)[source]

Update the detector with a new batch. Constructs the reference data’s kdqtree; then, when sufficient samples have been received, puts the test data into the same tree; then, checks divergence between the reference and test data.

The initial batch will be used as the reference at each update step, regardless of drift state. If the user wishes to change reference batch, use the set_reference method and then continue passing new batches to update.

Parameters

X (pandas.DataFrame or numpy array) – If just reset/initialized,
Otherwise (the reference data.) –
compared (a new batch of data to be) –
window. (to the reference) –
y_true (numpy.ndarray) – true labels of input data - not used in KdqTree
y_pred (numpy.ndarray) – predicted labels of input data - not used in KdqTree

property batches_since_reset

Number of batches since last drift detection.

Returns: int

property drift_state: Set detector’s drift state to "drift", "warning", or None.

property total_batches

Total number of batches the drift detector has been updated with.

Returns: int

class menelaus.data_drift.kdq_tree.KdqTreeDetector(alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]

Bases: object

Parent class for kdqTree-based drift detector classes. Whether reliant on streaming or batch data, kdqTree detectors have some common attributes, logic, and functions.

kdqTree is a drift detection algorithm which detects drift via the Kullback-Leibler divergence, calculated after partitioning the data space via constructing a k-d-quad-tree (kdq-tree). A reference window of initial data is compared to a test window of later data. The Kullback-Leibler divergence between the empirical distributions of the reference and test windows is calculated, and drift is alarmed when a threshold is reached. A kdqtree is a combination of k-d trees and quad-trees; it is a binary tree (k-d) whose nodes contain square cells (quad) which are created via sequential splits along each dimension. This structure allows the calculation of the K-L divergence for continuous distributions, as the K-L divergence is defined on probability mass functions. The number of samples in each leaf of the tree is an empirical distribution for either dataset; this allows us to calculate the K-L divergence.

If used in a streaming data setting, the reference window is used to construct a kdq-tree, and the data in both the reference and test window are filed into it. If used in a batch data setting, the reference window - the first batch passed in - is used to construct a kdq-tree, and data in test batches are compared to it. When drift is detected on a test batch, that test batch is set to be the new reference window - unless the user specifies a reference window using the set_reference method.

The threshold for drift is determined using the desired alpha level by a bootstrap estimate for the critical value of the K-L divergence, drawing a sample of num_boostrap_samples repeatedly, 2 * window_size times, from the reference window.

Additionally, the Kulldorff spatial scan statistic, which is a special case of the KL-divergence, can be calculated at each node of the kdq-tree, which gives a measure of the regions of the data space which have the greatest divergence between the reference and test windows. This can be used to visualize which regions of data space have the greatest drift. Note that these statistics are specific to the partitions of the data space by the kdq-tree, rather than (necessarily) the maximally different region in general. KSSs are made available via to_plotly_dataframe, which produces output structured for use with plotly.express.treemap.

Note that this algorithm could be used with other types of trees; the reference paper and this implementation use kdq-trees.

Note that the current implementation does not explicitly handle categorical data.

Ref. Dasu et al. [2006]

__init__(alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]

Parameters

alpha (float, optional) – Achievable significance level. Defaults to 0.01.
bootstrap_samples (int, optional) – The number of bootstrap samples to use to approximate the empirical distributions. Equivalent to kappa in Dasu (2006), which recommends 500-1000 samples. Defaults to 500.
count_ubound (int, optional) – An upper bound for the number of samples stored in a leaf node of the kdqTree. No leaf shall contain more samples than this value, unless further divisions violate the cutpoint_proportion_lbound restriction. Default 100.
cutpoint_proportion_lbound (float, optional) – A lower bound for the size of the leaf nodes. No node shall have a size length smaller than this proportion, relative to the original feature length. Defaults to 2e-10.

reset()[source]: Initialize the detector’s drift state and other relevant attributes. Intended for use after drift_state == "drift" or initialization.

to_plotly_dataframe(tree_id1='build', tree_id2='test', max_depth=None, input_cols=None)[source]

Generates a dataframe containing information about the kdqTree’s structure and some node characteristics, intended for use with plotly.

Parameters

tree_id1 (str, optional) – Reference tree. If tree_id2 is not specified, the only tree described. Defaults to "build".
tree_id2 (str, optional) – Test tree. If this is specified, the dataframe will also contain information about the difference between counts in each node for the reference vs. the test tree. Defaults to "test".
max_depth (int, optional) – Depth in the tree to which to recurse. Defaults to None.
input_cols (list, optional) – List of column names for the input data. Defaults to None.

Returns

A dataframe where each row corresponds to a node, and each column contains some information:

name: a label corresponding to which feature this split is on

idx: a unique ID for the node, to pass plotly.express.treemap’s id argument

parent_idx: the ID of the node’s parent

cell_count: how many samples are in this node in the reference tree.

depth: how deep the node is in the tree

count_diff: if tree_id2 is specified, the change in counts from the reference tree.

kss: the Kulldorff Spatial Scan Statistic for this node, defined as the Kullback-Leibler divergence for this node between the reference and test trees, using the individual node and all other nodes combined as the bins for the distributions.

Return type

pd.DataFrame

class menelaus.data_drift.kdq_tree.KdqTreeStreaming(window_size, persistence=0.05, alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]

Bases: KdqTreeDetector, StreamingDetector

Implements the kdqTree drift detection algorithm in a streaming data context. Inherits from KdqTreeDetector and StreamingDetector (see docs).

kdqTree is a drift detection algorithm which detects drift via the Kullback-Leibler divergence, calculated after partitioning the data space via constructing a k-d-quad-tree (kdq-tree).

If used in a streaming data setting, the reference window is used to construct a kdq-tree, and the data in both the reference and test window are filed into it. If used in a batch data setting, the reference window - the first batch passed in - is used to construct a kdq-tree, and data in test batches are compared to it. When drift is detected on a test batch, that test batch is set to be the new reference window - unless the user specifies a reference window using the set_reference method.

The threshold for drift is determined using the desired alpha level by a bootstrap estimate for the critical value of the K-L divergence, drawing a sample of num_boostrap_samples repeatedly, 2 * window_size times, from the reference window.

Additionally, the Kulldorff spatial scan statistic, which is a special case of the KL-divergence, can be calculated at each node of the kdq-tree, which gives a measure of the regions of the data space which have the greatest divergence between the reference and test windows. This can be used to visualize which regions of data space have the greatest drift. Note that these statistics are specific to the partitions of the data space by the kdq-tree, rather than (necessarily) the maximally different region in general. KSSs are made available via to_plotly_dataframe, which produces output structured for use with plotly.express.treemap.

Ref. Dasu et al. [2006]

__init__(window_size, persistence=0.05, alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]

Parameters

window_size (int) – The minimum number of samples required to test whether drift has occurred.
persistence (float, optional) – Persistence factor: how many samples in a row, as a proportion of the window size, must be in the “drift region” of K-L divergence, in order for kdqTree to alarm and reset. Defaults to 0.05.
alpha (float, optional) – Achievable significance level. Defaults to 0.01.
bootstrap_samples (int, optional) – The number of bootstrap samples to use to approximate the empirical distributions. Equivalent to kappa in Dasu (2006), which recommends 500-1000 samples. Defaults to 500.
count_ubound (int, optional) – An upper bound for the number of samples stored in a leaf node of the kdqTree. No leaf shall contain more samples than this value, unless further divisions violate the cutpoint_proportion_lbound restriction. Default 100.
cutpoint_proportion_lbound (float, optional) – A lower bound for the size of the leaf nodes. No node shall have a size length smaller than this proportion, relative to the original feature length. Defaults to 2e-10.

reset()[source]: Initialize the detector’s drift state and other relevant attributes. Intended for use after drift_state == "drift" or initialization.

to_plotly_dataframe(tree_id1='build', tree_id2='test', max_depth=None, input_cols=None)

Generates a dataframe containing information about the kdqTree’s structure and some node characteristics, intended for use with plotly.

Parameters

tree_id1 (str, optional) – Reference tree. If tree_id2 is not specified, the only tree described. Defaults to "build".
tree_id2 (str, optional) – Test tree. If this is specified, the dataframe will also contain information about the difference between counts in each node for the reference vs. the test tree. Defaults to "test".
max_depth (int, optional) – Depth in the tree to which to recurse. Defaults to None.
input_cols (list, optional) – List of column names for the input data. Defaults to None.

Returns

A dataframe where each row corresponds to a node, and each column contains some information:

name: a label corresponding to which feature this split is on

idx: a unique ID for the node, to pass plotly.express.treemap’s id argument

parent_idx: the ID of the node’s parent

cell_count: how many samples are in this node in the reference tree.

depth: how deep the node is in the tree

count_diff: if tree_id2 is specified, the change in counts from the reference tree.

kss: the Kulldorff Spatial Scan Statistic for this node, defined as the Kullback-Leibler divergence for this node between the reference and test trees, using the individual node and all other nodes combined as the bins for the distributions.

Return type

pd.DataFrame

update(X, y_true=None, y_pred=None)[source]

Update the detector with a new sample point. Constructs the reference data’s kdqtree; then, when sufficient samples have been received, puts the test data into the same tree; then, checks divergence between the reference and test data.

The reference window is maintained as the initial window until drift. Upon drift, the user may continue passing data to update and new reference windows will be constructed once sufficient samples are received.

Parameters

X (pandas.DataFrame or numpy array) – If just reset/initialized,
Otherwise (the reference data.) –
test (a new sample to put into the) –
window. –
y_true (numpy.ndarray) – true labels of input data - not used in KdqTree
y_pred (numpy.ndarray) – predicted labels of input data - not used in KdqTree

property drift_state: Set detector’s drift state to "drift", "warning", or None.

property samples_since_reset

Number of samples since last drift detection.

Returns: int

property total_samples

Total number of samples the drift detector has been updated with.

Returns: int

menelaus.data_drift.nndvi

class menelaus.data_drift.nndvi.NNDVI(k_nn: int = 30, sampling_times: int = 500, alpha: float = 0.01)[source]

Bases: BatchDetector

This class encodes the Nearest Neigbors Density Variation Identification (NN-DVI) drift detection algorithm, introduced in Liu et al. (2018). Note that this implementation is intended for batch datasets, rather than the streaming context.

Broadly, NN-DVI combines a reference and test data batch, creates a normalized version of the subsequent adjacency matrix (after a k-NN search), and then analyzes distance changes in the reference and test sections of the combined adjacency matrix. Those changes are compared against a threshold distance value, which is found by randomly sampling new reference and test sections, then fitting a Gaussian distribution to distance changes for those trials.

total_samples

number of batches the drift detector has ever been updated with.

Type: int

samples_since_reset

number of batches since the last drift detection.

Type: int

drift_state

detector’s current drift state. Can take values "drift", "warning", or None.

Type: str

k_nn

the ‘k’ in k-Nearest-Neighbor (k-NN) search

Type: int

reference_batch

initial batch of data

Type: numpy.array

sampling_times

number of times to perform sampling for threshold estimation

Type: int

alpha

significance level for detecting drift

Type: float

__init__(k_nn: int = 30, sampling_times: int = 500, alpha: float = 0.01)[source]

k_nn

the ‘k’ in k-Nearest-Neighbor (k-NN) search. Default 30.

Type: int, optional

sampling_times

number of times to perform sampling for threshold estimation. Default 500.

Type: int, optional

alpha

significance level for detecting drift. Default 0.01.

Type: float, optional

reset()[source]: Initialize relevant attributes to original values, to ensure information only stored from samples_since_reset onwards. Intended for use after drift_state == 'drift'.

set_reference(X, y_true=None, y_pred=None)[source]

Set the detector’s reference batch to an updated value; typically used in update.

X

updated reference batch

Type: numpy.array

y_true

true labels, not used in NNDVI

Type: numpy.array

y_pred

predicted labels, not used in NNDVI

Type: numpy.array

update(X: array, y_true=None, y_pred=None)[source]

Update the detector with a new test batch. If drift is detected, new reference batch becomes most recent test batch.

Parameters

X (numpy.array) – next batch of data to detect drift on.
y_true (numpy.array) – true labels, not used in NN-DVI
y_pred (numpy.array) – predicted labels, not used in NN-DVI

property batches_since_reset

Number of batches since last drift detection.

Returns: int

property drift_state: Set detector’s drift state to "drift", "warning", or None.

property total_batches

Total number of batches the drift detector has been updated with.

Returns: int

menelaus.data_drift.pca_cd

class menelaus.data_drift.pca_cd.PCACD(window_size, ev_threshold=0.99, delta=0.1, divergence_metric='kl', sample_period=0.05, online_scaling=True)[source]

Bases: StreamingDetector

Principal Component Analysis Change Detection (PCA-CD) is a drift detection algorithm which checks for change in the distribution of the given data using one of several divergence metrics calculated on the data’s principal components.

First, principal components are built from the reference window - the initial window_size samples. New samples from the test window, of the same width, are projected onto these principal components. The divergence metric is calculated on these scores for the reference and test windows; if this metric diverges enough, then we consider drift to have occurred. This threshold is determined dynamically through the use of the Page-Hinkley test.

Once drift is detected, the reference window is replaced with the current test window, and the test window is initialized.

Ref. Qahtan et al. [2015]

step

how frequently (by number of samples), to detect drift. This is either 100 samples or sample_period * window_size, whichever is smaller.

Type: int

ph_threshold

threshold parameter for the internal Page-Hinkley detector. Takes the value of .01 * window_size.

Type: float

num_pcs

the number of principal components being used to meet the specified ev_threshold parameter.

Type: int

__init__(window_size, ev_threshold=0.99, delta=0.1, divergence_metric='kl', sample_period=0.05, online_scaling=True)[source]

Parameters

window_size (int) – size of the reference window. Note that PCA_CD will only try to detect drift periodically, either every 100 observations or 5% of the window_size, whichever is smaller.
ev_threshold (float, optional) – Threshold for percent explained variance required when selecting number of principal components. Defaults to 0.99.
delta (float, optional) – Parameter for Page Hinkley test. Minimum amplitude of change in data needed to sound alarm. Defaults to 0.1.
divergence_metric (str, optional) –
divergence metric when comparing the two distributions when detecting drift. Defaults to “kl”.
- ”kl” - Jensen-Shannon distance, a symmetric bounded form of Kullback-Leibler divergence, uses kernel density estimation with Epanechnikov kernel
- ”intersection” - intersection area under the curves for the estimated density functions, uses histograms to estimate densities of windows. A discontinuous, less accurate estimate that should only be used when efficiency is of concern.
sample_period (float, optional) – how often to check for drift. This is 100 samples or sample_period * window_size, whichever is smaller. Default .05, or 5% of the window size.
online_scaling (bool, optional) – whether to standardize the data as it comes in, using the reference window, before applying PCA. Defaults to True.

reset()[source]: Initialize the detector’s drift state and other relevant attributes. Intended for use after drift_state == 'drift'.

update(X, y_true=None, y_pred=None)[source]

Update the detector with a new observation.

Parameters

X (numpy.ndarray) – next observation
y_true (numpy.ndarray) – true label of observation - not used in PCACD
y_pred (numpy.ndarray) – predicted label of observation - not used in PCACD

property drift_state: Set detector’s drift state to "drift", "warning", or None.

input_type = 'streaming'

property samples_since_reset

Number of samples since last drift detection.

Returns: int

property total_samples

Total number of samples the drift detector has been updated with.

Returns: int