menelaus.data_drift

Data drift detection algorithms are focused on detecting changes in the distribution of the variables within datasets. This could include shifts univariate statistics, such as the range, mean, or standard deviations, or shifts in multivariate relationships between variables, such as shifts in correlations or joint distributions.

Data drift detection algorithms are ideal for researchers seeking to better understand the change of their data over time or for the maintenance of deployed models in situations where labels are unavailable. Labels may not be readily available if obtaining them is computationally expensive or if, due to the nature of the use case, a significant time lag exists between when the models are applied and when the results are verified. Data drift detection is also applicable in unsupervised learning settings.

menelaus.data_drift.cdbd

class menelaus.data_drift.cdbd.CDBD(divergence='KL', detect_batch=1, statistic='tstat', significance=0.05, subsets=5)[source]

Bases: HistogramDensityMethod

The Confidence Distribution Batch Detection (CDBD) algorithm is a statistical test that seeks to detect concept drift in classifiers, without the use of labeled data. It is intended to monitor a classifier’s confidence scores but could be substituted with any univariate performance related statistic obtained from a learner., e.g. posterior probabilities.

This method relies upon three statistics:

  • KL divergence: the Kullback-Leibler Divergence (KLD) measure

  • Epsilon: the differences in Hellinger distances between sets of reference and test batches.

  • Beta: the adaptive threshold adapted at each time stamp. It is based on the mean of Epsilon plus the scaled standard deviation of Epsilon. The scale applied to the standard deviation is determined by the statistic parameter. It is either the number of standard deviations deemed significant ("stdev") or the t-statistic ("tstat").

CDBD operates by:

  1. Estimating density functions of reference and test data using histograms. The number of bins in each histogram equals the square root of the length of reference window. Bins are aligned by computing the minimum and maximum value for each feature from both test and reference window.

  2. Computing the distance between reference and test distributions. The KL divergence metric is used to calculate the distance between univariate histograms.

  3. Computing Epsilon.

  4. Computing the adaptive threshold Beta.

  5. Comparing current Epsilon to Beta. If Epsilon > Beta, drift is detected. The new reference batch is now the test batch on which drift was detected. All statistics are reset. If Epsilon <= Beta, drift is not detected. The reference batch is updated to include this most recent test batch. All statistics are maintained.

Ref. Lindstrom et al. [2013]

Epsilon

stores Epsilon values since the last drift detection.

Type

list

reference_n

number of samples in reference batch.

Type

int

total_epsilon

stores running sum of Epsilon values until drift is detected, initialized to 0.

Type

int

bins

number of bins in histograms, equivalent to square root of number of samples in reference batch .

Type

int

num_feat

number of features in reference batch.

Type

int

lambda

batch number on which last drift was detected.

Type

int

distances

For each batch seen (key), stores the distance between test and reference batch (value). Useful for visualizing drift detection statistics.

Type

dict

epsilon_values

For each batch seen (key), stores the Epsilon value between the current and previous test and reference batches (value). Useful for visualizing drift detection statistics. Does not store the bootstrapped estimate of Epsilon, if used.

Type

dict

thresholds

For each batch seen (key), stores the Beta thresholds between test and reference batch (value). Useful for visualizing drift detection statistics.

Type

dict

__init__(divergence='KL', detect_batch=1, statistic='tstat', significance=0.05, subsets=5)[source]
Parameters
  • divergence (str) –

    divergence measure used to compute distance between histograms. Default is “H”.

    • ”H” - Hellinger distance, original use is for HDDDM

    • ”KL” - Kullback-Leibler Divergence, original use is for CDBD

    • User can pass in custom divergence function. Input is two two-dimensional arrays containing univariate histogram estimates of density, one for reference, one for test. It must return the distance value between histograms. To be a valid distance metric, it must satisfy the following properties: non-negativity, identity, symmetry, and triangle inequality, e.g. that in examples/cbdb_example.py.

  • detect_batch (int) –

    the test batch on which drift will be detected. See class docstrings for more information on this modification. Defaults to 1.

    • if detect_batch == 1 - CDBD can detect drift on the first test batch passed to the update method. Total samples and samples since reset will be number of batches passed to HDM plus 1, due to splitting of reference batch

    • if detect_batch == 2 - CDBD can detect drift on the second test batch passed to the update method.

    • if detect_batch == 3 - CDBD can detect drift on the third test batch passed to the update method.

  • statistic (str) –

    statistical method used to compute adaptive threshold. Defaults to "tstat"

    • "tstat" - t-statistic with desired significance level and degrees of freedom = 2 for hypothesis testing on two populations.

    • "stdev" - uses number of standard deviations deemed significant to compute threshold.

  • significance (float) –

    statistical significance used to identify adaptive threshold. Defaults to 0.05.

    • if statistic = "tstat" - statistical significance of t-statistic, e.g. .05 for 95% significance level.

    • if statistic = "stdev" - number of standard deviations of change around the mean accepted.

  • subsets (int) –

    the number of subsets of reference data to take to compute initial estimate of Epsilon.

    • if too small - initial Epsilon value will be too small. Increases risk of missing drift.

    • if too high - intial Epsilon value will be too large. Increases risk of false alarms.

reset()

Initialize relevant attributes to original values, to ensure information only stored from batches_since_reset (lambda) onwards. Intended for use after drift_state == 'drift'.

set_reference(X, y_true=None, y_pred=None)[source]

Initialize detector with a reference batch. After drift, reference batch is automatically set to most recent test batch. Option for user to specify alternative reference batch using this method.

Parameters
  • X (pandas.DataFrame) – initial baseline dataset

  • y_true (numpy.array) – true labels for dataset - not used by CDBD

  • y_pred (numpy.array) – predicted labels for dataset - not used by CDBD

update(X, y_true=None, y_pred=None)[source]

Update the detector with a new test batch. If drift is detected, new reference batch becomes most recent test batch. If drift is not detected, reference batch is updated to include most recent test batch.

Parameters
  • X (DataFrame) – next batch of data to detect drift on.

  • y_true (numpy.ndarray) – true labels of next batch - not used in CDBD

  • y_pred (numpy.ndarray) – predicted labels of next batch - not used in CDBD

property batches_since_reset

Number of batches since last drift detection.

Returns

int

property drift_state

Set detector’s drift state to "drift", "warning", or None.

input_type = 'batch'
property total_batches

Total number of batches the drift detector has been updated with.

Returns

int

menelaus.data_drift.hdddm

class menelaus.data_drift.hdddm.HDDDM(detect_batch=1, divergence='H', statistic='tstat', significance=0.05, subsets=5)[source]

Bases: HistogramDensityMethod

HDDDM is a batch-based, unsupervised drift detection algorithm that detects changes in feature distributions. It uses the Hellinger distance metric to compare test and reference batches and is capable of detecting gradual or abrupt changes in data.

This method relies upon three statistics:

  • Hellinger distance: the sum of the normalized, squared differences in frequency counts for each bin between reference and test datasets, averaged across all features.

  • Epsilon: the differences in Hellinger distances between sets of reference and test batches.

  • Beta: the adaptive threshold adapted at each time stamp. It is based on the mean of Epsilon plus the scaled standard deviation of Epsilon. The scale applied to the standard deviation is determined by the statistic parameter. It is either the number of standard deviations deemed significant ("stdev") or the t-statistic ("tstat").

HDDDM operates by:

  1. Estimating density functions of reference and test data using histograms. The number of bins in each histogram equals the square root of the length of reference window. Bins are aligned by computing the minimum and maximum value for each feature from both test and reference window.

  2. Computing the distance between reference and test distributions. The Hellinger distance is first calculated between each feature in the reference and test batches. Then, the final Hellinger statistic used is the average of each feature’s distance.

  3. Computing Epsilon.

  4. Computing the adaptive threshold Beta.

  5. Comparing current Epsilon to Beta. If Epsilon > Beta, drift is detected. The new reference batch is now the test batch on which drift was detected. All statistics are reset. If Epsilon <= Beta, drift is not detected. The reference batch is updated to include this most recent test batch. All statistics are maintained.

Two key modifications were added to Ditzler and Polikar’s presentation of HDDDM:

  • To answer the research question of “where is drift occuring?”, it stores the distance values and Epsilon values for each feature. These statistics can be used to identify and visualize the features containing the most significant drifts.

    • The Hellinger distance values are calculated for each feature in the test batch. These values can be accessed when drift occurs using the self.feature_info dictionary.

    • The Epsilon values for each feature are stored, for each set of reference and test batches. For each feature, these values represent the difference in Hellinger distances within the test and reference batch at time t, to the Hellinger distances within the test and reference batch at time t-1. These can be acccessed with each update call using the self.feature_epsilons variable. They also can be accessed when drift occurs using the self.feature_info dictionary.

  • The original algorithm cannot detect drift until it is updated with the third test batch after either a) initilization or b) reset upon drift, because the threshold for drift detection is defined from the difference Epsilon. To have sufficient values to define this threshold, then, three batches are needed. The detect_batch parameter can be set such that bootstrapping is used to define this threshold earlier than the third test batch.

    • if detect_batch == 3, HDDDM will operate as described in Ditzler and Polikar [2011].

    • if detect_batch == 2, HDDDM will detect drift on the second test batch. On the second test batch, HDDDM uses bootstrapped samples from the reference batch to estimate the mean and standard deviation of Epsilon; this is used to calculate the necessary threshold. On the third test batch, this value is removed from all proceeding calculations.

    • if detect_batch == 1, HDDDM will detect drift on the first test batch. The initial reference batch is split randomly into two halves. The first half will serve as the original reference batch. The second half will serve as a proxy for the first test batch, allowing us to calculate the distance statistic. When HDDDM is updated with the first actual test batch, HDDDM will perform the method for bootstrapping Epsilon, as described in the above bullet for detect_batch == 2. This will allow a Beta threshold to be calculated using the first test batch, allowing for detection of drift on this batch.

Ref. Ditzler and Polikar [2011]

Epsilon

stores Epsilon values since the last drift detection.

Type

list

reference_n

number of samples in reference batch.

Type

int

total_epsilon

stores running sum of Epsilon values until drift is detected, initialized to 0.

Type

int

bins

number of bins in histograms, equivalent to square root of number of samples in reference batch.

Type

int

num_feat

number of features in reference batch.

Type

int

lambda

batch number on which last drift was detected.

Type

int

distances

For each batch seen (key), stores the Hellinger distance between test and reference batch (value). Useful for visualizing drift detection statistics.

Type

dict

epsilon_values

For each batch seen (key), stores the Epsilon value between the current and previous test and reference batches (value). Useful for visualizing drift detection statistics. Does not store the bootstrapped estimate of Epsilon, if used.

Type

dict

thresholds

For each batch seen (key), stores the Beta thresholds between test and reference batch (value). Useful for visualizing drift detection statistics.

Type

dict

__init__(detect_batch=1, divergence='H', statistic='tstat', significance=0.05, subsets=5)[source]
Parameters
  • divergence (str) –

    divergence measure used to compute distance between histograms. Default is “H”.

    • ”H” - Hellinger distance, original use is for HDDDM

    • ”KL” - Kullback-Leibler Divergence, original use is for CDBD

    • User can pass in custom divergence function. Input is two two-dimensional arrays containing univariate histogram estimates of density, one for reference, one for test. It must return the distance value between histograms. To be a valid distance metric, it must satisfy the following properties: non-negativity, identity, symmetry, and triangle inequality, e.g. that in examples/hdddm_example.py.

  • detect_batch (int) –

    the test batch on which drift will be detected. See class docstrings for more information on this modification. Defaults to 1.

    • if detect_batch == 1 - HDDDM can detect drift on the first test batch passed to the update method. Total batches and batches since reset will be number of batches passed to HDM plus 1, due to splitting of reference batch

    • if detect_batch == 2 - HDDDM can detect drift on the second test batch passed to the update method.

    • if detect_batch == 3 - HDDDM can detect drift on the third test batch passed to the update method.

  • statistic (str) –

    statistical method used to compute adaptive threshold. Defaults to "tstat".

    • "tstat" - t-statistic with desired significance level and degrees of freedom = 2 for hypothesis testing on two populations.

    • "stdev" - uses number of standard deviations deemed significant to compute threhsold.

  • significance (float) –

    statistical significance used to identify adaptive threshold. Defaults to 0.05.

    • if statistic = "tstat" - statistical significance of t-statistic, e.g. .05 for 95% significance level.

    • if statistic = "stdev" - number of standard deviations of change around the mean accepted.

  • subsets (int) –

    the number of subsets of reference data to take to compute initial estimate of Epsilon.

    • if too small - initial Epsilon value will be too small. Increases risk of missing drift.

    • if too high - intial Epsilon value will be too large. Increases risk of false alarms.

reset()

Initialize relevant attributes to original values, to ensure information only stored from batches_since_reset (lambda) onwards. Intended for use after drift_state == 'drift'.

set_reference(X, y_true=None, y_pred=None)

Initialize detector with a reference batch. After drift, reference batch is automatically set to most recent test batch. Option for user to specify alternative reference batch using this method.

Parameters
  • X (pandas.DataFrame) – initial baseline dataset

  • y_true (numpy.array) – true labels for dataset - not used by HDM

  • y_pred (numpy.array) – predicted labels for dataset - not used by HDM

update(X, y_true=None, y_pred=None)[source]

Update the detector with a new test batch. If drift is detected, new reference batch becomes most recent test batch. If drift is not detected, reference batch is updated to include most recent test batch.

Parameters
  • X (DataFrame) – next batch of data to detect drift on.

  • y_true (numpy.ndarray) – true labels of next batch - not used in HDDDM

  • y_pred (numpy.ndarray) – predicted labels of next batch - not used in HDDDM

property batches_since_reset

Number of batches since last drift detection.

Returns

int

property drift_state

Set detector’s drift state to "drift", "warning", or None.

input_type = 'batch'
property total_batches

Total number of batches the drift detector has been updated with.

Returns

int

menelaus.data_drift.histogram_density_method

class menelaus.data_drift.histogram_density_method.HistogramDensityMethod(divergence, detect_batch, statistic, significance, subsets)[source]

Bases: BatchDetector

The Histogram Density Method (HDM) is the base class for both HDDDM and CDBD. HDDDM differs from CDBD by relying upon the Hellinger distance measure while CDBD uses KL divergence.

This method relies upon three statistics:

  • Distance metric:

    • Hellinger distance (default if called via HDDDM): the sum of the normalized, squared differences in frequency counts for each bin between reference and test datasets, averaged across all features.

    • KL divergence (default if called via CDBD): Jensen-Shannon distances, a symmetric and bounded measure based upon Kullback-Leibler Divergence

    • Optional user-defined distance metric

  • Epsilon: the differences in Hellinger distances between sets of reference and test batches.

  • Beta: the adaptive threshold adapted at each time stamp. It is based on the mean of Epsilon plus the scaled standard deviation of Epsilon. The scale applied to the standard deviation is determined by the statistic parameter. It is either the number of standard deviations deemed significant ("stdev") or the t-statistic ("tstat").

HDM operates by:

  1. Estimating density functions of reference and test data using histograms. The number of bins in each histogram equals the square root of the length of reference window. Bins are aligned by computing the minimum and maximum value for each feature from both test and reference window.

  2. Computing the distance between reference and test distributions. In HDDDM, the Hellinger distance is first calculated between each feature in the reference and test batches. Then, the final Hellinger statistic used is the average of each feature’s distance. In CDBD, the KL divergence metric is used to calculate the distance between univariate histograms.

  3. Computing Epsilon.

  4. Computing the adaptive threshold Beta.

  5. Comparing current Epsilon to Beta. If Epsilon > Beta, drift is detected. The new reference batch is now the test batch on which drift was detected. All statistics are reset. If Epsilon <= Beta, drift is not detected. The reference batch is updated to include this most recent test batch. All statistics are maintained.

Two key modifications were added to Ditzler and Polikar’s presentation:

  • For HDDDM, to answer the question of “where is drift occuring?”, it stores the distance values and Epsilon values for each feature. These statistics can be used to identify and visualize the features containing the most significant drifts.

    • The Hellinger distance values are calculated for each feature in the test batch. These values can be accessed when drift occurs using the self.feature_info dictionary.

    • The Epsilon values for each feature are stored, for each set of reference and test batches. For each feature, these values represent the difference in Hellinger distances within the test and reference batch at time t, to the Hellinger distances within the test and reference batch at time t-1. These can be acccessed with each update call using the self.feature_epsilons variable. They also can be accessed when drift occurs using the self.feature_info dictionary.

  • The original algorithm cannot detect drift until it is updated with the third test batch after either a) initilization or b) reset upon drift, because the threshold for drift detection is defined from the difference Epsilon. To have sufficient values to define this threshold, then, three batches are needed. The detect_batch parameter can be set such that bootstrapping is used to define this threshold earlier than the third test batch.

    • if detect_batch == 3, HDM will operate as described in Ditzler and Polikar [2011].

    • if detect_batch == 2, HDM will detect drift on the second test batch. On the second test batch, HDM uses bootstrapped samples from the reference batch to estimate the mean and standard deviation of Epsilon; this is used to calculate the necessary threshold. On the third test batch, this value is removed from all proceeding

    • if detect_batch = 1, HDM will detect drift on the first test batch. The initial reference batch is split randomly into two halves. The first half will serve as the original reference batch. The second half will serve as a proxy for the first test batch, allowing us to calculate the distance statistic. When HDM is updated with the first actual test batch, HDM will perform the method for bootstrapping Epsilon, as described in the above bullet for detect_batch = 2. This will allow a Beta threshold to be calculated using the first test batch, allowing for detection of drift on this batch.

Ref. Lindstrom et al. [2013] and Ditzler and Polikar [2011]

Epsilon

stores Epsilon values since the last drift detection.

Type

list

reference_n

number of samples in reference batch.

Type

int

total_epsilon

stores running sum of Epsilon values until drift is detected, initialized to 0.

Type

int

distances

For each batch seen (key), stores the distance between test and reference batch (value). Useful for visualizing drift detection statistics.

Type

dict

epsilon_values

For each batch seen (key), stores the Epsilon value between the current and previous test and reference batches (value). Useful for visualizing drift detection statistics. Does not store the bootstrapped estimate of Epsilon, if used.

Type

dict

thresholds

For each batch seen (key), stores the Beta thresholds between test and reference batch (value). Useful for visualizing drift detection statistics.

Type

dict

__init__(divergence, detect_batch, statistic, significance, subsets)[source]
Parameters
  • divergence (str or function) –

    divergence measure used to compute distance between histograms. Default is “H”.

    • ”H” - Hellinger distance, original use is for HDDDM

    • ”KL” - Kullback-Leibler Divergence, original use is for CDBD

    • User can pass in custom divergence function. Input is two two-dimensional arrays containing univariate histogram estimates of density, one for reference, one for test. It must return the distance value between histograms. To be a valid distance metric, it must satisfy the following properties: non-negativity, identity, symmetry, and triangle inequality, e.g. that in examples/cdbd_example.py or examples/hdddm_example.py.

  • detect_batch (int) –

    the test batch on which drift will be detected. See class docstrings for more information on this modification. Defaults to 1.

    • if detect_batch == 1 - HDM can detect drift on the first test batch passed to the update method

    • if detect_batch == 2 - HDM can detect drift on the second test batch passed to the update method

    • if detect_batch == 3 - HDM can detect drift on the third test batch passed to the update method

  • statistic (str) –

    statistical method used to compute adaptive threshold. Defaults to "tstat".

    • "tstat" - t-statistic with desired significance level and degrees of freedom = 2 for hypothesis testing on two populations

    • "stdev" - uses number of standard deviations deemed significant to compute threhsold

  • significance (float) –

    statistical significance used to identify adaptive threshold. Defaults to 0.05.

    • if statistic = "tstat" - statistical significance of t-statistic, e.g. .05 for 95% significance level

    • if statistic = "stdev" - number of standard deviations of change around the mean accepted

  • subsets (int) –

    the number of subsets of reference data to take to compute initial estimate of Epsilon.

    • if too small - initial Epsilon value will be too small. Increases risk of missing drift

    • if too high - intial Epsilon value will be too large. Increases risk of false alarms.

reset()[source]

Initialize relevant attributes to original values, to ensure information only stored from batches_since_reset (lambda) onwards. Intended for use after drift_state == 'drift'.

set_reference(X, y_true=None, y_pred=None)[source]

Initialize detector with a reference batch. After drift, reference batch is automatically set to most recent test batch. Option for user to specify alternative reference batch using this method.

Parameters
  • X (pandas.DataFrame) – initial baseline dataset

  • y_true (numpy.array) – true labels for dataset - not used by HDM

  • y_pred (numpy.array) – predicted labels for dataset - not used by HDM

update(X, y_true=None, y_pred=None)[source]

Update the detector with a new test batch. If drift is detected, new reference batch becomes most recent test batch. If drift is not detected, reference batch is updated to include most recent test batch.

Parameters
  • X (DataFrame) – next batch of data to detect drift on.

  • y_true (numpy.ndarray) – true labels of next batch - not used in HDM

  • y_pred (numpy.ndarray) – predicted labels of next batch - not used in HDM

property batches_since_reset

Number of batches since last drift detection.

Returns

int

property drift_state

Set detector’s drift state to "drift", "warning", or None.

input_type = 'batch'
property total_batches

Total number of batches the drift detector has been updated with.

Returns

int

menelaus.data_drift.kdq_tree

class menelaus.data_drift.kdq_tree.KdqTreeBatch(alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]

Bases: KdqTreeDetector, BatchDetector

Implements the kdqTree drift detection algorithm in a batch data context. Inherits from KdqTreeDetector and BatchDetector (see docs).

kdqTree is a drift detection algorithm which detects drift via the Kullback-Leibler divergence, calculated after partitioning the data space via constructing a k-d-quad-tree (kdq-tree). A reference window of initial data is compared to a test window of later data. The Kullback-Leibler divergence between the empirical distributions of the reference and test windows is calculated, and drift is alarmed when a threshold is reached. A kdqtree is a combination of k-d trees and quad-trees; it is a binary tree (k-d) whose nodes contain square cells (quad) which are created via sequential splits along each dimension. This structure allows the calculation of the K-L divergence for continuous distributions, as the K-L divergence is defined on probability mass functions. The number of samples in each leaf of the tree is an empirical distribution for either dataset; this allows us to calculate the K-L divergence.

If used in a streaming data setting, the reference window is used to construct a kdq-tree, and the data in both the reference and test window are filed into it. If used in a batch data setting, the reference window - the first batch passed in - is used to construct a kdq-tree, and data in test batches are compared to it. When drift is detected on a test batch, that test batch is set to be the new reference window - unless the user specifies a reference window using the set_reference method.

The threshold for drift is determined using the desired alpha level by a bootstrap estimate for the critical value of the K-L divergence, drawing a sample of num_boostrap_samples repeatedly, 2 * window_size times, from the reference window.

Additionally, the Kulldorff spatial scan statistic, which is a special case of the KL-divergence, can be calculated at each node of the kdq-tree, which gives a measure of the regions of the data space which have the greatest divergence between the reference and test windows. This can be used to visualize which regions of data space have the greatest drift. Note that these statistics are specific to the partitions of the data space by the kdq-tree, rather than (necessarily) the maximally different region in general. KSSs are made available via to_plotly_dataframe, which produces output structured for use with plotly.express.treemap.

Ref. Dasu et al. [2006]

__init__(alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]
Parameters
  • alpha (float, optional) – Achievable significance level. Defaults to 0.01.

  • bootstrap_samples (int, optional) – The number of bootstrap samples to use to approximate the empirical distributions. Equivalent to kappa in Dasu (2006), which recommends 500-1000 samples. Defaults to 500.

  • count_ubound (int, optional) – An upper bound for the number of samples stored in a leaf node of the kdqTree. No leaf shall contain more samples than this value, unless further divisions violate the cutpoint_proportion_lbound restriction. Default 100.

  • cutpoint_proportion_lbound (float, optional) – A lower bound for the size of the leaf nodes. No node shall have a size length smaller than this proportion, relative to the original feature length. Defaults to 2e-10.

reset()[source]

Initialize the detector’s drift state and other relevant attributes. Intended for use after drift_state == "drift" or initialization.

set_reference(X, y_true=None, y_pred=None)[source]

Initialize detector with a reference batch. The user may specify an alternate reference batch than the one maintained by kdq-Tree. This will reset the detector.

Parameters
  • X (pandas.DataFrame or numpy.array) – baseline dataset

  • y_true (numpy.array) – actual labels of dataset - not used in KdqTree

  • y_pred (numpy.array) – predicted labels of dataset - not used in KdqTree

to_plotly_dataframe(tree_id1='build', tree_id2='test', max_depth=None, input_cols=None)

Generates a dataframe containing information about the kdqTree’s structure and some node characteristics, intended for use with plotly.

Parameters
  • tree_id1 (str, optional) – Reference tree. If tree_id2 is not specified, the only tree described. Defaults to "build".

  • tree_id2 (str, optional) – Test tree. If this is specified, the dataframe will also contain information about the difference between counts in each node for the reference vs. the test tree. Defaults to "test".

  • max_depth (int, optional) – Depth in the tree to which to recurse. Defaults to None.

  • input_cols (list, optional) – List of column names for the input data. Defaults to None.

Returns

A dataframe where each row corresponds to a node, and each column contains some information:

  • name: a label corresponding to which feature this split is on

  • idx: a unique ID for the node, to pass plotly.express.treemap’s id argument

  • parent_idx: the ID of the node’s parent

  • cell_count: how many samples are in this node in the reference tree.

  • depth: how deep the node is in the tree

  • count_diff: if tree_id2 is specified, the change in counts from the reference tree.

  • kss: the Kulldorff Spatial Scan Statistic for this node, defined as the Kullback-Leibler divergence for this node between the reference and test trees, using the individual node and all other nodes combined as the bins for the distributions.

Return type

pd.DataFrame

update(X, y_true=None, y_pred=None)[source]

Update the detector with a new batch. Constructs the reference data’s kdqtree; then, when sufficient samples have been received, puts the test data into the same tree; then, checks divergence between the reference and test data.

The initial batch will be used as the reference at each update step, regardless of drift state. If the user wishes to change reference batch, use the set_reference method and then continue passing new batches to update.

Parameters
  • X (pandas.DataFrame or numpy array) – If just reset/initialized,

  • Otherwise (the reference data.) –

  • compared (a new batch of data to be) –

  • window. (to the reference) –

  • y_true (numpy.ndarray) – true labels of input data - not used in KdqTree

  • y_pred (numpy.ndarray) – predicted labels of input data - not used in KdqTree

property batches_since_reset

Number of batches since last drift detection.

Returns

int

property drift_state

Set detector’s drift state to "drift", "warning", or None.

property total_batches

Total number of batches the drift detector has been updated with.

Returns

int

class menelaus.data_drift.kdq_tree.KdqTreeDetector(alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]

Bases: object

Parent class for kdqTree-based drift detector classes. Whether reliant on streaming or batch data, kdqTree detectors have some common attributes, logic, and functions.

kdqTree is a drift detection algorithm which detects drift via the Kullback-Leibler divergence, calculated after partitioning the data space via constructing a k-d-quad-tree (kdq-tree). A reference window of initial data is compared to a test window of later data. The Kullback-Leibler divergence between the empirical distributions of the reference and test windows is calculated, and drift is alarmed when a threshold is reached. A kdqtree is a combination of k-d trees and quad-trees; it is a binary tree (k-d) whose nodes contain square cells (quad) which are created via sequential splits along each dimension. This structure allows the calculation of the K-L divergence for continuous distributions, as the K-L divergence is defined on probability mass functions. The number of samples in each leaf of the tree is an empirical distribution for either dataset; this allows us to calculate the K-L divergence.

If used in a streaming data setting, the reference window is used to construct a kdq-tree, and the data in both the reference and test window are filed into it. If used in a batch data setting, the reference window - the first batch passed in - is used to construct a kdq-tree, and data in test batches are compared to it. When drift is detected on a test batch, that test batch is set to be the new reference window - unless the user specifies a reference window using the set_reference method.

The threshold for drift is determined using the desired alpha level by a bootstrap estimate for the critical value of the K-L divergence, drawing a sample of num_boostrap_samples repeatedly, 2 * window_size times, from the reference window.

Additionally, the Kulldorff spatial scan statistic, which is a special case of the KL-divergence, can be calculated at each node of the kdq-tree, which gives a measure of the regions of the data space which have the greatest divergence between the reference and test windows. This can be used to visualize which regions of data space have the greatest drift. Note that these statistics are specific to the partitions of the data space by the kdq-tree, rather than (necessarily) the maximally different region in general. KSSs are made available via to_plotly_dataframe, which produces output structured for use with plotly.express.treemap.

Note that this algorithm could be used with other types of trees; the reference paper and this implementation use kdq-trees.

Note that the current implementation does not explicitly handle categorical data.

Ref. Dasu et al. [2006]

__init__(alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]
Parameters
  • alpha (float, optional) – Achievable significance level. Defaults to 0.01.

  • bootstrap_samples (int, optional) – The number of bootstrap samples to use to approximate the empirical distributions. Equivalent to kappa in Dasu (2006), which recommends 500-1000 samples. Defaults to 500.

  • count_ubound (int, optional) – An upper bound for the number of samples stored in a leaf node of the kdqTree. No leaf shall contain more samples than this value, unless further divisions violate the cutpoint_proportion_lbound restriction. Default 100.

  • cutpoint_proportion_lbound (float, optional) – A lower bound for the size of the leaf nodes. No node shall have a size length smaller than this proportion, relative to the original feature length. Defaults to 2e-10.

reset()[source]

Initialize the detector’s drift state and other relevant attributes. Intended for use after drift_state == "drift" or initialization.

to_plotly_dataframe(tree_id1='build', tree_id2='test', max_depth=None, input_cols=None)[source]

Generates a dataframe containing information about the kdqTree’s structure and some node characteristics, intended for use with plotly.

Parameters
  • tree_id1 (str, optional) – Reference tree. If tree_id2 is not specified, the only tree described. Defaults to "build".

  • tree_id2 (str, optional) – Test tree. If this is specified, the dataframe will also contain information about the difference between counts in each node for the reference vs. the test tree. Defaults to "test".

  • max_depth (int, optional) – Depth in the tree to which to recurse. Defaults to None.

  • input_cols (list, optional) – List of column names for the input data. Defaults to None.

Returns

A dataframe where each row corresponds to a node, and each column contains some information:

  • name: a label corresponding to which feature this split is on

  • idx: a unique ID for the node, to pass plotly.express.treemap’s id argument

  • parent_idx: the ID of the node’s parent

  • cell_count: how many samples are in this node in the reference tree.

  • depth: how deep the node is in the tree

  • count_diff: if tree_id2 is specified, the change in counts from the reference tree.

  • kss: the Kulldorff Spatial Scan Statistic for this node, defined as the Kullback-Leibler divergence for this node between the reference and test trees, using the individual node and all other nodes combined as the bins for the distributions.

Return type

pd.DataFrame

class menelaus.data_drift.kdq_tree.KdqTreeStreaming(window_size, persistence=0.05, alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]

Bases: KdqTreeDetector, StreamingDetector

Implements the kdqTree drift detection algorithm in a streaming data context. Inherits from KdqTreeDetector and StreamingDetector (see docs).

kdqTree is a drift detection algorithm which detects drift via the Kullback-Leibler divergence, calculated after partitioning the data space via constructing a k-d-quad-tree (kdq-tree).

If used in a streaming data setting, the reference window is used to construct a kdq-tree, and the data in both the reference and test window are filed into it. If used in a batch data setting, the reference window - the first batch passed in - is used to construct a kdq-tree, and data in test batches are compared to it. When drift is detected on a test batch, that test batch is set to be the new reference window - unless the user specifies a reference window using the set_reference method.

The threshold for drift is determined using the desired alpha level by a bootstrap estimate for the critical value of the K-L divergence, drawing a sample of num_boostrap_samples repeatedly, 2 * window_size times, from the reference window.

Additionally, the Kulldorff spatial scan statistic, which is a special case of the KL-divergence, can be calculated at each node of the kdq-tree, which gives a measure of the regions of the data space which have the greatest divergence between the reference and test windows. This can be used to visualize which regions of data space have the greatest drift. Note that these statistics are specific to the partitions of the data space by the kdq-tree, rather than (necessarily) the maximally different region in general. KSSs are made available via to_plotly_dataframe, which produces output structured for use with plotly.express.treemap.

Ref. Dasu et al. [2006]

__init__(window_size, persistence=0.05, alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]
Parameters
  • window_size (int) – The minimum number of samples required to test whether drift has occurred.

  • persistence (float, optional) – Persistence factor: how many samples in a row, as a proportion of the window size, must be in the “drift region” of K-L divergence, in order for kdqTree to alarm and reset. Defaults to 0.05.

  • alpha (float, optional) – Achievable significance level. Defaults to 0.01.

  • bootstrap_samples (int, optional) – The number of bootstrap samples to use to approximate the empirical distributions. Equivalent to kappa in Dasu (2006), which recommends 500-1000 samples. Defaults to 500.

  • count_ubound (int, optional) – An upper bound for the number of samples stored in a leaf node of the kdqTree. No leaf shall contain more samples than this value, unless further divisions violate the cutpoint_proportion_lbound restriction. Default 100.

  • cutpoint_proportion_lbound (float, optional) – A lower bound for the size of the leaf nodes. No node shall have a size length smaller than this proportion, relative to the original feature length. Defaults to 2e-10.

reset()[source]

Initialize the detector’s drift state and other relevant attributes. Intended for use after drift_state == "drift" or initialization.

to_plotly_dataframe(tree_id1='build', tree_id2='test', max_depth=None, input_cols=None)

Generates a dataframe containing information about the kdqTree’s structure and some node characteristics, intended for use with plotly.

Parameters
  • tree_id1 (str, optional) – Reference tree. If tree_id2 is not specified, the only tree described. Defaults to "build".

  • tree_id2 (str, optional) – Test tree. If this is specified, the dataframe will also contain information about the difference between counts in each node for the reference vs. the test tree. Defaults to "test".

  • max_depth (int, optional) – Depth in the tree to which to recurse. Defaults to None.

  • input_cols (list, optional) – List of column names for the input data. Defaults to None.

Returns

A dataframe where each row corresponds to a node, and each column contains some information:

  • name: a label corresponding to which feature this split is on

  • idx: a unique ID for the node, to pass plotly.express.treemap’s id argument

  • parent_idx: the ID of the node’s parent

  • cell_count: how many samples are in this node in the reference tree.

  • depth: how deep the node is in the tree

  • count_diff: if tree_id2 is specified, the change in counts from the reference tree.

  • kss: the Kulldorff Spatial Scan Statistic for this node, defined as the Kullback-Leibler divergence for this node between the reference and test trees, using the individual node and all other nodes combined as the bins for the distributions.

Return type

pd.DataFrame

update(X, y_true=None, y_pred=None)[source]

Update the detector with a new sample point. Constructs the reference data’s kdqtree; then, when sufficient samples have been received, puts the test data into the same tree; then, checks divergence between the reference and test data.

The reference window is maintained as the initial window until drift. Upon drift, the user may continue passing data to update and new reference windows will be constructed once sufficient samples are received.

Parameters
  • X (pandas.DataFrame or numpy array) – If just reset/initialized,

  • Otherwise (the reference data.) –

  • test (a new sample to put into the) –

  • window.

  • y_true (numpy.ndarray) – true labels of input data - not used in KdqTree

  • y_pred (numpy.ndarray) – predicted labels of input data - not used in KdqTree

property drift_state

Set detector’s drift state to "drift", "warning", or None.

property samples_since_reset

Number of samples since last drift detection.

Returns

int

property total_samples

Total number of samples the drift detector has been updated with.

Returns

int

menelaus.data_drift.nndvi

class menelaus.data_drift.nndvi.NNDVI(k_nn: int = 30, sampling_times: int = 500, alpha: float = 0.01)[source]

Bases: BatchDetector

This class encodes the Nearest Neigbors Density Variation Identification (NN-DVI) drift detection algorithm, introduced in Liu et al. (2018). Note that this implementation is intended for batch datasets, rather than the streaming context.

Broadly, NN-DVI combines a reference and test data batch, creates a normalized version of the subsequent adjacency matrix (after a k-NN search), and then analyzes distance changes in the reference and test sections of the combined adjacency matrix. Those changes are compared against a threshold distance value, which is found by randomly sampling new reference and test sections, then fitting a Gaussian distribution to distance changes for those trials.

total_samples

number of batches the drift detector has ever been updated with.

Type

int

samples_since_reset

number of batches since the last drift detection.

Type

int

drift_state

detector’s current drift state. Can take values "drift", "warning", or None.

Type

str

k_nn

the ‘k’ in k-Nearest-Neighbor (k-NN) search

Type

int

reference_batch

initial batch of data

Type

numpy.array

sampling_times

number of times to perform sampling for threshold estimation

Type

int

alpha

significance level for detecting drift

Type

float

__init__(k_nn: int = 30, sampling_times: int = 500, alpha: float = 0.01)[source]
k_nn

the ‘k’ in k-Nearest-Neighbor (k-NN) search. Default 30.

Type

int, optional

sampling_times

number of times to perform sampling for threshold estimation. Default 500.

Type

int, optional

alpha

significance level for detecting drift. Default 0.01.

Type

float, optional

reset()[source]

Initialize relevant attributes to original values, to ensure information only stored from samples_since_reset onwards. Intended for use after drift_state == 'drift'.

set_reference(X, y_true=None, y_pred=None)[source]

Set the detector’s reference batch to an updated value; typically used in update.

X

updated reference batch

Type

numpy.array

y_true

true labels, not used in NNDVI

Type

numpy.array

y_pred

predicted labels, not used in NNDVI

Type

numpy.array

update(X: array, y_true=None, y_pred=None)[source]

Update the detector with a new test batch. If drift is detected, new reference batch becomes most recent test batch.

Parameters
  • X (numpy.array) – next batch of data to detect drift on.

  • y_true (numpy.array) – true labels, not used in NN-DVI

  • y_pred (numpy.array) – predicted labels, not used in NN-DVI

property batches_since_reset

Number of batches since last drift detection.

Returns

int

property drift_state

Set detector’s drift state to "drift", "warning", or None.

property total_batches

Total number of batches the drift detector has been updated with.

Returns

int

menelaus.data_drift.pca_cd

class menelaus.data_drift.pca_cd.PCACD(window_size, ev_threshold=0.99, delta=0.1, divergence_metric='kl', sample_period=0.05, online_scaling=True)[source]

Bases: StreamingDetector

Principal Component Analysis Change Detection (PCA-CD) is a drift detection algorithm which checks for change in the distribution of the given data using one of several divergence metrics calculated on the data’s principal components.

First, principal components are built from the reference window - the initial window_size samples. New samples from the test window, of the same width, are projected onto these principal components. The divergence metric is calculated on these scores for the reference and test windows; if this metric diverges enough, then we consider drift to have occurred. This threshold is determined dynamically through the use of the Page-Hinkley test.

Once drift is detected, the reference window is replaced with the current test window, and the test window is initialized.

Ref. Qahtan et al. [2015]

step

how frequently (by number of samples), to detect drift. This is either 100 samples or sample_period * window_size, whichever is smaller.

Type

int

ph_threshold

threshold parameter for the internal Page-Hinkley detector. Takes the value of .01 * window_size.

Type

float

num_pcs

the number of principal components being used to meet the specified ev_threshold parameter.

Type

int

__init__(window_size, ev_threshold=0.99, delta=0.1, divergence_metric='kl', sample_period=0.05, online_scaling=True)[source]
Parameters
  • window_size (int) – size of the reference window. Note that PCA_CD will only try to detect drift periodically, either every 100 observations or 5% of the window_size, whichever is smaller.

  • ev_threshold (float, optional) – Threshold for percent explained variance required when selecting number of principal components. Defaults to 0.99.

  • delta (float, optional) – Parameter for Page Hinkley test. Minimum amplitude of change in data needed to sound alarm. Defaults to 0.1.

  • divergence_metric (str, optional) –

    divergence metric when comparing the two distributions when detecting drift. Defaults to “kl”.

    • ”kl” - Jensen-Shannon distance, a symmetric bounded form of Kullback-Leibler divergence, uses kernel density estimation with Epanechnikov kernel

    • ”intersection” - intersection area under the curves for the estimated density functions, uses histograms to estimate densities of windows. A discontinuous, less accurate estimate that should only be used when efficiency is of concern.

  • sample_period (float, optional) – how often to check for drift. This is 100 samples or sample_period * window_size, whichever is smaller. Default .05, or 5% of the window size.

  • online_scaling (bool, optional) – whether to standardize the data as it comes in, using the reference window, before applying PCA. Defaults to True.

reset()[source]

Initialize the detector’s drift state and other relevant attributes. Intended for use after drift_state == 'drift'.

update(X, y_true=None, y_pred=None)[source]

Update the detector with a new observation.

Parameters
  • X (numpy.ndarray) – next observation

  • y_true (numpy.ndarray) – true label of observation - not used in PCACD

  • y_pred (numpy.ndarray) – predicted label of observation - not used in PCACD

property drift_state

Set detector’s drift state to "drift", "warning", or None.

input_type = 'streaming'
property samples_since_reset

Number of samples since last drift detection.

Returns

int

property total_samples

Total number of samples the drift detector has been updated with.

Returns

int