menelaus.data_drift
Data drift detection algorithms are focused on detecting changes in the distribution of the variables within datasets. This could include shifts univariate statistics, such as the range, mean, or standard deviations, or shifts in multivariate relationships between variables, such as shifts in correlations or joint distributions.
Data drift detection algorithms are ideal for researchers seeking to better understand the change of their data over time or for the maintenance of deployed models in situations where labels are unavailable. Labels may not be readily available if obtaining them is computationally expensive or if, due to the nature of the use case, a significant time lag exists between when the models are applied and when the results are verified. Data drift detection is also applicable in unsupervised learning settings.
menelaus.data_drift.cdbd
- class menelaus.data_drift.cdbd.CDBD(divergence='KL', detect_batch=1, statistic='tstat', significance=0.05, subsets=5)[source]
Bases:
HistogramDensityMethod
The Confidence Distribution Batch Detection (CDBD) algorithm is a statistical test that seeks to detect concept drift in classifiers, without the use of labeled data. It is intended to monitor a classifier’s confidence scores but could be substituted with any univariate performance related statistic obtained from a learner., e.g. posterior probabilities.
This method relies upon three statistics:
KL divergence: the Kullback-Leibler Divergence (KLD) measure
Epsilon: the differences in Hellinger distances between sets of reference and test batches.
Beta: the adaptive threshold adapted at each time stamp. It is based on the mean of Epsilon plus the scaled standard deviation of Epsilon. The scale applied to the standard deviation is determined by the
statistic
parameter. It is either the number of standard deviations deemed significant ("stdev"
) or the t-statistic ("tstat"
).
CDBD operates by:
Estimating density functions of reference and test data using histograms. The number of bins in each histogram equals the square root of the length of reference window. Bins are aligned by computing the minimum and maximum value for each feature from both test and reference window.
Computing the distance between reference and test distributions. The KL divergence metric is used to calculate the distance between univariate histograms.
Computing Epsilon.
Computing the adaptive threshold Beta.
Comparing current Epsilon to Beta. If Epsilon > Beta, drift is detected. The new reference batch is now the test batch on which drift was detected. All statistics are reset. If Epsilon <= Beta, drift is not detected. The reference batch is updated to include this most recent test batch. All statistics are maintained.
Ref. Lindstrom et al. [2013]
- Epsilon
stores Epsilon values since the last drift detection.
- Type
list
- reference_n
number of samples in reference batch.
- Type
int
- total_epsilon
stores running sum of Epsilon values until drift is detected, initialized to 0.
- Type
int
- bins
number of bins in histograms, equivalent to square root of number of samples in reference batch .
- Type
int
- num_feat
number of features in reference batch.
- Type
int
- lambda
batch number on which last drift was detected.
- Type
int
- distances
For each batch seen (key), stores the distance between test and reference batch (value). Useful for visualizing drift detection statistics.
- Type
dict
- epsilon_values
For each batch seen (key), stores the Epsilon value between the current and previous test and reference batches (value). Useful for visualizing drift detection statistics. Does not store the bootstrapped estimate of Epsilon, if used.
- Type
dict
- thresholds
For each batch seen (key), stores the Beta thresholds between test and reference batch (value). Useful for visualizing drift detection statistics.
- Type
dict
- __init__(divergence='KL', detect_batch=1, statistic='tstat', significance=0.05, subsets=5)[source]
- Parameters
divergence (str) –
divergence measure used to compute distance between histograms. Default is “H”.
”H” - Hellinger distance, original use is for HDDDM
”KL” - Kullback-Leibler Divergence, original use is for CDBD
User can pass in custom divergence function. Input is two two-dimensional arrays containing univariate histogram estimates of density, one for reference, one for test. It must return the distance value between histograms. To be a valid distance metric, it must satisfy the following properties: non-negativity, identity, symmetry, and triangle inequality, e.g. that in examples/cbdb_example.py.
detect_batch (int) –
the test batch on which drift will be detected. See class docstrings for more information on this modification. Defaults to 1.
if
detect_batch == 1
- CDBD can detect drift on the first test batch passed to the update method. Total samples and samples since reset will be number of batches passed to HDM plus 1, due to splitting of reference batchif
detect_batch == 2
- CDBD can detect drift on the second test batch passed to the update method.if
detect_batch == 3
- CDBD can detect drift on the third test batch passed to the update method.
statistic (str) –
statistical method used to compute adaptive threshold. Defaults to
"tstat"
"tstat"
- t-statistic with desired significance level and degrees of freedom = 2 for hypothesis testing on two populations."stdev"
- uses number of standard deviations deemed significant to compute threshold.
significance (float) –
statistical significance used to identify adaptive threshold. Defaults to 0.05.
if statistic =
"tstat"
- statistical significance of t-statistic, e.g. .05 for 95% significance level.if statistic =
"stdev"
- number of standard deviations of change around the mean accepted.
subsets (int) –
the number of subsets of reference data to take to compute initial estimate of Epsilon.
if too small - initial Epsilon value will be too small. Increases risk of missing drift.
if too high - intial Epsilon value will be too large. Increases risk of false alarms.
- reset()
Initialize relevant attributes to original values, to ensure information only stored from batches_since_reset (lambda) onwards. Intended for use after
drift_state == 'drift'
.
- set_reference(X, y_true=None, y_pred=None)[source]
Initialize detector with a reference batch. After drift, reference batch is automatically set to most recent test batch. Option for user to specify alternative reference batch using this method.
- Parameters
X (pandas.DataFrame) – initial baseline dataset
y_true (numpy.array) – true labels for dataset - not used by CDBD
y_pred (numpy.array) – predicted labels for dataset - not used by CDBD
- update(X, y_true=None, y_pred=None)[source]
Update the detector with a new test batch. If drift is detected, new reference batch becomes most recent test batch. If drift is not detected, reference batch is updated to include most recent test batch.
- Parameters
X (DataFrame) – next batch of data to detect drift on.
y_true (numpy.ndarray) – true labels of next batch - not used in CDBD
y_pred (numpy.ndarray) – predicted labels of next batch - not used in CDBD
- property batches_since_reset
Number of batches since last drift detection.
- Returns
int
- property drift_state
Set detector’s drift state to
"drift"
,"warning"
, orNone
.
- input_type = 'batch'
- property total_batches
Total number of batches the drift detector has been updated with.
- Returns
int
menelaus.data_drift.hdddm
- class menelaus.data_drift.hdddm.HDDDM(detect_batch=1, divergence='H', statistic='tstat', significance=0.05, subsets=5)[source]
Bases:
HistogramDensityMethod
HDDDM is a batch-based, unsupervised drift detection algorithm that detects changes in feature distributions. It uses the Hellinger distance metric to compare test and reference batches and is capable of detecting gradual or abrupt changes in data.
This method relies upon three statistics:
Hellinger distance: the sum of the normalized, squared differences in frequency counts for each bin between reference and test datasets, averaged across all features.
Epsilon: the differences in Hellinger distances between sets of reference and test batches.
Beta: the adaptive threshold adapted at each time stamp. It is based on the mean of Epsilon plus the scaled standard deviation of Epsilon. The scale applied to the standard deviation is determined by the
statistic
parameter. It is either the number of standard deviations deemed significant ("stdev"
) or the t-statistic ("tstat"
).
HDDDM operates by:
Estimating density functions of reference and test data using histograms. The number of bins in each histogram equals the square root of the length of reference window. Bins are aligned by computing the minimum and maximum value for each feature from both test and reference window.
Computing the distance between reference and test distributions. The Hellinger distance is first calculated between each feature in the reference and test batches. Then, the final Hellinger statistic used is the average of each feature’s distance.
Computing Epsilon.
Computing the adaptive threshold Beta.
Comparing current Epsilon to Beta. If Epsilon > Beta, drift is detected. The new reference batch is now the test batch on which drift was detected. All statistics are reset. If Epsilon <= Beta, drift is not detected. The reference batch is updated to include this most recent test batch. All statistics are maintained.
Two key modifications were added to Ditzler and Polikar’s presentation of HDDDM:
To answer the research question of “where is drift occuring?”, it stores the distance values and Epsilon values for each feature. These statistics can be used to identify and visualize the features containing the most significant drifts.
The Hellinger distance values are calculated for each feature in the test batch. These values can be accessed when drift occurs using the
self.feature_info
dictionary.The Epsilon values for each feature are stored, for each set of reference and test batches. For each feature, these values represent the difference in Hellinger distances within the test and reference batch at time t, to the Hellinger distances within the test and reference batch at time t-1. These can be acccessed with each update call using the
self.feature_epsilons
variable. They also can be accessed when drift occurs using theself.feature_info
dictionary.
The original algorithm cannot detect drift until it is updated with the third test batch after either a) initilization or b) reset upon drift, because the threshold for drift detection is defined from the difference Epsilon. To have sufficient values to define this threshold, then, three batches are needed. The
detect_batch
parameter can be set such that bootstrapping is used to define this threshold earlier than the third test batch.if
detect_batch == 3
, HDDDM will operate as described in Ditzler and Polikar [2011].if
detect_batch == 2
, HDDDM will detect drift on the second test batch. On the second test batch, HDDDM uses bootstrapped samples from the reference batch to estimate the mean and standard deviation of Epsilon; this is used to calculate the necessary threshold. On the third test batch, this value is removed from all proceeding calculations.if
detect_batch == 1
, HDDDM will detect drift on the first test batch. The initial reference batch is split randomly into two halves. The first half will serve as the original reference batch. The second half will serve as a proxy for the first test batch, allowing us to calculate the distance statistic. When HDDDM is updated with the first actual test batch, HDDDM will perform the method for bootstrapping Epsilon, as described in the above bullet fordetect_batch == 2
. This will allow a Beta threshold to be calculated using the first test batch, allowing for detection of drift on this batch.
Ref. Ditzler and Polikar [2011]
- Epsilon
stores Epsilon values since the last drift detection.
- Type
list
- reference_n
number of samples in reference batch.
- Type
int
- total_epsilon
stores running sum of Epsilon values until drift is detected, initialized to 0.
- Type
int
- bins
number of bins in histograms, equivalent to square root of number of samples in reference batch.
- Type
int
- num_feat
number of features in reference batch.
- Type
int
- lambda
batch number on which last drift was detected.
- Type
int
- distances
For each batch seen (key), stores the Hellinger distance between test and reference batch (value). Useful for visualizing drift detection statistics.
- Type
dict
- epsilon_values
For each batch seen (key), stores the Epsilon value between the current and previous test and reference batches (value). Useful for visualizing drift detection statistics. Does not store the bootstrapped estimate of Epsilon, if used.
- Type
dict
- thresholds
For each batch seen (key), stores the Beta thresholds between test and reference batch (value). Useful for visualizing drift detection statistics.
- Type
dict
- __init__(detect_batch=1, divergence='H', statistic='tstat', significance=0.05, subsets=5)[source]
- Parameters
divergence (str) –
divergence measure used to compute distance between histograms. Default is “H”.
”H” - Hellinger distance, original use is for HDDDM
”KL” - Kullback-Leibler Divergence, original use is for CDBD
User can pass in custom divergence function. Input is two two-dimensional arrays containing univariate histogram estimates of density, one for reference, one for test. It must return the distance value between histograms. To be a valid distance metric, it must satisfy the following properties: non-negativity, identity, symmetry, and triangle inequality, e.g. that in examples/hdddm_example.py.
detect_batch (int) –
the test batch on which drift will be detected. See class docstrings for more information on this modification. Defaults to 1.
if
detect_batch == 1
- HDDDM can detect drift on the first test batch passed to the update method. Total batches and batches since reset will be number of batches passed to HDM plus 1, due to splitting of reference batchif
detect_batch == 2
- HDDDM can detect drift on the second test batch passed to the update method.if
detect_batch == 3
- HDDDM can detect drift on the third test batch passed to the update method.
statistic (str) –
statistical method used to compute adaptive threshold. Defaults to
"tstat"
."tstat"
- t-statistic with desired significance level and degrees of freedom = 2 for hypothesis testing on two populations."stdev"
- uses number of standard deviations deemed significant to compute threhsold.
significance (float) –
statistical significance used to identify adaptive threshold. Defaults to 0.05.
if statistic =
"tstat"
- statistical significance of t-statistic, e.g. .05 for 95% significance level.if statistic =
"stdev"
- number of standard deviations of change around the mean accepted.
subsets (int) –
the number of subsets of reference data to take to compute initial estimate of Epsilon.
if too small - initial Epsilon value will be too small. Increases risk of missing drift.
if too high - intial Epsilon value will be too large. Increases risk of false alarms.
- reset()
Initialize relevant attributes to original values, to ensure information only stored from batches_since_reset (lambda) onwards. Intended for use after
drift_state == 'drift'
.
- set_reference(X, y_true=None, y_pred=None)
Initialize detector with a reference batch. After drift, reference batch is automatically set to most recent test batch. Option for user to specify alternative reference batch using this method.
- Parameters
X (pandas.DataFrame) – initial baseline dataset
y_true (numpy.array) – true labels for dataset - not used by HDM
y_pred (numpy.array) – predicted labels for dataset - not used by HDM
- update(X, y_true=None, y_pred=None)[source]
Update the detector with a new test batch. If drift is detected, new reference batch becomes most recent test batch. If drift is not detected, reference batch is updated to include most recent test batch.
- Parameters
X (DataFrame) – next batch of data to detect drift on.
y_true (numpy.ndarray) – true labels of next batch - not used in HDDDM
y_pred (numpy.ndarray) – predicted labels of next batch - not used in HDDDM
- property batches_since_reset
Number of batches since last drift detection.
- Returns
int
- property drift_state
Set detector’s drift state to
"drift"
,"warning"
, orNone
.
- input_type = 'batch'
- property total_batches
Total number of batches the drift detector has been updated with.
- Returns
int
menelaus.data_drift.histogram_density_method
- class menelaus.data_drift.histogram_density_method.HistogramDensityMethod(divergence, detect_batch, statistic, significance, subsets)[source]
Bases:
BatchDetector
The Histogram Density Method (HDM) is the base class for both HDDDM and CDBD. HDDDM differs from CDBD by relying upon the Hellinger distance measure while CDBD uses KL divergence.
This method relies upon three statistics:
Distance metric:
Hellinger distance (default if called via HDDDM): the sum of the normalized, squared differences in frequency counts for each bin between reference and test datasets, averaged across all features.
KL divergence (default if called via CDBD): Jensen-Shannon distances, a symmetric and bounded measure based upon Kullback-Leibler Divergence
Optional user-defined distance metric
Epsilon: the differences in Hellinger distances between sets of reference and test batches.
Beta: the adaptive threshold adapted at each time stamp. It is based on the mean of Epsilon plus the scaled standard deviation of Epsilon. The scale applied to the standard deviation is determined by the
statistic
parameter. It is either the number of standard deviations deemed significant ("stdev"
) or the t-statistic ("tstat"
).
HDM operates by:
Estimating density functions of reference and test data using histograms. The number of bins in each histogram equals the square root of the length of reference window. Bins are aligned by computing the minimum and maximum value for each feature from both test and reference window.
Computing the distance between reference and test distributions. In HDDDM, the Hellinger distance is first calculated between each feature in the reference and test batches. Then, the final Hellinger statistic used is the average of each feature’s distance. In CDBD, the KL divergence metric is used to calculate the distance between univariate histograms.
Computing Epsilon.
Computing the adaptive threshold Beta.
Comparing current Epsilon to Beta. If Epsilon > Beta, drift is detected. The new reference batch is now the test batch on which drift was detected. All statistics are reset. If Epsilon <= Beta, drift is not detected. The reference batch is updated to include this most recent test batch. All statistics are maintained.
Two key modifications were added to Ditzler and Polikar’s presentation:
For HDDDM, to answer the question of “where is drift occuring?”, it stores the distance values and Epsilon values for each feature. These statistics can be used to identify and visualize the features containing the most significant drifts.
The Hellinger distance values are calculated for each feature in the test batch. These values can be accessed when drift occurs using the self.feature_info dictionary.
The Epsilon values for each feature are stored, for each set of reference and test batches. For each feature, these values represent the difference in Hellinger distances within the test and reference batch at time t, to the Hellinger distances within the test and reference batch at time t-1. These can be acccessed with each update call using the self.feature_epsilons variable. They also can be accessed when drift occurs using the self.feature_info dictionary.
The original algorithm cannot detect drift until it is updated with the third test batch after either a) initilization or b) reset upon drift, because the threshold for drift detection is defined from the difference Epsilon. To have sufficient values to define this threshold, then, three batches are needed. The
detect_batch
parameter can be set such that bootstrapping is used to define this threshold earlier than the third test batch.if
detect_batch == 3
, HDM will operate as described in Ditzler and Polikar [2011].if
detect_batch == 2
, HDM will detect drift on the second test batch. On the second test batch, HDM uses bootstrapped samples from the reference batch to estimate the mean and standard deviation of Epsilon; this is used to calculate the necessary threshold. On the third test batch, this value is removed from all proceedingif
detect_batch
= 1, HDM will detect drift on the first test batch. The initial reference batch is split randomly into two halves. The first half will serve as the original reference batch. The second half will serve as a proxy for the first test batch, allowing us to calculate the distance statistic. When HDM is updated with the first actual test batch, HDM will perform the method for bootstrapping Epsilon, as described in the above bullet fordetect_batch
= 2. This will allow a Beta threshold to be calculated using the first test batch, allowing for detection of drift on this batch.
Ref. Lindstrom et al. [2013] and Ditzler and Polikar [2011]
- Epsilon
stores Epsilon values since the last drift detection.
- Type
list
- reference_n
number of samples in reference batch.
- Type
int
- total_epsilon
stores running sum of Epsilon values until drift is detected, initialized to 0.
- Type
int
- distances
For each batch seen (key), stores the distance between test and reference batch (value). Useful for visualizing drift detection statistics.
- Type
dict
- epsilon_values
For each batch seen (key), stores the Epsilon value between the current and previous test and reference batches (value). Useful for visualizing drift detection statistics. Does not store the bootstrapped estimate of Epsilon, if used.
- Type
dict
- thresholds
For each batch seen (key), stores the Beta thresholds between test and reference batch (value). Useful for visualizing drift detection statistics.
- Type
dict
- __init__(divergence, detect_batch, statistic, significance, subsets)[source]
- Parameters
divergence (str or function) –
divergence measure used to compute distance between histograms. Default is “H”.
”H” - Hellinger distance, original use is for HDDDM
”KL” - Kullback-Leibler Divergence, original use is for CDBD
User can pass in custom divergence function. Input is two two-dimensional arrays containing univariate histogram estimates of density, one for reference, one for test. It must return the distance value between histograms. To be a valid distance metric, it must satisfy the following properties: non-negativity, identity, symmetry, and triangle inequality, e.g. that in examples/cdbd_example.py or examples/hdddm_example.py.
detect_batch (int) –
the test batch on which drift will be detected. See class docstrings for more information on this modification. Defaults to 1.
if
detect_batch == 1
- HDM can detect drift on the first test batch passed to the update methodif
detect_batch == 2
- HDM can detect drift on the second test batch passed to the update methodif
detect_batch == 3
- HDM can detect drift on the third test batch passed to the update method
statistic (str) –
statistical method used to compute adaptive threshold. Defaults to
"tstat"
."tstat"
- t-statistic with desired significance level and degrees of freedom = 2 for hypothesis testing on two populations"stdev"
- uses number of standard deviations deemed significant to compute threhsold
significance (float) –
statistical significance used to identify adaptive threshold. Defaults to 0.05.
if statistic =
"tstat"
- statistical significance of t-statistic, e.g. .05 for 95% significance levelif statistic =
"stdev"
- number of standard deviations of change around the mean accepted
subsets (int) –
the number of subsets of reference data to take to compute initial estimate of Epsilon.
if too small - initial Epsilon value will be too small. Increases risk of missing drift
if too high - intial Epsilon value will be too large. Increases risk of false alarms.
- reset()[source]
Initialize relevant attributes to original values, to ensure information only stored from batches_since_reset (lambda) onwards. Intended for use after
drift_state == 'drift'
.
- set_reference(X, y_true=None, y_pred=None)[source]
Initialize detector with a reference batch. After drift, reference batch is automatically set to most recent test batch. Option for user to specify alternative reference batch using this method.
- Parameters
X (pandas.DataFrame) – initial baseline dataset
y_true (numpy.array) – true labels for dataset - not used by HDM
y_pred (numpy.array) – predicted labels for dataset - not used by HDM
- update(X, y_true=None, y_pred=None)[source]
Update the detector with a new test batch. If drift is detected, new reference batch becomes most recent test batch. If drift is not detected, reference batch is updated to include most recent test batch.
- Parameters
X (DataFrame) – next batch of data to detect drift on.
y_true (numpy.ndarray) – true labels of next batch - not used in HDM
y_pred (numpy.ndarray) – predicted labels of next batch - not used in HDM
- property batches_since_reset
Number of batches since last drift detection.
- Returns
int
- property drift_state
Set detector’s drift state to
"drift"
,"warning"
, orNone
.
- input_type = 'batch'
- property total_batches
Total number of batches the drift detector has been updated with.
- Returns
int
menelaus.data_drift.kdq_tree
- class menelaus.data_drift.kdq_tree.KdqTreeBatch(alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]
Bases:
KdqTreeDetector
,BatchDetector
Implements the kdqTree drift detection algorithm in a batch data context. Inherits from
KdqTreeDetector
andBatchDetector
(see docs).kdqTree is a drift detection algorithm which detects drift via the Kullback-Leibler divergence, calculated after partitioning the data space via constructing a k-d-quad-tree (kdq-tree). A reference window of initial data is compared to a test window of later data. The Kullback-Leibler divergence between the empirical distributions of the reference and test windows is calculated, and drift is alarmed when a threshold is reached. A kdqtree is a combination of k-d trees and quad-trees; it is a binary tree (k-d) whose nodes contain square cells (quad) which are created via sequential splits along each dimension. This structure allows the calculation of the K-L divergence for continuous distributions, as the K-L divergence is defined on probability mass functions. The number of samples in each leaf of the tree is an empirical distribution for either dataset; this allows us to calculate the K-L divergence.
If used in a streaming data setting, the reference window is used to construct a kdq-tree, and the data in both the reference and test window are filed into it. If used in a batch data setting, the reference window - the first batch passed in - is used to construct a kdq-tree, and data in test batches are compared to it. When drift is detected on a test batch, that test batch is set to be the new reference window - unless the user specifies a reference window using the set_reference method.
The threshold for drift is determined using the desired alpha level by a bootstrap estimate for the critical value of the K-L divergence, drawing a sample of
num_boostrap_samples
repeatedly,2 * window_size
times, from the reference window.Additionally, the Kulldorff spatial scan statistic, which is a special case of the KL-divergence, can be calculated at each node of the kdq-tree, which gives a measure of the regions of the data space which have the greatest divergence between the reference and test windows. This can be used to visualize which regions of data space have the greatest drift. Note that these statistics are specific to the partitions of the data space by the kdq-tree, rather than (necessarily) the maximally different region in general. KSSs are made available via
to_plotly_dataframe
, which produces output structured for use withplotly.express.treemap
.Ref. Dasu et al. [2006]
- __init__(alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]
- Parameters
alpha (float, optional) – Achievable significance level. Defaults to 0.01.
bootstrap_samples (int, optional) – The number of bootstrap samples to use to approximate the empirical distributions. Equivalent to kappa in Dasu (2006), which recommends 500-1000 samples. Defaults to 500.
count_ubound (int, optional) – An upper bound for the number of samples stored in a leaf node of the kdqTree. No leaf shall contain more samples than this value, unless further divisions violate the cutpoint_proportion_lbound restriction. Default 100.
cutpoint_proportion_lbound (float, optional) – A lower bound for the size of the leaf nodes. No node shall have a size length smaller than this proportion, relative to the original feature length. Defaults to 2e-10.
- reset()[source]
Initialize the detector’s drift state and other relevant attributes. Intended for use after
drift_state == "drift"
or initialization.
- set_reference(X, y_true=None, y_pred=None)[source]
Initialize detector with a reference batch. The user may specify an alternate reference batch than the one maintained by kdq-Tree. This will reset the detector.
- Parameters
X (pandas.DataFrame or numpy.array) – baseline dataset
y_true (numpy.array) – actual labels of dataset - not used in KdqTree
y_pred (numpy.array) – predicted labels of dataset - not used in KdqTree
- to_plotly_dataframe(tree_id1='build', tree_id2='test', max_depth=None, input_cols=None)
Generates a dataframe containing information about the kdqTree’s structure and some node characteristics, intended for use with
plotly
.- Parameters
tree_id1 (str, optional) – Reference tree. If
tree_id2
is not specified, the only tree described. Defaults to"build"
.tree_id2 (str, optional) – Test tree. If this is specified, the dataframe will also contain information about the difference between counts in each node for the reference vs. the test tree. Defaults to
"test"
.max_depth (int, optional) – Depth in the tree to which to recurse. Defaults to
None
.input_cols (list, optional) – List of column names for the input data. Defaults to
None
.
- Returns
A dataframe where each row corresponds to a node, and each column contains some information:
name
: a label corresponding to which feature this split is onidx
: a unique ID for the node, to passplotly.express.treemap
’s id argumentparent_idx
: the ID of the node’s parentcell_count
: how many samples are in this node in the reference tree.depth
: how deep the node is in the treecount_diff
: iftree_id2
is specified, the change in counts from the reference tree.kss
: the Kulldorff Spatial Scan Statistic for this node, defined as the Kullback-Leibler divergence for this node between the reference and test trees, using the individual node and all other nodes combined as the bins for the distributions.
- Return type
pd.DataFrame
- update(X, y_true=None, y_pred=None)[source]
Update the detector with a new batch. Constructs the reference data’s kdqtree; then, when sufficient samples have been received, puts the test data into the same tree; then, checks divergence between the reference and test data.
The initial batch will be used as the reference at each update step, regardless of drift state. If the user wishes to change reference batch, use the
set_reference
method and then continue passing new batches toupdate
.- Parameters
X (pandas.DataFrame or numpy array) – If just reset/initialized,
Otherwise (the reference data.) –
compared (a new batch of data to be) –
window. (to the reference) –
y_true (numpy.ndarray) – true labels of input data - not used in KdqTree
y_pred (numpy.ndarray) – predicted labels of input data - not used in KdqTree
- property batches_since_reset
Number of batches since last drift detection.
- Returns
int
- property drift_state
Set detector’s drift state to
"drift"
,"warning"
, orNone
.
- property total_batches
Total number of batches the drift detector has been updated with.
- Returns
int
- class menelaus.data_drift.kdq_tree.KdqTreeDetector(alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]
Bases:
object
Parent class for kdqTree-based drift detector classes. Whether reliant on streaming or batch data, kdqTree detectors have some common attributes, logic, and functions.
kdqTree is a drift detection algorithm which detects drift via the Kullback-Leibler divergence, calculated after partitioning the data space via constructing a k-d-quad-tree (kdq-tree). A reference window of initial data is compared to a test window of later data. The Kullback-Leibler divergence between the empirical distributions of the reference and test windows is calculated, and drift is alarmed when a threshold is reached. A kdqtree is a combination of k-d trees and quad-trees; it is a binary tree (k-d) whose nodes contain square cells (quad) which are created via sequential splits along each dimension. This structure allows the calculation of the K-L divergence for continuous distributions, as the K-L divergence is defined on probability mass functions. The number of samples in each leaf of the tree is an empirical distribution for either dataset; this allows us to calculate the K-L divergence.
If used in a streaming data setting, the reference window is used to construct a kdq-tree, and the data in both the reference and test window are filed into it. If used in a batch data setting, the reference window - the first batch passed in - is used to construct a kdq-tree, and data in test batches are compared to it. When drift is detected on a test batch, that test batch is set to be the new reference window - unless the user specifies a reference window using the set_reference method.
The threshold for drift is determined using the desired alpha level by a bootstrap estimate for the critical value of the K-L divergence, drawing a sample of
num_boostrap_samples
repeatedly,2 * window_size
times, from the reference window.Additionally, the Kulldorff spatial scan statistic, which is a special case of the KL-divergence, can be calculated at each node of the kdq-tree, which gives a measure of the regions of the data space which have the greatest divergence between the reference and test windows. This can be used to visualize which regions of data space have the greatest drift. Note that these statistics are specific to the partitions of the data space by the kdq-tree, rather than (necessarily) the maximally different region in general. KSSs are made available via
to_plotly_dataframe
, which produces output structured for use withplotly.express.treemap
.Note that this algorithm could be used with other types of trees; the reference paper and this implementation use kdq-trees.
Note that the current implementation does not explicitly handle categorical data.
Ref. Dasu et al. [2006]
- __init__(alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]
- Parameters
alpha (float, optional) – Achievable significance level. Defaults to 0.01.
bootstrap_samples (int, optional) – The number of bootstrap samples to use to approximate the empirical distributions. Equivalent to kappa in Dasu (2006), which recommends 500-1000 samples. Defaults to 500.
count_ubound (int, optional) – An upper bound for the number of samples stored in a leaf node of the kdqTree. No leaf shall contain more samples than this value, unless further divisions violate the cutpoint_proportion_lbound restriction. Default 100.
cutpoint_proportion_lbound (float, optional) – A lower bound for the size of the leaf nodes. No node shall have a size length smaller than this proportion, relative to the original feature length. Defaults to 2e-10.
- reset()[source]
Initialize the detector’s drift state and other relevant attributes. Intended for use after
drift_state == "drift"
or initialization.
- to_plotly_dataframe(tree_id1='build', tree_id2='test', max_depth=None, input_cols=None)[source]
Generates a dataframe containing information about the kdqTree’s structure and some node characteristics, intended for use with
plotly
.- Parameters
tree_id1 (str, optional) – Reference tree. If
tree_id2
is not specified, the only tree described. Defaults to"build"
.tree_id2 (str, optional) – Test tree. If this is specified, the dataframe will also contain information about the difference between counts in each node for the reference vs. the test tree. Defaults to
"test"
.max_depth (int, optional) – Depth in the tree to which to recurse. Defaults to
None
.input_cols (list, optional) – List of column names for the input data. Defaults to
None
.
- Returns
A dataframe where each row corresponds to a node, and each column contains some information:
name
: a label corresponding to which feature this split is onidx
: a unique ID for the node, to passplotly.express.treemap
’s id argumentparent_idx
: the ID of the node’s parentcell_count
: how many samples are in this node in the reference tree.depth
: how deep the node is in the treecount_diff
: iftree_id2
is specified, the change in counts from the reference tree.kss
: the Kulldorff Spatial Scan Statistic for this node, defined as the Kullback-Leibler divergence for this node between the reference and test trees, using the individual node and all other nodes combined as the bins for the distributions.
- Return type
pd.DataFrame
- class menelaus.data_drift.kdq_tree.KdqTreeStreaming(window_size, persistence=0.05, alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]
Bases:
KdqTreeDetector
,StreamingDetector
Implements the kdqTree drift detection algorithm in a streaming data context. Inherits from
KdqTreeDetector
andStreamingDetector
(see docs).kdqTree is a drift detection algorithm which detects drift via the Kullback-Leibler divergence, calculated after partitioning the data space via constructing a k-d-quad-tree (kdq-tree).
If used in a streaming data setting, the reference window is used to construct a kdq-tree, and the data in both the reference and test window are filed into it. If used in a batch data setting, the reference window - the first batch passed in - is used to construct a kdq-tree, and data in test batches are compared to it. When drift is detected on a test batch, that test batch is set to be the new reference window - unless the user specifies a reference window using the set_reference method.
The threshold for drift is determined using the desired alpha level by a bootstrap estimate for the critical value of the K-L divergence, drawing a sample of
num_boostrap_samples
repeatedly,2 * window_size
times, from the reference window.Additionally, the Kulldorff spatial scan statistic, which is a special case of the KL-divergence, can be calculated at each node of the kdq-tree, which gives a measure of the regions of the data space which have the greatest divergence between the reference and test windows. This can be used to visualize which regions of data space have the greatest drift. Note that these statistics are specific to the partitions of the data space by the kdq-tree, rather than (necessarily) the maximally different region in general. KSSs are made available via
to_plotly_dataframe
, which produces output structured for use withplotly.express.treemap
.Ref. Dasu et al. [2006]
- __init__(window_size, persistence=0.05, alpha=0.01, bootstrap_samples=500, count_ubound=100, cutpoint_proportion_lbound=2e-10)[source]
- Parameters
window_size (int) – The minimum number of samples required to test whether drift has occurred.
persistence (float, optional) – Persistence factor: how many samples in a row, as a proportion of the window size, must be in the “drift region” of K-L divergence, in order for
kdqTree
to alarm and reset. Defaults to 0.05.alpha (float, optional) – Achievable significance level. Defaults to 0.01.
bootstrap_samples (int, optional) – The number of bootstrap samples to use to approximate the empirical distributions. Equivalent to kappa in Dasu (2006), which recommends 500-1000 samples. Defaults to 500.
count_ubound (int, optional) – An upper bound for the number of samples stored in a leaf node of the kdqTree. No leaf shall contain more samples than this value, unless further divisions violate the cutpoint_proportion_lbound restriction. Default 100.
cutpoint_proportion_lbound (float, optional) – A lower bound for the size of the leaf nodes. No node shall have a size length smaller than this proportion, relative to the original feature length. Defaults to 2e-10.
- reset()[source]
Initialize the detector’s drift state and other relevant attributes. Intended for use after
drift_state == "drift"
or initialization.
- to_plotly_dataframe(tree_id1='build', tree_id2='test', max_depth=None, input_cols=None)
Generates a dataframe containing information about the kdqTree’s structure and some node characteristics, intended for use with
plotly
.- Parameters
tree_id1 (str, optional) – Reference tree. If
tree_id2
is not specified, the only tree described. Defaults to"build"
.tree_id2 (str, optional) – Test tree. If this is specified, the dataframe will also contain information about the difference between counts in each node for the reference vs. the test tree. Defaults to
"test"
.max_depth (int, optional) – Depth in the tree to which to recurse. Defaults to
None
.input_cols (list, optional) – List of column names for the input data. Defaults to
None
.
- Returns
A dataframe where each row corresponds to a node, and each column contains some information:
name
: a label corresponding to which feature this split is onidx
: a unique ID for the node, to passplotly.express.treemap
’s id argumentparent_idx
: the ID of the node’s parentcell_count
: how many samples are in this node in the reference tree.depth
: how deep the node is in the treecount_diff
: iftree_id2
is specified, the change in counts from the reference tree.kss
: the Kulldorff Spatial Scan Statistic for this node, defined as the Kullback-Leibler divergence for this node between the reference and test trees, using the individual node and all other nodes combined as the bins for the distributions.
- Return type
pd.DataFrame
- update(X, y_true=None, y_pred=None)[source]
Update the detector with a new sample point. Constructs the reference data’s kdqtree; then, when sufficient samples have been received, puts the test data into the same tree; then, checks divergence between the reference and test data.
The reference window is maintained as the initial window until drift. Upon drift, the user may continue passing data to update and new reference windows will be constructed once sufficient samples are received.
- Parameters
X (pandas.DataFrame or numpy array) – If just reset/initialized,
Otherwise (the reference data.) –
test (a new sample to put into the) –
window. –
y_true (numpy.ndarray) – true labels of input data - not used in KdqTree
y_pred (numpy.ndarray) – predicted labels of input data - not used in KdqTree
- property drift_state
Set detector’s drift state to
"drift"
,"warning"
, orNone
.
- property samples_since_reset
Number of samples since last drift detection.
- Returns
int
- property total_samples
Total number of samples the drift detector has been updated with.
- Returns
int
menelaus.data_drift.nndvi
- class menelaus.data_drift.nndvi.NNDVI(k_nn: int = 30, sampling_times: int = 500, alpha: float = 0.01)[source]
Bases:
BatchDetector
This class encodes the Nearest Neigbors Density Variation Identification (NN-DVI) drift detection algorithm, introduced in Liu et al. (2018). Note that this implementation is intended for batch datasets, rather than the streaming context.
Broadly, NN-DVI combines a reference and test data batch, creates a normalized version of the subsequent adjacency matrix (after a k-NN search), and then analyzes distance changes in the reference and test sections of the combined adjacency matrix. Those changes are compared against a threshold distance value, which is found by randomly sampling new reference and test sections, then fitting a Gaussian distribution to distance changes for those trials.
- total_samples
number of batches the drift detector has ever been updated with.
- Type
int
- samples_since_reset
number of batches since the last drift detection.
- Type
int
- drift_state
detector’s current drift state. Can take values
"drift"
,"warning"
, orNone
.- Type
str
- k_nn
the ‘k’ in k-Nearest-Neighbor (k-NN) search
- Type
int
- reference_batch
initial batch of data
- Type
numpy.array
- sampling_times
number of times to perform sampling for threshold estimation
- Type
int
- alpha
significance level for detecting drift
- Type
float
- __init__(k_nn: int = 30, sampling_times: int = 500, alpha: float = 0.01)[source]
- k_nn
the ‘k’ in k-Nearest-Neighbor (k-NN) search. Default 30.
- Type
int, optional
- sampling_times
number of times to perform sampling for threshold estimation. Default 500.
- Type
int, optional
- alpha
significance level for detecting drift. Default 0.01.
- Type
float, optional
- reset()[source]
Initialize relevant attributes to original values, to ensure information only stored from samples_since_reset onwards. Intended for use after
drift_state == 'drift'
.
- set_reference(X, y_true=None, y_pred=None)[source]
Set the detector’s reference batch to an updated value; typically used in
update
.- X
updated reference batch
- Type
numpy.array
- y_true
true labels, not used in NNDVI
- Type
numpy.array
- y_pred
predicted labels, not used in NNDVI
- Type
numpy.array
- update(X: array, y_true=None, y_pred=None)[source]
Update the detector with a new test batch. If drift is detected, new reference batch becomes most recent test batch.
- Parameters
X (numpy.array) – next batch of data to detect drift on.
y_true (numpy.array) – true labels, not used in NN-DVI
y_pred (numpy.array) – predicted labels, not used in NN-DVI
- property batches_since_reset
Number of batches since last drift detection.
- Returns
int
- property drift_state
Set detector’s drift state to
"drift"
,"warning"
, orNone
.
- property total_batches
Total number of batches the drift detector has been updated with.
- Returns
int
menelaus.data_drift.pca_cd
- class menelaus.data_drift.pca_cd.PCACD(window_size, ev_threshold=0.99, delta=0.1, divergence_metric='kl', sample_period=0.05, online_scaling=True)[source]
Bases:
StreamingDetector
Principal Component Analysis Change Detection (PCA-CD) is a drift detection algorithm which checks for change in the distribution of the given data using one of several divergence metrics calculated on the data’s principal components.
First, principal components are built from the reference window - the initial
window_size
samples. New samples from the test window, of the same width, are projected onto these principal components. The divergence metric is calculated on these scores for the reference and test windows; if this metric diverges enough, then we consider drift to have occurred. This threshold is determined dynamically through the use of the Page-Hinkley test.Once drift is detected, the reference window is replaced with the current test window, and the test window is initialized.
Ref. Qahtan et al. [2015]
- step
how frequently (by number of samples), to detect drift. This is either 100 samples or
sample_period * window_size
, whichever is smaller.- Type
int
- ph_threshold
threshold parameter for the internal Page-Hinkley detector. Takes the value of
.01 * window_size
.- Type
float
- num_pcs
the number of principal components being used to meet the specified
ev_threshold
parameter.- Type
int
- __init__(window_size, ev_threshold=0.99, delta=0.1, divergence_metric='kl', sample_period=0.05, online_scaling=True)[source]
- Parameters
window_size (int) – size of the reference window. Note that
PCA_CD
will only try to detect drift periodically, either every 100 observations or 5% of thewindow_size
, whichever is smaller.ev_threshold (float, optional) – Threshold for percent explained variance required when selecting number of principal components. Defaults to 0.99.
delta (float, optional) – Parameter for Page Hinkley test. Minimum amplitude of change in data needed to sound alarm. Defaults to 0.1.
divergence_metric (str, optional) –
divergence metric when comparing the two distributions when detecting drift. Defaults to “kl”.
”kl” - Jensen-Shannon distance, a symmetric bounded form of Kullback-Leibler divergence, uses kernel density estimation with Epanechnikov kernel
”intersection” - intersection area under the curves for the estimated density functions, uses histograms to estimate densities of windows. A discontinuous, less accurate estimate that should only be used when efficiency is of concern.
sample_period (float, optional) – how often to check for drift. This is 100 samples or
sample_period * window_size
, whichever is smaller. Default .05, or 5% of the window size.online_scaling (bool, optional) – whether to standardize the data as it comes in, using the reference window, before applying PCA. Defaults to
True
.
- reset()[source]
Initialize the detector’s drift state and other relevant attributes. Intended for use after
drift_state == 'drift'
.
- update(X, y_true=None, y_pred=None)[source]
Update the detector with a new observation.
- Parameters
X (numpy.ndarray) – next observation
y_true (numpy.ndarray) – true label of observation - not used in PCACD
y_pred (numpy.ndarray) – predicted label of observation - not used in PCACD
- property drift_state
Set detector’s drift state to
"drift"
,"warning"
, orNone
.
- input_type = 'streaming'
- property samples_since_reset
Number of samples since last drift detection.
- Returns
int
- property total_samples
Total number of samples the drift detector has been updated with.
- Returns
int