menelaus.datasets

This module contains functions and classes to generated data for testing detectors.

menelaus.datasets.generator

This submodule is not yet implemented.

menelaus.datasets.make_example_data

Functions to generate example data according to a fixed scheme.

menelaus.datasets.make_example_data.fetch_circle_data()[source]

Retrieve the Circle data from the datasets directory. Circle is synthetic data containing drift due to both a change in the feature distribution and a change in the conditional target distribution. Drift occurs from index 1000-1250 and affects 66% of the data points.

Ref. Minku [2010]

Returns: A dataframe containing the Circle dataset.
Return type: pd.DataFrame

menelaus.datasets.make_example_data.fetch_rainfall_data()[source]

Retrieve the Rainfall data from the datasets directory. National Oceanic and Atmospheric Administration (NOAA) rainfall data contains weather measurements collected over a 50 year period at a site location in Bellevue, Nebraska. It contains eight features: temperature, dew point, sea-level pressure, visibility, average wind speed, max sustained wind-speed, minimum temperature, and maximum temperature. The dependent variable is rain. Concept and data drift starts in index 12,000 and persists through the rest of the dataset.

Ref. Souza et al. [2020]

Returns: A dataframe containing the Rainfall dataset.
Return type: pd.DataFrame

menelaus.datasets.make_example_data.make_example_batch_data()[source]

This function returns a dataframe containing synthetic batch data for use with the repo’s examples. The dataframe’s columns are "year", "a", "b", ... "j", "cat", "confidence", "drift".

year covers 2007-2021, with 20,000 observations each.

Features "b", "e", "f" are normally distributed.

Features "a", "c", "d", "g", "h", "i", "j" have a gamma distribution.

The "cat" feature contains categorical variables ranging from 1-7, sampled with varying probability.

"confidence" contains values on [0, 0.6] through 2018, then values on [0.4, 1].

Drift occurs as follows:

Change the mean of column "b" in 2009. Reverts to original distribution in 2010.

Change the variance of columns "c" and "d" in 2012 by replacing some samples with the mean. Reverts to original distribution in 2013.

Increase the correlation of columns "e" and "f" in 2015 (0 correlation to 0.5 correlation).

Change the mean and variance of column "h" in 2019, and maintain this new distribution going forward. Change the range of the “confidence” column going forward.

Change the mean and variance of column "j" in 2021.

Returns: A dataframe containing a synthetic batch dataset.
Return type: pd.DataFrame