menelaus.datasets

This module contains functions and classes to generated data for testing detectors.

menelaus.datasets.generator

This submodule is not yet implemented.

menelaus.datasets.make_example_data

Functions to generate example data according to a fixed scheme.

menelaus.datasets.make_example_data.fetch_circle_data()[source]

Retrieve the Circle data from the datasets directory. Circle is synthetic data containing drift due to both a change in the feature distribution and a change in the conditional target distribution. Drift occurs from index 1000-1250 and affects 66% of the data points.

Ref. Minku [2010]

Returns

A dataframe containing the Circle dataset.

Return type

pd.DataFrame

menelaus.datasets.make_example_data.fetch_rainfall_data()[source]

Retrieve the Rainfall data from the datasets directory. National Oceanic and Atmospheric Administration (NOAA) rainfall data contains weather measurements collected over a 50 year period at a site location in Bellevue, Nebraska. It contains eight features: temperature, dew point, sea-level pressure, visibility, average wind speed, max sustained wind-speed, minimum temperature, and maximum temperature. The dependent variable is rain. Concept and data drift starts in index 12,000 and persists through the rest of the dataset.

Ref. Souza et al. [2020]

Returns

A dataframe containing the Rainfall dataset.

Return type

pd.DataFrame

menelaus.datasets.make_example_data.make_example_batch_data()[source]

This function returns a dataframe containing synthetic batch data for use with the repo’s examples. The dataframe’s columns are "year", "a", "b", ... "j", "cat", "confidence", "drift".

  • year covers 2007-2021, with 20,000 observations each.

  • Features "b", "e", "f" are normally distributed.

  • Features "a", "c", "d", "g", "h", "i", "j" have a gamma distribution.

  • The "cat" feature contains categorical variables ranging from 1-7, sampled with varying probability.

  • "confidence" contains values on [0, 0.6] through 2018, then values on [0.4, 1].

Drift occurs as follows:

  • Change the mean of column "b" in 2009. Reverts to original distribution in 2010.

  • Change the variance of columns "c" and "d" in 2012 by replacing some samples with the mean. Reverts to original distribution in 2013.

  • Increase the correlation of columns "e" and "f" in 2015 (0 correlation to 0.5 correlation).

  • Change the mean and variance of column "h" in 2019, and maintain this new distribution going forward. Change the range of the “confidence” column going forward.

  • Change the mean and variance of column "j" in 2021.

Returns

A dataframe containing a synthetic batch dataset.

Return type

pd.DataFrame