Skip to content

Anonymization

This section demonstrates how to use the Anonymization module in ydata-sdk.

Don't forget to set up your license key

    import os

    os.environ['YDATA_LICENSE_KEY'] = '{add-your-key}'

Example Code

import numpy as np
import pandas as pd

from ydata.dataset import Dataset
from ydata.metadata import Metadata
from ydata.preprocessors.methods.anonymization import AnonymizerConfigurationBuilder, AnonymizerType
from ydata.synthesizers.regular import RegularSynthesizer


def create_dummy_dataset(size: int = 100) -> Dataset:
    df = pd.DataFrame(
        {
            "constant": np.ones(size),
            "low_cardinality": np.random.randint(2, size=size),
            "ascending": np.arange(size),
            "negatives": -1 * np.arange(size),
            "missings": [np.nan] * size,
        }
    )
    df = df.astype(str)
    return Dataset(df)


if __name__ == "__main__":
    dataset = create_dummy_dataset()
    meta = Metadata(dataset)

    """
    The anonymizer supports several configuration to make it easy to use.
    For most cases, the configuration is a dictionary where the keys are the columns in the dataset and the
    values are the anonymizer type to use.

    The anonymizer types are an Enum object which internally is mapped to the proper function to anonymize.

    The anonymizer can be applied to any column but the output data type will be CATEGORICAL/STR.
    At the moment it is not possible to configure the output data type and it would not have much sense in most cases.
    """

    """
    The column `categorical_str` will be anonymized by a city name. It is also possible to use directly the name of the
    anonimizer rather than the Enum `AnonymizerType`.
    """
    config = {
        'categorical_str': AnonymizerType.CITY,
        'categorical_int': 'city'
    }

    """
    Some anonymizers have parameters to further configure them. For instance, the hostname can generate hostname of different levels.
    The parameters of the anonymizer have to be specified in a dictionary and the whole statement (anonymizer, params) in a tuple
    as shown below:
    """
    config = {
        'categorical_str': {
            "type": AnonymizerType.HOSTNAME,
            "levels": 3,  # hostname anonymizer's specific optional parameter
        },
        # Alternatively, string name still works
        'categorical_int': {
            "type": "hostname",
            "levels": 3,  # hostname anonymizer's specific optional parameter
        },
    }

    """
    The AnonymizerType.REGEX is a special case as it is very common. In this case, the REGEX to use to anonymize can be directly
    specified as a string. As long as it is a valid regular expression, it will be deduced as such.
    """
    config = {
        # Regex as a string is deduced automatically as AnonymizerType.REGEX
        'categorical_int': {
            "type": "regex",
            # regex anonymizer's specific required parameter
            "regex": r'[0-9]{4}-[A-Z]{5}',
        }
    }

    """
    Sometimes two columns in a dataset are referring to the same entity. For instance, a list of transactions between customers might
    involve a column referring to the sender and one to the receiver. By default, the anonymizer operates column-based such that the
    same customer might be encoded by a different value in both column, breaking the relationship.
    It is possible to specify that columns must be anonymize jointly using the example below. In this case, notice that the index in
    dictionary does not have to refer to a column.
    """
    config = {
        'joint_columns': {
            'cols': ['categorical_str', 'categorical_str_2'],
            'type': AnonymizerType.HOSTNAME
        }
    }

    """
    There is also a configuration builder available to assist during the anonymizer configuration creation.
    """
    builder = AnonymizerConfigurationBuilder(config)
    builder.add_config(
        {
            "categorical_int": {
                "type": "regex",
                "regex": r'[0-9]{4}-[A-Z]{5}',
            },
            "negatives": AnonymizerType.NAME,
            "ascending": {
                "type": AnonymizerType.CITY,
                "locale": "ja_JP",
            }
        }
    )

    """
    The builder can also save and load previous configurations
    """
    builder.save("anonymizer_config.pkl")
    builder = AnonymizerConfigurationBuilder.load("anonymizer_config.pkl")

    """
    Once defined, the anonymizer configuration should be passed directly to the Synthesizer.
    """
    config = {
        # Regex as a string is deduced automatically as AnonymizerType.REGEX
        'ascending': {
            "type": "regex",
            "regex": r'[0-9]{4}-[A-Z]{5}'
        },
        'low_cardinality': AnonymizerType.CITY
    }

    synth = RegularSynthesizer()
    synth.fit(dataset,
              metadata=meta,
              anonymize=config)

    # Column `ascending` will be replaced by a string of the form 1234-ABCDE and `low_cardinality` by city names
    synth_sample = synth.sample(len(dataset))
    """
    The AnonymizerConfigurationBuilder can also be passed directly to the synthesizer
    """
    builder = AnonymizerConfigurationBuilder(config)
    synth = RegularSynthesizer()
    synth.fit(dataset,
              metadata=meta,
              anonymize=builder)

    synth_sample = synth.sample(len(dataset))

    """
    Alternatively, it is possible to use the AnonymizerEngine outside of the Synthesizer as a regular object as follows:
    """
    from ydata.preprocessors.preprocess_methods import AnonymizerEngine
    anonymizer = AnonymizerEngine()

    # which also accepts both AnonymizerConfigurationBuilder or a dictionary as config
    anonymized = anonymizer.fit_transform(X=dataset, config=config, metadata=meta)