Skip to content

Anonymization

YData Synthesizers offers a way to anonymize sensitive information such that the original values are not present in the synthetic data but replaced by fake values.

Does the model retain the original values?

No! The anonymization is performed before the model training such that it never sees the original values.

The anonymization is performed by specifying which columns need to be anonymized and how to performed the anonymization. The anonymization rules are defined as a dictionary with the following format:

{column_name: anonymization_rule}

While here are some predefined anonymization rules such as name, email, company, it is also possible to create a rule using a regular expression. The anonymization rules have to be passed to a synthesizer in its fit method using the parameter anonymize.

What is the difference between anonymization and privacy?

Anonymization makes sure sensitive information are hidden from the data. Privacy makes sure it is not possible to infer the original data points from the synthetic data points via statistical attacks.

Therefore, for data sharing anonymization and privacy controls are complementary.

The example below demonstrates how to anonymize the column Name by fake names and the column Ticket by a regular expression:

import os

from ydata.sdk.dataset import get_dataset
from ydata.sdk.synthesizers import RegularSynthesizer

# Do not forget to add your token as env variables
os.environ["YDATA_TOKEN"] = '<TOKEN>'  # Remove if already defined


def main():
    """In this example, we demonstrate how to train a synthesizer from a pandas
    DataFrame.

    After training a Regular Synthesizer, we request a sample.
    """
    X = get_dataset('titanic')

    # We initialize a regular synthesizer
    # As long as the synthesizer does not call `fit`, it exists only locally
    synth = RegularSynthesizer()

    # We define anonymization rules, which is a dictionary with format:
    # {column_name: anonymization_rule, ...}
    # while here are some predefined anonymization rules like: name, email, company
    # it is also possible to create a rule using a regular expression
    rules = {
        "Name": "name",
        "Ticket": "[A-Z]{2}-[A-Z]{4}"
    }

    # We train the synthesizer on our dataset
    synth.fit(
        X,
        name="titanic_synthesizer",
        anonymize=rules
    )

    # We request a synthetic dataset with 50 rows
    sample = synth.sample(n_samples=50)

    print(sample[["Name", "Ticket"]].head(3))


if __name__ == "__main__":
    main()