Skip to content

Privacy control

YData Synthesizers offers 3 different levels of privacy:

  1. high privacy: the model is optimized for privacy purposes,
  2. high fidelity (default): the model is optimized for high fidelity,
  3. balanced: tradeoff between privacy and fidelity.

The default privacy level is high fidelity. The privacy level can be changed by the user at the moment a synthesizer level is trained by using the parameter privacy_level. The parameter expect a PrivacyLevel value.

What is the difference between anonymization and privacy?

Anonymization makes sure sensitive information are hidden from the data. Privacy makes sure it is not possible to infer the original data points from the synthetic data points via statistical attacks.

Therefore, for data sharing anonymization and privacy controls are complementary.

The example below demonstrates how to train a synthesizer configured for high privacy:

import os

from ydata.sdk.dataset import get_dataset
from ydata.sdk.synthesizers import PrivacyLevel, RegularSynthesizer

# Do not forget to add your token as env variables
os.environ["YDATA_TOKEN"] = '<TOKEN>'  # Remove if already defined


def main():
    """In this example, we demonstrate how to train a synthesizer
    with a high-privacy setting from a pandas DataFrame.
    After training a Regular Synthesizer, we request a sample.
    """
    X = get_dataset('titanic')

    # We initialize a regular synthesizer
    # As long as the synthesizer does not call `fit`, it exists only locally
    synth = RegularSynthesizer()

    # We train the synthesizer on our dataset setting the privacy level to high
    synth.fit(
        X,
        name="titanic_synthesizer",
        privacy_level=PrivacyLevel.HIGH_PRIVACY
    )

    # We request a synthetic dataset with 50 rows
    sample = synth.sample(n_samples=50)
    print(sample)


if __name__ == "__main__":
    main()