Conditional Sampling for Synthetic Data Generation

Overview

Conditional sampling is a powerful technique in synthetic data generation that allows users to produce data samples tailored to specific conditions or feature values. This approach is particularly beneficial when addressing challenges such as class imbalance, underrepresentation, or the need for data augmentation in machine learning models.

Benefits of Conditional Sampling

Addressing Class Imbalance: By generating synthetic samples for underrepresented classes, conditional sampling helps balance datasets, leading to improved model performance.
Data Augmentation: Enhances the diversity of training data by producing samples that meet specific criteria, aiding in the robustness of models.
Bias Mitigation: Facilitates the creation of datasets that are more representative of diverse populations, reducing potential biases in model predictions.
Controlled Data Generation: Offers the ability to generate data with desired characteristics, ensuring that synthetic data aligns with specific analytical needs.

Implementing Conditional Sampling with ydata-sdk

The ydata-sdk provides tools to implement conditional sampling seamlessly. Below is a step-by-step guide to using the RegularSynthesizer for conditional synthetic data generation.

Authenticate with your ydata-sdk account

    import os
    os.environ["YDATA_LICENSE_KEY"] = "YDATA_LICENSE_KEY"

Load your dataset You can either use Pandas to load your data of ydata-sdk's available connectors to read the data. In this example, we will be using Pandas.

    import pandas as pd
    from ydata.dataset import Dataset

    # Load your data
    df = pd.read_csv('your_dataset.csv')

    # Create a dataset
    dataset = Dataset(df)

Compute the dataset Metadata Before configuring your synthetic data generator, calculate your dataset Metadata. Metadatais not only used to optimize the fitting of your dataset but also to let you best understand your dataset structure, data quality and properties to be aware prior the synthesis.
```
    from ydata.metadata import Metadata

    metadata = Metadata(dataset)
    print(metadata)
```

Initialize and Train the Synthesizer Specify the feature(s) you want to condition on during the training phase.

    from ydata.synthesizers import RegularSynthesizer

    synthesizer = RegularSynthesizer()
    synthesizer.fit(dataset, condition_on=['col_name1', 'col_name2']) #condition_on can be defined as a single column or as a list of columns

Replace target_feature with the name of the feature or features you wish to condition on.

Generate Conditional Synthetic Samples Define the conditions and the desired distribution for the synthetic data.

    synthetic_data = synthesizer.sample(
    n_samples=1000,
    condition_on={
            #Categorical variable condition
            "col_name1": {
                "categories": [{
                    "category": 'cat_value',
                    "percentage": 0.7
                }]
            },
            #Numerical variables condition
            "col_name2": { 
                "minimum": 55,
                "maximum": 60
            }
        }
    )

Adjust the n_samples and category percentages and numerical values as per your requirements. Or in case you just prefer to balance the columns representation you can leverage the parameter balancing as per the example below.

    synthetic_data = synthesizer.sample(
        n_samples=1000,
        balancing=True
    )

Best Practices

Feature Selection: Choose meaningful features for conditioning that have a significant impact on your analysis or model performance.
Data Quality: Ensure the original dataset is well-structured to train an effective synthesizer.
Evaluation: After generating synthetic data, evaluate its quality and impact on model performance to ensure it meets your objectives.

Conclusion

Conditional sampling enhances the capabilities of synthetic data generation by allowing for targeted and controlled data creation. Leveraging tools like the ydata-sdk's RegularSynthesizer facilitates this process, enabling users to address specific data challenges effectively.

For more detail and a full example please check the examples to get you started.