Skip to content

Tabular Synthesizer

ydata.synthesizers.RegularSynthesizer

Bases: BaseModel

The RegularSynthesizer is designed to learn patterns from real datasets and generate synthetic data that maintains statistical properties while ensuring privacy and security. It provides a simple API for training, sampling, saving, and loading models.

Key Features
  • *fit *: Learn from real data to create a generative model.
  • *sample *: Produce high-quality synthetic data based on the trained model.
  • *save *: Store a trained synthesizer for future use.
  • *load *: Restore a previously trained synthesizer.
Usage Example
from ydata.synthesizers import RegularSynthesizer

# Step 1: Train the model
synth = RegularSynthesizer()
synth.fit(real_data, metadata)

# Step 2: Generate synthetic data
synthetic_data = synth.sample(n_samples=1000)

# Step 3: Save the trained model
synth.save("model.pkl")

# Step 4: Load the trained model later
loaded_synth = RegularSynthesizer.load("model.pkl")

fit(X, metadata, *, condition_on=None, privacy_level=PrivacyLevel.HIGH_FIDELITY, calculated_features=None, anonymize=None, anonymize_ids=False, segment_by='auto', holdout_size=0.2, random_state=None)

Train the RegularSynthesizer on real tabular data.

This method learns patterns from the provided dataset (X) to build a generative model capable of producing high-quality synthetic data. It allows for feature extraction, handling missing values, and applying privacy controls.

  • Handles missing values and applies anonymization if required.
  • Supports conditional synthesis by segmenting data into meaningful groups.
  • Integrates business rules through the calculated features to evaluate model performance.

Parameters:

Name Type Description Default
X Dataset

The real dataset used to train the synthesizer.

required
metadata Metadata

object describing the dataset, including feature types and relationships.

required
calculated_features list[dict[str, str | Callable | List[str]]] | None

List of computed features that should be derived before training, if provided

None
anonymize dict | AnonymizerConfigurationBuilder | None

Configuration for anonymization strategies, such as hashing or generalization, if provided

None
privacy_level PrivacyLevel | str

Defines the trade-off between privacy and data fidelity. Options: "HIGH_FIDELITY", "BALANCED_PRIVACY_FIDELITY", "HIGH_PRIVACY". Defaults to "HIGH_FIDELITY".

HIGH_FIDELITY
condition_on Union[str, list[str]] | None

Enables conditional data generation by specifying key features to condition the model on.

None
anonymize_ids bool

If True, automatically anonymizes columns of type ID. Defaults to False.

False
segment_by SegmentByType

Defines how data should be segmented while training, based on a column or an automated decision. Options: "auto" (default).

'auto'
holdout_size float

Percentage of data to hold out for model evaluation. Default is 0.2 (20%).

0.2
random_state RandomSeed

Set a seed for reproducibility. If None, randomness is used.

None

Returns:

Name Type Description
None

Trains the synthesizer in place.

None

Trains the synthesizer in place.

sample(n_samples=1, condition_on=None, balancing=False, random_state=None, connector=None, **kwargs)

Generate synthetic tabular data using the trained RegularSynthesizer.

This method generates new synthetic records that mimic the statistical properties of the original dataset. Users can optionally condition on specific features, apply balancing strategies, and define an output storage connector for direct integration with databases or cloud storage.

Parameters:

Name Type Description Default
n_samples int

Number of synthetic records/rows to generate. Default is 1.

1
condition_on list[ConditionalFeature] | dict | DataFrame | None

Condition the generator on specific feature values to create data with controlled distributions.

None
balancing bool

If True, ensures balanced sampling the defined conditional features. Default is False.

False
random_state RandomSeed

Set a random seed for reproducibility. Default is None (random generation).

None
connector BigQueryConnector | ObjectStorageConnector | RDBMSConnector | None

If provided, the generated synthetic data is automatically stored in a cloud-based data warehouse or database.

None

Returns:

Name Type Description
sample Dataset

A Dataset object containing the synthetic samples.