Tabular Synthesizer

`ydata.synthesizers.RegularSynthesizer`

Bases: BaseModel

The RegularSynthesizer is designed to learn patterns from real datasets and generate synthetic data that maintains statistical properties while ensuring privacy and security. It provides a simple API for training, sampling, saving, and loading models.

Key Features

*fit *: Learn from real data to create a generative model.
*sample *: Produce high-quality synthetic data based on the trained model.
*save *: Store a trained synthesizer for future use.
*load *: Restore a previously trained synthesizer.

Usage Example

from ydata.synthesizers import RegularSynthesizer

# Step 1: Train the model
synth = RegularSynthesizer()
synth.fit(real_data, metadata)

# Step 2: Generate synthetic data
synthetic_data = synth.sample(n_samples=1000)

# Step 3: Save the trained model
synth.save("model.pkl")

# Step 4: Load the trained model later
loaded_synth = RegularSynthesizer.load("model.pkl")

`fit(X, metadata, *, condition_on=None, privacy_level=PrivacyLevel.HIGH_FIDELITY, calculated_features=None, anonymize=None, anonymize_ids=False, segment_by='auto', holdout_size=0.2, random_state=None)`

Train the RegularSynthesizer on real tabular data.

This method learns patterns from the provided dataset (X) to build a generative model capable of producing high-quality synthetic data. It allows for feature extraction, handling missing values, and applying privacy controls.

Handles missing values and applies anonymization if required.
Supports conditional synthesis by segmenting data into meaningful groups.
Integrates business rules through the calculated features to evaluate model performance.

Parameters:

Name	Type	Description	Default
`X`	`Dataset`	The real dataset used to train the synthesizer.	required
`metadata`	`Metadata`	object describing the dataset, including feature types and relationships.	required
`calculated_features`	`list[dict[str, str \| Callable \| List[str]]] \| None`	List of computed features that should be derived before training, if provided	`None`
`anonymize`	`dict \| AnonymizerConfigurationBuilder \| None`	Configuration for anonymization strategies, such as hashing or generalization, if provided	`None`
`privacy_level`	`PrivacyLevel \| str`	Defines the trade-off between privacy and data fidelity. Options: `"HIGH_FIDELITY"`, `"BALANCED_PRIVACY_FIDELITY"`, `"HIGH_PRIVACY"`. Defaults to `"HIGH_FIDELITY"`.	`HIGH_FIDELITY`
`condition_on`	`Union[str, list[str]] \| None`	Enables conditional data generation by specifying key features to condition the model on.	`None`
`anonymize_ids`	`bool`	If `True`, automatically anonymizes columns of type ID. Defaults to `False`.	`False`
`segment_by`	`SegmentByType`	Defines how data should be segmented while training, based on a column or an automated decision. Options: `"auto"` (default).	`'auto'`
`holdout_size`	`float`	Percentage of data to hold out for model evaluation. Default is `0.2` (20%).	`0.2`
`random_state`	`RandomSeed`	Set a seed for reproducibility. If `None`, randomness is used.	`None`

Returns:

Name	Type	Description
`None`		Trains the synthesizer in place.
`None`		Trains the synthesizer in place.

`sample(n_samples=1, condition_on=None, balancing=False, random_state=None, connector=None, **kwargs)`

Generate synthetic tabular data using the trained RegularSynthesizer.

This method generates new synthetic records that mimic the statistical properties of the original dataset. Users can optionally condition on specific features, apply balancing strategies, and define an output storage connector for direct integration with databases or cloud storage.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of synthetic records/rows to generate. Default is `1`.	`1`
`condition_on`	`list[ConditionalFeature] \| dict \| DataFrame \| None`	Condition the generator on specific feature values to create data with controlled distributions.	`None`
`balancing`	`bool`	If `True`, ensures balanced sampling the defined conditional features. Default is `False`.	`False`
`random_state`	`RandomSeed`	Set a random seed for reproducibility. Default is `None` (random generation).	`None`
`connector`	`BigQueryConnector \| ObjectStorageConnector \| RDBMSConnector \| None`	If provided, the generated synthetic data is automatically stored in a cloud-based data warehouse or database.	`None`

Returns:

Name	Type	Description
`sample`	`Dataset`	A Dataset object containing the synthetic samples.