Tabular Synthesizer
ydata.synthesizers.RegularSynthesizer
Bases: BaseModel
The RegularSynthesizer
is designed to learn patterns from real datasets
and generate synthetic data that maintains statistical properties while ensuring
privacy and security. It provides a simple API for training, sampling, saving,
and loading models.
Key Features
- *fit *: Learn from real data to create a generative model.
- *sample *: Produce high-quality synthetic data based on the trained model.
- *save *: Store a trained synthesizer for future use.
- *load *: Restore a previously trained synthesizer.
Usage Example
from ydata.synthesizers import RegularSynthesizer
# Step 1: Train the model
synth = RegularSynthesizer()
synth.fit(real_data, metadata)
# Step 2: Generate synthetic data
synthetic_data = synth.sample(n_samples=1000)
# Step 3: Save the trained model
synth.save("model.pkl")
# Step 4: Load the trained model later
loaded_synth = RegularSynthesizer.load("model.pkl")
fit(X, metadata, *, condition_on=None, privacy_level=PrivacyLevel.HIGH_FIDELITY, calculated_features=None, anonymize=None, anonymize_ids=False, segment_by='auto', holdout_size=0.2, random_state=None)
Train the RegularSynthesizer
on real tabular data.
This method learns patterns from the provided dataset (X
) to build a generative
model capable of producing high-quality synthetic data. It allows for feature
extraction, handling missing values, and applying privacy controls.
- Handles missing values and applies anonymization if required.
- Supports conditional synthesis by segmenting data into meaningful groups.
- Integrates business rules through the calculated features to evaluate model performance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
Dataset
|
The real dataset used to train the synthesizer. |
required |
metadata
|
Metadata
|
object describing the dataset, including feature types and relationships. |
required |
calculated_features
|
list[dict[str, str | Callable | List[str]]] | None
|
List of computed features that should be derived before training, if provided |
None
|
anonymize
|
dict | AnonymizerConfigurationBuilder | None
|
Configuration for anonymization strategies, such as hashing or generalization, if provided |
None
|
privacy_level
|
PrivacyLevel | str
|
Defines the trade-off between privacy and data fidelity. Options: |
HIGH_FIDELITY
|
condition_on
|
Union[str, list[str]] | None
|
Enables conditional data generation by specifying key features to condition the model on. |
None
|
anonymize_ids
|
bool
|
If |
False
|
segment_by
|
SegmentByType
|
Defines how data should be segmented while training, based on a column or an automated decision. Options: |
'auto'
|
holdout_size
|
float
|
Percentage of data to hold out for model evaluation. Default is |
0.2
|
random_state
|
RandomSeed
|
Set a seed for reproducibility. If |
None
|
Returns:
Name | Type | Description |
---|---|---|
None |
Trains the synthesizer in place. |
|
None |
Trains the synthesizer in place. |
sample(n_samples=1, condition_on=None, balancing=False, random_state=None, connector=None, **kwargs)
Generate synthetic tabular data using the trained RegularSynthesizer
.
This method generates new synthetic records that mimic the statistical properties of the original dataset. Users can optionally condition on specific features, apply balancing strategies, and define an output storage connector for direct integration with databases or cloud storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_samples
|
int
|
Number of synthetic records/rows to generate. Default is |
1
|
condition_on
|
list[ConditionalFeature] | dict | DataFrame | None
|
Condition the generator on specific feature values to create data with controlled distributions. |
None
|
balancing
|
bool
|
If |
False
|
random_state
|
RandomSeed
|
Set a random seed for reproducibility. Default is |
None
|
connector
|
BigQueryConnector | ObjectStorageConnector | RDBMSConnector | None
|
If provided, the generated synthetic data is automatically stored in a cloud-based data warehouse or database. |
None
|
Returns:
Name | Type | Description |
---|---|---|
sample |
Dataset
|
A Dataset object containing the synthetic samples. |