Faker Synthesizer
ydata.synthesizers.FakerSynthesizer
A synthesizer for generating synthetic data based on user-defined configurations.
The FakerSynthesizer
allows users to create synthetic tabular data without needing
an existing dataset. Instead, it generates data based on user-provided metadata
or manually defined column configurations. This approach is useful for:
- Creating mock datasets for testing and development.
- Generating data prototypes before real data is available.
- Ensuring privacy-preserving synthetic data without reference to actual records.
Key Features:
- Metadata-Driven Generation: Generates synthetic data based on predefined Metadata.
- Customizable Column Types: Supports user-defined column structures.
- Multi-Language Support: Uses locale settings to generate realistic names, addresses, etc.
Example Usage:
from ydata.synthesizers import FakerSynthesizer
# Initialize the synthesizer with a specific locale
faker_synth = FakerSynthesizer(locale="en") # English data generation
faker_synth.fit(metadata)
# Generate synthetic data
synthetic_data = faker_synth.sample(n_samples=1000)
fit(metadata)
Configure the FakerSynthesizer
using provided metadata.
This method sets up the synthesizer by defining the structure of the synthetic
dataset based on the given Metadata
. The metadata can either be:
- Computed: Automatically extracted from an existing dataset.
- User-Defined: Manually constructed to specify custom column types and distributions.
Once fit()
is called, the synthesizer will use this metadata to generate
structured synthetic data that adheres to the defined schema.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metadata
|
Metadata
|
A metadata object describing the structure of the synthetic dataset, including: - Column names and data types. - Faker-based data generators (e.g., names, addresses, emails). - Value constraints (e.g., numeric ranges, categorical options). |
required |
sample(sample_size=1000)
Generate a synthetic dataset based on the configured metadata.
This method produces synthetic data according to the schema defined in the
fit()
step. The generated data adheres to the column types, constraints,
and distributions specified in the provided Metadata
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sample_size
|
int
|
The number of synthetic records/rows to generate. Defaults to |
1000
|
Returns:
Name | Type | Description |
---|---|---|
dataset |
Dataset
|
A Dataset object with the generated synthetic records/rows. |
save(path)
Saves the SYNTHESIZER and the models fitted per variable.