Time-Series Synthesizer

`ydata.synthesizers.TimeSeriesSynthesizer`

Bases: BaseModel

Unlike the RegularSynthesizer, the TimeSeriesSynthesizer is designed to capture and replicate temporal relationships within entities over time. It learns from sequential patterns in the data and generates synthetic time-series records that preserve trends, seasonality, and correlations per entity.

Additionally, this synthesizer can augment datasets by increasing the number of unique entities while maintaining realistic temporal behavior.

Key Features

Time-Aware Training (fit): Learns entity-level sequential dependencies and trends over time.
Pattern-Preserving Sampling (sample): Generates synthetic time-series data that mimics real-world time progression.
Entity Augmentation: Expands the dataset by generating additional synthetic entities with realistic time patterns.
Time Window Processing: Operates on an N-entity time window to model time dependencies effectively.
Model Persistence (save & load): Store and restore trained synthesizers for future use.

To define a single entity series the following Metadata configuration would be required:

    dataset_attrs = {
        "sortbykey": "sate",
    }

metadata = Metadata(dataset, dataset_type=DatasetType.TIMESERIES, dataset_attrs=dataset_attrs)

As for a multi-entity time series, it requires the metadata dataset attributes to specify at least one column corresponding to an entity ID. For instance, the following example specify two columns as entity ID columns:

dataset_attrs = {
    "sortbykey": "sate",
    "entities": ['entity', 'entity_2']
}

metadata = Metadata(dataset, dataset_type=DatasetType.TIMESERIES, dataset_attrs=dataset_attrs)

Usage Example

from ydata.synthesizers import TimeSeriesSynthesizer

# Step 1: Train the model with time-series data
synth = TimeSeriesSynthesizer()
synth.fit(data, metadata)

# Step 2: Generate synthetic time-series data
synthetic_data = synth.sample(n_entities=10)

# Step 3: Save the trained model
synth.save("timeseries_model.pkl")

# Step 4: Load the trained model later
loaded_synth = TimeSeriesSynthesizer.load("timeseries_model.pkl")

`fit(X, metadata, extracted_cols=None, calculated_features=None, anonymize=None, privacy_level=PrivacyLevel.HIGH_FIDELITY, condition_on=None, anonymize_ids=False, segment_by='auto', random_state=None)`

Train the TimeSeriesSynthesizer on real time-series data.

This method learns patterns, dependencies, and sequential behaviors from the input dataset (X) while preserving the relationships between entities over time. The synthesizer processes time-dependent features and constructs a generative model capable of producing realistic time-series data.

Parameters:

Name	Type	Description	Default
`X`	`Dataset`	Input dataset.	required
`metadata`	`Metadata`	Metadata instance.	required
`extracted_cols`	`list[str]`	List of columns to extract data from.	`None`
`calculated_features`	`list[dict[str, str \|]]`	Defines additional business rules to be ensured for the synthetic generated dataset.	`None`
`anonymize`	`Optional[dict \| AnonymizerConfigurationBuilder]`	Specifies anonymization strategies for sensitive fields while leveraging ydata's AnonymizerEngine	`None`
`privacy_level`	`str \| PrivacyLevel`	Defines the trade-off between privacy and data fidelity. Options: `"HIGH_FIDELITY"`, `"BALANCED_PRIVACY_FIDELITY"`, `"HIGH_PRIVACY"`. Defaults to `"HIGH_FIDELITY"`. Defaults to `HIGH_FIDELITY`.	`HIGH_FIDELITY`
`condition_on`	`Union[str, list[str]]`	Enables conditional data generation by specifying key features to condition the model on.	`None`
`anonymize_ids`	`bool`	If `True`, automatically anonymizes columns of type ID. Defaults to `False`.	`False`
`segment_by`	str \| list \| `auto`	Defines how data should be segmented while training, based on a column or an automated decision. Options: `"auto"` (default).	`'auto'`
`random_state`	`Optional`	Set a seed for reproducibility. If `None`, randomness is used.	`None`

`sample(n_entities=None, smoothing=False, fidelity=None, sort_result=True, condition_on=None, balancing=False, random_state=None, connector=None, **kwargs)`

Generate a time series.

This method generates a new time series. The instance should be trained via the method fit before calling sample. The generated time series has the same length of the training data. However, in the case of multi-entity time series, it is possible The generated time series has the same length of the training data. However, in the case of multi-entity time series, it is possible to augment the number of entities by specifying the parameter n_entities.

For a multi-entity sample, there are two major arguments that can be used to modify the results: fidelity and smoothing.

Fidelity: It defines how close the new entities should be from the original ones. When a float, it represents the behavioral noise to be added to the entity expressed as a percentage of its variance. See ydata.synthesizer.entity_augmenter.FidelityConfig for more details.
Smoothing: It defines if and how the new entities trajectory should be smoothed. See ydata.synthesizer.entity_augmenter.SmoothingConfig for more details.

Parameters:

Name	Type	Description	Default
`n_entities`	`Optional[int]`	Number of entities to sample. If None, generates as many entities as in the training data. By default None.	`None`
`smoothing`	`Union[bool, dict, SmoothingConfig]`	Define how the smoothing should be done. `True` uses the `auto` configuration.	`False`
`fidelity Optional[Union[float, dict, FidelityConfig]]`		Define the fidely policy.	required
`sort_result`	`bool`	True if the sample should be sorted by sortbykey, False otherwise.	`True`
`condition_on`	`list[ConditionalFeature] \| dict \| DataFrame \| None`	Conditional rules to be applied.	`None`
`balancing`	`bool`	If True, the categorical features included in the conditional rules have equally distributed percentages.	`False`

Returns:

Name	Type	Description
`Dataset`	`Dataset`	The generated synthetic time-series dataset