Synthesize time-series data
Use YData's TimeSeriesSynthesizer to generate time-series synthetic data
Tabular data is the most common type of data we encounter in data problems.
When thinking about tabular data, we assume independence between different records, but this does not happen in reality. Suppose we check events from our day-to-day life, such as room temperature changes, bank account transactions, stock price fluctuations, and air quality measurements in our neighborhood. In that case, we might end up with datasets where measures and records evolve and are related through time. This type of data is known to be sequential or time-series data.
Thus, sequential or time-series data refers to any data containing elements ordered into sequences in a structured format. Dissecting any time-series dataset, we see differences in variables' behavior that need to be understood for an effective generation of synthetic data. Typically any time-series dataset is composed of the following:
- Variables that define the order of time (these can be simple with one variable or composed)
- Time-variant variables
- Variables that refer to entities (single or multiple entities)
- Variables that are attributes (those that don't depend on time but rather on the entity)
Below find an example:
from ydata.sdk.dataset import get_dataset
from ydata.sdk.synthesizers import TimeSeriesSynthesizer
# Do not forget to add your token as env variable
os.environ["YDATA_TOKEN"] = '<TOKEN>'
X = get_dataset('occupancy')
# We initialize a time series synthesizer
# As long as the synthesizer does not call `fit`, it exists only locally
synth = TimeSeriesSynthesizer()
# We train the synthesizer on our dataset
# sortbykey -> variable that define the time order for the sequence
# By default it is requested a synthetic sample with the same length as the original data
# The TimeSeriesSynthesizer is designed to replicate temporal series and therefore the original time-horizon is respected
sample = synth.sample(n_entities=1)