# Quickstart

YData SDK allows you to with an easy and familiar interface, to adopt a Data-Centric AI approach for the development of Machine Learning solutions. YData SDK features were designed to support structure data, including tabular data, time-series and transactional data.

To start leveraging the package features you should consume your data either through the Connectors or pandas.Dataframe. The list of available connectors can be found here [add a link].

    # Example for a Google Cloud Storage Connector
credentials = "{insert-credentials-file-path}"

# We create a new connector for Google Cloud Storage
connector = Connector(connector_type='gcs', credentials=credentials)

# Create a Datasource from the connector
# Note that a connector can be re-used for several datasources
X = DataSource(connector=connector, path='gs://<my_bucket>.csv')

    # Load a small dataset

# Init a synthesizer
synth = RegularSynthesizer()

# Train the synthesizer with the pandas Dataframe as input
# The data is then sent to the cluster for processing
synth.fit(X)


The synthesis process returns a pandas.DataFrame object. Note that if you are using the ydata-sdk free version, all of your data is sent to a remote cluster on YData's infrastructure.

## Data synthesis flow

The process of data synthesis can be described into the following steps:

stateDiagram-v2
init_synth --> train_synth
train_synth --> generate_samples
generate_samples --> [*]

The code snippet below shows how easy can be to start generating new synthetic data. The package includes a set of examples datasets for a quickstart.

    from ydata.sdk.dataset import get_dataset

X = get_dataset('census')

# Init a synthesizer
synth = RegularSynthesizer()

# Fit the synthesizer to the input data
synth.fit(X)

# Sample new synthetic data. The below request ask for new 1000 synthetic rows
synth.sample(n_samples=1000)


Do I need to prepare my data before synthesis?

The sdk ensures that the original behaviour is replicated. For that reason, there is no need to preprocess outlier observations or missing data.

By default all the missing data is replicated as NaN.