Skip to content

Welcome to YData SDK documentation

pypi Pythonversion downloads

YData SDK is the leading Python package for Data & AI, which includes an ecosystem of methods that enables data professionals to adopt a data-centric development approach focused on improving the quality of the data. The library includes a set of integrated components for data ingestion, standardized data quality evaluation and data improvement, and synthetic data generation, allowing an iterative improvement of the data used in high-impact business applications.

YData SDK for improved data quality and synthetic data!

Get your license key at ydata.ai/register

Benefits

YData SDK interface enables the ability to integrate data quality tooling with other platforms offering several beneficts in the realm of data science development and data management:

  • Next-gen features: YData SDK provides the state-of-the-art tooling for advanced data quality profiling, metadata management and manipulation, as well as the worldwide leading synthetic data generation technology.
  • Collaboration: ease of integration with a multitude of tools and services, reducing the need to reinvent the wheel and fostering a collaborative environment for all developers (data scientists, data engineers, software developers, etc.)
  • Improved usage experience: YData SDK enables a well-integrated software solution, which allows a seamless transition between different tools or platforms without facing compatibility issues.
  • Interoperability: seamless integration with other data platform and systems like Databricks, Snowflake, etc. This ensures that all your software will work cohesively with all the elements from your data architecture.

Current functionality

YData SDK is currently composed by the following main modules:

  • Connectors

    • YData’s SDK includes several connectors for easy integration with existing data sources. It supports several storage types, like filesystems and RDBMS. Check the list of connectors.
    • The SDK’s Datasources run on top of Dask, which allows it to deal with not only small workloads but also larger volumes of data.
  • Profiling

    • The most comprehensive profile report includes a set of metrics and algorithms summarizes datasets quality in three main dimensions: warnings, univariate analysis and a multivariate perspective.
  • Synthetic Data

    • Simplified Training Interface: Easily train generative models to learn and replicate the behavior, patterns, and distribution of your original dataset. Tailor your model to prioritize either privacy or utility, depending on your specific use case.
    • On-Demand Data Generation: Once your synthetic data generator is trained, you can produce synthetic samples as needed. Customize the output by specifying the exact number of records required.
    • Privacy Assurance: Built-in anonymization and privacy-preserving features ensure that synthetic datasets are free from Personal Identifiable Information (PII), making them safe to share and use.
    • Conditional Sampling: Apply constraints to control the domain and values of specific features in the generated data, enabling more targeted and relevant synthetic datasets.
  • Synthetic data quality report

    • An extensive synthetic data quality report that measures 3 dimensions: privacy, utility and fidelity of the generated data. The report can be downloaded in PDF format for ease of sharing and compliance purposes or as a JSON to enable the integration in data flows.

Supported data formats

Tabular data Synthetic data generator The RegularSynthesizer is perfect to synthesize high-dimensional data, that is time-indepentent with high quality results.

Timeseries Synthetic data generator The TimeSeriesSynthesizer is perfect to synthesize both regularly and not evenly spaced time-series, from smart-sensors to stock. The TimeSeriesSynthesizer also supports transactional data, known to have highly irregular time intervals between records and directional relations between entities.

Relational databases Synthetic data generator The MultiTableSynthesizer is perfect to learn how to replicate the data within a relational database schema.