Welcome to YData SDK Documentation

Overview

Get Started with YData SDK

Get your license key at ydata.ai/register

YData SDK is the leading Python package for Data & AI, providing an ecosystem of methods that enables data professionals to adopt a data-centric development approach focused on improving data quality. The library includes integrated components for:

Data Ingestion: Connect to various data sources seamlessly
Data Quality Evaluation: Standardized metrics and assessments
Data Improvement: Tools for enhancing dataset quality
Synthetic Data Generation: Create high-quality synthetic datasets

🚀 What’s New in the Latest Release

We’re excited to introduce support for text and unstructured data, unlocking new possibilities for working with Large Language Models (LLMs) and foundation models. This major release includes:

QAGenerator – Automatically generate high-quality question-answer pairs from documents for evaluation, benchmarking, or RAG pipelines.
DocumentGenerator – Generate synthetic internal documents (PDF, DOCX, HTML) for use in AI workflows, data anonymization, or compliance testing.

Whether you're working on LLM eval, red-teaming, or training models in regulated environments, YData SDK is now your go-to platform for synthetic document generation and text data Q&A pairs generation.

📦 To get started with all features, run:

pypi

pip install "ydata-sdk[text,docx]"

Key Benefits

YData SDK offers several advantages for AI, data science development and data management:

Next-Gen Features
- State-of-the-art data quality profiling
- Advanced metadata management
- Leading synthetic data generation technology for structured and unstructured data
Enhanced Collaboration
- Seamless integration with multiple tools and services
- Unified environment for all developers
- Reduced development overhead
Improved Developer Experience
- Well-integrated software solution
- Seamless transitions between tools
- Consistent compatibility
Enterprise Interoperability
- Native integration with major platforms (Databricks, Snowflake)
- Cohesive data architecture support
- Enterprise-grade reliability

Core Functionality

1. Connectors

2. Metadata

3. Data Profiling

4. Synthetic Data

5. Data Anonymization

Supported Data Formats

Tabular DataTime-Series DataRelational DatabasesText corpusDocuments

Tabular data Synthetic data generator The RegularSynthesizer is perfect for high-dimensional, time-independent data synthesis with exceptional quality results.

Timeseries Synthetic data generator The TimeSeriesSynthesizer handles both regular and irregular time-series data, from smart sensors to stock market data, including support for transactional data with irregular intervals.

Relational databases Synthetic data generator The MultiTableSynthesizer excels at replicating complex relational database schemas while maintaining data integrity and relationships.

The TextSynthesizer and QASynthesizer excels at generating privacy preserving text corpus and generating Question and Answer Pairs for LLM fine tuning and eval.

The DocumentSynthesizer excels at replicating complex custom internal documents while maintaining data consistency and content relevance.