Getting Started with YData SDK
The ydata-sdk is a powerful Python package designed to simplify data access, processing, and synthetic data generation within the YData ecosystem. This comprehensive toolkit enables users to manage datasets, run profiling, and generate high-quality synthetic data for analytics, machine learning, and data privacy applications.
Core Capabilities
The SDK is structured into six key areas, each designed to address specific data management needs:
1. Connectors
- Data Source Integration
- Connect to various databases (SQL, DWs, Lakehouses)
- Access cloud storage (S3, Azure, GCP)
- Handle local file systems
- Streamlined Data Access
- Unified interface for all data sources
- Optimized data loading
- Efficient memory management
2. Metadata
- Data Understanding
- Extract comprehensive dataset metadata
- Analyze data quality metrics
- Track data lineage
- Enhanced Management
- Automated metadata collection
- Version control for datasets
- Quality monitoring
3. Profiling
- Comprehensive Analysis
- Statistical profiling and analysis
- Data quality assessment
- Pattern and anomaly detection
- Visualization
- Interactive data visualizations
- Distribution analysis
- Correlation insights
- Automated Reporting
- Quality score generation
- Data drift monitoring
- Actionable recommendations
4. Anonymization
- Privacy Protection
- PII detection and masking
- Sensitive data handling
- Compliance validation
- Advanced Methods
- Multiple anonymization techniques
- Privacy metrics calculation
- Utility preservation
- Custom Rules
- Configurable privacy rules
- Business-specific requirements
- Regulatory compliance
5. Synthetic Data
- Tabular & Relational
- Create high-fidelity synthetic datasets (single table, time-series, multi-table)
- Preserve data distributions and relationships
- Ensure privacy compliance
- Text & Documents (LLM-powered)
- Document Generator — generate PDF, DOCX, and HTML documents with configurable type, tone, and style; supports scanned document simulation and batch variation via
DatasetConfig - LLM Synthesizer — generate single-table or multi-table datasets from natural language prompts; no source data required
- Q&A Generator — produce question–answer pairs from existing documents for RAG evaluation, fine-tuning, and benchmarking
- Supports multiple LLM backends: Workbench, OpenAI, Anthropic, Gemini
- Document Generator — generate PDF, DOCX, and HTML documents with configurable type, tone, and style; supports scanned document simulation and batch variation via
- Use Cases
- Analytics and reporting
- Machine learning / AI training
- Privacy-preserving sharing / applications
- OCR and document AI training data
- LLM evaluation and fine-tuning datasets
6. Report
- Automated Reporting
- Generate comprehensive data quality reports
- Create profiling insights
- Perform integrity checks
- Output Formats
- Interactive dashboards
- PDF reports
- JSON exports