MultiDataset
The MultiDataset
object in ydata-sdk
is a high-level abstraction designed to manage relational datasets consisting of multiple interrelated tables.
It supports both in-memory usage (e.g., CSV or Parquet files) and deferred loading from relational databases (via connectors), enabling scalable data workflows across diverse environments.
MultiDataset
preserves schema relationships—including primary and foreign keys—through a centralized metadata object, ensuring data integrity across table operations. It integrates tightly with the SDK’s profiling, synthesis, and anonymization modules, and enables advanced multi-table workflows, including referentially-consistent synthetic data generation.
Whether you're analyzing structured CSVs, working with normalized data in a Postgres database, or composing a dataset programmatically, MultiDataset offers a consistent and extensible interface.
Key Features
- Relational Dataset Support: Designed for multi-table datasets with defined primary/foreign key relationships.
- Flexible Loading Modes: Supports both eager (in-memory) and lazy (on-demand from RDBMS) loading patterns.
- Schema-Aware Operations: Maintains explicit schema definitions and validates table relationships.
- Lazy Evaluation with Connectors: Tables are loaded only when accessed, ideal for large-scale databases.
- Modular Integration: Compatible with MultiMetadata, Synthesizer, and downstream AI-ready workflows.
- Query-Like Interface: Select and operate on individual tables or subsets using familiar dictionary semantics.
ydata.dataset.multidataset.MultiDataset
schema
property
Returns the schema associated with the MultiDataset.
Returns:
Name | Type | Description |
---|---|---|
MultiTableSchema |
The object defining table structures and relationships. |
add_foreign_key(table, column, parent_table, parent_column, relation_type=RelationType.MANY_TO_MANY)
Adds a foreign key relationship to the schema.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table
|
str
|
Name of the child table. |
required |
column
|
str
|
Foreign key column in the child table. |
required |
parent_table
|
str
|
Name of the parent table. |
required |
parent_column
|
str
|
Primary key column in the parent table. |
required |
relation_type
|
str | RelationType
|
Type of relationship (e.g., MANY_TO_MANY). Defaults to MANY_TO_MANY. |
MANY_TO_MANY
|
add_observer_for_new_tables(func)
Registers an observer function to be notified when new tables are loaded into the MultiDataset.
Typically used by MultiMetadata to receive updates when deferred tables are materialized.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
func
|
Callable
|
A callback function that accepts (table_name, Dataset). |
required |
add_primary_key(table, column)
compute()
Materializes all deferred tables in the dataset by fetching them via the connector, if available.
Returns:
Name | Type | Description |
---|---|---|
MultiDataset |
The same object with all tables loaded into memory. |
from_files(folder_path, schema_path, sep=',', file_type=FileType.CSV)
classmethod
Load a MultiDataset from a folder of CSV or Parquet files, using an optional schema.yaml file to define table relationships.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folder_path
|
str
|
Path to the folder containing .csv/.parquet files and a schema.yaml file. |
required |
Returns:
Name | Type | Description |
---|---|---|
MultiDataset |
An initialized MultiDataset object. |
items()
Standard dictionary-like methods for iterating over tables in the MultiDataset.
keys()
Standard dictionary-like methods for iterating over tables in the MultiDataset.
select_tables(tables)
values()
Standard dictionary-like methods for iterating over tables in the MultiDataset.