MultiDataset

The MultiDataset object in ydata-sdk is a high-level abstraction designed to manage relational datasets consisting of multiple interrelated tables. It supports both in-memory usage (e.g., CSV or Parquet files) and deferred loading from relational databases (via connectors), enabling scalable data workflows across diverse environments.

MultiDataset preserves schema relationships—including primary and foreign keys—through a centralized metadata object, ensuring data integrity across table operations. It integrates tightly with the SDK’s profiling, synthesis, and anonymization modules, and enables advanced multi-table workflows, including referentially-consistent synthetic data generation.

Whether you're analyzing structured CSVs, working with normalized data in a Postgres database, or composing a dataset programmatically, MultiDataset offers a consistent and extensible interface.

Key Features

Relational Dataset Support: Designed for multi-table datasets with defined primary/foreign key relationships.
Flexible Loading Modes: Supports both eager (in-memory) and lazy (on-demand from RDBMS) loading patterns.
Schema-Aware Operations: Maintains explicit schema definitions and validates table relationships.
Lazy Evaluation with Connectors: Tables are loaded only when accessed, ideal for large-scale databases.
Modular Integration: Compatible with MultiMetadata, Synthesizer, and downstream AI-ready workflows.
Query-Like Interface: Select and operate on individual tables or subsets using familiar dictionary semantics.

`ydata.dataset.multidataset.MultiDataset`

`schema` `property`

Returns the schema associated with the MultiDataset.

Returns:

Name	Type	Description
`MultiTableSchema`		The object defining table structures and relationships.

`add_foreign_key(table, column, parent_table, parent_column, relation_type=RelationType.MANY_TO_MANY)`

Adds a foreign key relationship to the schema.

Parameters:

Name	Type	Description	Default
`table`	`str`	Name of the child table.	required
`column`	`str`	Foreign key column in the child table.	required
`parent_table`	`str`	Name of the parent table.	required
`parent_column`	`str`	Primary key column in the parent table.	required
`relation_type`	`str \| RelationType`	Type of relationship (e.g., MANY_TO_MANY). Defaults to MANY_TO_MANY.	`MANY_TO_MANY`

`add_observer_for_new_tables(func)`

Registers an observer function to be notified when new tables are loaded into the MultiDataset.

Typically used by MultiMetadata to receive updates when deferred tables are materialized.

Parameters:

Name	Type	Description	Default
`func`	`Callable`	A callback function that accepts (table_name, Dataset).	required

`add_primary_key(table, column)`

Adds a primary key column to a specific table in the schema.

Parameters:

Name	Type	Description	Default
`table`	`str`	Table name.	required
`column`	`str`	Column name to mark as the primary key.	required

`compute()`

Materializes all deferred tables in the dataset by fetching them via the connector, if available.

Returns:

Name	Type	Description
`MultiDataset`		The same object with all tables loaded into memory.

`from_files(folder_path, schema_path, sep=',', file_type=FileType.CSV)` `classmethod`

Load a MultiDataset from a folder of CSV or Parquet files, using an optional schema.yaml file to define table relationships.

Parameters:

Name	Type	Description	Default
`folder_path`	`str`	Path to the folder containing .csv/.parquet files and a schema.yaml file.	required

Returns:

Name	Type	Description
`MultiDataset`		An initialized MultiDataset object.

`items()`

Standard dictionary-like methods for iterating over tables in the MultiDataset.

`keys()`

Standard dictionary-like methods for iterating over tables in the MultiDataset.

`select_tables(tables)`

Selects a subset of tables from the MultiDataset.

Parameters:

Name	Type	Description	Default
`tables`	`Iterable[str]`	Names of the tables to include.	required

Returns:

Name	Type	Description
`MultiDataset`		A new MultiDataset instance containing only the selected tables.

`values()`

Standard dictionary-like methods for iterating over tables in the MultiDataset.

MultiDataset

Key Features

ydata.dataset.multidataset.MultiDataset

schema property

add_foreign_key(table, column, parent_table, parent_column, relation_type=RelationType.MANY_TO_MANY)

add_observer_for_new_tables(func)

add_primary_key(table, column)

compute()

from_files(folder_path, schema_path, sep=',', file_type=FileType.CSV) classmethod

items()

keys()

select_tables(tables)

values()

`ydata.dataset.multidataset.MultiDataset`

`schema` `property`

`add_foreign_key(table, column, parent_table, parent_column, relation_type=RelationType.MANY_TO_MANY)`

`add_observer_for_new_tables(func)`

`add_primary_key(table, column)`

`compute()`

`from_files(folder_path, schema_path, sep=',', file_type=FileType.CSV)` `classmethod`

`items()`

`keys()`

`select_tables(tables)`

`values()`