Skip to content

MultiDataset

The MultiDataset object in ydata-sdk is a high-level abstraction designed to manage relational datasets consisting of multiple interrelated tables. It supports both in-memory usage (e.g., CSV or Parquet files) and deferred loading from relational databases (via connectors), enabling scalable data workflows across diverse environments.

MultiDataset preserves schema relationships—including primary and foreign keys—through a centralized metadata object, ensuring data integrity across table operations. It integrates tightly with the SDK’s profiling, synthesis, and anonymization modules, and enables advanced multi-table workflows, including referentially-consistent synthetic data generation.

Whether you're analyzing structured CSVs, working with normalized data in a Postgres database, or composing a dataset programmatically, MultiDataset offers a consistent and extensible interface.

Key Features

  • Relational Dataset Support: Designed for multi-table datasets with defined primary/foreign key relationships.
  • Flexible Loading Modes: Supports both eager (in-memory) and lazy (on-demand from RDBMS) loading patterns.
  • Schema-Aware Operations: Maintains explicit schema definitions and validates table relationships.
  • Lazy Evaluation with Connectors: Tables are loaded only when accessed, ideal for large-scale databases.
  • Modular Integration: Compatible with MultiMetadata, Synthesizer, and downstream AI-ready workflows.
  • Query-Like Interface: Select and operate on individual tables or subsets using familiar dictionary semantics.

ydata.dataset.multidataset.MultiDataset

schema property

Returns the schema associated with the MultiDataset.

Returns:

Name Type Description
MultiTableSchema

The object defining table structures and relationships.

add_foreign_key(table, column, parent_table, parent_column, relation_type=RelationType.MANY_TO_MANY)

Adds a foreign key relationship to the schema.

Parameters:

Name Type Description Default
table str

Name of the child table.

required
column str

Foreign key column in the child table.

required
parent_table str

Name of the parent table.

required
parent_column str

Primary key column in the parent table.

required
relation_type str | RelationType

Type of relationship (e.g., MANY_TO_MANY). Defaults to MANY_TO_MANY.

MANY_TO_MANY

add_observer_for_new_tables(func)

Registers an observer function to be notified when new tables are loaded into the MultiDataset.

Typically used by MultiMetadata to receive updates when deferred tables are materialized.

Parameters:

Name Type Description Default
func Callable

A callback function that accepts (table_name, Dataset).

required

add_primary_key(table, column)

Adds a primary key column to a specific table in the schema.

Parameters:

Name Type Description Default
table str

Table name.

required
column str

Column name to mark as the primary key.

required

compute()

Materializes all deferred tables in the dataset by fetching them via the connector, if available.

Returns:

Name Type Description
MultiDataset

The same object with all tables loaded into memory.

from_files(folder_path, schema_path, sep=',', file_type=FileType.CSV) classmethod

Load a MultiDataset from a folder of CSV or Parquet files, using an optional schema.yaml file to define table relationships.

Parameters:

Name Type Description Default
folder_path str

Path to the folder containing .csv/.parquet files and a schema.yaml file.

required

Returns:

Name Type Description
MultiDataset

An initialized MultiDataset object.

items()

Standard dictionary-like methods for iterating over tables in the MultiDataset.

keys()

Standard dictionary-like methods for iterating over tables in the MultiDataset.

select_tables(tables)

Selects a subset of tables from the MultiDataset.

Parameters:

Name Type Description Default
tables Iterable[str]

Names of the tables to include.

required

Returns:

Name Type Description
MultiDataset

A new MultiDataset instance containing only the selected tables.

values()

Standard dictionary-like methods for iterating over tables in the MultiDataset.