Skip to content

Holdout

The Holdout object in ydata-sdk is a utility designed to create, store, and reproduce consistent train-test splits from a given Dataset. It ensures repeatability in experiments by managing the exact rows allocated to the holdout (test) set, based on a fixed sampling strategy and split fraction.

By storing essential information such as partition metadata and dataset divisions, the Holdout object supports reliable reloading of the same split configuration across different environments or workflow stages—enabling accurate evaluation of models and synthetic data.

from ydata.dataset.holdout import Holdout

holdout_config = Holdout(fraction=0.3)
train, holdout = holdout_config.get_split(X=dataset, metadata=metadata, strategy='random')

Notebook example can be found at YData Academy.

ydata.dataset.holdout.Holdout

Split a :class:Dataset into a train and hold-out (test) partition and store enough information to reproduce that exact split later.

A holdout (sometimes called a test or validation split) is created once, published, and then loaded again elsewhere to guarantee that everyone evaluating a model sees the same rows.

Parameters

fraction : float, default 0.2 Fraction of the original dataset to place in the hold-out split. Must be strictly between 0 and 1.

Attributes

uuid : str Random identifier that uniquely tags this split instance.

holdout_def property

tuple | None: Cached (divisions, npartitions) needed to reload the hold-out split when the original dataset is re-opened.

None until :meth:get_split has been called.

get_split(X, metadata=None, random_state=None, strategy='random')

Generate and cache a train / hold-out split of X.

Parameters

X : Dataset The full dataset to split. metadata : Metadata, optional Column-level metadata used by stratified sampling, if requested. random_state : RandomSeed, optional Seed or NumPy/​random-like RNG for deterministic splits. strategy : {'random', 'stratified'}, default 'random' * 'random' – simple random sampling. * 'stratified'Deprecated. See Raises.

Returns

train : Dataset The rows that belong to the training split. holdout : Dataset The rows that belong to the hold-out split.

Raises

NotImplementedError If strategy == 'stratified'. The existing implementation is no longer compatible with datasets whose index resets inside each partition (if reset_index is applied to distributed dataframes)

load(path) classmethod

Reload a :class:Holdout instance from a pickle file.

Parameters

path : str path : str File that was previously created with :meth:save.

Returns

Holdout The deserialized object.

Raises

AssertionError If the unpickled object is not a :class:Holdout.

save(path)

Persist this :class:Holdout object to disk with :pymod:pickle.

Parameters

path : str Destination file name (e.g. "holdout.pkl").