Holdout
The Holdout object in ydata-sdk is a utility designed to create, store, and reproduce consistent train-test splits from a given Dataset. It ensures repeatability in experiments by managing the exact rows allocated to the holdout (test) set, based on a fixed sampling strategy and split fraction.
By storing essential information such as partition metadata and dataset divisions, the Holdout object supports reliable reloading of the same split configuration across different environments or workflow stages—enabling accurate evaluation of models and synthetic data.
from ydata.dataset.holdout import Holdout
holdout_config = Holdout(fraction=0.3)
train, holdout = holdout_config.get_split(X=dataset, metadata=metadata, strategy='random')
Notebook example can be found at YData Academy.
ydata.dataset.holdout.Holdout
Split a :class:Dataset
into a train and hold-out (test) partition and
store enough information to reproduce that exact split later.
A holdout (sometimes called a test or validation split) is created once, published, and then loaded again elsewhere to guarantee that everyone evaluating a model sees the same rows.
Parameters
fraction : float, default 0.2
Fraction of the original dataset to place in the hold-out split.
Must be strictly between 0 and 1.
Attributes
uuid : str Random identifier that uniquely tags this split instance.
holdout_def
property
tuple | None: Cached (divisions, npartitions)
needed to reload the
hold-out split when the original dataset is re-opened.
None
until :meth:get_split
has been called.
get_split(X, metadata=None, random_state=None, strategy='random')
Generate and cache a train / hold-out split of X.
Parameters
X : Dataset
The full dataset to split.
metadata : Metadata, optional
Column-level metadata used by stratified sampling, if requested.
random_state : RandomSeed, optional
Seed or NumPy/random-like RNG for deterministic splits.
strategy : {'random', 'stratified'}, default 'random'
* 'random'
– simple random sampling.
* 'stratified'
– Deprecated. See Raises.
Returns
train : Dataset The rows that belong to the training split. holdout : Dataset The rows that belong to the hold-out split.
Raises
NotImplementedError
If strategy == 'stratified'
. The existing implementation is no
longer compatible with datasets whose index resets inside each
partition (if reset_index is applied to distributed dataframes)
load(path)
classmethod
Reload a :class:Holdout
instance from a pickle file.
Parameters
path : str
path : str
File that was previously created with :meth:save
.
Returns
Holdout The deserialized object.
Raises
AssertionError
If the unpickled object is not a :class:Holdout
.
save(path)
Persist this :class:Holdout
object to disk with :pymod:pickle
.
Parameters
path : str
Destination file name (e.g. "holdout.pkl"
).