Skip to content

Constraints

The constraint engine lets you define data quality rules that can be validated against any Dataset and used to filter out non-compliant rows before synthesis or downstream processing.

How it works

Every constraint implements a validate(dataset) method that returns a boolean mask — True where the rule is satisfied, False where it is violated. The ConstraintEngine collects those masks and aggregates them into a summary report.

Dataset ──► ConstraintEngine.validate() ──► summary()  (what broke?)
                     └──► filter()  (remove offending rows)

Two kinds of constraints

Kind What it checks Output shape Used by filter()?
Row constraint Each individual row (value comparisons, nulls, regex…) n_rows × n_cols boolean mask ✅ Yes
Column constraint An aggregate statistic of the whole column (mean, std, max…) 1 × n_cols boolean ❌ No — reported only

Quick start

from ydata.constraints import (
    ConstraintEngine,
    NotNull, Unique, GreaterThan, NotIncludedIn,  # row
    MeanBetween, NullRateLowerThan, MaxBetween,   # column
    CustomConstraint,                              # bring your own logic
)
from ydata.dataset import Dataset

engine = ConstraintEngine([
    # ── Row constraints ──
    NotNull(columns=["age", "income"]),
    Unique(columns=["customer_id"]),
    GreaterThan(columns=["age"], value=0),
    NotIncludedIn(column="status", values=["banned", "deleted"]),

    # ── Column constraints ──
    MeanBetween(lower_bound=20, upper_bound=65, columns=["age"]),
    NullRateLowerThan(value=0.05, columns=["income"]),
    MaxBetween(lower_bound=0, upper_bound=10_000, columns=["tx_amount"]),

    # ── Custom logic ──
    CustomConstraint(
        lambda df: df["end_date"] >= df["start_date"],
        columns=["end_date"], available_columns=["start_date"],
        axis="row",
    ),
])

engine.validate(dataset)
print(engine.summary())

clean_dataset = engine.filter(dataset)

Sections