Skip to content

Constraint Engine

The ConstraintEngine is the orchestrator: it holds a collection of constraints, runs them against a Dataset, aggregates the results, and can filter out non-compliant rows.

from ydata.constraints import ConstraintEngine

Creating an engine

Pass constraints at construction time, or add them incrementally:

from ydata.constraints import ConstraintEngine, NotNull, MeanBetween

# At construction
engine = ConstraintEngine([
    NotNull(columns=["age", "income"]),
    MeanBetween(lower_bound=20, upper_bound=60, columns=["age"]),
])

# Incrementally
engine = ConstraintEngine()
engine.add_constraint(NotNull(columns=["age"]))
engine.add_constraints([MeanBetween(lower_bound=20, upper_bound=60, columns=["age"])])

validate()

Runs all constraints against the dataset and stores the result internally.

engine.validate(dataset)

The engine caches the result — calling validate() again on the same engine is a no-op unless you add or remove a constraint between calls (which resets the cache automatically).


summary()

Returns a nested dict with violation counts, ratios, and per-constraint breakdowns.

summary = engine.summary()

print(summary["rows_violation_count"])   # total rows that violated at least one rule
print(summary["rows_violation_ratio"])   # as a fraction of total rows

for name, detail in summary["violation_per_constraint"].items():
    print(name, detail)

include_rows=True adds the raw boolean mask to each constraint's report:

summary = engine.summary(include_rows=True)

Note

Call validate() before summary().


filter()

Returns a new Dataset with only the rows that satisfy all row constraints. Column constraints are not used for filtering — they report aggregate violations only.

clean_dataset = engine.filter(dataset)

If the engine contains column constraints, a UserWarning is emitted to remind you they are being skipped during row removal.

import warnings

with warnings.catch_warnings(record=True) as caught:
    warnings.simplefilter("always")
    clean = engine.filter(dataset)

# caught will contain the column-constraint warning if applicable

explain_constraints()

Returns a human-readable dict mapping each constraint's key to its string representation:

engine.explain_constraints()
# {'NotNull on columns [age, income]': 'NotNull on columns [age, income]', ...}

Fault isolation

If any individual constraint raises an exception during validate(), the engine logs a warning and skips that constraint rather than crashing the entire validation. This means partial results are always available even when some constraints are misconfigured.

import logging
logging.basicConfig(level=logging.WARNING)

engine.validate(dataset)  # broken constraints are warned and skipped

Naming constraints

Give constraints a name to make summaries and engine output easier to read:

from ydata.constraints import NotNull, MeanBetween

engine = ConstraintEngine([
    NotNull(columns=["income"], name="income_not_null"),
    MeanBetween(lower_bound=20, upper_bound=60, columns=["age"], name="age_mean_range"),
])
engine.validate(dataset)
summary = engine.summary()
# Keys in summary["violation_per_constraint"] will be the names above

Full example

import pandas as pd
import numpy as np
from ydata.dataset import Dataset
from ydata.constraints import (
    ConstraintEngine,
    NotNull, Unique, GreaterThan, NotIncludedIn, StringLength, Monotonic,
    MeanBetween, MinBetween, MaxBetween, NullRateLowerThan,
    CustomConstraint,
)

df = pd.DataFrame({
    "customer_id": range(1_000),
    "age":         np.random.randint(18, 80, 1_000),
    "income":      np.random.normal(50_000, 15_000, 1_000),
    "tx_amount":   np.random.exponential(200, 1_000),
    "status":      np.random.choice(["active", "pending"], 1_000),
    "timestamp":   pd.date_range("2024-01-01", periods=1_000, freq="h"),
})
dataset = Dataset(df)

engine = ConstraintEngine([
    # Row constraints
    NotNull(columns=["age", "income"],          name="no_nulls"),
    Unique(columns=["customer_id"],             name="unique_id"),
    GreaterThan(columns=["age"], value=0,       name="positive_age"),
    NotIncludedIn(column="status",
                  values=["banned", "deleted"], name="valid_status"),
    StringLength(columns=["status"],
                 min_length=4, max_length=10,   name="status_length"),
    Monotonic(columns=["timestamp"],            name="timestamps_ordered"),

    # Column constraints
    MeanBetween(20, 65, columns=["age"],         name="age_mean"),
    MinBetween(0, 18, columns=["age"],           name="age_min"),
    MaxBetween(0, 10_000, columns=["tx_amount"], name="tx_max"),
    NullRateLowerThan(0.05, columns=["income"],  name="income_nulls"),

    # Custom cross-column rule
    CustomConstraint(
        lambda df: df["income"] > 0,
        columns=["income"], axis="row", name="positive_income",
    ),
])

engine.validate(dataset)

summary = engine.summary()
print(f"Rows violated: {summary['rows_violation_count']} "
      f"({summary['rows_violation_ratio']:.1%})")

clean = engine.filter(dataset)
print(f"Clean dataset: {len(clean._data)} rows")