Constraint Engine
The ConstraintEngine is the orchestrator: it holds a collection of constraints, runs them against a Dataset, aggregates the results, and can filter out non-compliant rows.
Creating an engine
Pass constraints at construction time, or add them incrementally:
from ydata.constraints import ConstraintEngine, NotNull, MeanBetween
# At construction
engine = ConstraintEngine([
NotNull(columns=["age", "income"]),
MeanBetween(lower_bound=20, upper_bound=60, columns=["age"]),
])
# Incrementally
engine = ConstraintEngine()
engine.add_constraint(NotNull(columns=["age"]))
engine.add_constraints([MeanBetween(lower_bound=20, upper_bound=60, columns=["age"])])
validate()
Runs all constraints against the dataset and stores the result internally.
The engine caches the result — calling validate() again on the same engine is a no-op unless you add or remove a constraint between calls (which resets the cache automatically).
summary()
Returns a nested dict with violation counts, ratios, and per-constraint breakdowns.
summary = engine.summary()
print(summary["rows_violation_count"]) # total rows that violated at least one rule
print(summary["rows_violation_ratio"]) # as a fraction of total rows
for name, detail in summary["violation_per_constraint"].items():
print(name, detail)
include_rows=True adds the raw boolean mask to each constraint's report:
Note
Call validate() before summary().
filter()
Returns a new Dataset with only the rows that satisfy all row constraints. Column constraints are not used for filtering — they report aggregate violations only.
If the engine contains column constraints, a UserWarning is emitted to remind you they are being skipped during row removal.
import warnings
with warnings.catch_warnings(record=True) as caught:
warnings.simplefilter("always")
clean = engine.filter(dataset)
# caught will contain the column-constraint warning if applicable
explain_constraints()
Returns a human-readable dict mapping each constraint's key to its string representation:
engine.explain_constraints()
# {'NotNull on columns [age, income]': 'NotNull on columns [age, income]', ...}
Fault isolation
If any individual constraint raises an exception during validate(), the engine logs a warning and skips that constraint rather than crashing the entire validation. This means partial results are always available even when some constraints are misconfigured.
import logging
logging.basicConfig(level=logging.WARNING)
engine.validate(dataset) # broken constraints are warned and skipped
Naming constraints
Give constraints a name to make summaries and engine output easier to read:
from ydata.constraints import NotNull, MeanBetween
engine = ConstraintEngine([
NotNull(columns=["income"], name="income_not_null"),
MeanBetween(lower_bound=20, upper_bound=60, columns=["age"], name="age_mean_range"),
])
engine.validate(dataset)
summary = engine.summary()
# Keys in summary["violation_per_constraint"] will be the names above
Full example
import pandas as pd
import numpy as np
from ydata.dataset import Dataset
from ydata.constraints import (
ConstraintEngine,
NotNull, Unique, GreaterThan, NotIncludedIn, StringLength, Monotonic,
MeanBetween, MinBetween, MaxBetween, NullRateLowerThan,
CustomConstraint,
)
df = pd.DataFrame({
"customer_id": range(1_000),
"age": np.random.randint(18, 80, 1_000),
"income": np.random.normal(50_000, 15_000, 1_000),
"tx_amount": np.random.exponential(200, 1_000),
"status": np.random.choice(["active", "pending"], 1_000),
"timestamp": pd.date_range("2024-01-01", periods=1_000, freq="h"),
})
dataset = Dataset(df)
engine = ConstraintEngine([
# Row constraints
NotNull(columns=["age", "income"], name="no_nulls"),
Unique(columns=["customer_id"], name="unique_id"),
GreaterThan(columns=["age"], value=0, name="positive_age"),
NotIncludedIn(column="status",
values=["banned", "deleted"], name="valid_status"),
StringLength(columns=["status"],
min_length=4, max_length=10, name="status_length"),
Monotonic(columns=["timestamp"], name="timestamps_ordered"),
# Column constraints
MeanBetween(20, 65, columns=["age"], name="age_mean"),
MinBetween(0, 18, columns=["age"], name="age_min"),
MaxBetween(0, 10_000, columns=["tx_amount"], name="tx_max"),
NullRateLowerThan(0.05, columns=["income"], name="income_nulls"),
# Custom cross-column rule
CustomConstraint(
lambda df: df["income"] > 0,
columns=["income"], axis="row", name="positive_income",
),
])
engine.validate(dataset)
summary = engine.summary()
print(f"Rows violated: {summary['rows_violation_count']} "
f"({summary['rows_violation_ratio']:.1%})")
clean = engine.filter(dataset)
print(f"Clean dataset: {len(clean._data)} rows")