Column Constraints
Column constraints check aggregate statistics of a column — mean, std, min, max, null rate, etc. — and report whether that statistic falls within an acceptable range. They do not remove rows; they flag the column as a whole.
All column constraints can be imported from ydata.constraints:
from ydata.constraints import (
MeanBetween, StandardDeviationBetween, QuantileBetween,
MinBetween, MaxBetween,
NullRateLowerThan, NullValuesCountLowerThan,
UniqueValuesBetween, Constant, SumLowerThan,
CustomConstraint, # with axis="column"
)
Column constraints and filter()
ConstraintEngine.filter() operates on row constraints only. If you add column constraints to an engine and then call filter(), the engine emits a UserWarning and ignores those constraints during row removal. Use validate() + summary() to inspect column-level violations.
MeanBetween
Checks that the column mean falls inside [lower_bound, upper_bound].
from ydata.constraints import MeanBetween
c = MeanBetween(lower_bound=20, upper_bound=60, columns=["age"])
StandardDeviationBetween
Checks that the column standard deviation falls inside [lower_bound, upper_bound].
from ydata.constraints import StandardDeviationBetween
c = StandardDeviationBetween(lower_bound=0, upper_bound=15, columns=["age"])
QuantileBetween
Checks that a given quantile of the column falls inside [lower_bound, upper_bound].
from ydata.constraints import QuantileBetween
# 90th percentile of transaction amount must be below 800
c = QuantileBetween(quantile=0.90, lower_bound=0, upper_bound=800, columns=["tx_amount"])
| Parameter | Type | Description |
|---|---|---|
quantile |
float |
Quantile to compute, in [0, 1] |
lower_bound |
float |
Lower bound (right-open by default) |
upper_bound |
float |
Upper bound |
columns |
str \| list[str] \| None |
Column(s) to check |
MinBetween / MaxBetween
Checks that the column minimum or maximum falls inside [lower_bound, upper_bound].
| Parameter | Type | Description |
|---|---|---|
lower_bound |
float |
Minimum allowed value for the column's min/max |
upper_bound |
float |
Maximum allowed value for the column's min/max |
columns |
str \| list[str] \| None |
Column(s) to check |
NullRateLowerThan
Checks that the proportion of null values is strictly below value. Prefer this over NullValuesCountLowerThan — it works regardless of dataset size.
from ydata.constraints import NullRateLowerThan
# Less than 5% of values may be null
c = NullRateLowerThan(value=0.05, columns=["income", "age"])
| Parameter | Type | Description |
|---|---|---|
value |
float |
Maximum null rate (exclusive), in [0, 1] |
columns |
str \| list[str] \| None |
Column(s) to check |
NullValuesCountLowerThan
Checks that the absolute count of null values is below value.
from ydata.constraints import NullValuesCountLowerThan
c = NullValuesCountLowerThan(value=10, columns=["income"])
Tip
For most use cases, prefer NullRateLowerThan — it doesn't require you to know the dataset size in advance.
UniqueValuesBetween / Constant
Checks the number of unique values in a column.
SumLowerThan
Checks that the column sum is below value.
from ydata.constraints import SumLowerThan
c = SumLowerThan(value=1_000_000, columns=["daily_spend"])
CustomConstraint (column)
For any aggregate logic not covered above, use CustomConstraint with axis="column". The callable receives a pd.Series (one full column at a time) and must return a boolean scalar — True if the column passes, False if it violates.
from ydata.constraints import CustomConstraint
# Less than 5% of values below 18
c = CustomConstraint(
lambda col: (col < 18).mean() < 0.05,
columns=["age"],
name="underage_rate",
axis="column",
)
# All values must be within 3 standard deviations of the mean
c = CustomConstraint(
lambda col: ((col - col.mean()).abs() / col.std() < 3).all(),
columns=["income"],
name="no_outliers",
axis="column",
)
| Parameter | Type | Description |
|---|---|---|
check |
Callable |
Receives a pd.Series, returns a bool |
columns |
str \| list[str] \| None |
Column(s) to evaluate (each independently) |
name |
str \| None |
Optional label |
axis |
str |
Must be "column" (or "columns") |
Generic: Interval / GreaterThan / LowerThan / Equal
These lower-level classes accept any callable as the check function, making them fully composable.
from ydata.constraints import ColumnGreaterThan, ColumnLowerThan
from ydata.constraints.columns import Interval, Equal
def nunique(col):
return col.nunique()
# Cardinality must be > 1 (not constant)
c = ColumnGreaterThan(check=nunique, columns=["category"], value=1)
# Mean must equal 0 ± 0.1
c = Equal(check=lambda col: col.mean(), columns=["residuals"], value=0, tolerance=0.1)