Skip to content

Column Constraints

Column constraints check aggregate statistics of a column — mean, std, min, max, null rate, etc. — and report whether that statistic falls within an acceptable range. They do not remove rows; they flag the column as a whole.

All column constraints can be imported from ydata.constraints:

from ydata.constraints import (
    MeanBetween, StandardDeviationBetween, QuantileBetween,
    MinBetween, MaxBetween,
    NullRateLowerThan, NullValuesCountLowerThan,
    UniqueValuesBetween, Constant, SumLowerThan,
    CustomConstraint,  # with axis="column"
)

Column constraints and filter()

ConstraintEngine.filter() operates on row constraints only. If you add column constraints to an engine and then call filter(), the engine emits a UserWarning and ignores those constraints during row removal. Use validate() + summary() to inspect column-level violations.


MeanBetween

Checks that the column mean falls inside [lower_bound, upper_bound].

from ydata.constraints import MeanBetween

c = MeanBetween(lower_bound=20, upper_bound=60, columns=["age"])

StandardDeviationBetween

Checks that the column standard deviation falls inside [lower_bound, upper_bound].

from ydata.constraints import StandardDeviationBetween

c = StandardDeviationBetween(lower_bound=0, upper_bound=15, columns=["age"])

QuantileBetween

Checks that a given quantile of the column falls inside [lower_bound, upper_bound].

from ydata.constraints import QuantileBetween

# 90th percentile of transaction amount must be below 800
c = QuantileBetween(quantile=0.90, lower_bound=0, upper_bound=800, columns=["tx_amount"])
Parameter Type Description
quantile float Quantile to compute, in [0, 1]
lower_bound float Lower bound (right-open by default)
upper_bound float Upper bound
columns str \| list[str] \| None Column(s) to check

MinBetween / MaxBetween

Checks that the column minimum or maximum falls inside [lower_bound, upper_bound].

from ydata.constraints import MinBetween

# Column minimum must be >= 0 and <= 18 (sanity check)
c = MinBetween(lower_bound=0, upper_bound=18, columns=["age"])
from ydata.constraints import MaxBetween

# Column maximum must not exceed 10 000
c = MaxBetween(lower_bound=0, upper_bound=10_000, columns=["tx_amount"])
Parameter Type Description
lower_bound float Minimum allowed value for the column's min/max
upper_bound float Maximum allowed value for the column's min/max
columns str \| list[str] \| None Column(s) to check

NullRateLowerThan

Checks that the proportion of null values is strictly below value. Prefer this over NullValuesCountLowerThan — it works regardless of dataset size.

from ydata.constraints import NullRateLowerThan

# Less than 5% of values may be null
c = NullRateLowerThan(value=0.05, columns=["income", "age"])
Parameter Type Description
value float Maximum null rate (exclusive), in [0, 1]
columns str \| list[str] \| None Column(s) to check

NullValuesCountLowerThan

Checks that the absolute count of null values is below value.

from ydata.constraints import NullValuesCountLowerThan

c = NullValuesCountLowerThan(value=10, columns=["income"])

Tip

For most use cases, prefer NullRateLowerThan — it doesn't require you to know the dataset size in advance.


UniqueValuesBetween / Constant

Checks the number of unique values in a column.

from ydata.constraints import UniqueValuesBetween

# Column must have between 2 and 10 distinct values
c = UniqueValuesBetween(lower_bound=2, upper_bound=10, columns=["category"])
from ydata.constraints import Constant

# Column must have exactly 1 unique value (i.e. be constant)
c = Constant(columns=["version"])

SumLowerThan

Checks that the column sum is below value.

from ydata.constraints import SumLowerThan

c = SumLowerThan(value=1_000_000, columns=["daily_spend"])

CustomConstraint (column)

For any aggregate logic not covered above, use CustomConstraint with axis="column". The callable receives a pd.Series (one full column at a time) and must return a boolean scalarTrue if the column passes, False if it violates.

from ydata.constraints import CustomConstraint

# Less than 5% of values below 18
c = CustomConstraint(
    lambda col: (col < 18).mean() < 0.05,
    columns=["age"],
    name="underage_rate",
    axis="column",
)

# All values must be within 3 standard deviations of the mean
c = CustomConstraint(
    lambda col: ((col - col.mean()).abs() / col.std() < 3).all(),
    columns=["income"],
    name="no_outliers",
    axis="column",
)
Parameter Type Description
check Callable Receives a pd.Series, returns a bool
columns str \| list[str] \| None Column(s) to evaluate (each independently)
name str \| None Optional label
axis str Must be "column" (or "columns")

Generic: Interval / GreaterThan / LowerThan / Equal

These lower-level classes accept any callable as the check function, making them fully composable.

from ydata.constraints import ColumnGreaterThan, ColumnLowerThan
from ydata.constraints.columns import Interval, Equal

def nunique(col):
    return col.nunique()

# Cardinality must be > 1 (not constant)
c = ColumnGreaterThan(check=nunique, columns=["category"], value=1)

# Mean must equal 0 ± 0.1
c = Equal(check=lambda col: col.mean(), columns=["residuals"], value=0, tolerance=0.1)