Skip to content

Metadata

ydata.metadata.Metadata

Core metadata class for analyzing Datasets.

The Metadata class is responsible for extracting statistical summaries and data characteristics from a given Dataset. It plays a central role in both data profiling and synthetic data generation, providing insights into feature distributions, unique values, correlations, and other dataset properties.

Key Features:

  • Schema Inference: Identifies feature types (DataTypes) based on data characteristics.
  • Descriptive Statistics: Computes uniques, skewness, correlations, distributions, among other metrics.
  • Profiling Support: Helps analyze dataset structure, feature importance, and warnings.
  • Synthetic Data Generation Support: Assists in learning data characteristics and identification of potential PII data.
  • Configurable Computation: Supports partitioning and configurable metrics for large datasets.
Properties

columns (List[str]): List of feature names in the dataset. ncols (int): number of features/columns shape (Tuple[int, int]): tuple of (nrows, ncols) uniques (Dict[str, int]): number of unique values per feature. skewness (Dict[str, float]): skewness metric per continuous feature. schema (Dict[str, str]): feature type (VariableTypes), based on data types.

Example Usage:

from ydata.metadata import Dataset, Metadata

# Create a dataset
df = pd.read_csv('data.csv')
dataset = Dataset(df)

# Generate metadata for Dataset analysis
metadata = Metadata(dataset=dataset)

# Access dataset insights
print(metadata.shape)      # (10000, 12)
print(metadata.schema)     # {'age': 'int', 'salary': 'float', 'category': 'string'}
print(metadata.uniques)    # {'age': 50, 'salary': 2000, 'category': 5}

cardinality property

A property that returns a tuple with a dict with categorical variables approximated cardinality and the sum of the total cardinality.

Returns:

Name Type Description
cardinality dict

A dictionary with categorical variables approximated cardinality values.

categorical_vars property

Return the list of categorical columns in the dataset. Returns: numerical_cols (list): A list with the name of categorical columns in the dataset.

columns property

Get the column metadata for the dataset.

This property returns a dictionary containing metadata about the dataset's columns, including feature names and their associated characteristics. It is primarily used to provide insights into the structure of the dataset.

Returns:

Name Type Description
columns dict

metadata dictionary with the mapping of the columns along with their variable and data types. Returns an object Column for each column.

dataset_attrs property writable

A property that returns a dictionary with the defined dataset attributes Returns: dataset_attrs (dict): a dictionary with the defined dataset attributes

date_vars property

Return the list of date columns in the dataset. Returns: date_cols (list): A list with the name of date columns in the dataset.

id_vars property

Return the list of ID columns in the dataset. Returns: numerical_cols (list): A list with the name of ID columns in the dataset.

isconstant property

Returns a list with the name of the columns that are constant throughout the dataset, i.e., always assume the same value.

A column is considered constant only and only when the whole columns assume the same value the new definition accounting for the missing values also ensures improvements in what concerns replicating missing values distribution

Returns:

Name Type Description
isconstant list

A list of columns that are constant

longtext_vars property

Return the list of longtext columns in the dataset. Returns: numerical_cols (list): A list with the name of longtext columns in the dataset.

ncols property

Get the number of columns in the dataset and/or ConfigurationBuilder

Returns:

Name Type Description
ncols int

The number of columns in the dataset.

numerical_vars property

Return the list of numerical columns in the dataset. Returns: numerical_cols (list): A list with the name of numerical columns in the dataset.

shape property

Get the shape of the dataset that was fed into the Metadata.

Returns:

Name Type Description
shape tuple

A tuple containing the shape of the dataset that was fed into the Metadata (nrows, ncols). Is only available if dataset != None.

string_vars property

Return the list of string columns in the dataset. Returns: numerical_cols (list): A list with the name of string columns in the dataset.

summary property

Get a comprehensive summary of the dataset's metadata.

This property provides a structured summary containing key dataset metrics, column details, computed statistics, and detected warnings. It is useful for profiling, data validation, and integration with other libraries.

Returns:

Name Type Description
summary dict

A dictionary containing summary statistics about the dataset. It includes all the calculated metrics such as: - Dataset Type: "TABULAR", "TIME-SERIES", etc. - Number of Columns: Total feature count. - Duplicate Rows: Number of duplicate records detected. - Target Column: Identifies a target variable if applicable. - Column Details: Data type, variable type, and characteristics for each feature. - Warnings: Potential data quality issues such as skewness, cardinality, and imbalances.

target property writable

Get the target column in the dataset. Returns: target (str): The target column in the dataset.

warnings property

Get dataset warnings based on statistical and structural analyses.

This property returns a dictionary of warnings that highlight potential data quality issues, such as skewness, cardinality, constant values, correlations, imbalances, and constant-length features. These warnings are useful for profiling, preprocessing, and synthetic data generation.

Returns:

Name Type Description
warnings dict

A dictionary of warnings that highlight potential issues with the dataset variables or columns.

add_characteristic(column, characteristic)

Add new characteristic to a column.

Parameters:

Name Type Description Default
column str

column name

required
characteristic ColumnCharacteristic

characteristic to add

required

add_characteristics(characteristics)

Add characteristics to the specified columns.

The argument characteristics is dictionary indexed on the columns that accept two syntaxes: 1. a characteristic 2. a list of characteristics

Example:

```python
characteristics = {
    'col1': 'phone',
    'col2': ['uuid', 'name']
}
metadata.add_characteristics(characteristics)

```

Parameters:

Name Type Description Default
characteristics dict[str, list[ColumnCharacteristic | str] | ColumnCharacteristic | str]

characteristics to add

required

compute_characteristics(dataset, columns=None, deferred=False)

Compute the dataset's characteristics.

The method returns the characteristics and update the metadata instance's summary accordingly.

Parameters:

Name Type Description Default
dataset Dataset

dataset corresponding to the Metadata instance

required
columns dict | None

columns dictionary

None
deferred bool

defer the computation if True, else compute now

False

Returns:

Type Description
dict | Future

dict if deferred is False, Future otherwise

compute_correlation(dataset, columns=None, deferred=False)

Compute the dataset's correlation matrix.

The method returns the correlation matrix and update the metadata instance's summary accordingly.

Parameters:

Name Type Description Default
dataset Dataset

dataset corresponding to the Metadata instance

required
columns dict | None

columns dictionary

None
deferred bool

defer the computation if True, else compute now

False

Returns:

Type Description
DataFrame | Future

pandas dataframe if deferred is False, Future otherwise

get_characteristics()

Get the characteristics for all columns.

Returns:

Type Description
dict[str, list[ColumnCharacteristic]]

dict[str, list[ColumnCharacteristic]]: characteristics dictionary

get_possible_targets()

Identify valid target columns for predictive modeling or synthetic data generation.

This method evaluates the dataset and determines which columns are suitable as target variables. Columns are excluded from consideration if they fall into any of the following categories:

  • Invalid data types (e.g., long text, string, date).
  • Constant values (columns with only one unique value).
  • ID-like columns (unique identifiers that do not hold predictive value).
  • Columns with missing values (to ensure data integrity)
  • Columns with defined characteristics

Returns:

Name Type Description
targets tuple

a list with the name of the columns that are potential target variables and its details as a dictionary.

load(path) staticmethod

Load a Metadata object from a saved file.

This method restores a previously saved Metadata object from a pickle (.pkl) file. It allows users to reload metadata without needing to reprocess the dataset.

Parameters:

Name Type Description Default
path str

The path to load the metadata from.

required

Returns:

Name Type Description
metadata Metadata

A loaded Metadata object.

Example Usage:
from ydata.metadata import Metadata

# Load metadata from a saved file
metadata = Metadata.load("metadata.pkl")

# Access dataset insights from loaded metadata
print(metadata.shape)
print(metadata.schema)
print(metadata.summary)

remove_characteristic(column, characteristic)

Remove a characteristic from a column.

Parameters:

Name Type Description Default
column str

column name

required
characteristic ColumnCharacteristic

characteristic to remove

required

remove_characteristics(characteristics)

Remove characteristics to the specified columns.

The argument characteristics is dictionary indexed on the columns that accept two syntaxes: 1. a characteristic 2. a list of characteristics

Example:

```python
characteristics = {
    'col1': 'phone',
    'col2': ['uuid', 'name']
}
metadata.remove_characteristics(characteristics)

```

Parameters:

Name Type Description Default
characteristics dict[str, list[ColumnCharacteristic | str] | ColumnCharacteristic | str]

characteristics to add

required

save(path)

Save the Metadata object to a pickle file.

This method serializes the Metadata object and saves it as a pickle (.pkl) file at the specified path. The saved file can later be loaded to restore the metadata without reprocessing the dataset.

Parameters:

Name Type Description Default
path str

The path to save the metadata to. The file extension should be .pkl to ensure proper deserialization.

required

Returns:

Name Type Description
None

The metadata object is stored in the specified file location.

Example Usage:
from ydata.metadata import Metadata

# Load dataset metadata
metadata = Metadata(dataset=my_dataset)

# Save metadata to a file
metadata.save("metadata.pkl")

set_characteristics(characteristics)

Define the characteristics for all columns.

Obs.: this will overwrite any previous definition for characteristics

Parameters:

Name Type Description Default
characteristics dict[str, list[ColumnCharacteristic]]

the new set of characteristics

required

set_dataset_attrs(sortby, entities=None)

Update dataset attributes.

Parameters:

Name Type Description Default
sortby str | List[str]

Column(s) that express the temporal component

required
entities str | List[str] | None

Column(s) that identify the entities. Defaults to None

None

set_dataset_type(dataset_type, dataset_attrs=None)

Update the dataset type and optionally set dataset attributes.

This method updates the dataset type and, if provided, initializes the dataset attributes (dataset_attrs). It is particularly useful when working with time-series datasets, where additional metadata is required.

Parameters:

Name Type Description Default
dataset_type DatasetType | str

new dataset type

required
dataset_attrs dict | None

Dataset attrs for TIMESERIES dataset. Defaults to None.

None

update_datatypes(value, dataset=None)

Method to update the data types set during the Metadata automatic datatype inference.

Valid datatypes to update the columns are: "longtext", "categorical", "numerical", "date" and "id". value (dict): A dictionary with the name: datatype value to be assigned to the column. Provide only the names of the columns that need a datatype update.