Metadata

`ydata.metadata.Metadata`

Core metadata class for analyzing Datasets.

The Metadata class is responsible for extracting statistical summaries and data characteristics from a given Dataset. It plays a central role in both data profiling and synthetic data generation, providing insights into feature distributions, unique values, correlations, and other dataset properties.

Key Features:

Schema Inference: Identifies feature types (DataTypes) based on data characteristics.
Descriptive Statistics: Computes uniques, skewness, correlations, distributions, among other metrics.
Profiling Support: Helps analyze dataset structure, feature importance, and warnings.
Synthetic Data Generation Support: Assists in learning data characteristics and identification of potential PII data.
Configurable Computation: Supports partitioning and configurable metrics for large datasets.

Properties

columns (List[str]): List of feature names in the dataset. ncols (int): number of features/columns shape (Tuple[int, int]): tuple of (nrows, ncols) uniques (Dict[str, int]): number of unique values per feature. skewness (Dict[str, float]): skewness metric per continuous feature. schema (Dict[str, str]): feature type (VariableTypes), based on data types.

Example Usage:

from ydata.metadata import Dataset, Metadata

# Create a dataset
df = pd.read_csv('data.csv')
dataset = Dataset(df)

# Generate metadata for Dataset analysis
metadata = Metadata(dataset=dataset)

# Access dataset insights
print(metadata.shape)      # (10000, 12)
print(metadata.schema)     # {'age': 'int', 'salary': 'float', 'category': 'string'}
print(metadata.uniques)    # {'age': 50, 'salary': 2000, 'category': 5}

`cardinality` `property`

A property that returns a tuple with a dict with categorical variables approximated cardinality and the sum of the total cardinality.

Returns:

Name	Type	Description
`cardinality`	`dict`	A dictionary with categorical variables approximated cardinality values.

`categorical_vars` `property`

Return the list of categorical columns in the dataset. Returns: numerical_cols (list): A list with the name of categorical columns in the dataset.

`columns` `property`

Get the column metadata for the dataset.

This property returns a dictionary containing metadata about the dataset's columns, including feature names and their associated characteristics. It is primarily used to provide insights into the structure of the dataset.

Returns:

Name	Type	Description
`columns`	`dict`	metadata dictionary with the mapping of the columns along with their variable and data types. Returns an object Column for each column.

`dataset_attrs` `property` `writable`

A property that returns a dictionary with the defined dataset attributes Returns: dataset_attrs (dict): a dictionary with the defined dataset attributes

`date_vars` `property`

Return the list of date columns in the dataset. Returns: date_cols (list): A list with the name of date columns in the dataset.

`id_vars` `property`

Return the list of ID columns in the dataset. Returns: numerical_cols (list): A list with the name of ID columns in the dataset.

`isconstant` `property`

Returns a list with the name of the columns that are constant throughout the dataset, i.e., always assume the same value.

A column is considered constant only and only when the whole columns assume the same value the new definition accounting for the missing values also ensures improvements in what concerns replicating missing values distribution

Returns:

Name	Type	Description
`isconstant`	`list`	A list of columns that are constant

`longtext_vars` `property`

Return the list of longtext columns in the dataset. Returns: numerical_cols (list): A list with the name of longtext columns in the dataset.

`ncols` `property`

Get the number of columns in the dataset and/or ConfigurationBuilder

Returns:

Name	Type	Description
`ncols`	`int`	The number of columns in the dataset.

`numerical_vars` `property`

Return the list of numerical columns in the dataset. Returns: numerical_cols (list): A list with the name of numerical columns in the dataset.

`shape` `property`

Get the shape of the dataset that was fed into the Metadata.

Returns:

Name	Type	Description
`shape`	`tuple`	A tuple containing the shape of the dataset that was fed into the Metadata (nrows, ncols). Is only available if dataset != None.

`string_vars` `property`

Return the list of string columns in the dataset. Returns: numerical_cols (list): A list with the name of string columns in the dataset.

`summary` `property`

Get a comprehensive summary of the dataset's metadata.

This property provides a structured summary containing key dataset metrics, column details, computed statistics, and detected warnings. It is useful for profiling, data validation, and integration with other libraries.

Returns:

Name	Type	Description
`summary`	`dict`	A dictionary containing summary statistics about the dataset. It includes all the calculated metrics such as: - Dataset Type: `"TABULAR"`, `"TIME-SERIES"`, etc. - Number of Columns: Total feature count. - Duplicate Rows: Number of duplicate records detected. - Target Column: Identifies a target variable if applicable. - Column Details: Data type, variable type, and characteristics for each feature. - Warnings: Potential data quality issues such as skewness, cardinality, and imbalances.

`target` `property` `writable`

Get the target column in the dataset. Returns: target (str): The target column in the dataset.

`warnings` `property`

Get dataset warnings based on statistical and structural analyses.

This property returns a dictionary of warnings that highlight potential data quality issues, such as skewness, cardinality, constant values, correlations, imbalances, and constant-length features. These warnings are useful for profiling, preprocessing, and synthetic data generation.

Returns:

Name	Type	Description
`warnings`	`dict`	A dictionary of warnings that highlight potential issues with the dataset variables or columns.

`add_characteristic(column, characteristic)`

Add new characteristic to a column.

Parameters:

Name	Type	Description	Default
`column`	`str`	column name	required
`characteristic`	`ColumnCharacteristic`	characteristic to add	required

`add_characteristics(characteristics)`

Add characteristics to the specified columns.

The argument characteristics is dictionary indexed on the columns that accept two syntaxes: 1. a characteristic 2. a list of characteristics

Example:

```python
characteristics = {
    'col1': 'phone',
    'col2': ['uuid', 'name']
}
metadata.add_characteristics(characteristics)

```

Parameters:

Name	Type	Description	Default
`characteristics`	`dict[str, list[ColumnCharacteristic \| str] \| ColumnCharacteristic \| str]`	characteristics to add	required

`compute_characteristics(dataset, columns=None, deferred=False)`

Compute the dataset's characteristics.

The method returns the characteristics and update the metadata instance's summary accordingly.

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	dataset corresponding to the Metadata instance	required
`columns`	`dict \| None`	columns dictionary	`None`
`deferred`	`bool`	defer the computation if True, else compute now	`False`

Returns:

Type	Description
`dict \| Future`	dict if deferred is False, Future otherwise

`compute_correlation(dataset, columns=None, deferred=False)`

Compute the dataset's correlation matrix.

The method returns the correlation matrix and update the metadata instance's summary accordingly.

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	dataset corresponding to the Metadata instance	required
`columns`	`dict \| None`	columns dictionary	`None`
`deferred`	`bool`	defer the computation if True, else compute now	`False`

Returns:

Type	Description
`DataFrame \| Future`	pandas dataframe if deferred is False, Future otherwise

`get_characteristics()`

Get the characteristics for all columns.

Returns:

Type	Description
`dict[str, list[ColumnCharacteristic]]`	dict[str, list[ColumnCharacteristic]]: characteristics dictionary

`get_possible_targets()`

Identify valid target columns for predictive modeling or synthetic data generation.

This method evaluates the dataset and determines which columns are suitable as target variables. Columns are excluded from consideration if they fall into any of the following categories:

Invalid data types (e.g., long text, string, date).
Constant values (columns with only one unique value).
ID-like columns (unique identifiers that do not hold predictive value).
Columns with missing values (to ensure data integrity)
Columns with defined characteristics

Returns:

Name	Type	Description
`targets`	`tuple`	a list with the name of the columns that are potential target variables and its details as a dictionary.

`load(path)` `staticmethod`

Load a Metadata object from a saved file.

This method restores a previously saved Metadata object from a pickle (.pkl) file. It allows users to reload metadata without needing to reprocess the dataset.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to load the metadata from.	required

Returns:

Name	Type	Description
`metadata`	`Metadata`	A loaded `Metadata` object.

Example Usage:

from ydata.metadata import Metadata

# Load metadata from a saved file
metadata = Metadata.load("metadata.pkl")

# Access dataset insights from loaded metadata
print(metadata.shape)
print(metadata.schema)
print(metadata.summary)

`remove_characteristic(column, characteristic)`

Remove a characteristic from a column.

Parameters:

Name	Type	Description	Default
`column`	`str`	column name	required
`characteristic`	`ColumnCharacteristic`	characteristic to remove	required

`remove_characteristics(characteristics)`

Remove characteristics to the specified columns.

The argument characteristics is dictionary indexed on the columns that accept two syntaxes: 1. a characteristic 2. a list of characteristics

Example:

```python
characteristics = {
    'col1': 'phone',
    'col2': ['uuid', 'name']
}
metadata.remove_characteristics(characteristics)

```

Parameters:

Name	Type	Description	Default
`characteristics`	`dict[str, list[ColumnCharacteristic \| str] \| ColumnCharacteristic \| str]`	characteristics to add	required

`save(path)`

Save the Metadata object to a pickle file.

This method serializes the Metadata object and saves it as a pickle (.pkl) file at the specified path. The saved file can later be loaded to restore the metadata without reprocessing the dataset.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to save the metadata to. The file extension should be `.pkl` to ensure proper deserialization.	required

Returns:

Name	Type	Description
`None`		The metadata object is stored in the specified file location.

Example Usage:

from ydata.metadata import Metadata

# Load dataset metadata
metadata = Metadata(dataset=my_dataset)

# Save metadata to a file
metadata.save("metadata.pkl")

`set_characteristics(characteristics)`

Define the characteristics for all columns.

Obs.: this will overwrite any previous definition for characteristics

Parameters:

Name	Type	Description	Default
`characteristics`	`dict[str, list[ColumnCharacteristic]]`	the new set of characteristics	required

`set_dataset_attrs(sortby, entities=None)`

Update dataset attributes.

Parameters:

Name	Type	Description	Default
`sortby`	`str \| List[str]`	Column(s) that express the temporal component	required
`entities`	`str \| List[str] \| None`	Column(s) that identify the entities. Defaults to None	`None`

`set_dataset_type(dataset_type, dataset_attrs=None)`

Update the dataset type and optionally set dataset attributes.

This method updates the dataset type and, if provided, initializes the dataset attributes (dataset_attrs). It is particularly useful when working with time-series datasets, where additional metadata is required.

Parameters:

Name	Type	Description	Default
`dataset_type`	`DatasetType \| str`	new dataset type	required
`dataset_attrs`	`dict \| None`	Dataset attrs for TIMESERIES dataset. Defaults to None.	`None`

`update_datatypes(value, dataset=None)`

Method to update the data types set during the Metadata automatic datatype inference.

Valid datatypes to update the columns are: "longtext", "categorical", "numerical", "date" and "id". value (dict): A dictionary with the name: datatype value to be assigned to the column. Provide only the names of the columns that need a datatype update.

Metadata

ydata.metadata.Metadata

Key Features:

Example Usage:

cardinality property

categorical_vars property

columns property

dataset_attrs property writable

date_vars property

id_vars property

isconstant property

longtext_vars property

ncols property

numerical_vars property

shape property

string_vars property

summary property

target property writable

warnings property

add_characteristic(column, characteristic)

add_characteristics(characteristics)

compute_characteristics(dataset, columns=None, deferred=False)

compute_correlation(dataset, columns=None, deferred=False)

get_characteristics()

get_possible_targets()

load(path) staticmethod

Example Usage:

remove_characteristic(column, characteristic)

remove_characteristics(characteristics)

save(path)

Example Usage:

set_characteristics(characteristics)

set_dataset_attrs(sortby, entities=None)

set_dataset_type(dataset_type, dataset_attrs=None)

update_datatypes(value, dataset=None)

`ydata.metadata.Metadata`

`cardinality` `property`

`categorical_vars` `property`

`columns` `property`

`dataset_attrs` `property` `writable`

`date_vars` `property`

`id_vars` `property`

`isconstant` `property`

`longtext_vars` `property`

`ncols` `property`

`numerical_vars` `property`

`shape` `property`

`string_vars` `property`

`summary` `property`

`target` `property` `writable`

`warnings` `property`

`add_characteristic(column, characteristic)`

`add_characteristics(characteristics)`

`compute_characteristics(dataset, columns=None, deferred=False)`

`compute_correlation(dataset, columns=None, deferred=False)`

`get_characteristics()`

`get_possible_targets()`

`load(path)` `staticmethod`

`remove_characteristic(column, characteristic)`

`remove_characteristics(characteristics)`

`save(path)`

`set_characteristics(characteristics)`

`set_dataset_attrs(sortby, entities=None)`

`set_dataset_type(dataset_type, dataset_attrs=None)`

`update_datatypes(value, dataset=None)`