Metadata
ydata.metadata.Metadata
Core metadata class for analyzing Datasets.
The Metadata
class is responsible for extracting statistical summaries and
data characteristics from a given Dataset
. It plays a central role in both
data profiling and synthetic data generation, providing insights into
feature distributions, unique values, correlations, and other dataset properties.
Key Features:
- Schema Inference: Identifies feature types (
DataTypes
) based on data characteristics. - Descriptive Statistics: Computes uniques, skewness, correlations, distributions, among other metrics.
- Profiling Support: Helps analyze dataset structure, feature importance, and warnings.
- Synthetic Data Generation Support: Assists in learning data characteristics and identification of potential PII data.
- Configurable Computation: Supports partitioning and configurable metrics for large datasets.
Properties
columns (List[str]): List of feature names in the dataset. ncols (int): number of features/columns shape (Tuple[int, int]): tuple of (nrows, ncols) uniques (Dict[str, int]): number of unique values per feature. skewness (Dict[str, float]): skewness metric per continuous feature. schema (Dict[str, str]): feature type (VariableTypes), based on data types.
Example Usage:
from ydata.metadata import Dataset, Metadata
# Create a dataset
df = pd.read_csv('data.csv')
dataset = Dataset(df)
# Generate metadata for Dataset analysis
metadata = Metadata(dataset=dataset)
# Access dataset insights
print(metadata.shape) # (10000, 12)
print(metadata.schema) # {'age': 'int', 'salary': 'float', 'category': 'string'}
print(metadata.uniques) # {'age': 50, 'salary': 2000, 'category': 5}
cardinality
property
A property that returns a tuple with a dict with categorical variables approximated cardinality and the sum of the total cardinality.
Returns:
Name | Type | Description |
---|---|---|
cardinality |
dict
|
A dictionary with categorical variables approximated cardinality values. |
categorical_vars
property
Return the list of categorical columns in the dataset. Returns: numerical_cols (list): A list with the name of categorical columns in the dataset.
columns
property
Get the column metadata for the dataset.
This property returns a dictionary containing metadata about the dataset's columns, including feature names and their associated characteristics. It is primarily used to provide insights into the structure of the dataset.
Returns:
Name | Type | Description |
---|---|---|
columns |
dict
|
metadata dictionary with the mapping of the columns along with their variable and data types. Returns an object Column for each column. |
dataset_attrs
property
writable
A property that returns a dictionary with the defined dataset attributes Returns: dataset_attrs (dict): a dictionary with the defined dataset attributes
date_vars
property
Return the list of date columns in the dataset. Returns: date_cols (list): A list with the name of date columns in the dataset.
id_vars
property
Return the list of ID columns in the dataset. Returns: numerical_cols (list): A list with the name of ID columns in the dataset.
isconstant
property
Returns a list with the name of the columns that are constant throughout the dataset, i.e., always assume the same value.
A column is considered constant only and only when the whole columns assume the same value the new definition accounting for the missing values also ensures improvements in what concerns replicating missing values distribution
Returns:
Name | Type | Description |
---|---|---|
isconstant |
list
|
A list of columns that are constant |
longtext_vars
property
Return the list of longtext columns in the dataset. Returns: numerical_cols (list): A list with the name of longtext columns in the dataset.
ncols
property
Get the number of columns in the dataset and/or ConfigurationBuilder
Returns:
Name | Type | Description |
---|---|---|
ncols |
int
|
The number of columns in the dataset. |
numerical_vars
property
Return the list of numerical columns in the dataset. Returns: numerical_cols (list): A list with the name of numerical columns in the dataset.
shape
property
Get the shape of the dataset that was fed into the Metadata.
Returns:
Name | Type | Description |
---|---|---|
shape |
tuple
|
A tuple containing the shape of the dataset that was fed into the Metadata (nrows, ncols). Is only available if dataset != None. |
string_vars
property
Return the list of string columns in the dataset. Returns: numerical_cols (list): A list with the name of string columns in the dataset.
summary
property
Get a comprehensive summary of the dataset's metadata.
This property provides a structured summary containing key dataset metrics, column details, computed statistics, and detected warnings. It is useful for profiling, data validation, and integration with other libraries.
Returns:
Name | Type | Description |
---|---|---|
summary |
dict
|
A dictionary containing summary statistics about the dataset. It includes all the calculated metrics such as:
- Dataset Type: |
target
property
writable
Get the target column in the dataset. Returns: target (str): The target column in the dataset.
warnings
property
Get dataset warnings based on statistical and structural analyses.
This property returns a dictionary of warnings that highlight potential data quality issues, such as skewness, cardinality, constant values, correlations, imbalances, and constant-length features. These warnings are useful for profiling, preprocessing, and synthetic data generation.
Returns:
Name | Type | Description |
---|---|---|
warnings |
dict
|
A dictionary of warnings that highlight potential issues with the dataset variables or columns. |
add_characteristic(column, characteristic)
Add new characteristic to a column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
column name |
required |
characteristic
|
ColumnCharacteristic
|
characteristic to add |
required |
add_characteristics(characteristics)
Add characteristics to the specified columns.
The argument characteristics
is dictionary indexed on the columns that accept two syntaxes:
1. a characteristic
2. a list of characteristics
Example:
```python
characteristics = {
'col1': 'phone',
'col2': ['uuid', 'name']
}
metadata.add_characteristics(characteristics)
```
Parameters:
Name | Type | Description | Default |
---|---|---|---|
characteristics
|
dict[str, list[ColumnCharacteristic | str] | ColumnCharacteristic | str]
|
characteristics to add |
required |
compute_characteristics(dataset, columns=None, deferred=False)
Compute the dataset's characteristics.
The method returns the characteristics and update the metadata instance's summary accordingly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
Dataset
|
dataset corresponding to the Metadata instance |
required |
columns
|
dict | None
|
columns dictionary |
None
|
deferred
|
bool
|
defer the computation if True, else compute now |
False
|
Returns:
Type | Description |
---|---|
dict | Future
|
dict if deferred is False, Future otherwise |
compute_correlation(dataset, columns=None, deferred=False)
Compute the dataset's correlation matrix.
The method returns the correlation matrix and update the metadata instance's summary accordingly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
Dataset
|
dataset corresponding to the Metadata instance |
required |
columns
|
dict | None
|
columns dictionary |
None
|
deferred
|
bool
|
defer the computation if True, else compute now |
False
|
Returns:
Type | Description |
---|---|
DataFrame | Future
|
pandas dataframe if deferred is False, Future otherwise |
get_characteristics()
get_possible_targets()
Identify valid target columns for predictive modeling or synthetic data generation.
This method evaluates the dataset and determines which columns are suitable as target variables. Columns are excluded from consideration if they fall into any of the following categories:
- Invalid data types (e.g., long text, string, date).
- Constant values (columns with only one unique value).
- ID-like columns (unique identifiers that do not hold predictive value).
- Columns with missing values (to ensure data integrity)
- Columns with defined characteristics
Returns:
Name | Type | Description |
---|---|---|
targets |
tuple
|
a list with the name of the columns that are potential target variables and its details as a dictionary. |
load(path)
staticmethod
Load a Metadata
object from a saved file.
This method restores a previously saved Metadata
object from a pickle (.pkl
) file.
It allows users to reload metadata without needing to reprocess the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The path to load the metadata from. |
required |
Returns:
Name | Type | Description |
---|---|---|
metadata |
Metadata
|
A loaded |
Example Usage:
remove_characteristic(column, characteristic)
Remove a characteristic from a column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
column name |
required |
characteristic
|
ColumnCharacteristic
|
characteristic to remove |
required |
remove_characteristics(characteristics)
Remove characteristics to the specified columns.
The argument characteristics
is dictionary indexed on the columns that accept two syntaxes:
1. a characteristic
2. a list of characteristics
Example:
```python
characteristics = {
'col1': 'phone',
'col2': ['uuid', 'name']
}
metadata.remove_characteristics(characteristics)
```
Parameters:
Name | Type | Description | Default |
---|---|---|---|
characteristics
|
dict[str, list[ColumnCharacteristic | str] | ColumnCharacteristic | str]
|
characteristics to add |
required |
save(path)
Save the Metadata
object to a pickle file.
This method serializes the Metadata
object and saves it as a pickle (.pkl
) file
at the specified path. The saved file can later be loaded to restore the metadata
without reprocessing the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The path to save the metadata to. The file extension should be |
required |
Returns:
Name | Type | Description |
---|---|---|
None |
The metadata object is stored in the specified file location. |
Example Usage:
set_characteristics(characteristics)
set_dataset_attrs(sortby, entities=None)
set_dataset_type(dataset_type, dataset_attrs=None)
Update the dataset type and optionally set dataset attributes.
This method updates the dataset type and, if provided, initializes the
dataset attributes (dataset_attrs
). It is particularly useful when
working with time-series datasets, where additional metadata is required.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_type
|
DatasetType | str
|
new dataset type |
required |
dataset_attrs
|
dict | None
|
Dataset attrs for TIMESERIES dataset. Defaults to None. |
None
|
update_datatypes(value, dataset=None)
Method to update the data types set during the Metadata automatic datatype inference.
Valid datatypes to update the columns are: "longtext", "categorical", "numerical", "date" and "id". value (dict): A dictionary with the name: datatype value to be assigned to the column. Provide only the names of the columns that need a datatype update.