Skip to content

Dataset

The Dataset object in ydata-sdk is a high-performance data structure built on Dask, enabling efficient computation and parallel processing of large datasets. It not only accelerates operations but also maintains schema references and key dataset properties, ensuring seamless integration with downstream applications such as data profiling and synthetic data generation workflows.

The Dataset object seamlessly integrates with familiar data engines like pandas and numpy, allowing effortless conversion.

Key features

  • Optimized for Performance: Leverages Dask for parallel computing, enabling efficient handling of large datasets.
  • Schema Awareness: Stores and maintains metadata, data types, and structure for enhanced data integrity.
  • Seamless Integration: Easily converts to pandas DataFrames or numpy arrays for flexible data manipulation.
  • Scalability: Processes both small and massive datasets without memory limitations.
  • Supports Data Preprocessing – Provides built-in utilities for data transformation.

ydata.dataset.dataset.Dataset

Dataset class provides the interface to handle data within YData's package.

Parameters:

Name Type Description Default
df Union[DataFrame, DataFrame]

The data to be manipulated.

required
schema Optional[Dict]

Mapping of column names to variable types.

None
sample float

Fraction of the data to be sampled as the Dataset

0.2
index Optional[str]

Name of the column to be used as index, if any. This is an optional input, specially recommended for TimeSeries data.

None
divisions Optional[list | tuple]

This property is utilized by Dask, the underlying engine of the Dataset object, to enhance performance during parallel computing. It can be leveraged to optimize data processing efficiency and scalability.

None
Properties

columns (list[str]): list of column names that are part of the Dataset schema nrows (tuple): number of rows from the Dataset ncols (int): number of columns shape (tuple): tuple of (nrows, ncols) memory_usage (int): number of bytes consumed by the underlying dataframe nmissings (int): total number of missings in Dataset infered_dtypes_count (Dict[str, Dict]): infered data types per column infered_dtypes (Dict[str, str]): infered data type per column dtypes (Dict[str, str]): mapping of data type per column, either provided or inferred index (str): Returns the name of the index column

Magic Methods

columns property

Property that returns a list of column names. Returns: columns (list[str]): A list with the Dataset column names.

divisions property

A property that returns the number of divisions set for the Dataset. Returns: divisions (tuple): the number of divisions set for the Dataset.

index property

"A property that returns the name of the index column Returns: index_name (str): index columns name

loc property

Label location based indexer for selection. This method is inherited from Dask original LocIndexer implementation.

df.loc["b"] df.loc["b":"d"]

memory_usage property

A property that returns the memory usage of the Dataset. Returns: memory_usage (Dask Series): Memory usage of the Dataset.

ncols property

Property that returns the number of columns Returns: ncols (int): Number of columns

nmissings property

Get the total number of missing values in the Dataset.

This property computes and returns the sum of missing values across all columns in the dataset, returning the total count as an integer.

Returns:

Name Type Description
nmissings int

The total number of missing values in the Dataset

Notes:
- If there are no missing values, the returned value will be `0`.

nrows property

Property that returns the number of rows Returns: nrows (int): number of rows

schema property writable

Property that returns a dictionary of the schema of the dataset. The dictionary follows the following structure: {column_name: variable_type}

Returns:

Name Type Description
schema dict

A dictionary with the schema of the dataset.

apply(function, axis=1, raw=False, args=None, meta='__no_default__')

Parallelized version of apply.

Only supported on the rows axis. To guarantee results with expected format, output metadata should be provided with meta argument. Arguments: function (callable): Function to apply to each row axis (Union[int, str]): 1/'columns' apply function to each row. 0/'index' apply function to each column is not supported. raw (bool): Passed function operates on Pandas Series objects (False), or numpy arrays (True) args (Optional[Tuple]): Positional arguments to pass to function in addition to the array/series meta (Optional[Union[Dict, List[Tuple], Tuple, Dataset]]): A dictionary, list of tuples, tuple or dataset that matches the dtypes and column names of the output. This is an optional argument since it only certifies that Dask will use the correct metadata instead of infering which may lead to unexpected results. Returns: df (Dataset): A dataset object output of function.

astype(column, vartype, format=None)

Convert a column in the dataset to a specified data type.

This method changes the data type of a specified column in the dataset, ensuring that the conversion follows the defined VariableType mappings. It also updates the dataset's internal schema to reflect the new type.

Parameters:

Name Type Description Default
column str

The name of the column in the dataset to be converted

required
vartype VariableType | str

The target data type for the column. Can be a VariableType instance or a string representation of a type (e.g., "int", "float", "date"). If "date" is specified, the conversion ensures the column is treated as a date and your able to define the date format following Python's formats.

required
format Optional[str]

An optional format string used for date parsing when vartype="date" or vartype="datetime". If None, default parsing rules are applied.

None

copy()

Copy a Dataset instance.

Returns:

Name Type Description
dataset Dataset

A new Dataset instance with the scame schema and index.

drop_columns(columns, inplace=False)

Drops specified columns from a Dataset.

Parameters:

Name Type Description Default
columns str or list

column labels to drop

required
inplace bool

if False, return a copy. Otherwise, drop inplace and return None.

False

head(n=5)

Return the n first rows of a dataset.

If the number of rows in the first partition is lower than n, Dask will not return the requested number of rows (see dask.dataframe.core.head and dask.dataframe.core.safe_head). To avoid this corner case, we retry using all partitions -1.

Parameters:

Name Type Description Default
n int

Number of rows that we want to select from the top of the dataset

5

Returns:

Name Type Description
dataset pandas DataFrame

A pandas DataFrame containing the first n rows.

infer_dtypes(schema=None)

Infer and assign data types to dataset columns.

This method determines the most representative variable type for each feature based on observed value distributions. If a schema is provided, it overrides the inferred types. Otherwise, the method analyzes the dataset and assigns data types accordingly.

Parameters:

Name Type Description Default
schema Optional[dict]

A dictionary where keys are column names and values are the manually assigned data types. If None, the method automatically infers types.

None

missings(compute=False)

Calculates the number of missing values in a Dataset.

query(query)

Filter the dataset using a query expression.

This method applies a Pandas-style query to filter the dataset based on the given condition. It returns a new Dataset containing only the rows that match the query.

For more information check Dask's documentation on Dask dataframe query expression(https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.query.html). Args: query (str): The query expression to filter the dataset.

Returns:

Name Type Description
dataset Dataset

The dataset resulting from the provided query expression.

reorder_columns(columns)

Defines the order of the underlying data based on provided 'columns' list of column names.

Usage

data.columns ['colA', 'colB', colC'] data.reorder_columns(['colB', 'colC']).columns ['colB', 'colC']

sample(size, strategy='random', **strategy_params)

Generate a sampled subset of the dataset.

This method returns a sample from the dataset using either random sampling or stratified sampling. The sample size can be defined as an absolute number of rows or as a fraction of the dataset.

Parameters:

Name Type Description Default
size Union[float, int]

size (number of rows) of the sampled subset

required
strategy str['random', 'stratified']

strategy used to generate a sampled subset

'random'

Returns:

Name Type Description
dataset Dataset

the sampled subset of the dataset.

select_columns(columns, copy=True)

Returns a Dataset containing only a subset with the specified columns. If columns is a single feature, returns a Dataset with a single column.

Parameters:

Name Type Description Default
columns str or list

column labels to select

required
copy bool

if True, return a copy. Otherwise, select inplace and return self.

True

select_dtypes(include=None, exclude=None)

Return a subset of the dataset containing only specified data types.

This method filters the dataset to include or exclude specific data types, allowing users to focus on relevant columns based on their types. Args: include (Optional[str | list]): Specifies the columns with the expected variable types to included in the resulting dataset. exclude (Optional[str | list]): Specifies the columns with the variable types that are expected to be excluded in the resulting dataset.

Returns: dataset (Dataset): Subset of the dataset containing only columns with the specified variably types.

shape(lazy_eval=True, delayed=False)

Returns dataset shape as a tuple (rows, columns).

Supports lazy evaluation of nrows, ncols is unexpensive and returned directly

Parameters:

Name Type Description Default
lazy_eval bool

Returns the currently computed values for nrows and ncols properties. Defaults to True.

True
delayed bool

If True, compute delayed properties instead for nrows and ncols. This is recommended to optimize DASK's DAG flow, the underlying computational engine of the Dataset. Defaults to False.

False

sort_values(by, ignore_index=True, inplace=False)

Sort the dataset by one or more columns.

This method sorts the dataset based on the specified column(s), returning either a new sorted dataset or modifying the existing dataset in place.

Parameters:

Name Type Description Default
by List[str]

A list wit the name of the column(s) to sort.

required
ignore_index bool

Whether to ignore index or not. Defaults to True.

True
inplace bool

Whether to sort the dataset in-place. Defaults to False.

False

Returns:

Name Type Description
dataset Dataset

the sorted dataset in case inplace is set to False.

sorted_index(by)

Get the sorted index of the dataset based on specified columns.

This method computes the order of the dataset when sorted by the given column(s). It returns a Pandas Series representing the index positions corresponding to the sorted dataset. Args: by (List[str]): A list wit the name of the column(s) to sort.

Returns:

Name Type Description
index pandas Series

A Pandas Series containing the sorted index positions.

tail(n=5)

Return the n last rows of a dataset.

If the number of rows in the first partition is lower than n, Dask will not return the requested number of rows (see dask.dataframe.core.head and dask.dataframe.core.safe_head). To avoid this corner case, we retry using all partitions -1.

Parameters:

Name Type Description Default
n int

Number of rows that we want to select from the bottom of the dataset

5

Returns:

Name Type Description
dataset pandas DataFrame

A pandas DataFrame containing the last n rows.

to_dask()

Converts the Dataset object to a DASK DataFrame Returns: dataset (DASK.DataFrame): Return the data from the Dataset objects as a DASK DataFrame

to_numpy()

Converts the Dataset object to a Numpy ndarray Returns: dataset (Numpy ndarray): Return the data from the Dataset objects as a Numpy ndarray

to_pandas()

Converts the Dataset object to a pandas DataFrame Returns: dataset (pandas.DataFrame): Return the data from the Dataset objects as a pandas DataFrame

uniques(col, approx=True, delayed=False)

Compute the number of unique values in a column.

This method calculates the distinct count of values in a given column, either exactly or using an approximate method for improved performance on large datasets. The result is stored for future reference when an exact count is computed.

Parameters:

Name Type Description Default
col str

The column name for which to compute the number of unique values.

required
approx bool

If True, uses an approximate method to estimate the unique value count. If False, computes the exact count. Defaults to True.

True
delayed bool

Whether to compute or delay the count. Defaults to False.

False

Returns:

Name Type Description
nuniques int, DASK Scalar

The number of unique values in the column.

update_types(dtypes)

Batch update data types for multiple columns in the dataset.

This method allows updating the data types of multiple columns at once by providing a list of dictionaries, where each dictionary specifies a column name and the target variable type.

Parameters:

Name Type Description Default
dtypes list

A list of dictionaries, where each dictionary must contain: - "column" (str): The name of the column to update. - "vartype" (VariableType | str): The new data type for the column.

required

value_counts(col, compute=True)

Compute the frequency of unique values in a specified column.

This method returns the count of occurrences for each unique value in the given column. By default, it computes the result eagerly, but it can also return a lazy Dask Series for efficient computation on large datasets.

Parameters:

Name Type Description Default
col str

The name of the column in the Dataset that we want to count the values.

required
compute bool

Whether to compute or delay the count. Defaults to True.

True

Returns:

Name Type Description
value_counts (Series, Series)

a Series with the value_counts.