Dataset

The Dataset object in ydata-sdk is a high-performance data structure built on Dask, enabling efficient computation and parallel processing of large datasets. It not only accelerates operations but also maintains schema references and key dataset properties, ensuring seamless integration with downstream applications such as data profiling and synthetic data generation workflows.

The Dataset object seamlessly integrates with familiar data engines like pandas and numpy, allowing effortless conversion.

Key features

Optimized for Performance: Leverages Dask for parallel computing, enabling efficient handling of large datasets.
Schema Awareness: Stores and maintains metadata, data types, and structure for enhanced data integrity.
Seamless Integration: Easily converts to pandas DataFrames or numpy arrays for flexible data manipulation.
Scalability: Processes both small and massive datasets without memory limitations.
Supports Data Preprocessing – Provides built-in utilities for data transformation.

`ydata.dataset.dataset.Dataset`

Dataset class provides the interface to handle data within YData's package.

Parameters:

Name	Type	Description	Default
`df`	`Union[DataFrame, DataFrame]`	The data to be manipulated.	required
`schema`	`Optional[Dict]`	Mapping of column names to variable types.	`None`
`sample`	`float`	Fraction of the data to be sampled as the Dataset	`0.2`
`index`	`Optional[str]`	Name of the column to be used as index, if any. This is an optional input, specially recommended for TimeSeries data.	`None`
`divisions`	`Optional[list \| tuple]`	This property is utilized by Dask, the underlying engine of the Dataset object, to enhance performance during parallel computing. It can be leveraged to optimize data processing efficiency and scalability.	`None`

Properties

columns (list[str]): list of column names that are part of the Dataset schema nrows (tuple): number of rows from the Dataset ncols (int): number of columns shape (tuple): tuple of (nrows, ncols) memory_usage (int): number of bytes consumed by the underlying dataframe nmissings (int): total number of missings in Dataset infered_dtypes_count (Dict[str, Dict]): infered data types per column infered_dtypes (Dict[str, str]): infered data type per column dtypes (Dict[str, str]): mapping of data type per column, either provided or inferred index (str): Returns the name of the index column

Magic Methods

`columns` `property`

Property that returns a list of column names. Returns: columns (list[str]): A list with the Dataset column names.

`divisions` `property`

A property that returns the number of divisions set for the Dataset. Returns: divisions (tuple): the number of divisions set for the Dataset.

`index` `property`

"A property that returns the name of the index column Returns: index_name (str): index columns name

`loc` `property`

Label location based indexer for selection. This method is inherited from Dask original LocIndexer implementation.

df.loc["b"] df.loc["b":"d"]

`memory_usage` `property`

A property that returns the memory usage of the Dataset. Returns: memory_usage (Dask Series): Memory usage of the Dataset.

`ncols` `property`

Property that returns the number of columns Returns: ncols (int): Number of columns

`nmissings` `property`

Get the total number of missing values in the Dataset.

This property computes and returns the sum of missing values across all columns in the dataset, returning the total count as an integer.

Returns:

Name	Type	Description
`nmissings`	`int`	The total number of missing values in the Dataset

Notes:

- If there are no missing values, the returned value will be `0`.

`nrows` `property`

Property that returns the number of rows Returns: nrows (int): number of rows

`schema` `property` `writable`

Property that returns a dictionary of the schema of the dataset. The dictionary follows the following structure: {column_name: variable_type}

Returns:

Name	Type	Description
`schema`	`dict`	A dictionary with the schema of the dataset.

`apply(function, axis=1, raw=False, args=None, meta='__no_default__')`

Parallelized version of apply.

Only supported on the rows axis. To guarantee results with expected format, output metadata should be provided with meta argument. Arguments: function (callable): Function to apply to each row axis (Union[int, str]): 1/'columns' apply function to each row. 0/'index' apply function to each column is not supported. raw (bool): Passed function operates on Pandas Series objects (False), or numpy arrays (True) args (Optional[Tuple]): Positional arguments to pass to function in addition to the array/series meta (Optional[Union[Dict, List[Tuple], Tuple, Dataset]]): A dictionary, list of tuples, tuple or dataset that matches the dtypes and column names of the output. This is an optional argument since it only certifies that Dask will use the correct metadata instead of infering which may lead to unexpected results. Returns: df (Dataset): A dataset object output of function.

`astype(column, vartype, format=None)`

Convert a column in the dataset to a specified data type.

This method changes the data type of a specified column in the dataset, ensuring that the conversion follows the defined VariableType mappings. It also updates the dataset's internal schema to reflect the new type.

Parameters:

Name	Type	Description	Default
`column`	`str`	The name of the column in the dataset to be converted	required
`vartype`	`VariableType \| str`	The target data type for the column. Can be a `VariableType` instance or a string representation of a type (e.g., `"int"`, `"float"`, `"date"`). If `"date"` is specified, the conversion ensures the column is treated as a date and your able to define the date format following Python's formats.	required
`format`	`Optional[str]`	An optional format string used for date parsing when `vartype="date"` or `vartype="datetime"`. If `None`, default parsing rules are applied.	`None`

`copy()`

Copy a Dataset instance.

Returns:

Name	Type	Description
`dataset`	`Dataset`	A new Dataset instance with the scame schema and index.

`drop_columns(columns, inplace=False)`

Drops specified columns from a Dataset.

Parameters:

Name	Type	Description	Default
`columns`	`str or list`	column labels to drop	required
`inplace`	`bool`	if False, return a copy. Otherwise, drop inplace and return None.	`False`

`head(n=5)`

Return the n first rows of a dataset.

If the number of rows in the first partition is lower than n, Dask will not return the requested number of rows (see dask.dataframe.core.head and dask.dataframe.core.safe_head). To avoid this corner case, we retry using all partitions -1.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of rows that we want to select from the top of the dataset	`5`

Returns:

Name	Type	Description
`dataset`	`pandas DataFrame`	A pandas DataFrame containing the first `n` rows.

`infer_dtypes(schema=None)`

Infer and assign data types to dataset columns.

This method determines the most representative variable type for each feature based on observed value distributions. If a schema is provided, it overrides the inferred types. Otherwise, the method analyzes the dataset and assigns data types accordingly.

Parameters:

Name	Type	Description	Default
`schema`	`Optional[dict]`	A dictionary where keys are column names and values are the manually assigned data types. If `None`, the method automatically infers types.	`None`

`missings(compute=False)`

Calculates the number of missing values in a Dataset.

`query(query)`

Filter the dataset using a query expression.

This method applies a Pandas-style query to filter the dataset based on the given condition. It returns a new Dataset containing only the rows that match the query.

For more information check Dask's documentation on Dask dataframe query expression(https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.query.html). Args: query (str): The query expression to filter the dataset.

Returns:

Name	Type	Description
`dataset`	`Dataset`	The dataset resulting from the provided query expression.

`reorder_columns(columns)`

Defines the order of the underlying data based on provided 'columns' list of column names.

Usage

data.columns ['colA', 'colB', colC'] data.reorder_columns(['colB', 'colC']).columns ['colB', 'colC']

`sample(size, strategy='random', **strategy_params)`

Generate a sampled subset of the dataset.

This method returns a sample from the dataset using either random sampling or stratified sampling. The sample size can be defined as an absolute number of rows or as a fraction of the dataset.

Parameters:

Name	Type	Description	Default
`size`	`Union[float, int]`	size (number of rows) of the sampled subset	required
`strategy`	`str['random', 'stratified']`	strategy used to generate a sampled subset	`'random'`

Returns:

Name	Type	Description
`dataset`	`Dataset`	the sampled subset of the dataset.

`select_columns(columns, copy=True)`

Returns a Dataset containing only a subset with the specified columns. If columns is a single feature, returns a Dataset with a single column.

Parameters:

Name	Type	Description	Default
`columns`	`str or list`	column labels to select	required
`copy`	`bool`	if True, return a copy. Otherwise, select inplace and return self.	`True`

`select_dtypes(include=None, exclude=None)`

Return a subset of the dataset containing only specified data types.

This method filters the dataset to include or exclude specific data types, allowing users to focus on relevant columns based on their types. Args: include (Optional[str | list]): Specifies the columns with the expected variable types to included in the resulting dataset. exclude (Optional[str | list]): Specifies the columns with the variable types that are expected to be excluded in the resulting dataset.

Returns: dataset (Dataset): Subset of the dataset containing only columns with the specified variably types.

`shape(lazy_eval=True, delayed=False)`

Returns dataset shape as a tuple (rows, columns).

Supports lazy evaluation of nrows, ncols is unexpensive and returned directly

Parameters:

Name	Type	Description	Default
`lazy_eval`	`bool`	Returns the currently computed values for nrows and ncols properties. Defaults to True.	`True`
`delayed`	`bool`	If True, compute delayed properties instead for nrows and ncols. This is recommended to optimize DASK's DAG flow, the underlying computational engine of the Dataset. Defaults to False.	`False`

`sort_values(by, ignore_index=True, inplace=False)`

Sort the dataset by one or more columns.

This method sorts the dataset based on the specified column(s), returning either a new sorted dataset or modifying the existing dataset in place.

Parameters:

Name	Type	Description	Default
`by`	`List[str]`	A list wit the name of the column(s) to sort.	required
`ignore_index`	`bool`	Whether to ignore index or not. Defaults to True.	`True`
`inplace`	`bool`	Whether to sort the dataset in-place. Defaults to False.	`False`

Returns:

Name	Type	Description
`dataset`	`Dataset`	the sorted dataset in case inplace is set to False.

`sorted_index(by)`

Get the sorted index of the dataset based on specified columns.

This method computes the order of the dataset when sorted by the given column(s). It returns a Pandas Series representing the index positions corresponding to the sorted dataset. Args: by (List[str]): A list wit the name of the column(s) to sort.

Returns:

Name	Type	Description
`index`	`pandas Series`	A Pandas Series containing the sorted index positions.

`tail(n=5)`

Return the n last rows of a dataset.

If the number of rows in the first partition is lower than n, Dask will not return the requested number of rows (see dask.dataframe.core.head and dask.dataframe.core.safe_head). To avoid this corner case, we retry using all partitions -1.

Parameters:

Name	Type	Description	Default
`n`	`int`	Number of rows that we want to select from the bottom of the dataset	`5`

Returns:

Name	Type	Description
`dataset`	`pandas DataFrame`	A pandas DataFrame containing the last `n` rows.

`to_dask()`

Converts the Dataset object to a DASK DataFrame Returns: dataset (DASK.DataFrame): Return the data from the Dataset objects as a DASK DataFrame

`to_numpy()`

Converts the Dataset object to a Numpy ndarray Returns: dataset (Numpy ndarray): Return the data from the Dataset objects as a Numpy ndarray

`to_pandas()`

Converts the Dataset object to a pandas DataFrame Returns: dataset (pandas.DataFrame): Return the data from the Dataset objects as a pandas DataFrame

`uniques(col, approx=True, delayed=False)`

Compute the number of unique values in a column.

This method calculates the distinct count of values in a given column, either exactly or using an approximate method for improved performance on large datasets. The result is stored for future reference when an exact count is computed.

Parameters:

Name	Type	Description	Default
`col`	`str`	The column name for which to compute the number of unique values.	required
`approx`	`bool`	If `True`, uses an approximate method to estimate the unique value count. If `False`, computes the exact count. Defaults to True.	`True`
`delayed`	`bool`	Whether to compute or delay the count. Defaults to False.	`False`

Returns:

Name	Type	Description
`nuniques`	`int, DASK Scalar`	The number of unique values in the column.

`update_types(dtypes)`

Batch update data types for multiple columns in the dataset.

This method allows updating the data types of multiple columns at once by providing a list of dictionaries, where each dictionary specifies a column name and the target variable type.

Parameters:

Name	Type	Description	Default
`dtypes`	`list`	A list of dictionaries, where each dictionary must contain: - `"column"` (`str`): The name of the column to update. - `"vartype"` (`VariableType \| str`): The new data type for the column.	required

`value_counts(col, compute=True)`

Compute the frequency of unique values in a specified column.

This method returns the count of occurrences for each unique value in the given column. By default, it computes the result eagerly, but it can also return a lazy Dask Series for efficient computation on large datasets.

Parameters:

Name	Type	Description	Default
`col`	`str`	The name of the column in the Dataset that we want to count the values.	required
`compute`	`bool`	Whether to compute or delay the count. Defaults to True.	`True`

Returns:

Name	Type	Description
`value_counts`	`(Series, Series)`	a Series with the value_counts.

Dataset

Key features

ydata.dataset.dataset.Dataset

columns property

divisions property

index property

loc property

memory_usage property

ncols property

nmissings property

Notes:

nrows property

schema property writable

apply(function, axis=1, raw=False, args=None, meta='__no_default__')

astype(column, vartype, format=None)

copy()

drop_columns(columns, inplace=False)

head(n=5)

infer_dtypes(schema=None)

missings(compute=False)

query(query)

reorder_columns(columns)

sample(size, strategy='random', **strategy_params)

select_columns(columns, copy=True)

select_dtypes(include=None, exclude=None)

shape(lazy_eval=True, delayed=False)

sort_values(by, ignore_index=True, inplace=False)

sorted_index(by)

tail(n=5)

to_dask()

to_numpy()

to_pandas()

uniques(col, approx=True, delayed=False)

update_types(dtypes)

value_counts(col, compute=True)

`ydata.dataset.dataset.Dataset`

`columns` `property`

`divisions` `property`

`index` `property`

`loc` `property`

`memory_usage` `property`

`ncols` `property`

`nmissings` `property`

`nrows` `property`

`schema` `property` `writable`

`apply(function, axis=1, raw=False, args=None, meta='__no_default__')`

`astype(column, vartype, format=None)`

`copy()`

`drop_columns(columns, inplace=False)`

`head(n=5)`

`infer_dtypes(schema=None)`

`missings(compute=False)`

`query(query)`

`reorder_columns(columns)`

`sample(size, strategy='random', **strategy_params)`

`select_columns(columns, copy=True)`

`select_dtypes(include=None, exclude=None)`

`shape(lazy_eval=True, delayed=False)`

`sort_values(by, ignore_index=True, inplace=False)`

`sorted_index(by)`

`tail(n=5)`

`to_dask()`

`to_numpy()`

`to_pandas()`

`uniques(col, approx=True, delayed=False)`

`update_types(dtypes)`

`value_counts(col, compute=True)`