Dataset
The Dataset object in ydata-sdk is a high-performance data structure built on Dask, enabling efficient computation and parallel processing of large datasets. It not only accelerates operations but also maintains schema references and key dataset properties, ensuring seamless integration with downstream applications such as data profiling and synthetic data generation workflows.
The Dataset object seamlessly integrates with familiar data engines like pandas and numpy, allowing effortless conversion.
Key features
- Optimized for Performance: Leverages Dask for parallel computing, enabling efficient handling of large datasets.
- Schema Awareness: Stores and maintains metadata, data types, and structure for enhanced data integrity.
- Seamless Integration: Easily converts to pandas DataFrames or numpy arrays for flexible data manipulation.
- Scalability: Processes both small and massive datasets without memory limitations.
- Supports Data Preprocessing – Provides built-in utilities for data transformation.
ydata.dataset.dataset.Dataset
Dataset class provides the interface to handle data within YData's package.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
Union[DataFrame, DataFrame]
|
The data to be manipulated. |
required |
schema
|
Optional[Dict]
|
Mapping of column names to variable types. |
None
|
sample
|
float
|
Fraction of the data to be sampled as the Dataset |
0.2
|
index
|
Optional[str]
|
Name of the column to be used as index, if any. This is an optional input, specially recommended for TimeSeries data. |
None
|
divisions
|
Optional[list | tuple]
|
This property is utilized by Dask, the underlying engine of the Dataset object, to enhance performance during parallel computing. It can be leveraged to optimize data processing efficiency and scalability. |
None
|
Properties
columns (list[str]): list of column names that are part of the Dataset schema nrows (tuple): number of rows from the Dataset ncols (int): number of columns shape (tuple): tuple of (nrows, ncols) memory_usage (int): number of bytes consumed by the underlying dataframe nmissings (int): total number of missings in Dataset infered_dtypes_count (Dict[str, Dict]): infered data types per column infered_dtypes (Dict[str, str]): infered data type per column dtypes (Dict[str, str]): mapping of data type per column, either provided or inferred index (str): Returns the name of the index column
Magic Methods
columns
property
Property that returns a list of column names. Returns: columns (list[str]): A list with the Dataset column names.
divisions
property
A property that returns the number of divisions set for the Dataset. Returns: divisions (tuple): the number of divisions set for the Dataset.
index
property
"A property that returns the name of the index column Returns: index_name (str): index columns name
loc
property
Label location based indexer for selection. This method is inherited from Dask original LocIndexer implementation.
df.loc["b"] df.loc["b":"d"]
memory_usage
property
A property that returns the memory usage of the Dataset. Returns: memory_usage (Dask Series): Memory usage of the Dataset.
ncols
property
Property that returns the number of columns Returns: ncols (int): Number of columns
nmissings
property
Get the total number of missing values in the Dataset
.
This property computes and returns the sum of missing values across all columns in the dataset, returning the total count as an integer.
Returns:
Name | Type | Description |
---|---|---|
nmissings |
int
|
The total number of missing values in the Dataset |
Notes:
- If there are no missing values, the returned value will be `0`.
nrows
property
Property that returns the number of rows Returns: nrows (int): number of rows
schema
property
writable
Property that returns a dictionary of the schema of the dataset. The dictionary follows the following structure: {column_name: variable_type}
Returns:
Name | Type | Description |
---|---|---|
schema |
dict
|
A dictionary with the schema of the dataset. |
apply(function, axis=1, raw=False, args=None, meta='__no_default__')
Parallelized version of apply.
Only supported on the rows axis. To guarantee results with expected format, output metadata should be provided with meta argument. Arguments: function (callable): Function to apply to each row axis (Union[int, str]): 1/'columns' apply function to each row. 0/'index' apply function to each column is not supported. raw (bool): Passed function operates on Pandas Series objects (False), or numpy arrays (True) args (Optional[Tuple]): Positional arguments to pass to function in addition to the array/series meta (Optional[Union[Dict, List[Tuple], Tuple, Dataset]]): A dictionary, list of tuples, tuple or dataset that matches the dtypes and column names of the output. This is an optional argument since it only certifies that Dask will use the correct metadata instead of infering which may lead to unexpected results. Returns: df (Dataset): A dataset object output of function.
astype(column, vartype, format=None)
Convert a column in the dataset to a specified data type.
This method changes the data type of a specified column in the dataset, ensuring that
the conversion follows the defined VariableType
mappings. It also updates the dataset's
internal schema to reflect the new type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The name of the column in the dataset to be converted |
required |
vartype
|
VariableType | str
|
The target data type for the column. Can be a |
required |
format
|
Optional[str]
|
An optional format string used for date parsing when |
None
|
copy()
Copy a Dataset instance.
Returns:
Name | Type | Description |
---|---|---|
dataset |
Dataset
|
A new Dataset instance with the scame schema and index. |
drop_columns(columns, inplace=False)
head(n=5)
Return the n
first rows of a dataset.
If the number of rows in the first partition is lower than n
,
Dask will not return the requested number of rows (see
dask.dataframe.core.head
and dask.dataframe.core.safe_head
).
To avoid this corner case, we retry using all partitions -1.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int
|
Number of rows that we want to select from the top of the dataset |
5
|
Returns:
Name | Type | Description |
---|---|---|
dataset |
pandas DataFrame
|
A pandas DataFrame containing the first |
infer_dtypes(schema=None)
Infer and assign data types to dataset columns.
This method determines the most representative variable type for each feature
based on observed value distributions. If a schema
is provided, it overrides
the inferred types. Otherwise, the method analyzes the dataset and assigns
data types accordingly.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
schema
|
Optional[dict]
|
A dictionary where keys are column names and values are the manually assigned data types. If |
None
|
missings(compute=False)
Calculates the number of missing values in a Dataset.
query(query)
Filter the dataset using a query expression.
This method applies a Pandas-style query to filter the dataset based on
the given condition. It returns a new Dataset
containing only the rows that
match the query.
For more information check Dask's documentation on Dask dataframe query expression(https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.query.html). Args: query (str): The query expression to filter the dataset.
Returns:
Name | Type | Description |
---|---|---|
dataset |
Dataset
|
The dataset resulting from the provided query expression. |
reorder_columns(columns)
Defines the order of the underlying data based on provided 'columns' list of column names.
Usage
data.columns ['colA', 'colB', colC'] data.reorder_columns(['colB', 'colC']).columns ['colB', 'colC']
sample(size, strategy='random', **strategy_params)
Generate a sampled subset of the dataset.
This method returns a sample from the dataset using either random sampling or stratified sampling. The sample size can be defined as an absolute number of rows or as a fraction of the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
size
|
Union[float, int]
|
size (number of rows) of the sampled subset |
required |
strategy
|
str['random', 'stratified']
|
strategy used to generate a sampled subset |
'random'
|
Returns:
Name | Type | Description |
---|---|---|
dataset |
Dataset
|
the sampled subset of the dataset. |
select_columns(columns, copy=True)
Returns a Dataset containing only a subset with the specified columns. If columns is a single feature, returns a Dataset with a single column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
str or list
|
column labels to select |
required |
copy
|
bool
|
if True, return a copy. Otherwise, select inplace and return self. |
True
|
select_dtypes(include=None, exclude=None)
Return a subset of the dataset containing only specified data types.
This method filters the dataset to include or exclude specific data types, allowing users to focus on relevant columns based on their types. Args: include (Optional[str | list]): Specifies the columns with the expected variable types to included in the resulting dataset. exclude (Optional[str | list]): Specifies the columns with the variable types that are expected to be excluded in the resulting dataset.
Returns: dataset (Dataset): Subset of the dataset containing only columns with the specified variably types.
shape(lazy_eval=True, delayed=False)
Returns dataset shape as a tuple (rows, columns).
Supports lazy evaluation of nrows, ncols is unexpensive and returned directly
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lazy_eval
|
bool
|
Returns the currently computed values for nrows and ncols properties. Defaults to True. |
True
|
delayed
|
bool
|
If True, compute delayed properties instead for nrows and ncols. This is recommended to optimize DASK's DAG flow, the underlying computational engine of the Dataset. Defaults to False. |
False
|
sort_values(by, ignore_index=True, inplace=False)
Sort the dataset by one or more columns.
This method sorts the dataset based on the specified column(s), returning either a new sorted dataset or modifying the existing dataset in place.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
by
|
List[str]
|
A list wit the name of the column(s) to sort. |
required |
ignore_index
|
bool
|
Whether to ignore index or not. Defaults to True. |
True
|
inplace
|
bool
|
Whether to sort the dataset in-place. Defaults to False. |
False
|
Returns:
Name | Type | Description |
---|---|---|
dataset |
Dataset
|
the sorted dataset in case inplace is set to False. |
sorted_index(by)
Get the sorted index of the dataset based on specified columns.
This method computes the order of the dataset when sorted by the given column(s). It returns a Pandas Series representing the index positions corresponding to the sorted dataset. Args: by (List[str]): A list wit the name of the column(s) to sort.
Returns:
Name | Type | Description |
---|---|---|
index |
pandas Series
|
A Pandas Series containing the sorted index positions. |
tail(n=5)
Return the n
last rows of a dataset.
If the number of rows in the first partition is lower than n
,
Dask will not return the requested number of rows (see
dask.dataframe.core.head
and dask.dataframe.core.safe_head
).
To avoid this corner case, we retry using all partitions -1.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int
|
Number of rows that we want to select from the bottom of the dataset |
5
|
Returns:
Name | Type | Description |
---|---|---|
dataset |
pandas DataFrame
|
A pandas DataFrame containing the last |
to_dask()
Converts the Dataset object to a DASK DataFrame Returns: dataset (DASK.DataFrame): Return the data from the Dataset objects as a DASK DataFrame
to_numpy()
Converts the Dataset object to a Numpy ndarray Returns: dataset (Numpy ndarray): Return the data from the Dataset objects as a Numpy ndarray
to_pandas()
Converts the Dataset object to a pandas DataFrame Returns: dataset (pandas.DataFrame): Return the data from the Dataset objects as a pandas DataFrame
uniques(col, approx=True, delayed=False)
Compute the number of unique values in a column.
This method calculates the distinct count of values in a given column, either exactly or using an approximate method for improved performance on large datasets. The result is stored for future reference when an exact count is computed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col
|
str
|
The column name for which to compute the number of unique values. |
required |
approx
|
bool
|
If |
True
|
delayed
|
bool
|
Whether to compute or delay the count. Defaults to False. |
False
|
Returns:
Name | Type | Description |
---|---|---|
nuniques |
int, DASK Scalar
|
The number of unique values in the column. |
update_types(dtypes)
Batch update data types for multiple columns in the dataset.
This method allows updating the data types of multiple columns at once by providing a list of dictionaries, where each dictionary specifies a column name and the target variable type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dtypes
|
list
|
A list of dictionaries, where each dictionary must contain:
- |
required |
value_counts(col, compute=True)
Compute the frequency of unique values in a specified column.
This method returns the count of occurrences for each unique value in the given column. By default, it computes the result eagerly, but it can also return a lazy Dask Series for efficient computation on large datasets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col
|
str
|
The name of the column in the Dataset that we want to count the values. |
required |
compute
|
bool
|
Whether to compute or delay the count. Defaults to True. |
True
|
Returns:
Name | Type | Description |
---|---|---|
value_counts |
(Series, Series)
|
a Series with the value_counts. |