Databricks & YData

This sections provides a detailed guide about integrating YData SDK with Databricks. By combining Databricks and YData SDK, users gain a comprehensive AI solution. YData SDK enables access to previously siloed data, enhances understanding, and improves data quality. Meanwhile, Databricks provides the scalability needed to deliver robust AI capabilities.

Integration benefits

Enhanced Data Accessibility: Seamlessly access and integrate previously siloed data.
Improved Data Quality: Use YData SDK to enhance the quality of your data through data preparation and augmentation.
Scalability: Leverage Databricks' robust infrastructure to scale data processing and AI workloads.
Streamlined Workflows: Simplify data workflows, reducing manual effort and potential errors.
Comprehensive Support: Benefit from extensive documentation and support for both platforms, ensuring smooth integration and operation.

YData SDK in Databricks Notebooks

The YData SDK provides a powerful set of tools for integrating and enhancing data within Databricks notebooks. This guide covers the installation, basic usage, and advanced features of the YData SDK, helping users maximize the potential of their data for AI and machine learning applications.

Prerequisites

Before using the YData SDK in Databricks notebooks, ensure the following prerequisites are met:

Access to a Databricks workspace
A valid YData account and API key
Basic knowledge of Python and Databricks notebooks

Best Practices

Data Security: Ensure API keys and sensitive data are securely managed.
Efficient Coding: Use vectorized operations for data manipulation where possible.
Resource Management: Monitor and manage the resources used by your Databricks cluster to optimize performance.

Installation

To install the YData SDK in a Databricks notebook, use the following command:

%pip install ydata-sdk
dbutils.library.restartPython()

Ensure the installation is successful before proceeding to the next steps.

Setting up the LICENSE KEY

First, set up your licence key:

import os

# Add your YData token as part of your environment variables for authentication
os.environ['YDATA_LICENSE_KEY'] = '{add-your-token}'

Synthetic data generation

This section explores one of the most powerful features of the YData SDK for enhancing and refining data within Databricks notebooks. This includes as generating synthetic data to augment datasets or to generate privacy-preserving data. By leveraging these advanced capabilities, users can significantly enhance the robustness and performance of their AI and machine learning models, unlocking the full potential of their data. Leveraging synthetic data allows to create privacy-preserving datasets that maintain real-world value, enabling users to work with sensitive information securely while accessing utility of real data.

Another key focus is on generating synthetic data to augment existing datasets. This technique, particularly through conditional synthetic data generation, allows users to create targeted, realistic datasets. By addressing data imbalances and enriching the training data, conditional synthetic data generation significantly enhances the robustness and performance of machine learning (ML) models, leading to more accurate and reliable outcomes.

Read data from a delta table

# Read data from the catalog
df = spark.sql("SELECT * FROM ydata.default.credit_scoring_labeled")

# Display the dataframe
display(df)

After reading the data we need to convert it to pandas dataframe in order to create our synthetic data generation model. For the augmentation use-case we will be leveraging Conditional Synthetic data generation.

Training a conditional synthetic data generator

from ydata.synthesizers import RegularSynthesizer

# Convert Spark dataframe to pandas dataframe
pandas_df = df.toPandas()
pandas_df = pandas_df.drop('ID', axis=1)

# Train a synthetic data generator using ydata-sdk
synth = RegularSynthesizer(name='Synth credit scoring | Conditional')
synth.fit(pandas_df, condition_on='Label')

# Display the synthetic dataframe
display(synth)

Now that we have a trained conditional synthetic data generator we are able to generate a few samples controlling the population behaviour based on the columns that we have conditioned the process to.

Generating a synthetic sample conditioned to column 'Label'

#generate synthetic samples condition to Label
synthetic_sample = synth.sample(n_samples=len(pandas_df), condition_on={
            "Label": {
                        "categories": [{
                            "category": 1,
                            "percentage": 0.7
                        }]
        }
    }
)

After generating the synthetic data we can combine it with our dataset.

Convert the dataframe to Spark dataframe

# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

#Create a spark dataframe from the synthetic dataframe
synthetic_df = spark.createDataFrame(synthetic_sample)

display(synthetic_df)

Combining the datasets

# Concatenate the original dataframe with the synthetic dataframe
#removing the column ID as it is not used
df = df.drop('ID')
concatenated_df = df.union(synthetic_df)

# Display the concatenated dataframe
display(concatenated_df)

Afterwards you can use your augmented dataset to train a Machine Learning model using MLFlow.