Workflow 3

To perform churn analysis using DS Lab Notebooks, apply explainable AI for model insights, and integrate results into a pipeline and data sand-box to drive customer retention strategies.

The process begins with creating a robust DS Lab model within the TensorFlow environment. The trained model is then exported into the Churn Pipeline, enhancing its predictive capabilities. Transformed data is written into the Data Sandbox for secure and efficient storage. Finally, we build a compelling business story, leveraging insights from the explainable model to drive customer retention and growth strategies.

This workflow demonstrates the synergy of advanced technologies in enabling explainable AI, predictive modeling, and actionable business insights, paving the way for informed and data-driven decisions.

Key Steps in This Workflow

· Access DS Lab: From the platform home screen, select the DS Lab Plugin from the app menu.

· Explore Projects: On the DS Lab homepage, view the list of existing projects, which display details such as name, description, environment, resource allocation type, libraries, and action options (version control, sharing, editing, etc.).

· Create a New Project: Click the Create button to start a new project.

· Upload and Run Notebook: Upload the Churn Prediction Notebook, load and explore the data, perform data analysis, and visualize results.

Create a Project in DS Lab

Navigate to DS Lab from the Apps menu.

Clicks on Create+ button.

Enter a project name: DS LAB WORKFLOW 3 and description.

Select the algorithm type: Regression, forecasting and Classification from the drop down. (Regression, Classification, Forecasting, NLP, etc.).

Choose a Environment:

  1. PythonTensorFlow

  2. Python(Used in this workflow)

  3. PyTorch

  4. PySpark

Allocate resources according to dataset size: medium.

Set an Idle Shutdown limit: 1h.

Save the project.

Steps to Import a Notebook and Add Sandbox Data in DS Lab

  1. Activate the Project

  2. Click Activate on the right side of the project.

  3. Once activated, click View (to the right) to open the project.

  4. Wait for the kernel to start.

  5. Import a Notebook

  6. In the Repo section, click the three dots and select Import.

  7. Provide a name (e.g., Test Notebook) and a description.

o Choose the notebook file from your local system and upload it.

from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor

nb = NotebookExecutor()

data = nb.get_data('59801689926743119', '@SYS.USERID', 'True', {}, [])

data['Churn'] = data['Churn'].map({

'No' : 0,

'Yes' : 1

})

data.head(3)

[NOTE: The above code is the auto generated code while clicking the data]

[Execute all the below code in different cells]

data.dtypes

let's gain insights into the data types of each column in our dataset using the data.dtypes attribute.

The table above showcases the data types associated with each column. These data type details provide essential information for further data processing and analysis.

data.select_dtypes('object').columns

Let's take it a step further and explore how to select specific columns with data type 'object' from our dataset. The table above shows the columns that have been selected based on their data type, which is 'object'. This technique allows us to narrow our analysis to the relevant columns for in-depth exploration.

data.select_dtypes('number').columns

Now, let's explore how to select columns with numerical data types from our dataset.The table above shows the columns that have been selected based on their numerical data types. This approach allows us to concentrate on numeric data for advanced quantitative analysis.

import numpy as np

from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

def preprocess_categorical(df_in, categorical_columns):

df = df_in[categorical_columns].copy()

ohe = OneHotEncoder()

df_cat = ohe.fit_transform(df);

df_cat = df_cat.todense()

return df_cat, ohe

def preprocess_numerical(df_in, numerical_columns):

df = df_in[numerical_columns].copy()

scaler = MinMaxScaler()

df_num = scaler.fit_transform(df);

return df_num, scaler

def preprocess_data(df_in, categorical_columns, numerical_columns):

df_cat, ohe = preprocess_categorical(df_in, categorical_columns)

df_num, scaler = preprocess_numerical(df_in, numerical_columns)

X = np.concatenate((df_cat, df_num), axis=1)

n_cat_out = df_cat.shape[1]

n_num_out = df_num.shape[1]

return X, ohe, scaler, n_cat_out, n_num_out

def invert_preprocessing(df_in, ohe, scaler, n_cat_out, n_num_out):

n_cat, n_num = ohe.n_features_in_, scaler.n_features_in_

cat_inv = ohe.inverse_transform(df_in[:, :n_cat_out])

num_inv = scaler.inverse_transform(df_in[:, -n_num_out:])

cat_inv = pd.DataFrame(cat_inv, columns=ohe.feature_names_in_)

num_inv = pd.DataFrame(num_inv, columns=scaler.feature_names_in_)

df_out = pd.concat((cat_inv, num_inv), axis=1)

return df_out

categorical_columns = ['gender', 'Partner', 'Dependents', 'PhoneService',

'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',

'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',

'Contract', 'PaperlessBilling', 'PaymentMethod']

numerical_columns = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']

X, ohe, scaler, n_cat_out, n_num_out = preprocess_data(data, categorical_columns, numerical_columns)

y = data['Churn']

the code aims to preprocess data for machine learning tasks by transforming categorical variables into a binary representation (one-hot encoding) and scaling numerical variables to a fixed range (min-max scaling). However, it is important to ensure the DataFrame data is correctly defined with the specified columns for the code to function properly.

invert_preprocessing(X, ohe, scaler, n_cat_out, n_num_out)

By calling invert_preprocessing with the appropriate inputs, one can obtain the original DataFrame with the categorical and numerical columns in their original format before any one-hot encoding or min-max scaling was applied. This function is useful for reverting the processed data back to its original state, especially when it is necessary to interpret or analyze the data in its original form after applying machine learning algorithms that required preprocessing. However, it's important to ensure that the original DataFrame data used in the preprocessing matches the structure and order of columns used during the preprocessing steps for accurate inversion.

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model = RandomForestClassifier()

model.fit(X_train, y_train);

preds_train = model.predict(X_train)

preds_test = model.predict(X_test)

print(classification_report(y_train, preds_train))

print(classification_report(y_test, preds_test))

the code performs data splitting, model training, prediction, and evaluation using the random forest classifier. The classification reports provide a detailed analysis of the model's performance on both the training and testing sets, giving insight into its accuracy, precision, recall, and F1-score for each class in the target variable y.

# make column transformer object to use for data preprocessing in pipeline

from sklearn.compose import ColumnTransformer

categorical_columns = ['gender', 'Partner', 'Dependents', 'PhoneService',

'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',

'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',

'Contract', 'PaperlessBilling', 'PaymentMethod']

numerical_columns = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']

ct = ColumnTransformer(

[

('ohe', OneHotEncoder(), categorical_columns),

('scaler', MinMaxScaler(), numerical_columns)

],

remainder='drop'

)

X_trans = ct.fit_transform(data);

By using this ColumnTransformer object ct, you can easily integrate it into a machine learning pipeline for efficient preprocessing of both categorical and numerical features. The pipeline would allow you to apply this preprocessing consistently on both the training and testing data without data leakage, making it easier to train and evaluate machine learning models effectively.

Note: Run Each and every Cell to get the output

Enocding part

import pandas as pd

from sklearn.preprocessing import LabelEncoder

# Instantiate the LabelEncoder

label_encoder = LabelEncoder()

# Fit and transform the series to label encode the values

encoded_data = label_encoder.fit_transform(y_train);

# Create a new Series with the encoded values

encoded_series = pd.Series(encoded_data);

encoded_series

It imports the required libraries, pandas for data manipulation and LabelEncoder from sklearn.preprocessing for label encoding. The LabelEncoder is instantiated, creating an instance of the label encoder. The fit_transform() method is used to both fit the label encoder to the y_train series and transform it to label encode the categorical values. The result is stored in the encoded_data variable as a NumPy array. Finally, a new pandas Series called encoded_series is created using pd.Series() to store the label-encoded values in a more manageable and structured format for further analysis or usage.

Saving and Registering a Model in DS Lab

· In a new notebook cell, click the three dots and select Save Model.

· A code snippet will be generated automatically to save the designed model.

· Run the cell using the Run Cell icon.

· Once the code executes successfully, navigate to the Models tab (located in the same place as the Data section) and click All to view the models.

· From the model list, click the three dots next to your model and select Register.

· The model is now registered and can be used in the DS Lab component of the Data Pipeline.

Create a sandbox using forecasting data.

· Navigate to the data center module using apps menu.

· Click on the sandbox option.

· Navigate to the create button and click on it.

· Give sandbox and name and an appropriate description.

· Choose the csv file from your local system for the forecasting data.

· Once all the configuration are done, click on the upload option.

· You have successfully created a sandbox using the forecasting data.

Create a Pipeline

Let’s move to pipeline flow.

· First, locate and select the Data Pipeline Plugin from the app menu. This will take you to the pipeline home page.

· Next, look for the create button and click on it.

· Great! Now, let's specify the details for our end-to-end DS Lab pipeline. Enter a suitable name for your pipeline, such as Churn_Prediction_datapipeline. For the description, briefly describe the workflow, such as End-to-End DS Lab Workflow

· Now, choose the resource allocation type based on your requirements. This feature allows you to deploy the pipeline with high, medium, or low-end configurations, depending on the velocity and volume of data that the pipeline must handle."

· Once you're done with the configuration, click on the save button to save your pipeline."

Once the pipeline is saved in the pipeline list, you can add components to the canvas to create your pipeline workflow or dataflow.

To add a component, simply drag the required component from the Component Palette, located on the right side of the user interface, and drop it onto the canvas. You can configure each component to define your pipeline workflow. The Pipeline Editor displays the Component Palette, which contains various components like Reader, Writer, Transformation, Consumer, Producer, Machine Learning, and more. Use these components to design your pipeline according to your specific requirements

Add and Configuring Sandbox Reader and Adding a Kafka Event

  • Drag and drop the Sandbox Reader from the Reader section of the Components Palette (on the right side of the canvas).

  • In the Basic Information panel, set the Invocation Type to Realtime, then move to the Meta Information tab.

  • In Meta Information:

  • Select Network as the Storage Type.

  • Select CSV as the File Type.

  • Choose the desired Sandbox Name from the dropdown (use a sandbox you previously uploaded data to).

  • Select the specific Sandbox File you want to process.

  • Check both Header and Infer Schema to ensure headers are handled and the schema is automatically detected.

  • Once the configuration is complete, click Save.

Adding an Event:

  • Open the Event Panel located next to the Components Palette.

  • Click the ‘+’ icon to create a new event.

  • In the Event Creation page, fill in all mandatory fields.

  • Set the Partition value to 1.

  • Click Add Kafka Event to create the event.

  • From the Event Panel, drag and drop the newly created Event Component onto the canvas.

  • Connect the Event Component to the Sandbox Reader Component.

Congratulations! You've successfully created an event and connected it to the Sandbox Reader component. Your data processing and DS LAB pipeline is taking shape

· Now, let's add the DS Lab component to the canvas for batch processing. Drag and drop the DS Lab component from the Machine learning section onto the canvas from the component palette.

· Select 'batch' as the invocation type and move to the Meta Information tab

· From the execution type dropdown, choose 'Model runner

· Now, select 'churn prediction Project' from the project name dropdown. This project should have been previously created in the DS Lab plugin

· Next, choose 'churn preprocess' dp model from the model name dropdown. This model should have been registered in the DS Lab plugin

· Once you've configured everything, don't forget to save the DS Lab component."

Now, let's create an event and connect it to the component."

· Click on the "Event" panel located next to the component section in the component palette. This action will open the event panel, which allows you to manage events.

· Inside the event panel, locate and click on the "+" icon. This will initiate the process of adding a new event.

· In the event creation Page, configure all the mandatory fields. Make sure to fill in all the required information accurately.

· Set the partition for the event. The partition is a setting that specifies how the event should be categorized or grouped. In this case, set the partition value to 1.

· Once all the information is filled and the partition is set, click on the "add kafka event" button to create the event. The event will now be added to the system.

· To display the event on the canvas, locate the event component in the event panel..

· Drag and drop the event component from the event panel onto the canvas area where you want to display the event. The event component will appear as a visual representation of the event.

Add Python Script Component

· Drag and drop the Python script component after searching the component name in the search bar of the component palette

· Once the Python script component is added, select the Invocation type as "batch". This means that the script will be executed on the entire DataFrame as a batch operation.

· Go to the meta information section of the Python script component.

· Specify the component name as "DropColumn" to give it a meaningful name.

· Write the script for dropping the 'index' column in the Python script editor:

· In this script, the function func takes a DataFrame df as input and returns a DataFrame df_out with the 'index' column dropped.

· After writing the script, select the function func from the "Start Function" dropdown. This tells the Python script component to use the defined function as the entry point for the script.

· For the input data to the Python script, select "Data frame" from the "In Event Data Type" dropdown, and specify df as the parameter name. This links the input DataFrame to the df parameter of the func function.

· Once all the settings are configured, save the component to apply the changes.

Now, let's create an event and connect it to the component."

· Click on the "Event" panel located next to the component section in the component palette. This action will open the event panel, which allows you to manage events.

· Inside the event panel, locate and click on the "+" icon. This will initiate the process of adding a new event.

· In the event creation Page, configure all the mandatory fields. Make sure to fill in all the required information accurately.

· Set the partition for the event. The partition is a setting that specifies how the event should be categorized or grouped. In this case, set the partition value to 1.

· Once all the information is filled and the partition is set, click on the "add kafka event" button to create the event. The event will now be added to the system.

· To display the event on the canvas, locate the event component in the event panel

· Drag and drop the event component from the event panel onto the canvas area where you want to display the event. The event component will appear as a visual representation of the event.

Now, let's add the DS Lab component to the canvas for executing ml model.

· Now, let's add the DS Lab component to the canvas for batch processing. Drag and drop the DS Lab component from the Machine learning section onto the canvas from the component palette

· Select 'batch' as the invocation type and move to the Meta Information tab

· From the execution type dropdown, choose 'Model runner

· Now, select 'churn prediction Project' from the project name dropdown. This project should have been previously created in the DS Lab plugin

· Next, choose 'churn_model ml model from the model name dropdown. This model should have been registered in the DS Lab plugin

Now, let's create an event and connect it to the component.

· Click on the "Event" panel located next to the component section in the component palette. This action will open the event panel, which allows you to manage events.

· Inside the event panel, locate and click on the "+" icon. This will initiate the process of adding a new event.

· In the event creation Page, configure all the mandatory fields. Make sure to fill in all the required information accurately.

· Set the partition for the event. The partition is a setting that specifies how the event should be categorized or grouped. In this case, set the partition value to 1.

· Once all the information is filled and the partition is set, click on the "add kafka event" button to create the event. The event will now be added to the system.

· To display the event on the canvas, locate the event component in the event panel

· Drag and drop the event component from the event panel onto the canvas area where you want to display the event. The event component will appear as a visual representation of the event.

Now, let's add the DS Lab component to the pipeline for run or execute enrichment script.

· Drag and drop the DS Lab component from the Machine learning section onto the canvas from the component palette

· Select 'batch' as the invocation type and move to the Meta Information tab

· From the execution type dropdown, choose Script runner.

· Now, select 'churn prediction Project' from the project name dropdown. This project should have been previously created in the DS Lab plugin

· Next, choose churn_pred_util script from the Script name dropdown. This model should have been registered in the DS Lab plugin

· Function type as data frame

· Select func from the start function dropdown

· Let’s pass Secrets credentials in “Input Arguments” section

· host @ENV.DS_CH_HOST

· port @ENV.DS_CH_TCP_PORT

· database @ENV.DS_CH_DB_DEVELOPMENT

· user @ENV.DS_CH_USER_DEVELOPMENT

· save the component

Now, let's create an event and connect it to the component."

· Click on the "Event" panel located next to the component section in the component palette. This action will open the event panel, which allows you to manage events.

· Inside the event panel, locate and click on the "+" icon. This will initiate the process of adding a new event.

· In the event creation Page, configure all the mandatory fields. Make sure to fill in all the required information accurately.

· Set the partition for the event. The partition is a setting that specifies how the event should be categorized or grouped. In this case, set the partition value to 1.

· Once all the information is filled and the partition is set, click on the "add kafka event" button to create the event. The event will now be added to the system.

· To display the event on the canvas, locate the event component in the event panel

· Drag and drop the event component from the event panel onto the canvas area where you want to display the event. The event component will appear as a visual representation of the event.

Now it's time to select the appropriate Writer component. Let's add the Sandbox Writer component from the Writer section and configure it

· Drag and drop the Sandbox Writer component from the writer section of the component palette onto the pipeline.

· "Select 'Realtime' as the invocation type and move to the Meta Information tab."

· "In the Meta Information tab, choose 'network' as the storage type."

· "Next, specify the sandbox file name where you want to write the data."

· "Select 'csv' as the file type for writing the data."

· "Now, choose 'Overwrite mode' from the save mode dropdown to ensure the data is overwritten each time the pipeline runs."

· "Once you've configured the component, click on the save button to save your settings. "Now, connect the output event of the DS Lab component to the input of the Sandbox Writer component."

Congratulations! You've successfully added and configured the Sandbox Writer component. Your end-to-end data processing and DS Lab pipeline is now complete, ready to process data in realtime and write the results to the specified file. You can now run the pipeline to start the data processing and ds lab tasks."

After configuring and setting up the Pipeline flow, it's time to Update and activate the pipeline.

· Locate "Update Pipeline" icon in the toolbar and Click on it

· Now, Click on the ‘Activate’ icon . This will Start the execution of the Pipeline and start the data processing.

· After activating the Pipeline, navigate to the logs and advance Log section, Look for the Log Panel and click on it to access the advanced logs for detailed information.

· Within the Log Panel, you'll see the pods associated with each component. Pods are containers that hold the execution environment for the Pipeline.

· Check if the pods for each component have come up and are running. This indicates that the components are successfully deployed and ready to execute their tasks.

· To view the specific logs for each component, click on the corresponding pod or log entry. The logs will provide detailed information about the execution and any potential errors or issues encountered during the process.

In this workflow, we are building an end-to-end data processing and DS Lab pipeline to read churn data from the sandbox, apply a model created in the DS Lab plugin, and then write the processed data back to the sandbox.

In our end-to-end data processing and DS Lab pipeline for churn data analysis, we embark on a journey to extract valuable insights from the churn data residing in the sandbox. The first step involves seamlessly connecting to the sandbox and retrieving the churn data, which will serve as the foundation for our entire analysis. Leveraging the powerful features of DS Lab, we ensure a smooth and efficient data retrieval process. With the churn data at our disposal, we move on to the next step, where we explore the predictive capabilities of the DS Lab model. This carefully crafted model is designed specifically for churn prediction and plays a vital role in helping us make well-informed decisions. We eagerly run the churn data through our model, witnessing its accurate predictions, identifying potential churners, and gaining valuable insights into customer behavior. However, our journey doesn't stop there. In the subsequent step, we delve into the realm of data processing and enrichment. Employing essential data processing techniques, we meticulously clean and prepare the churn data, ensuring its quality and reliability for further analysis. Additionally, we enrich the data with relevant information, harnessing the power of additional variables to enhance the predictive power of our model. With the processed and enriched churn data now at its best, the final step involves writing it back to the sandbox. This secure storage ensures that the valuable data can be accessed for further exploration, analysis, or even reporting purposes. Our end-to-end pipeline enables us to seamlessly progress from data retrieval to prediction and data enhancement, culminating in a powerful and comprehensive churn analysis to drive strategic decision-making and customer retention efforts.

Once your data is successfully written, let's move on to the next steps.

· locate and select data center plugin from the app menu

· Inside the Data Center Plugin, navigate to the "Sandbox" section. This is where we'll create a sandbox to store our data.

· let's proceed to create a datastore. A datastore is like a repository where you can access and manage your data.

· In the Business Story section, select the appropriate visualization type (bar graphs, line charts, pie charts, etc.) based on your data and what you want to represent.

After completing your workflow and ensuring that all your tasks are accomplished, it's essential to deactivate the pipeline and Data Science (DS) lab to save resources and prevent any unnecessary usage.

Last updated