DS Model-to-Sandbox Pipeline

To create a project in DS Lab, train and register a model, and build a pipeline that writes model outputs to the Data Sandbox for analysis and reuse.

This guide outlines the process to create and operationalize a machine learning workflow within the BDB Platform’s DS Lab and Data Pipeline modules. The workflow demonstrates how to:

  • Create a project in DS Lab

  • Train, test, and register an ML model

  • Export scripts to the Data Pipeline

  • Build a data pipeline that reads, predicts, and writes results back into the Data Sandbox for analysis and reuse

This setup enables seamless collaboration between Data Science experimentation and Data Engineering pipelines, supporting scalable ML model deployment.

Architecture Overview

Stage

Component

Purpose

1

DS Lab Project

Create a workspace for model training

2

DS Lab Notebook

Prepare and upload data for model training

3

DS Lab Model

Train, save, and register ML models

4

DS Lab Script

Generate and export input data logic

5

Data Pipeline

Automate model inference and output writing

6

Sandbox Writer

Store model outputs for downstream use

Step 1: Create a Project in DS Lab

Purpose

To establish a dedicated workspace for model experimentation.

Procedure

  1. Navigate to DS Lab from the Apps Menu.

  2. Click the Create (+) button.

  3. Fill in the following fields:

    • Project Name: DS LAB WORKFLOW 1

    • Description: Machine learning workflow project for model training and deployment.

    • Algorithm Type: Select Regression and Classification from the dropdown (supports Regression, Classification, Forecasting, NLP, etc.).

  4. Choose the Environment:

    • Python TensorFlow (used in this workflow)

    • Alternative options: PyTorch, PySpark

  5. Resource Allocation: Medium (based on dataset size).

  6. Idle Shutdown: 1 hour.

  7. External Libraries: Add boto3.

  8. Click Save.

Step 2: Create a Notebook and Upload a Dataset

Purpose

To add a dataset for model training within the DS Lab environment.

Procedure

  1. Activate the project:

    • Click Activate on the right-hand panel.

    • Click View to open the activated project.

  2. Wait for the kernel to start.

  3. Click Create Notebook → Enter:

    • Name: test_notebook

    • Description: Base notebook for training workflow.

    • Click Save.

Add Sandbox Data

  1. Click the Data icon on the left navigation pane.

  2. Click the ‘+’ icon next to the search bar.

  3. In the Add Data dialog:

    • Select Data Sandbox Files from the Data Source dropdown.

    • Click Upload and provide:

      • Sandbox Name: Customer Data

      • Description: Customer data for model training

      • Upload File: Select a CSV file from your local system.

    • Click Save.

    • A confirmation message “File is uploaded” will appear.

  4. Check the newly uploaded Sandbox file → Click Add.

  5. Select the dataset checkbox to auto-generate code for data retrieval.

  6. Run the cell using the Run Cell icon.

Step 3: Upload an External Notebook and Train the Model

Purpose

To import and execute an external notebook for model training using scikit-learn.

Procedure

  1. Click the Workspace icon in the left navigation pane.

  2. Navigate to the Repo section → Click the 3 dots (⋮) → Select Import.

  3. Provide:

    • Notebook Name: Pima Diabetes Model

    • Description: Training model using the Pima Indians Diabetes dataset.

    • Upload dataset via Git URL.

Execute the Following Cells in Order

1. Import Dependencies

import pandas
from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB

2. Load the Dataset

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
dataframe

3. Define Features (X) and Target (Y)

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

4. Split into Training and Test Sets

test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)

5. Train the Model

model = MultinomialNB()
model.fit(X_train, Y_train)

6. Save the Model

from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor
nb = NotebookExecutor()
saved_model = nb.save_model(model=model, modelName='model', modelType='ml', X=None, y=None, estimator_type='')

7. Load and Predict

loaded_model = nb.load_saved_model('56031751531092038')
X_test_copy = X_test.copy()
Y_pred = nb.predict(model=loaded_model, dataframe=X_test_copy, modeltype='ml')
Y_pred.head()

8. Evaluate Accuracy

from sklearn.metrics import accuracy_score
accuracy_score(Y_test, Y_pred.predictions)
A trained Multinomial Naive Bayes model is saved and validated.

Step 4: Register the Model

Purpose

To make the trained model available for pipelines and batch execution.

Procedure

  1. Go to the Models tab (left pane).

  2. Select the All option from the dropdown menu.

  3. Locate your trained model → Click the Register icon for the selected model.

  4. Confirmation: “Model successfully registered.”

Step 5: Create Input Data Script

Purpose

To generate a reusable input data script for feeding the model.

Procedure

  1. Go to the Workspace icon.

  2. In the Repo section → Click 3 dots (⋮) → Select Create Notebook.

  3. Name: sklearn_model_input

  4. Paste the following script:

def input_data():
    import pandas as pd
    url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
    names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
    df = pd.read_csv(url, names=names)

    df_test = df.copy()
    df_test.drop('class', axis=1, inplace=True)
    return df_test

input_data()
  1. Run the cell using the Run Cell icon.

Step 6: Export Notebook to Pipeline

Procedure

  1. Navigate to the 3 dots (⋮) of the notebook → Select Register.

  2. Select the target cell to export → Click Next.

  3. Choose Export as Script → Click Finish.

Step 7: Create a Data Pipeline

Purpose

To automate data ingestion, model execution, and result writing to the Sandbox.

Procedure

  1. Open Data Pipeline from the Apps Menu.

  2. Click Create Pipeline and configure:

    • Name: DS Lab Workflow 1

    • Description: End-to-end DS Lab and Sandbox workflow

    • Resource Allocation: Medium

  3. Click Save.

Step 8: Add DS Lab Runner (Input Script Execution)

Procedure

  1. From the Components Palette, drag and drop DSLab Runner (from the Machine Learning section).

  2. Configure:

    • Invocation Type: Real-Time

    • Execution Type: Script Runner

    • Function Input: DataFrame

    • Project Name: DS LAB WORKFLOW 1

    • Script Name: sklearn_model_input

    • Start Function: input_data

  3. Save the component.

Step 9: Add Kafka Event

  1. From the Event Panel, click + → Add Kafka Event.

  2. Configure and drag it to the pipeline canvas.

Step 10: Add DS Lab Runner (Model Execution)

Procedure

  1. From the Components Palette, drag and drop another DSLab Runner.

  2. Configure:

    • Invocation Type: Batch

    • Execution Type: Model Runner

    • Project Name: DS LAB WORKFLOW 1

    • Model Name: model

  3. Save the component.

  4. Add another Kafka Event and connect it to this component.

Step 11: Add Sandbox Writer

Purpose

To write model output back to the Sandbox for analysis and visualization.

Procedure

  1. From the Components Palette, drag and drop Sandbox Writer (under the Writer section).

  2. Configure:

    • Invocation Type: Batch

    • Storage Type: Network

    • File Type: CSV

    • Sandbox File: Customer

    • Save Mode: Overwrite

  3. Click Save.

Step 12: Update and Activate the Pipeline

Pipeline Flow

DSLab Runner (Script) → Event → DSLab Runner (Model) → Event → Sandbox Writer

Procedure

  1. Click the Update/Save icon in the toolbar.

  2. Click the Activate icon to deploy the pipeline.

  3. Monitor pods and component status.

Step 13: Monitor Execution and Verify Results

  • Use the Logs Panel to check real-time execution progress.

  • Verify component messages:

    • “Sandbox Writer successfully wrote data.”

    • “Model Runner completed successfully.”

  • Use the Preview Tab of events to inspect intermediate data.

  • Once processing completes, click Deactivate to stop the pipeline.

Outcome

You have successfully:

  • Created and trained a model in DS Lab.

  • Registered and exported model logic to Data Pipeline.

  • Built a pipeline that:

    • Executes model predictions

    • Writes processed data into the Data Sandbox

This reusable workflow bridges data science experimentation and operational analytics, ensuring reproducible ML pipelines within the BDB ecosystem.

Key Benefits

Feature

Value

DS Lab Integration

Unified workspace for model development

Auto Export

Seamless migration from DS Lab to Pipeline

Real-Time & Batch Support

Handles both live and scheduled workloads

Sandbox Writer

Simplified storage and reuse of model output

Summary

You have completed Data Science Lab Workflow 1, which demonstrates:

  • Model development in DS Lab

  • Automated model execution via Data Pipeline

  • Persistent data storage in the Sandbox for downstream analysis

This workflow operationalizes machine learning on the BDB Platform, enabling a complete data-to-deployment lifecycle with minimal code.