DS Model-to-Sandbox Pipeline

To create a project in DS Lab, train and register a model, and build a pipeline that writes model outputs to the Data Sandbox for analysis and reuse.

This guide outlines the process to create and operationalize a machine learning workflow within the BDB Platform’s DS Lab and Data Pipeline modules. The workflow demonstrates how to:

Create a project in DS Lab
Train, test, and register an ML model
Export scripts to the Data Pipeline
Build a data pipeline that reads, predicts, and writes results back into the Data Sandbox for analysis and reuse

This setup enables seamless collaboration between Data Science experimentation and Data Engineering pipelines, supporting scalable ML model deployment.

Architecture Overview

Stage

Component

Purpose

DS Lab Project

Create a workspace for model training

DS Lab Notebook

Prepare and upload data for model training

DS Lab Model

Train, save, and register ML models

DS Lab Script

Generate and export input data logic

Data Pipeline

Automate model inference and output writing

Sandbox Writer

Store model outputs for downstream use

Prerequisites:

Before starting, ensure:

Access to DS Lab and Data Pipeline modules on the BDB Platform
Permissions to create and manage projects, models, and pipelines
ClickHouse, Kafka, and Sandbox services configured
Familiarity with Python, scikit-learn, and BDB DS Lab APIs

Step 1: Create a Project in DS Lab

Purpose

To establish a dedicated workspace for model experimentation.

Procedure

Navigate to DS Lab from the Apps Menu.
Click the Create (+) button.
Fill in the following fields:
- Project Name: DS LAB WORKFLOW 1
- Description: Machine learning workflow project for model training and deployment.
- Algorithm Type: Select Regression and Classification from the dropdown (supports Regression, Classification, Forecasting, NLP, etc.).
Choose the Environment:
- Python TensorFlow (used in this workflow)
- Alternative options: PyTorch, PySpark
Resource Allocation: Medium (based on dataset size).
Idle Shutdown: 1 hour.
External Libraries: Add boto3.
Click Save.

The DS Lab project DS LAB WORKFLOW 1 has been successfully created.

Step 2: Create a Notebook and Upload a Dataset

Purpose

To add a dataset for model training within the DS Lab environment.

Procedure

Activate the project:
- Click Activate on the right-hand panel.
- Click View to open the activated project.
Wait for the kernel to start.
Click Create Notebook → Enter:
- Name: test_notebook
- Description: Base notebook for training workflow.
- Click Save.

Add Sandbox Data

Click the Data icon on the left navigation pane.
Click the ‘+’ icon next to the search bar.
In the Add Data dialog:
- Select Data Sandbox Files from the Data Source dropdown.
- Click Upload and provide:
  - Sandbox Name: Customer Data
  - Description: Customer data for model training
  - Upload File: Select a CSV file from your local system.
- Click Save.
- A confirmation message “File is uploaded” will appear.
Check the newly uploaded Sandbox file → Click Add.
Select the dataset checkbox to auto-generate code for data retrieval.
Run the cell using the Run Cell icon.

The Sandbox data is loaded into the notebook environment and ready for model training.

Step 3: Upload an External Notebook and Train the Model

Purpose

To import and execute an external notebook for model training using scikit-learn.

Procedure

Click the Workspace icon in the left navigation pane.
Navigate to the Repo section → Click the 3 dots (⋮) → Select Import.
Provide:
- Notebook Name: Pima Diabetes Model
- Description: Training model using the Pima Indians Diabetes dataset.
- Upload dataset via Git URL.

Execute the Following Cells in Order

1. Import Dependencies

import pandas
from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB

2. Load the Dataset

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
dataframe

3. Define Features (X) and Target (Y)

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

4. Split into Training and Test Sets

test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)

5. Train the Model

model = MultinomialNB()
model.fit(X_train, Y_train)

6. Save the Model

from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor
nb = NotebookExecutor()
saved_model = nb.save_model(model=model, modelName='model', modelType='ml', X=None, y=None, estimator_type='')

7. Load and Predict

loaded_model = nb.load_saved_model('56031751531092038')
X_test_copy = X_test.copy()
Y_pred = nb.predict(model=loaded_model, dataframe=X_test_copy, modeltype='ml')
Y_pred.head()

8. Evaluate Accuracy

from sklearn.metrics import accuracy_score
accuracy_score(Y_test, Y_pred.predictions)
A trained Multinomial Naive Bayes model is saved and validated.

Step 4: Register the Model

Purpose

To make the trained model available for pipelines and batch execution.

Procedure

Go to the Models tab (left pane).
Select the All option from the dropdown menu.
Locate your trained model → Click the Register icon for the selected model.
Confirmation: “Model successfully registered.”

The model is now available for Data Pipeline use.

Step 5: Create Input Data Script

Purpose

To generate a reusable input data script for feeding the model.

Procedure

Go to the Workspace icon.
In the Repo section → Click 3 dots (⋮) → Select Create Notebook.
Name: sklearn_model_input
Paste the following script:

def input_data():
    import pandas as pd
    url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
    names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
    df = pd.read_csv(url, names=names)

    df_test = df.copy()
    df_test.drop('class', axis=1, inplace=True)
    return df_test

input_data()

Run the cell using the Run Cell icon.

A test dataframe is generated for pipeline input.

Step 6: Export Notebook to Pipeline

Procedure

Navigate to the 3 dots (⋮) of the notebook → Select Register.
Select the target cell to export → Click Next.
Choose Export as Script → Click Finish.

The input script is now accessible within Data Pipeline.

Step 7: Create a Data Pipeline

Purpose

To automate data ingestion, model execution, and result writing to the Sandbox.

Procedure

Open Data Pipeline from the Apps Menu.
Click Create Pipeline and configure:
- Name: DS Lab Workflow 1
- Description: End-to-end DS Lab and Sandbox workflow
- Resource Allocation: Medium
Click Save.

Step 8: Add DS Lab Runner (Input Script Execution)

Procedure

From the Components Palette, drag and drop DSLab Runner (from the Machine Learning section).
Configure:
- Invocation Type: Real-Time
- Execution Type: Script Runner
- Function Input: DataFrame
- Project Name: DS LAB WORKFLOW 1
- Script Name: sklearn_model_input
- Start Function: input_data
Save the component.

Step 9: Add Kafka Event

From the Event Panel, click + → Add Kafka Event.
Configure and drag it to the pipeline canvas.

Step 10: Add DS Lab Runner (Model Execution)

Procedure

From the Components Palette, drag and drop another DSLab Runner.
Configure:
- Invocation Type: Batch
- Execution Type: Model Runner
- Project Name: DS LAB WORKFLOW 1
- Model Name: model
Save the component.
Add another Kafka Event and connect it to this component.

Step 11: Add Sandbox Writer

Purpose

To write model output back to the Sandbox for analysis and visualization.

Procedure

From the Components Palette, drag and drop Sandbox Writer (under the Writer section).
Configure:
- Invocation Type: Batch
- Storage Type: Network
- File Type: CSV
- Sandbox File: Customer
- Save Mode: Overwrite
Click Save.

Step 12: Update and Activate the Pipeline

Pipeline Flow

DSLab Runner (Script) → Event → DSLab Runner (Model) → Event → Sandbox Writer

Procedure

Click the Update/Save icon in the toolbar.
Click the Activate icon to deploy the pipeline.
Monitor pods and component status.

Step 13: Monitor Execution and Verify Results

Use the Logs Panel to check real-time execution progress.
Verify component messages:
- “Sandbox Writer successfully wrote data.”
- “Model Runner completed successfully.”
Use the Preview Tab of events to inspect intermediate data.
Once processing completes, click Deactivate to stop the pipeline.

Outcome

You have successfully:

Created and trained a model in DS Lab.
Registered and exported model logic to Data Pipeline.
Built a pipeline that:
- Executes model predictions
- Writes processed data into the Data Sandbox

This reusable workflow bridges data science experimentation and operational analytics, ensuring reproducible ML pipelines within the BDB ecosystem.

Key Benefits

Feature

Value

DS Lab Integration

Unified workspace for model development

Auto Export

Seamless migration from DS Lab to Pipeline

Real-Time & Batch Support

Handles both live and scheduled workloads

Sandbox Writer

Simplified storage and reuse of model output

Summary

You have completed Data Science Lab Workflow 1, which demonstrates:

Model development in DS Lab
Automated model execution via Data Pipeline
Persistent data storage in the Sandbox for downstream analysis

This workflow operationalizes machine learning on the BDB Platform, enabling a complete data-to-deployment lifecycle with minimal code.

PreviousData Science Lab NextCreate Forecasting Models in DS Lab