DS Model-to-Sandbox Pipeline
To create a project in DS Lab, train and register a model, and build a pipeline that writes model outputs to the Data Sandbox for analysis and reuse.
This guide outlines the process to create and operationalize a machine learning workflow within the BDB Platform’s DS Lab and Data Pipeline modules. The workflow demonstrates how to:
Create a project in DS Lab
Train, test, and register an ML model
Export scripts to the Data Pipeline
Build a data pipeline that reads, predicts, and writes results back into the Data Sandbox for analysis and reuse
This setup enables seamless collaboration between Data Science experimentation and Data Engineering pipelines, supporting scalable ML model deployment.
Architecture Overview
Stage
Component
Purpose
1
DS Lab Project
Create a workspace for model training
2
DS Lab Notebook
Prepare and upload data for model training
3
DS Lab Model
Train, save, and register ML models
4
DS Lab Script
Generate and export input data logic
5
Data Pipeline
Automate model inference and output writing
6
Sandbox Writer
Store model outputs for downstream use
Prerequisites:
Before starting, ensure:
Access to DS Lab and Data Pipeline modules on the BDB Platform
Permissions to create and manage projects, models, and pipelines
ClickHouse, Kafka, and Sandbox services configured
Familiarity with Python, scikit-learn, and BDB DS Lab APIs
Step 1: Create a Project in DS Lab
Purpose
To establish a dedicated workspace for model experimentation.
Procedure
Navigate to DS Lab from the Apps Menu.
Click the Create (+) button.
Fill in the following fields:
Project Name:
DS LAB WORKFLOW 1Description: Machine learning workflow project for model training and deployment.
Algorithm Type: Select
Regression and Classificationfrom the dropdown (supports Regression, Classification, Forecasting, NLP, etc.).
Choose the Environment:
Python TensorFlow (used in this workflow)
Alternative options: PyTorch, PySpark
Resource Allocation: Medium (based on dataset size).
Idle Shutdown: 1 hour.
External Libraries: Add
boto3.Click Save.
Step 2: Create a Notebook and Upload a Dataset
Purpose
To add a dataset for model training within the DS Lab environment.
Procedure
Activate the project:
Click Activate on the right-hand panel.
Click View to open the activated project.
Wait for the kernel to start.
Click Create Notebook → Enter:
Name:
test_notebookDescription: Base notebook for training workflow.
Click Save.
Add Sandbox Data
Click the Data icon on the left navigation pane.
Click the ‘+’ icon next to the search bar.
In the Add Data dialog:
Select Data Sandbox Files from the Data Source dropdown.
Click Upload and provide:
Sandbox Name: Customer Data
Description: Customer data for model training
Upload File: Select a CSV file from your local system.
Click Save.
A confirmation message “File is uploaded” will appear.
Check the newly uploaded Sandbox file → Click Add.
Select the dataset checkbox to auto-generate code for data retrieval.
Run the cell using the Run Cell icon.
Step 3: Upload an External Notebook and Train the Model
Purpose
To import and execute an external notebook for model training using scikit-learn.
Procedure
Click the Workspace icon in the left navigation pane.
Navigate to the Repo section → Click the 3 dots (⋮) → Select Import.
Provide:
Notebook Name:
Pima Diabetes ModelDescription: Training model using the Pima Indians Diabetes dataset.
Upload dataset via Git URL.
Execute the Following Cells in Order
1. Import Dependencies
import pandas
from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB2. Load the Dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
dataframe3. Define Features (X) and Target (Y)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]4. Split into Training and Test Sets
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)5. Train the Model
model = MultinomialNB()
model.fit(X_train, Y_train)6. Save the Model
from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor
nb = NotebookExecutor()
saved_model = nb.save_model(model=model, modelName='model', modelType='ml', X=None, y=None, estimator_type='')7. Load and Predict
loaded_model = nb.load_saved_model('56031751531092038')
X_test_copy = X_test.copy()
Y_pred = nb.predict(model=loaded_model, dataframe=X_test_copy, modeltype='ml')
Y_pred.head()8. Evaluate Accuracy
from sklearn.metrics import accuracy_score
accuracy_score(Y_test, Y_pred.predictions)
A trained Multinomial Naive Bayes model is saved and validated.Step 4: Register the Model
Purpose
To make the trained model available for pipelines and batch execution.
Procedure
Go to the Models tab (left pane).
Select the All option from the dropdown menu.
Locate your trained model → Click the Register icon for the selected model.
Confirmation: “Model successfully registered.”
Step 5: Create Input Data Script
Purpose
To generate a reusable input data script for feeding the model.
Procedure
Go to the Workspace icon.
In the Repo section → Click 3 dots (⋮) → Select Create Notebook.
Name:
sklearn_model_inputPaste the following script:
def input_data():
import pandas as pd
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url, names=names)
df_test = df.copy()
df_test.drop('class', axis=1, inplace=True)
return df_test
input_data()Run the cell using the Run Cell icon.
Step 6: Export Notebook to Pipeline
Procedure
Navigate to the 3 dots (⋮) of the notebook → Select Register.
Select the target cell to export → Click Next.
Choose Export as Script → Click Finish.
Step 7: Create a Data Pipeline
Purpose
To automate data ingestion, model execution, and result writing to the Sandbox.
Procedure
Open Data Pipeline from the Apps Menu.
Click Create Pipeline and configure:
Name:
DS Lab Workflow 1Description: End-to-end DS Lab and Sandbox workflow
Resource Allocation: Medium
Click Save.
Step 8: Add DS Lab Runner (Input Script Execution)
Procedure
From the Components Palette, drag and drop DSLab Runner (from the Machine Learning section).
Configure:
Invocation Type: Real-Time
Execution Type: Script Runner
Function Input: DataFrame
Project Name: DS LAB WORKFLOW 1
Script Name: sklearn_model_input
Start Function: input_data
Save the component.
Step 9: Add Kafka Event
From the Event Panel, click + → Add Kafka Event.
Configure and drag it to the pipeline canvas.
Step 10: Add DS Lab Runner (Model Execution)
Procedure
From the Components Palette, drag and drop another DSLab Runner.
Configure:
Invocation Type: Batch
Execution Type: Model Runner
Project Name: DS LAB WORKFLOW 1
Model Name: model
Save the component.
Add another Kafka Event and connect it to this component.
Step 11: Add Sandbox Writer
Purpose
To write model output back to the Sandbox for analysis and visualization.
Procedure
From the Components Palette, drag and drop Sandbox Writer (under the Writer section).
Configure:
Invocation Type: Batch
Storage Type: Network
File Type: CSV
Sandbox File:
CustomerSave Mode: Overwrite
Click Save.
Step 12: Update and Activate the Pipeline
Pipeline Flow
DSLab Runner (Script) → Event → DSLab Runner (Model) → Event → Sandbox WriterProcedure
Click the Update/Save icon in the toolbar.
Click the Activate icon to deploy the pipeline.
Monitor pods and component status.
Step 13: Monitor Execution and Verify Results
Use the Logs Panel to check real-time execution progress.
Verify component messages:
“Sandbox Writer successfully wrote data.”
“Model Runner completed successfully.”
Use the Preview Tab of events to inspect intermediate data.
Once processing completes, click Deactivate to stop the pipeline.
Outcome
You have successfully:
Created and trained a model in DS Lab.
Registered and exported model logic to Data Pipeline.
Built a pipeline that:
Executes model predictions
Writes processed data into the Data Sandbox
This reusable workflow bridges data science experimentation and operational analytics, ensuring reproducible ML pipelines within the BDB ecosystem.
Key Benefits
Feature
Value
DS Lab Integration
Unified workspace for model development
Auto Export
Seamless migration from DS Lab to Pipeline
Real-Time & Batch Support
Handles both live and scheduled workloads
Sandbox Writer
Simplified storage and reuse of model output
Summary
You have completed Data Science Lab Workflow 1, which demonstrates:
Model development in DS Lab
Automated model execution via Data Pipeline
Persistent data storage in the Sandbox for downstream analysis
This workflow operationalizes machine learning on the BDB Platform, enabling a complete data-to-deployment lifecycle with minimal code.