Workflow 1

To create a project in DS Lab, train and register a model, and build a pipeline that writes model outputs to the Data Sandbox for analysis and reuse.

Create a Project in DS Lab

Navigate to DS Lab from the Apps menu.
Clicks on Create+ button.
Enter a project name: DS LAB WORKFLOW 1 and description.
Select the algorithm type: Regression and Classificationf from the drop down. (Regression, Classification, Forecasting, NLP, etc.).
Choose an Environment:
PythonTensorFlow(Used in this woorflow)
PyTorch
PySpark
Allocate resources according to dataset size: medium.
Set an Idle Shutdown limit: 1h.
Add external libraries: boto3.

9. Save the project.

Create a Notebook and Upload a Dataset

Activate the project by clicking on the Activate option located on the right.
Open the project by clicking on the “view” which is situated to the right after you click on the activated project .
Wait for the kernel to start.
Click on “create” to create a notebook, give notebook an appropriate name: test notebook and description and then click on “save”.
Click on the “Data” icon located to the left the of the workspace to add the sandbox as data.
After clicking on the data icon, click on the ‘+’ icon located to right of the search bar.
An “add data” page will pop up, select the “data sandbox files” option from the data source dropdown menu.
Click on the “upload” option located to the right of the “add data” page.
Give sandbox a name and an appropriate description and choose the sandbox file from your local system i.e “Customer data”
Click on save to upload the file successfully, Once the file is uploaded a pop will appear” File is uploaded”
Now you can check the check box of the newly uploaded sandbox and Click on “add” to add the sandbox as data.
Once the sandbox is added, click on the cell and then click on the check box of the sandbox that you see under the data section.
Automatically a code will appear in the cell, this appeared code will help you to get the data of the sandbox file you uploaded.
Run the cell using the “run cell” icon present on the top left corner of the cell.

Upload an External Notebook and Create a Model

1. Now navigate to the workspace icon and click on it.

2. Navigate to the 3 dots of repo section and select “import” from the options.

3. Once you click on it, Give notebook a name and description, and choose the Pima Indians Diabetes dataset.

[NOTE :Execute all the codes given below in different cells]

Run the cell using the “run cell” icon present on the top left corner of the cell.

#Sklearn train model

import pandas

from sklearn import model_selection

from sklearn.naive_bayes import MultinomialNB

# Read CSV From Git and create dataframe

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

dataframe

1. Define features (X) and target (Y):

# Define Feature column (X) and Target Column (Y)

array = dataframe.values

# Method 1

X = array[:,0:8]

Y = array[:,8]

# Method 2

df_x = dataframe[['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']]

df_y = dataframe['class']

X = df_x.values

Y = df_y.values

2. Split into training and test sets:

# Define test size and random seed value

test_size = 0.33

seed = 7

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)

3. Train the model:

# Use Multinomial Naive Bayes Classifier

model = MultinomialNB()

model.fit(X_train, Y_train);

[NOTE:Once the Model is trained go to the model section, located to the left of the screen (same as where the “data” icon is located)and then select “all” from the dropdown menu given to the left of the ‘+’ icon, click on the cell and then click on the model’s checkbox, this automatically generate a code using which you can see the data of the model,” THE BELOW IS THE AUTO GENERATED CODE”]

4. Save the model using DSLab library:

from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor

nb = NotebookExecutor()

saved_model = nb.save_model(model=model, modelName='model', modelType='ml', X=None, y=None, estimator_type='')

5. Load the saved model:

loaded_model = nb.load_saved_model('56031751531092038')

6. Run predictions:

# Create copy of test data

X_test_copy = X_test.copy()

# Model Prediction

Y_pred = nb.predict(model=loaded_model, dataframe=X_test_copy, modeltype='ml')

Y_pred.head()

7. Check accuracy:

from sklearn.metrics import accuracy_score

accuracy_score(Y_test, Y_pred.predictions)

Register the Model

Navigate to the Models tab.
Select “all” from the drop down menu .
Locate your model, click on 3 dots of the model and click Register.

4. The model is now available for pipelines.

Create an Input Data Script and Export to Pipeline

Input Data Notebook

1. Now navigate to the workspace icon and click on it.

2. Navigate to the 3 dots of repo section and then click “create” .

3. Create a notebook named sklearn_model_input and give it an appropriate description and click on save.

4. Now inside the cells, provide the codes given below.

def input_data():

import pandas as pd

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

df = pd.read_csv(url, names=names)

# Create test dataframe

df_test = df

df_test.drop('class', axis=1, inplace=True)

return df_test

input_data()

This code gets the csv data using the git url and and also creates test dataframe.

Run the cell using the “run cell” icon present on the top left corner of the cell.

Export Notebook to Pipeline

Now navigate to the 3 dots of the notebook and click on Register.
Select the cell to export, and then click on next.
Select export as a script and then click on “Finish”.

Create a Data Pipeline

Open Data Pipeline from the Apps menu.
Create a pipeline with name: DS Lab Workflow 1, description, and Resources Allocation: Medium.

Add DSLab Runner

· Click the ‘+’ icon on the right side of the screen to open the Components Palette if it’s not present on the canvas.

· From components palette, Drag and drop the DS Lab Runner Component from the machine learning section.

· Component: DSLab Runner

· Invocation type : Real-Time

· Move to meta information and configure the information such as : Execution type as script runner

· Function input as dataframe

· Select the project name as DS LAB WORKFLOW 1

· select the script name as sklearn_input_data

· start function as input_data and then save the component.

Add Kafka Event

Navigate to the event panel situated right next to the component panel in the component palette, click on the ‘+’ icon to Add Kafka event → click on add an event in the poped up screen, now Drag and drop the event to the pipeline canvas.

Add DSLab Runner

· From component palette, Drag and drop the DS Lab Runner Component from the machine learning section.

· Component: DSLab Runner

· Invocation type : Batch

· Move to meta information and configure the information such as : Execution type as Model runner

· Select the project name as DS LAB WORKFLOW 1

· select the model name as model

· then save the component.

Add Kafka Event

Navigate to the event panel situated right next to the component panel, click on the ‘+’ icon to Add Kafka event → click on add an event in the poped up screen, now Drag and drop the event to the pipeline canvas.

Add Sandbox Writer

· Now from the component palette, drag and drop the sandbox writer from the writer section onto the canvas.

· Component: Sandbox Writer

· Give the invocation type as batch.

· Configure the Meta Information such as : storage type as Network

· file type as csv

· sandbox file as customer and save mode as overwrite.

· Save the component once it is configured.

Update and activate the pipeline.

click the update/save icon from the toolbar after configuring all the components in the pipeline.
Click “Activate” icon located in the toolbar to activate the pipeline.
Once activated, the pipeline will begin executing and the pods will start deploying.

· Check the Component Logs to monitor the status of each component.

· After all pods are up and running, move to the Logs section to track pipeline execution in detail.

· Once the component is successfully executed, it sends the data to the event.

· You can see the data in the preview tab of each event located after the respective component.

· Once the sandbox has executed successfully, you will get to see in the logs section “sandbox writer successfully written data”.

· Once the data is written successfully the component will be automatically stopped

· Deactivate the pipeline by clicking on the “deactivate” icon from the tool.

You have successfully completed data science lab workflow 1.

PreviousData Science Lab NextWorkflow 2

Last updated 27 days ago