Data Science User Workflow
The Data Science Lab (DSLab) provides an end-to-end environment for developing machine learning workflows. A typical user journey involves creating a project, uploading datasets, building models, and exporting them for downstream use.
Step 1: Create a Project
Navigation path: DSLab > Create Project
Go to the Data Science Lab module.
Click Create Project.
Enter the required information:
Project Name
Description (optional)
Environment (Python or PySpark)
Resource configuration (CPU/Memory)
(Optional) Git repository settings for version control.
Click Save.
Step 2: Upload Data and Access from a Notebook
Navigation path: DSLab > Project > Workspace > Notebookv > Data
Open the created project.
Navigate to the Workspace and go to the Datasets tab.
Click Upload Dataset.
Supported formats: CSV, JSON, Parquet, and more.
Provide a dataset name and description.
Navigate to the Repo folder inside the Workspace.
Create or open a Notebook stored in the Repo folder.
Import the uploaded data into the Notebook for analysis and preprocessing.
Example (Python Notebook in Repo folder):
import pandas as pd
# Load dataset
df = pd.read_csv('/workspace/datasets/customer_data.csv')
# Preview first few rows
df.head()
Step 3: Create a Model
Navigation path: DSLab > Project > Workspace > Models
In a Notebook from the Repo folder, prepare your dataset (cleaning, feature engineering, train/test split).
Build a model using Python libraries (e.g., scikit-learn, PyTorch, TensorFlow) or use AutoML for automated model building.
Model creation involves the following steps:
Read DataFrame
Define test and train data
Create a model - Execute the script by running the code cell.
Example (scikit-learn in Repo Notebook):
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# Train-test split
X = df.drop("churn", axis=1)
y = df["churn"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Evaluate
print("Accuracy:", model.score(X_test, y_test))
Save the trained model in the Models tab of the Workspace.
Navigate to the next code cell.
Assign a model name to specify the model and model type as 'ml'.
Execute the code cell.
After the code is executed, the model will be saved under the Models tab.
Function Parameters
model - Trained model variable name.
modelName - The desired name given by the user for the trained model.
modelType - Type in which the model can be saved.
X - This array contains the input features or predictors used to train the model. Each row in the X_train array represents a sample or observation in the training set, and each column represents a feature or variable.
Y - This array contains the corresponding output or response variable for each sample in the training set. It is also called the target variable, dependent variable, or label. The Y_train array has the same number of rows as the X_train array.
estimator_type - The estimator_type of a data science model refers to the type of estimator used.
Function Parameters Highlighted for a DS Model
Step 4: Export the Model
Once the model is trained and registered:
Navigate to the Models tab inside the Workspace.
Select the trained model.
Click Register.
The model becomes available in the Data Pipeline module.
It can be integrated with jobs for batch or real-time inference.
Summary Workflow
Create Project → Workspace with Repo folder is created.
Upload Dataset → Store dataset in Workspace and access via Notebook in Repo.
Develop in Notebook (Repo folder) → Preprocess data, train ML models, and register them.
Export Model → Deploy models into pipelines for production workflows.