Data Science User Workflow

The Data Science Lab (DSLab) provides an end-to-end environment for developing machine learning workflows. A typical user journey involves creating a project, uploading datasets, building models, and exporting them for downstream use.

Step 1: Create a Project

Navigation path: DSLab > Create Project

  1. Go to the Data Science Lab module.

  2. Click Create Project.

  3. Enter the required information:

    • Project Name

    • Description (optional)

    • Environment (Python or PySpark)

    • Resource configuration (CPU/Memory)

    • (Optional) Git repository settings for version control.

  4. Click Save.

Step 2: Upload Data and Access from a Notebook

Navigation path: DSLab > Project > Workspace > Notebookv > Data

  1. Open the created project.

  2. Navigate to the Workspace and go to the Datasets tab.

  3. Click Upload Dataset.

    • Supported formats: CSV, JSON, Parquet, and more.

    • Provide a dataset name and description.

  4. Navigate to the Repo folder inside the Workspace.

  5. Create or open a Notebook stored in the Repo folder.

  6. Import the uploaded data into the Notebook for analysis and preprocessing.

Example (Python Notebook in Repo folder):

import pandas as pd

# Load dataset
df = pd.read_csv('/workspace/datasets/customer_data.csv')

# Preview first few rows
df.head()

Step 3: Create a Model

Navigation path: DSLab > Project > Workspace > Models

  1. In a Notebook from the Repo folder, prepare your dataset (cleaning, feature engineering, train/test split).

  2. Build a model using Python libraries (e.g., scikit-learn, PyTorch, TensorFlow) or use AutoML for automated model building.

  3. Model creation involves the following steps:

    1. Read DataFrame

    2. Define test and train data

    3. Create a model - Execute the script by running the code cell.

Example (scikit-learn in Repo Notebook):

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Train-test split
X = df.drop("churn", axis=1)
y = df["churn"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Evaluate
print("Accuracy:", model.score(X_test, y_test))
  1. Save the trained model in the Models tab of the Workspace.

    1. Navigate to the next code cell.

    2. Assign a model name to specify the model and model type as 'ml'.

    3. Execute the code cell.

    4. After the code is executed, the model will be saved under the Models tab.

  2. Function Parameters

    1. model - Trained model variable name.

    2. modelName - The desired name given by the user for the trained model.

    3. modelType - Type in which the model can be saved.

    4. X - This array contains the input features or predictors used to train the model. Each row in the X_train array represents a sample or observation in the training set, and each column represents a feature or variable.

    5. Y - This array contains the corresponding output or response variable for each sample in the training set. It is also called the target variable, dependent variable, or label. The Y_train array has the same number of rows as the X_train array.

    6. estimator_type - The estimator_type of a data science model refers to the type of estimator used.

      Function Parameters Highlighted for a DS Model

Step 4: Export the Model

Once the model is trained and registered:

  1. Navigate to the Models tab inside the Workspace.

  2. Select the trained model.

  3. Click Register.

    • The model becomes available in the Data Pipeline module.

    • It can be integrated with jobs for batch or real-time inference.

Summary Workflow

  1. Create Project → Workspace with Repo folder is created.

  2. Upload Dataset → Store dataset in Workspace and access via Notebook in Repo.

  3. Develop in Notebook (Repo folder) → Preprocess data, train ML models, and register them.

  4. Export Model → Deploy models into pipelines for production workflows.