ML Model

This page provides step-by-step process to train an ML model inside a Data Science notebook.

Machine Failure Prediction Using Random Forest and Logistic Regression

This notebook demonstrates the complete workflow of preparing machine data, balancing it, training multiple models, and evaluating them. We will use Random Forest and Logistic Regression, and handle imbalanced datasets using SMOTE. We will also standardize the features and explore correlations to understand the data better.

Step 1: Installing Required Packages/ Libraries.

SMOTE generates synthetic examples for the minority class to balance the dataset.

Note: Once installed, restart the Kernel.

!!pip install imbalanced-learn

Step 2: Upload and Load the data

We begin by loading the dataset into a pandas DataFrame to inspect the first few rows. This helps us understand the structure and types of data we are working with.

Sample Data: Download the following sample data and upload it using the Sandbox data source to the project.

510KB
Open

To load your uploaded data, you can either:

from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor
nb = NotebookExecutor()
df= nb.get_data('84831755771014158', '@SYS.USERID', 'True', {}, [], sheet_name = '')
# The first function parameter refers to the service ID of the dataset.
# @SYS.USERID refers to the user ID of the current user.
# If the Sandbox key is 'false', it is referred to as a dataset, and if it's 'true', then the file is a sandbox file.
# {} refers to the filters applied to the dataset.
# [] refers to the data preparations applied to the dataset.
# After [], users can specify the number of rows to limit the headcount of the dataset with a comma separator (e.g., limit=10).

Note: After loading the data file into the code cell, modify the data frame. We suggest removing the name of the loaded file and just keeping the data frame as "df=nb.get_data" as shown in the above code cell.

# Loading the dataset into a pandas DataFrame.
import pandas as pd

print("Loading the data...")
df.head()

Step 3: Check for missing values

Before proceeding, we need to identify any missing values in the dataset. Missing values can affect model performance and may need to be handled.

#  Check for missing values
# Checking if any column contains null values to handle missing data.
df.isnull().sum()

Note: This function checks your data columns and provides a count of missing values. It's recommended that you ensure there is no missing data before you proceed.

Step 4: Feature Engineering

We create a new feature TempDiff that captures the difference between Process temperature and Air temperature.

  • Derived features like this can help models capture hidden patterns in the data.

# Create a new feature called 'Temp Diff' to capture the temperature difference.
df['Temp Diff'] = df['Process temperature [K]'] - df['Air temperature [K]']

Step 5: Remove unnecessary and leakage-causing features

Some features either leak information about the target or are irrelevant identifiers. We drop such columns to prevent the model from learning incorrect patterns.

  • Dropped features include UDI, Product ID (identifiers) and TWF, HDF, PWF, OSF, RNF (redundant/derived features).

# Removing unnecessary and leakage causing features from data
df.drop(columns=['UDI','Product ID'],axis=1,inplace=True)
df.drop(columns=['TWF','HDF','PWF','OSF','RNF'],inplace=True)

Also, print the new data to verify the loaded data.

print("Loading the new data. please wait!...")
df.head()

Step 6: Encode Categorical Variables

The Type column is categorical. Machine learning models require numeric inputs, so we convert it using Label Encoding. For example: Type 'A' becomes 0, Type 'B' becomes 1, etc.

# Encode categorical variables
# The 'Type' column is categorical. Convert it to numeric labels.
from sklearn.preprocessing import LabelEncoder

df['Type'] = LabelEncoder().fit_transform(df['Type'])

Step 7: Standardize the Dataset

Standardization scales features to have a mean of 0 and a standard deviation of 1. This process is crucial for machine learning models as it helps them train faster and prevents features with large magnitudes from dominating the learning process.

# Standardize the features
# Standardization brings all features to a similar scale.
from sklearn.preprocessing import StandardScaler

scale = StandardScaler()
data = pd.DataFrame(scale.fit_transform(df), columns=df.columns, index=df.index)

Step 8: Feature Correlation

Visualizing correlations helps to identify redundant features and understand the relationships between variables. A common way to do this is with a heatmap, which visually represents how strongly features are related to each other and to the target variable.

#  Visualize feature correlations
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation of Features")
plt.show()

Step 9: Select Features and Target

This step involves defining the feature matrix (X) and the target vector (y) for the machine failure prediction task.

  • Features (X): This includes the numerical columns that are likely to influence machine failure, such as temperature, rotational speed, torque, tool wear, and the previously derived feature, TempDiff.

  • Target (y): The target variable is the Machine failure column, which indicates whether a failure occurred (1) or not (0).

X = df[['Air temperature [K]','Process temperature [K]','Type','Rotational speed [rpm]','Torque [Nm]','Tool wear [min]','Temp Diff']]
y = df['Machine failure']
X = df[['Air temperature [K]', 'Process temperature [K]', 'Type', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]', 'Temp Diff']]

Step 10: Clean Feature Column Names

This step involves cleaning column names by removing special characters like [, ], or < using regular expressions. This is done to ensure compatibility with machine learning libraries, such as scikit-learn, other Python functions that may not be able to process such characters.

# For converting all the undefined nan values
X.columns = X.columns.astype(str).str.replace(r'[\[\]<>]', '', regex=True)

Step 11: Standardize Selected Features

This step involves standardizing the features in the feature matrix, X, to prepare them for model training. This process ensures that all features contribute equally to the model, preventing features with larger numerical values from disproportionately influencing the results.

# Standardizing the selected feature
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
X_scaled = scale.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled,columns=X.columns)
print("..................................................\n.......................................\n...................................")
print("The standardized form of X features are given as*****.......*******\n.............................\n..................................\n",X_scaled_df.describe)
print("..................................................\n.......................................\n...................................")

Step 12: Plot Feature Distributions Before and After Scaling

Visualizing the distributions of features before and after standardization helps to clearly see its effects. Histograms of the data before scaling may show features with widely varying ranges and centers, making direct comparisons difficult. After scaling, however, the distributions are centered around a mean of 0 and have a comparable scale, which is ideal for many machine learning algorithms.

# plotting graphs before and after the standardizations
print("**********Plot before scaling**********")
num_cols = len(df.columns)
fig, axes = plt.subplots(1, num_cols, figsize=(5 * num_cols, 4))
if num_cols == 1:
    axes = [axes]

for i, col in enumerate(df.columns):
    axes[i].hist(df[col], bins=5, color='skyblue')
    axes[i].set_title(f'Before Scaling: {col}')
plt.tight_layout()
plt.show()

print("Applying Standardization...\n,this may take a while.please wait!")
scaler = StandardScaler()
scaled_array = scaler.fit_transform(df)
df_scaled = pd.DataFrame(scaled_array, columns=df.columns)

print("*******Plot after scaling********")
fig, axes = plt.subplots(1, num_cols, figsize=(5 * num_cols, 4))
if num_cols == 1:
    axes = [axes]

for i, col in enumerate(df_scaled.columns):
    axes[i].hist(df_scaled[col], bins=5, color='salmon')
    axes[i].set_title(f'After Scaling: {col}')
plt.tight_layout()
plt.show()

Step 13: Check Target Distribution

Before training a model, it is crucial to check the distribution of the target variable to see if the dataset is imbalanced. An imbalanced dataset has an unequal number of instances for each class, which can lead to a model that performs well on the majority class but poorly on the minority class. Understanding this distribution helps in choosing the right sampling techniques and evaluation metrics for the model.

# checking number of counts of target
from collections import Counter
count = Counter(y)
print("The no. of counts of targeted coulmns are...",count)

Step 14: Apply SMOTE to Balance Dataset

  • SMOTE oversamples the minority class to prevent model bias.

  • After this step, both classes have roughly equal representation.

# balancing the data for proper functioning of mode
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled,y_resampled = smote.fit_resample(X,y)
y_resampled.shape

Step 15: Check Class Distribution After Balancing

After applying SMOTE in Step 15, it's important to verify that the dataset is now balanced.

  • Counter(y_resampled) counts the number of instances for each class in the resampled target.

  • This ensures that both classes (machine failure = 0 and 1) have roughly equal representation, which is important for unbiased model training.

from collections import Counter
counts = Counter(y_resampled)
print("Now the balanced data is",counts)

Step 16: Train Multiple Models

We train and evaluate several models, including:

  • Logistic Regression

  • Random Forest

  • KNN

  • SVC

We compute the Accuracy and F1 Score for each.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score,accuracy_score

X_train,X_test,y_train,y_test = train_test_split(X_resampled,y_resampled,test_size=0.2,stratify=y_resampled,random_state=42)
models = {
    "LogisticRegression": LogisticRegression(C=(1.0),penalty="l2",solver="liblinear",max_iter=(500)),
    "Random Forest"     : RandomForestClassifier(),
    
    "KNN"               : KNeighborsClassifier(),
    "SVC"               : SVC()
}

results = {}

for name,model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test,y_pred)
    f1 = f1_score(y_test,y_pred)
    results[name] = {"Accuray":acc,"f1 score":f1}

import pandas as pd
comparison_df = pd.DataFrame(results).T
print(comparison_df)

Step 17: Clean Column Headers

Column names may contain spaces or special characters, which can cause issues in modeling pipelines. We use a regular expression to replace non-alphanumeric characters with underscores.

  • re.sub(r'\W+', '_', col) replaces all non-word characters with _.

  • strip('_') removes leading or trailing underscores.

# Clean Column Headers
# Some column names may contain spaces or special characters which can cause issues later.
# We replace non-alphanumeric characters with underscores.
import re

def clean_column_headers(df: pd.DataFrame) -> pd.DataFrame:
    df.columns = [re.sub(r'\W+', '_', col).strip('_') for col in df.columns]
    return df

df1 = clean_column_headers(df)
# print(df1.head())

Step 18: Train Random Forest Separately and Evaluate

This is a focused Random Forest model training with evaluation metrics.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Assuming your DataFrame is named df1
# Step 1: Select features and target
X = df1.drop(['Machine_failure', 'Process_temperature_K', 'Product_ID', 'Type'], axis=1, errors='ignore')
y = df1['Machine_failure']

# Step 2: (Optional) Handle categorical variables if needed
# e.g., X = pd.get_dummies(X) if categorical columns exist

# Step 3: Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)  # 100 trees by default
rf_model.fit(X_train, y_train)

# Step 5: Make predictions
y_pred = rf_model.predict(X_test)

# Step 6: Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Step 19: Train Logistic Regression Separately and Evaluate

Same as above, but with Logistic Regression.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
 
# Assuming your DataFrame is named df1
# Step 1: Select features and target
X = df1.drop(['Machine_failure', 'Process_temperature_K', 'Product_ID', 'Type'], axis=1, errors='ignore')
y = df1['Machine_failure']             
 
# Step 2: (Optional) Handle categorical variables if needed (e.g., pd.get_dummies)
 
# Step 3: Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Step 4: Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
 
# Step 5: Make predictions
y_pred = model.predict(X_test)
 
# Step 6: Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Step 20: Save Model for Explainer Dashboard

We save the trained model using NotebookExecutor().save_model() to enable Explainer Dashboard for model interpretability.

  • modelType='ml' machine learning model

  • estimator_type='classification' specifies that it is a classification model

  • X and y training data needed for explainability

  • For custom Keras layers, use native Keras save functions.

from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor
nb = NotebookExecutor()
saved_model = nb.save_model(model = model, modelName = 'model_prediction', modelType = 'ml', X = X_train, y = y_train, estimator_type='classification')
#X and y are training datasets to get explainer dashboard.
#estimator_type is to specify algorithm type i.e., classification and regression.
#Only 'ml' models with tabular data as input will support in Explainer Dashboard.
#Choose modelType = 'ml' for machine learning models, modelType = 'cv' for computer vision models and modelType = 'dp' for data transformation pickle files. 
#Provide 'column_headers' as a parameter if they have to be saved in the model.
#If using custom layer in keras, use native save functionality from keras.

Note:

  • You can save this model with a unique name each time.

  • The saved model will be accessible under the Models option in the left panel. Navigation Path: Workspace > Left Navigation panel > Models.

Step 21: Load the Saved Model

Load the previously saved model using its ID.

from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor
nb = NotebookExecutor()
loaded_model = nb.load_saved_model('84831758793183092')

Note:

  • While loading a saved model to a code cell, a model ID number will be displayed.

  • You can use either the provided code or put a check mark in the corresponding checkbox for the saved model to load it in the next code cell.

Step 22: Predict on Test Set

Make predictions on the test set using the loaded model. Specify modeltype='ml' for a machine learning model.

df=nb.predict(model = loaded_model, dataframe = X, modeltype='ml') 
print(df.head())
 #Choose modeltype 'ml' for machine learning models and 'cv' for computer vision model 
 #ex: For machine learning model nb.predict(model = model, modeltype = 'ml', dataframe = df) 
 #ex: For computer vision keras model nb.predict(model = model, modeltype = 'cv', imgs = imgs, imgsize = (28, 28), dim = 1, class_names = class_names) 
 #and for pytorch model(model = model, modeltype = 'cv', imgs = imgs, class_names = class_names) 
 #Note: incase any error in prediction user squeezed image data in keras

Note: You can use either the provided code or the Predict function in the code cell to make a prediction.