ML Model
This page provides step-by-step process to train an ML model inside a Data Science notebook.
Machine Failure Prediction Using Random Forest and Logistic Regression
This notebook demonstrates the complete workflow of preparing machine data, balancing it, training multiple models, and evaluating them. We will use Random Forest and Logistic Regression, and handle imbalanced datasets using SMOTE. We will also standardize the features and explore correlations to understand the data better.
Step 1: Installing Required Packages/ Libraries.
SMOTE generates synthetic examples for the minority class to balance the dataset.
Note: Once installed, restart the Kernel.
!!pip install imbalanced-learn
Step 2: Upload and Load the data
We begin by loading the dataset into a pandas DataFrame to inspect the first few rows. This helps us understand the structure and types of data we are working with.
Sample Data: Download the following sample data and upload it using the Sandbox data source to the project.
To load your uploaded data, you can either:
from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor
nb = NotebookExecutor()
df= nb.get_data('84831755771014158', '@SYS.USERID', 'True', {}, [], sheet_name = '')
# The first function parameter refers to the service ID of the dataset.
# @SYS.USERID refers to the user ID of the current user.
# If the Sandbox key is 'false', it is referred to as a dataset, and if it's 'true', then the file is a sandbox file.
# {} refers to the filters applied to the dataset.
# [] refers to the data preparations applied to the dataset.
# After [], users can specify the number of rows to limit the headcount of the dataset with a comma separator (e.g., limit=10).
# Loading the dataset into a pandas DataFrame.
import pandas as pd
print("Loading the data...")
df.head()
Step 3: Check for missing values
Before proceeding, we need to identify any missing values in the dataset. Missing values can affect model performance and may need to be handled.
# Check for missing values
# Checking if any column contains null values to handle missing data.
df.isnull().sum()
Step 4: Feature Engineering
We create a new feature TempDiff
that captures the difference between Process temperature and Air temperature.
Derived features like this can help models capture hidden patterns in the data.
# Create a new feature called 'Temp Diff' to capture the temperature difference.
df['Temp Diff'] = df['Process temperature [K]'] - df['Air temperature [K]']
Step 5: Remove unnecessary and leakage-causing features
Some features either leak information about the target or are irrelevant identifiers. We drop such columns to prevent the model from learning incorrect patterns.
Dropped features include
UDI
,Product ID
(identifiers) andTWF
,HDF
,PWF
,OSF
,RNF
(redundant/derived features).
# Removing unnecessary and leakage causing features from data
df.drop(columns=['UDI','Product ID'],axis=1,inplace=True)
df.drop(columns=['TWF','HDF','PWF','OSF','RNF'],inplace=True)
Also, print the new data to verify the loaded data.
print("Loading the new data. please wait!...")
df.head()
Step 6: Encode Categorical Variables
The Type column is categorical. Machine learning models require numeric inputs, so we convert it using Label Encoding. For example: Type 'A' becomes 0, Type 'B' becomes 1, etc.
# Encode categorical variables
# The 'Type' column is categorical. Convert it to numeric labels.
from sklearn.preprocessing import LabelEncoder
df['Type'] = LabelEncoder().fit_transform(df['Type'])
Step 7: Standardize the Dataset
Standardization scales features to have a mean of 0 and a standard deviation of 1. This process is crucial for machine learning models as it helps them train faster and prevents features with large magnitudes from dominating the learning process.
# Standardize the features
# Standardization brings all features to a similar scale.
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
data = pd.DataFrame(scale.fit_transform(df), columns=df.columns, index=df.index)
Step 8: Feature Correlation
Visualizing correlations helps to identify redundant features and understand the relationships between variables. A common way to do this is with a heatmap, which visually represents how strongly features are related to each other and to the target variable.
# Visualize feature correlations
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10,5))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation of Features")
plt.show()

Step 9: Select Features and Target
This step involves defining the feature matrix (X) and the target vector (y) for the machine failure prediction task.
Features (X): This includes the numerical columns that are likely to influence machine failure, such as temperature, rotational speed, torque, tool wear, and the previously derived feature,
TempDiff
.Target (y): The target variable is the
Machine failure
column, which indicates whether a failure occurred (1) or not (0).
X = df[['Air temperature [K]','Process temperature [K]','Type','Rotational speed [rpm]','Torque [Nm]','Tool wear [min]','Temp Diff']]
y = df['Machine failure']
X = df[['Air temperature [K]', 'Process temperature [K]', 'Type', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]', 'Temp Diff']]
Step 10: Clean Feature Column Names
This step involves cleaning column names by removing special characters like [
, ]
, or <
using regular expressions. This is done to ensure compatibility with machine learning libraries, such as scikit-learn,
other Python functions that may not be able to process such characters.
# For converting all the undefined nan values
X.columns = X.columns.astype(str).str.replace(r'[\[\]<>]', '', regex=True)
Step 11: Standardize Selected Features
This step involves standardizing the features in the feature matrix, X, to prepare them for model training. This process ensures that all features contribute equally to the model, preventing features with larger numerical values from disproportionately influencing the results.
# Standardizing the selected feature
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
X_scaled = scale.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled,columns=X.columns)
print("..................................................\n.......................................\n...................................")
print("The standardized form of X features are given as*****.......*******\n.............................\n..................................\n",X_scaled_df.describe)
print("..................................................\n.......................................\n...................................")
Step 12: Plot Feature Distributions Before and After Scaling
Visualizing the distributions of features before and after standardization helps to clearly see its effects. Histograms of the data before scaling may show features with widely varying ranges and centers, making direct comparisons difficult. After scaling, however, the distributions are centered around a mean of 0 and have a comparable scale, which is ideal for many machine learning algorithms.
# plotting graphs before and after the standardizations
print("**********Plot before scaling**********")
num_cols = len(df.columns)
fig, axes = plt.subplots(1, num_cols, figsize=(5 * num_cols, 4))
if num_cols == 1:
axes = [axes]
for i, col in enumerate(df.columns):
axes[i].hist(df[col], bins=5, color='skyblue')
axes[i].set_title(f'Before Scaling: {col}')
plt.tight_layout()
plt.show()
print("Applying Standardization...\n,this may take a while.please wait!")
scaler = StandardScaler()
scaled_array = scaler.fit_transform(df)
df_scaled = pd.DataFrame(scaled_array, columns=df.columns)
print("*******Plot after scaling********")
fig, axes = plt.subplots(1, num_cols, figsize=(5 * num_cols, 4))
if num_cols == 1:
axes = [axes]
for i, col in enumerate(df_scaled.columns):
axes[i].hist(df_scaled[col], bins=5, color='salmon')
axes[i].set_title(f'After Scaling: {col}')
plt.tight_layout()
plt.show()

Step 13: Check Target Distribution
Before training a model, it is crucial to check the distribution of the target variable to see if the dataset is imbalanced. An imbalanced dataset has an unequal number of instances for each class, which can lead to a model that performs well on the majority class but poorly on the minority class. Understanding this distribution helps in choosing the right sampling techniques and evaluation metrics for the model.
# checking number of counts of target
from collections import Counter
count = Counter(y)
print("The no. of counts of targeted coulmns are...",count)
Step 14: Apply SMOTE to Balance Dataset
SMOTE oversamples the minority class to prevent model bias.
After this step, both classes have roughly equal representation.
# balancing the data for proper functioning of mode
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled,y_resampled = smote.fit_resample(X,y)
y_resampled.shape
Step 15: Check Class Distribution After Balancing
After applying SMOTE in Step 15, it's important to verify that the dataset is now balanced.
Counter(y_resampled)
counts the number of instances for each class in the resampled target.This ensures that both classes (machine failure = 0 and 1) have roughly equal representation, which is important for unbiased model training.
from collections import Counter
counts = Counter(y_resampled)
print("Now the balanced data is",counts)
Step 16: Train Multiple Models
We train and evaluate several models, including:
Logistic Regression
Random Forest
KNN
SVC
We compute the Accuracy and F1 Score for each.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score,accuracy_score
X_train,X_test,y_train,y_test = train_test_split(X_resampled,y_resampled,test_size=0.2,stratify=y_resampled,random_state=42)
models = {
"LogisticRegression": LogisticRegression(C=(1.0),penalty="l2",solver="liblinear",max_iter=(500)),
"Random Forest" : RandomForestClassifier(),
"KNN" : KNeighborsClassifier(),
"SVC" : SVC()
}
results = {}
for name,model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)
results[name] = {"Accuray":acc,"f1 score":f1}
import pandas as pd
comparison_df = pd.DataFrame(results).T
print(comparison_df)
Step 17: Clean Column Headers
Column names may contain spaces or special characters, which can cause issues in modeling pipelines. We use a regular expression to replace non-alphanumeric characters with underscores.
re.sub(r'\W+', '_', col)
replaces all non-word characters with_
.strip('_')
removes leading or trailing underscores.
# Clean Column Headers
# Some column names may contain spaces or special characters which can cause issues later.
# We replace non-alphanumeric characters with underscores.
import re
def clean_column_headers(df: pd.DataFrame) -> pd.DataFrame:
df.columns = [re.sub(r'\W+', '_', col).strip('_') for col in df.columns]
return df
df1 = clean_column_headers(df)
# print(df1.head())
Step 18: Train Random Forest Separately and Evaluate
This is a focused Random Forest model training with evaluation metrics.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Assuming your DataFrame is named df1
# Step 1: Select features and target
X = df1.drop(['Machine_failure', 'Process_temperature_K', 'Product_ID', 'Type'], axis=1, errors='ignore')
y = df1['Machine_failure']
# Step 2: (Optional) Handle categorical variables if needed
# e.g., X = pd.get_dummies(X) if categorical columns exist
# Step 3: Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 4: Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42) # 100 trees by default
rf_model.fit(X_train, y_train)
# Step 5: Make predictions
y_pred = rf_model.predict(X_test)
# Step 6: Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Step 19: Train Logistic Regression Separately and Evaluate
Same as above, but with Logistic Regression.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Assuming your DataFrame is named df1
# Step 1: Select features and target
X = df1.drop(['Machine_failure', 'Process_temperature_K', 'Product_ID', 'Type'], axis=1, errors='ignore')
y = df1['Machine_failure']
# Step 2: (Optional) Handle categorical variables if needed (e.g., pd.get_dummies)
# Step 3: Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 4: Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Step 5: Make predictions
y_pred = model.predict(X_test)
# Step 6: Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Step 20: Save Model for Explainer Dashboard
We save the trained model using NotebookExecutor().save_model()
to enable Explainer Dashboard for model interpretability.
modelType='ml'
machine learning modelestimator_type='classification'
specifies that it is a classification modelX
andy
training data needed for explainabilityFor custom Keras layers, use native Keras save functions.
from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor
nb = NotebookExecutor()
saved_model = nb.save_model(model = model, modelName = 'model_prediction', modelType = 'ml', X = X_train, y = y_train, estimator_type='classification')
#X and y are training datasets to get explainer dashboard.
#estimator_type is to specify algorithm type i.e., classification and regression.
#Only 'ml' models with tabular data as input will support in Explainer Dashboard.
#Choose modelType = 'ml' for machine learning models, modelType = 'cv' for computer vision models and modelType = 'dp' for data transformation pickle files.
#Provide 'column_headers' as a parameter if they have to be saved in the model.
#If using custom layer in keras, use native save functionality from keras.
Step 21: Load the Saved Model
Load the previously saved model using its ID.
from Notebook.DSNotebook.NotebookExecutor import NotebookExecutor
nb = NotebookExecutor()
loaded_model = nb.load_saved_model('84831758793183092')
Step 22: Predict on Test Set
Make predictions on the test set using the loaded model. Specify modeltype='ml'
for a machine learning model.
df=nb.predict(model = loaded_model, dataframe = X, modeltype='ml')
print(df.head())
#Choose modeltype 'ml' for machine learning models and 'cv' for computer vision model
#ex: For machine learning model nb.predict(model = model, modeltype = 'ml', dataframe = df)
#ex: For computer vision keras model nb.predict(model = model, modeltype = 'cv', imgs = imgs, imgsize = (28, 28), dim = 1, class_names = class_names)
#and for pytorch model(model = model, modeltype = 'cv', imgs = imgs, class_names = class_names)
#Note: incase any error in prediction user squeezed image data in keras