ML Model
This page provides step-by-step process to train an ML model inside a Data Science notebook.
Machine Failure Prediction Using Random Forest and Logistic Regression
This notebook demonstrates the complete workflow of preparing machine data, balancing it, training multiple models, and evaluating them. We will use Random Forest and Logistic Regression, and handle imbalanced datasets using SMOTE. We will also standardize the features and explore correlations to understand the data better.
Step 1: Installing Required Packages/ Libraries.
SMOTE generates synthetic examples for the minority class to balance the dataset.
Note: Once installed, restart the Kernel.
!!pip install imbalanced-learnStep 2: Upload and Load the data
We begin by loading the dataset into a pandas DataFrame to inspect the first few rows. This helps us understand the structure and types of data we are working with.
Sample Data: Download the following sample data and upload it using the Sandbox data source to the project.
To load your uploaded data, you can either:
Step 3: Check for missing values
Before proceeding, we need to identify any missing values in the dataset. Missing values can affect model performance and may need to be handled.
Step 4: Feature Engineering
We create a new feature TempDiff that captures the difference between Process temperature and Air temperature.
Derived features like this can help models capture hidden patterns in the data.
Step 5: Remove unnecessary and leakage-causing features
Some features either leak information about the target or are irrelevant identifiers. We drop such columns to prevent the model from learning incorrect patterns.
Dropped features include
UDI,Product ID(identifiers) andTWF,HDF,PWF,OSF,RNF(redundant/derived features).
Also, print the new data to verify the loaded data.
Step 6: Encode Categorical Variables
The Type column is categorical. Machine learning models require numeric inputs, so we convert it using Label Encoding. For example: Type 'A' becomes 0, Type 'B' becomes 1, etc.
Step 7: Standardize the Dataset
Standardization scales features to have a mean of 0 and a standard deviation of 1. This process is crucial for machine learning models as it helps them train faster and prevents features with large magnitudes from dominating the learning process.
Step 8: Feature Correlation
Visualizing correlations helps to identify redundant features and understand the relationships between variables. A common way to do this is with a heatmap, which visually represents how strongly features are related to each other and to the target variable.

Step 9: Select Features and Target
This step involves defining the feature matrix (X) and the target vector (y) for the machine failure prediction task.
Features (X): This includes the numerical columns that are likely to influence machine failure, such as temperature, rotational speed, torque, tool wear, and the previously derived feature,
TempDiff.Target (y): The target variable is the
Machine failurecolumn, which indicates whether a failure occurred (1) or not (0).
Step 10: Clean Feature Column Names
This step involves cleaning column names by removing special characters like [, ], or < using regular expressions. This is done to ensure compatibility with machine learning libraries, such as scikit-learn, other Python functions that may not be able to process such characters.
Step 11: Standardize Selected Features
This step involves standardizing the features in the feature matrix, X, to prepare them for model training. This process ensures that all features contribute equally to the model, preventing features with larger numerical values from disproportionately influencing the results.
Step 12: Plot Feature Distributions Before and After Scaling
Visualizing the distributions of features before and after standardization helps to clearly see its effects. Histograms of the data before scaling may show features with widely varying ranges and centers, making direct comparisons difficult. After scaling, however, the distributions are centered around a mean of 0 and have a comparable scale, which is ideal for many machine learning algorithms.

Step 13: Check Target Distribution
Before training a model, it is crucial to check the distribution of the target variable to see if the dataset is imbalanced. An imbalanced dataset has an unequal number of instances for each class, which can lead to a model that performs well on the majority class but poorly on the minority class. Understanding this distribution helps in choosing the right sampling techniques and evaluation metrics for the model.
Step 14: Apply SMOTE to Balance Dataset
SMOTE oversamples the minority class to prevent model bias.
After this step, both classes have roughly equal representation.
Step 15: Check Class Distribution After Balancing
After applying SMOTE in Step 15, it's important to verify that the dataset is now balanced.
Counter(y_resampled)counts the number of instances for each class in the resampled target.This ensures that both classes (machine failure = 0 and 1) have roughly equal representation, which is important for unbiased model training.
Step 16: Train Multiple Models
We train and evaluate several models, including:
Logistic Regression
Random Forest
KNN
SVC
We compute the Accuracy and F1 Score for each.
Step 17: Clean Column Headers
Column names may contain spaces or special characters, which can cause issues in modeling pipelines. We use a regular expression to replace non-alphanumeric characters with underscores.
re.sub(r'\W+', '_', col)replaces all non-word characters with_.strip('_')removes leading or trailing underscores.
Step 18: Train Random Forest Separately and Evaluate
This is a focused Random Forest model training with evaluation metrics.
Step 19: Train Logistic Regression Separately and Evaluate
Same as above, but with Logistic Regression.
Step 20: Save Model for Explainer Dashboard
We save the trained model using NotebookExecutor().save_model() to enable Explainer Dashboard for model interpretability.
modelType='ml'machine learning modelestimator_type='classification'specifies that it is a classification modelXandytraining data needed for explainabilityFor custom Keras layers, use native Keras save functions.
Step 21: Load the Saved Model
Load the previously saved model using its ID.
Step 22: Predict on Test Set
Make predictions on the test set using the loaded model. Specify modeltype='ml' for a machine learning model.