ML Model

This page provides step-by-step process to train an ML model inside a Data Science notebook.

Machine Failure Prediction Using Random Forest and Logistic Regression

This notebook demonstrates the complete workflow of preparing machine data, balancing it, training multiple models, and evaluating them. We will use Random Forest and Logistic Regression, and handle imbalanced datasets using SMOTE. We will also standardize the features and explore correlations to understand the data better.

Step 1: Installing Required Packages/ Libraries.

SMOTE generates synthetic examples for the minority class to balance the dataset.

Note: Once installed, restart the Kernel.

!!pip install imbalanced-learn

Step 2: Upload and Load the data

We begin by loading the dataset into a pandas DataFrame to inspect the first few rows. This helps us understand the structure and types of data we are working with.

Sample Data: Download the following sample data and upload it using the Sandbox data source to the project.

file-download
510KB

To load your uploaded data, you can either:

circle-info

Note: After loading the data file into the code cell, modify the data frame. We suggest removing the name of the loaded file and just keeping the data frame as "df=nb.get_data" as shown in the above code cell.

Step 3: Check for missing values

Before proceeding, we need to identify any missing values in the dataset. Missing values can affect model performance and may need to be handled.

circle-info

Note: This function checks your data columns and provides a count of missing values. It's recommended that you ensure there is no missing data before you proceed.

Step 4: Feature Engineering

We create a new feature TempDiff that captures the difference between Process temperature and Air temperature.

  • Derived features like this can help models capture hidden patterns in the data.

Step 5: Remove unnecessary and leakage-causing features

Some features either leak information about the target or are irrelevant identifiers. We drop such columns to prevent the model from learning incorrect patterns.

  • Dropped features include UDI, Product ID (identifiers) and TWF, HDF, PWF, OSF, RNF (redundant/derived features).

Also, print the new data to verify the loaded data.

Step 6: Encode Categorical Variables

The Type column is categorical. Machine learning models require numeric inputs, so we convert it using Label Encoding. For example: Type 'A' becomes 0, Type 'B' becomes 1, etc.

Step 7: Standardize the Dataset

Standardization scales features to have a mean of 0 and a standard deviation of 1. This process is crucial for machine learning models as it helps them train faster and prevents features with large magnitudes from dominating the learning process.

Step 8: Feature Correlation

Visualizing correlations helps to identify redundant features and understand the relationships between variables. A common way to do this is with a heatmap, which visually represents how strongly features are related to each other and to the target variable.

Step 9: Select Features and Target

This step involves defining the feature matrix (X) and the target vector (y) for the machine failure prediction task.

  • Features (X): This includes the numerical columns that are likely to influence machine failure, such as temperature, rotational speed, torque, tool wear, and the previously derived feature, TempDiff.

  • Target (y): The target variable is the Machine failure column, which indicates whether a failure occurred (1) or not (0).

Step 10: Clean Feature Column Names

This step involves cleaning column names by removing special characters like [, ], or < using regular expressions. This is done to ensure compatibility with machine learning libraries, such as scikit-learn, other Python functions that may not be able to process such characters.

Step 11: Standardize Selected Features

This step involves standardizing the features in the feature matrix, X, to prepare them for model training. This process ensures that all features contribute equally to the model, preventing features with larger numerical values from disproportionately influencing the results.

Step 12: Plot Feature Distributions Before and After Scaling

Visualizing the distributions of features before and after standardization helps to clearly see its effects. Histograms of the data before scaling may show features with widely varying ranges and centers, making direct comparisons difficult. After scaling, however, the distributions are centered around a mean of 0 and have a comparable scale, which is ideal for many machine learning algorithms.

Step 13: Check Target Distribution

Before training a model, it is crucial to check the distribution of the target variable to see if the dataset is imbalanced. An imbalanced dataset has an unequal number of instances for each class, which can lead to a model that performs well on the majority class but poorly on the minority class. Understanding this distribution helps in choosing the right sampling techniques and evaluation metrics for the model.

Step 14: Apply SMOTE to Balance Dataset

  • SMOTE oversamples the minority class to prevent model bias.

  • After this step, both classes have roughly equal representation.

Step 15: Check Class Distribution After Balancing

After applying SMOTE in Step 15, it's important to verify that the dataset is now balanced.

  • Counter(y_resampled) counts the number of instances for each class in the resampled target.

  • This ensures that both classes (machine failure = 0 and 1) have roughly equal representation, which is important for unbiased model training.

Step 16: Train Multiple Models

We train and evaluate several models, including:

  • Logistic Regression

  • Random Forest

  • KNN

  • SVC

We compute the Accuracy and F1 Score for each.

Step 17: Clean Column Headers

Column names may contain spaces or special characters, which can cause issues in modeling pipelines. We use a regular expression to replace non-alphanumeric characters with underscores.

  • re.sub(r'\W+', '_', col) replaces all non-word characters with _.

  • strip('_') removes leading or trailing underscores.

Step 18: Train Random Forest Separately and Evaluate

This is a focused Random Forest model training with evaluation metrics.

Step 19: Train Logistic Regression Separately and Evaluate

Same as above, but with Logistic Regression.

Step 20: Save Model for Explainer Dashboard

We save the trained model using NotebookExecutor().save_model() to enable Explainer Dashboard for model interpretability.

  • modelType='ml' machine learning model

  • estimator_type='classification' specifies that it is a classification model

  • X and y training data needed for explainability

  • For custom Keras layers, use native Keras save functions.

circle-info

Note:

  • You can save this model with a unique name each time.

  • The saved model will be accessible under the Models option in the left panel. Navigation Path: Workspace > Left Navigation panel > Models.

Step 21: Load the Saved Model

Load the previously saved model using its ID.

circle-info

Note:

  • While loading a saved model to a code cell, a model ID number will be displayed.

  • You can use either the provided code or put a check mark in the corresponding checkbox for the saved model to load it in the next code cell.

Step 22: Predict on Test Set

Make predictions on the test set using the loaded model. Specify modeltype='ml' for a machine learning model.

circle-info

Note: You can use either the provided code or the Predict function in the code cell to make a prediction.