Classification using Data Driven Approaches

Classification using Data Driven Approaches#

Objectives#

In this exercice, we will work with a dataset provided here. The aim is to perform fault diagnosis on a CSTR based on the measured signals.

This datasets contains a set of simulations of a known benchmark in the fault diagnosis for chemical processes community, i.e., the Continuous Stirred Tank Reactor. This system carries an exothermic reaction A -> B, and a feedback loop for controlling the reactor’s temperature. There are 7 measured variables. These variables are:

Concentration of A in the inlet flow,
Temperature of the inlet flow,
Temperature of the inlet coolant flow,
Coolant flow-rate,
Concentration of B in the outlet flow,
Temperature of the outlet flow,
Temperature of the outlet coolant flow.

The goal is to predict a set of 12 faults from these 7 variables, measured throughout 200 minutes, at a 1 minute rate. Additional information on the faults can be found in [Montesuma, 2021, Montesuma et al., 2022].

Exploartory data Analysis#

The first step is to perform eploratory data analysis (EDA). This includes:

loading the data
convert to a suitable format
inspect the dataset (size, content)
clean the dataset

# import necessairy packages (hint: numpy, sk-learn, matplotlib, etc.)
import numpy as np

print(np.zeroes(5))

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 4
      1 # import necessairy packages (hint: numpy, sk-learn, matplotlib, etc.)
      2 import numpy as np
----> 4 print(np.zeroes(5))

File ~\AppData\Local\miniconda3\envs\aimbio\lib\site-packages\numpy\__init__.py:320, in __getattr__(attr)
    317     from .testing import Tester
    318     return Tester
--> 320 raise AttributeError("module {!r} has no attribute "
    321                      "{!r}".format(__name__, attr))

AttributeError: module 'numpy' has no attribute 'zeroes'

# load the dataset (hint....numpy)

# convert to pandas dataframe (much nicer visuals)

# inspect the dataset (hint: show the top 10 rows)

the last 4 columns contain the following data:

Column 1400 refers to the Fault.
Column 1401 refers to the domain level
Column 1402 refers to the noise level
Column 1403 refers to the reaction order

# extract the data e.g. fault_label=df.iloc[:,-4]

# investigate the uniqueness of the domains

# investigate the uniqueness of the faults

# investigate the uniqueness of the parameter noise

# investigate the uniqueness of the reaction order

# investigate how these uniques values are associated in each domain, is there some exlusivity?

#for domain in np.unique(domain_label):
#    domain_parameters = np.where(domain_label == domain)[0]
#    print(f'The domain {domain} has a noise level of {np.unique(parameter_noise[domain_parameters]).item()} and a reaction order of {np.unique(reaction_order[domain_parameters]).item()}')

Data preprocessing#

# split the data into features and target

# split the data into training, validation and test

Model evaluation#

# crate a function for evaluating various classification model

#def predictions(classifier, X_train, y_train, X_test, y_test, model_name):
#    classifier.fit(X_train, y_train)
#    print(classifier.score(X_test, y_test))

#    cm = confusion_matrix(classifier.predict(X_test), y_test)
#    sns.heatmap(cm, annot=True, cmap='viridis')
#    plt.title(f'Confusion matrix for {model_name}')
#    plt.show()

With the current data, train and validate the following models:

SVC
DecisionTree
RF
XGB

# SVC
#print('SVC')
#predictions(SVC(random_state=2207, probability=True), X_train, y_train, X_val, y_val, model_name='SVC')

Improving Performance#

Inspect the data again, have you noticed anything? are all variable of the same scale? If not, then scale them.

# scale the data (hint: StandardScaler())

Retrain the models. how is the accuracy compared to previousely?

Dimentionality Reduction#

A total of 1400 features are used. A question would be, is that really necessairy? Try to use PCA to reduce the number of features. and hpow much reduction is acceptable (see scree tree)

# Perform PCA (e.g. with variance threshhold of 0.956)

# find the threshold and select the data

# Scale the data (again)

# retrain the models

Puhsing performance even further (extra)#

the models used so far relied on the default hyperparameters. Investigate the documentation of one model and create a grid search. Find the best combination of hyperparameters

Notice:#

Thanks to XXX and XXXX for contributing to the development of the solution

[HdHPK14]

Christopher Ramsay Holdgraf, Wendy de Heer, Brian N. Pasley, and Robert T. Knight. Evidence for Predictive Coding in Human Auditory Cortex. In International Conference on Cognitive Neuroscience. Brisbane, Australia, Australia, 2014. Frontiers in Neuroscience.

[Mon21]

Eduardo Fernandes Montesuma. Cross-Domain Fault Diagnosis through Optimal Transport. Bachelor's Thesis, Universidade Federal do Ceará, 2021.

[MMCM22]

Eduardo Fernandes Montesuma, Michela Mulas, Francesco Corona, and Fred Maurice Ngole Mboula. Cross-domain fault diagnosis through optimal transport for a cstr process. IFAC-PapersOnLine, 55(7):946–951, 2022.