Regression using Data Driven approaches

Regression using Data Driven approaches#

Context#

In this notebook, we will consider a first principle Monod Kenetic model for cell growth.

\[\begin{split} \mu = \mu_{max} * \dfrac{S}{K_S + S} \\ \end{split}\]

The balances are described as follows for the biomass: $$ \dfrac{dX}{dt} = \mu * X \\ $$ And the substrate: $$ \dfrac{dS}{dt} = -\dfrac{1}{Y_{xs}} * \mu * X $$

We will use this model to “generate” a dataset. We will assume the trye kenetics are unknown.

Data generation#

To generate the data we first need to implement the model.

# implement the model
def monod_model(y,t, mu_max, Yxs, S0):
    X, S = y
    # define mu maxc

    # define the derivatives
    dXdt = 0
    dSdt = 0
    return [dXdt, dSdt] 

For generating the data, we will use the following parameters

Parameter	Value	Info
$\mu_{max}$	0.5	maximum growth rate
$K_S	0.2	half-saturation constant
$Y_{XS}$	0.4	Yield coefficient

We will simulate a range of initial conditions where $X_0 \in [0.1, 0.3] $ and $S_0 \in [1.0, 5.0]$. the simulation time is $t \in [0, 10]$ (dont generate too many time steps, e.g. 20)

# define range of X01 and S0

# define the time range

To perform data generation, we will sample uniformly from the ranges of $X_0$ and $S_0$. Solve th model. To add some stochasticity, we add some white noise. The nouse is charachterised by a the following distribution $N(\mu=0, \sigma=0.01)$.

# define sample size
n_sample = 100
# unitialize list for data
data = []
# make a loop

for _ in range(n_sample):
    # random initialize X0 and S0

    # solve the ode

    # add noise

    # store the results
    for i in range(len(t)):
        data.append({
            't': t[i],
            'X0': X0,
            'S0': S0,
            'X': max(0, X_noisy[i]),  # ensure non-negative
            'S': max(0, S_noisy[i]),
            'dXdt': monod_model([X_noisy[i], S_noisy[i]], t[i], 
                                μmax, Ks, Yxs, S0)[0]
        })

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 15
      5 # make a loop
      7 for _ in range(n_sample):
      8     # random initialize X0 and S0
      9 
   (...)
     13 
     14     # store the results
---> 15     for i in range(len(t)):
     16         data.append({
     17             't': t[i],
     18             'X0': X0,
   (...)
     23                                 μmax, Ks, Yxs, S0)[0]
     24         })

NameError: name 't' is not defined

Preprocessing#

The result should be made better presantable and converted into an easier format for preprocessing. Fllow the intructions below

# convert data into a dataframe

# prepare the features and target columns
feature_columns = ['t', 'X0', 'S0', 'X', 'S']
X = df[feature_columns]
y = df['dXdt']

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 5
      1 # convert data into a dataframe
      2 
      3 # prepare the features and target columns
      4 feature_columns = ['t', 'X0', 'S0', 'X', 'S']
----> 5 X = df[feature_columns]
      6 y = df['dXdt']

NameError: name 'df' is not defined

the ML related preprocessing should include:

splitting
scaling

# split the data into bins

# scale the data

For this part select an ML model of your own choosing + a multilayer perceptron (MLP).

# define the models (for MLP, start out with the scikit-learn implementation)

# train the models

Model evaluation#

With the trained model, we can perform predictions. calculate various meryics of intrest and make the following plots

parity plot
error plot
prediction plot

# make parity plot

# make error distribution plots

# make prediction plot (growth trajectory)

Process Optimization#

Despite the fact that we dont have a mechanistic model and if the accuracy of the ML based model is sufficient, we can perform process optimization. Here, the obkective is to dtermine the necessairy initial consition in order to obtain a target biomass of 1.0 .

A naive optimization will be done through grid search e.i. perform equidistant sampling of the initial condition and evaluate the final biomass concentration. Select the conditions providing the closes results.

# Grid search over initial conditions

# perform prediction for each instance

# evaluate the closeness to the target

# selec the best initial condition
#  
import numpy as np
x = np.zeros(5)

Parameter	Value	Info
\(\mu_{max}\)	0.5	maximum growth rate
$K_S	0.2	half-saturation constant
\(Y_{XS}\)	0.4	Yield coefficient