QSAR

Adding a Model Algorithm

In this tutorial we will do a direct follow up of the example we showed in Creating an Extension. We will add a new machine learning algorithm for QSAR modelling to the qsarextra extension that we created previously. You will see that we will actually not need to explicitly define any REST API endpoints or handle web requests in order to make our implementation visible in the REST API. The GenUI framework already defines suitable endpoints and will simply call your code when required. Therefore, we can mainly focus on the implementation of our algorithm.

In order to add new QSAR model implementations, all we have to do is create a new subpackage in our qsarextra application. This package will be called genuimodels and its presence tells the genuisetup command that it should look for special modules and classes in this package. For new machine learning algorithms, we need to create a module called algorithms.py. The genuisetup command will be looking for a module named like this and search for implementations of the genui.models.genuimodels.bases.Algorithm abstract class. A minimal algorithms.py would look something like this:

"""
algorithms.py in src/genui/qsar/extensions/qsarextra/genuimodels/

"""

from pandas import DataFrame, Series
from genui.models.genuimodels.bases import Algorithm

class MyAlgorithm(Algorithm):
    name = "MyAlgorithmName"

    @property
    def model(self):
        pass

    def fit(self, X: DataFrame, y: Series):
        pass

    def predict(self, X: DataFrame) -> Series:
        pass

Implement these three methods and you are done. No more work needed. The algorithm should now show up among the others in the REST API (URL: /api/qsar/algorithms/) after you run the genuisetup command.

What happens after you run genuisetup is that the new algorithm will be registered and an entry will be created in the database representing this class. If the user then selects this as an option while defining a QSAR model with the API (or in the GUI), the GenUI framework knows it needs to use this class to construct the model. It prepares all the data (depending on what descriptors were chosen by the user) and after initialization the Algorithm.fit method is called. Similarly, for predictions the Algorithm.predict method is used.

Lets see how a real world example would look like. We will implement a new algorithm based on the implementation of Support Vector Machines (SVMs) in scikit-learn. We could include SVMs in qsarextra by defining the following class:

"""
algorithms.py in src/genui/qsar/extensions/qsarextra/genuimodels/

"""

from pandas import DataFrame, Series
from sklearn.svm import SVR, SVC

from genui.models.genuimodels.bases import Algorithm
from genui.models.models import ModelParameter


class SVM(Algorithm):
    name = "SVM"
    parameters = {
        "C" : {
            "type" : ModelParameter.FLOAT,
            "defaultValue" : 1.0
        },
        "kernel" : {
            "type" : ModelParameter.STRING,
            "defaultValue" : 'rbf'
        }
    }

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs) # call base class constructor
        self.alg = SVR if self.mode.name == self.REGRESSION else SVC # based on prediction mode, get the correct scikit-learn class

    @property
    def model(self):
        """
        You define this property so that it returns the final fitted model.
        It can be any object so it is ok if we just return the SVC/SVR instance
        directly.

        This object is used mainly for serialization to disk and you can
        implement methods that do the job. GenUI uses *joblib* by default,
        which can handle scikit-learn instances just fine so there
        is no need to customize anything here.

        Returns
        -------
        object
            An instance representing the fitted model.
        """

        return self._model # None by default

    def fit(self, X: DataFrame, y: Series):
        """
        This method takes the data matrix and fits the model.
        The input will be a `DataFrame` and `Series`.
        Data will usually be raw without any transformations
        or normalizations applied so you might want to do them
        here as well.

        Parameters
        ----------
        X : DataFrame
            The data matrix to fit by the model. Samples as rows, variables as columns.
        y : Series
            The ground truth value for each sample. Should be the same length as rows of X.
        """

        # we also want probabilities for classification (see the 'predict' method)
        # so we add the 'probability' parameter when needed
        self._model = self.alg(probability=True, **self.params) if self.alg.__name__  == SVC.__name__ else self.alg(**self.params)

        self._model.fit(X, y)
        if self.callback:
            self.callback(self)

    def predict(self, X: DataFrame) -> Series:
        """
        A method used for predictions. You get
        a matrix of samples (you should again transform
        and normalize and needed) and it is expected
        your model returns the predictions as a `Series`.

        Parameters
        ----------
        X : DataFrame
            The samples.

        Returns
        -------
        predictions : Series
            The predictions.

        """

        is_regression = self.mode.name == self.REGRESSION
        if self.model:
            if is_regression:
                return self.model.predict(X)
            else:
                return self.model.predict_proba(X)[:,1]
        else:
            raise Exception("You have to fit the model first.")

For more information on other useful attributes and methods, see the genui.models.genuimodels.bases.Algorithm reference.

Writing Tests

It is always good practice to validate newly implemented features with unit tests. The GenUI framework defines a few classes that make writing tests easier. In order to test our SVM models, we could define the following test case in the qsarextra.tests module:

"""
tests.py in src/genui/qsar/extensions/qsarextra/

"""

from rest_framework.test import APITestCase

from genui.models.models import AlgorithmMode, Algorithm
from genui.qsar.tests import QSARModelInit


class QSARExtraTestCase(QSARModelInit, APITestCase):

    def test_my_SVC(self):
        self.createTestQSARModel(
            mode = AlgorithmMode.objects.get(name="classification"),
            algorithm = Algorithm.objects.get(name="SVM"),
            parameters={
                "C" : 1.5,
                "kernel" : 'poly'
            }
        )

    def test_my_SVR(self):
        self.createTestQSARModel(
            mode = AlgorithmMode.objects.get(name="regression"),
            algorithm = Algorithm.objects.get(name="SVM"),
            parameters={
                "C" : 1.5,
                "kernel" : 'poly'
            }
        )

The createTestQSARModel method of QSARModelInit defines a basic unit test to train a given QSAR model using the REST API. It automatically sets up a project and imports some test compounds and bioactivites from the ChEMBL database for training. The resulting model is returned from the method as the appropriate Django model.

Note

You can run all tests for GenUI with python manage.py test. However, you will need to set the settings module to genui.settings.test. This is the same as the genui.settings.debug configuration, but all Celery tasks will be ran synchronously in a single thread and created media files are saved into a separate directory while executing tests as well.

Adding New Molecular Descriptors

In QSAR modelling, an important decision is the choice of molecular descriptors so you will likely want to implement calculation of your own. Doing so is easy and it is again done through the definition of a special class. This time we will need to implement the DescriptorCalculator.__call__ method of the DescriptorCalculator abstract class defined in the genui.qsar package.

Lets say we would like to have the qsarextra extension provide a new set of chemical descriptors. We have to create a new module under genui.qsar.extensions.qsarextra.genuimodels, but this time we will name it descriptors.py. In this file, we can define the descriptor calculators. For example, we could include the 2D descriptors provided by the RDKit library like so:

"""
descriptors.py in src/genui/qsar/extensions/qsarextra/genuimodels

"""

from pandas import DataFrame

from genui.qsar.genuimodels.bases import DescriptorCalculator

from rdkit.ML.Descriptors.MoleculeDescriptors import MolecularDescriptorCalculator
from rdkit.Chem import Descriptors, MolFromSmiles

class RDKitDescriptorsCalculator(DescriptorCalculator):
    group_name = 'RDKit_2D'

    def __call__(self, smiles) -> DataFrame:
        """
        Calculates 2D RDKit descriptors.

        Parameters
        ----------
        smiles : list
            A list of SMILES strings.

        Returns
        -------
        descriptors : DataFrame
            The matrix of calculated descriptors as `DataFrame`.
        """

        desc_list = [x[0] for x in Descriptors.descList]
        calc = MolecularDescriptorCalculator(desc_list)
        ret = []
        for smile in smiles:
            mol = MolFromSmiles(smile)
            descs = calc.CalcDescriptors(mol)
            ret.append(descs)

        return DataFrame(ret, columns=desc_list)

Note that you also have to give the new group of descriptors a name using the DescriptorCalculator.group_name class attribute. This is the name under which this descriptor group appears in the REST API.

Adding Performance Metrics

GenUI already has a small collection of performance metrics for both classification and regression tasks. However, it is very easy to implement custom metrics. The process is similar to what we have seen so far. You just need to create a new metrics.py module file in the qsarextra.genuimodels package and create subclasses of ValidationMetric inside.

We could again exploit scikit-learn to provide a simple implementation of the F1 score:

"""
metrics.py in src/genui/qsar/extensions/qsarextra/genuimodels/
"""

from sklearn import metrics

from genui.models.genuimodels.bases import ValidationMetric, Algorithm


class F1(ValidationMetric):
    """
    Implementation of the F1 score for classification accuracy.
    """

    name = "F1"
    description = "Compute the F1 score, also known as balanced F-score or F-measure."
    modes = [Algorithm.CLASSIFICATION]

    def __call__(self, true_vals, predicted_vals):
        """
        Implementation of the validation metric calculation.

        Note: Predicted values (`predicted_vals`) for classification models
        should be probabilities with which an item belongs
        to the class noted in `true_vals`. For regression, `predicted_vals`
        are simply the predicted values.

        Parameters
        ----------
        true_vals
            True prediction values from data.
        predicted_vals
            Predicted values from the model.
        Returns
        -------
        score :float
            A single number representing the model score according to this metric.

        """

        return metrics.f1_score(true_vals, self.probasToClasses(predicted_vals))

All you have to do is implement the __call__ method and give your new metric a name, description and a list of modes you want this metric to be available for.