Compounds

The GenUI backend server already provides a few useful extensions that allow creation of compound sets from various sources (see genui.compounds.extensions). A compound set is an important data structure in GenUI. It is a way of grouping and organizing compounds for the purpose of QSAR modelling or training molecular generators in a GenUI project. Similarly, there are also activity sets which group biological activities of compounds.

In this tutorial, we will be showing the implementation of a simple extension that will allow us to upload compounds and their activity data in JSON via the REST API. We will call it jsonimport.

Creating the Extension

The process of creating an extension is no different from the approach we already outlined before:

cd src/
mkdir genui/compounds/extensions/jsonimport
python manage.py startapp jsonimport compounds/extensions/jsonimport

Also, do not forget to add your package to the __init__.py of genui.compounds.extensions:

"""
__init__.py in src/genui/compounds/extensions/

"""

__all__ = ('chembl', 'generated', 'sdf', 'csvimports', 'jsonimport')

Setting URL Prefix

In order to be able to upload JSON data to the application, the appropriate REST API endpoint will need to be set up. We will also likely want endpoints to do other things with the data after upload (querying/updating/deleting/…).

Luckily, the genui.compounds package and the Django Rest Framework make this job a little easier. You will only need to define a URL prefix for the endpoints and attach a customized viewset. You can do all this in the urls.py module of your extension app (create this file if it does not exist):

"""
urls.py in src/genui/compounds/extensions/jsonimport/

"""
from django.urls import path, include
from rest_framework import routers
from . import views

router = routers.DefaultRouter()
router.register(r'sets/json', views.JSONSetViewSet, basename='jsonSet')

urlpatterns = [
    path('', include(router.urls)),
]

Here, we are using a router, which sets up the endpoints automatically for us. All we need to do is implement the controllers in the JSONSetViewSet (see Implementing a Compound Set Viewset). Here, we also provide a basename (jsonSet) for the routes. This will allow us to easily get the URL of the appropriate endpoint with the reverse function in Django.

The last step is to append the router URLs to the urlpatterns variable, which is a special variable Django uses to pick up URL definitions. The patterns you have defined here are automatically appended under /compounds/ by the genui.compounds application (see Setting Up the Extension). Therefore, we will find our endpoints under /compounds/sets/json/ along the endpoints for other extensions that import SDF or CSV files.

Implementing a Compound Set Viewset

Lets now look at how we can implement the JSONSetViewSet, which will power our endpoints. We will define it in the views.py module of our extension:

"""
views.py in src/genui/compounds/extensions/jsonimport/

Viewsets of the jsonimport package.
"""


from genui.compounds.extensions.jsonimport.initializer import JSONSetInitializer
from genui.compounds.extensions.jsonimport.models import JSONMolSet
from genui.compounds.extensions.jsonimport.serializers import JSONMolSetSerializer
from genui.compounds.views import BaseMolSetViewSet


class JSONSetViewSet(BaseMolSetViewSet):
    queryset = JSONMolSet.objects.all() # a Django model queryset defining this compound set
    serializer_class = JSONMolSetSerializer # JSON object serializer
    initializer_class = JSONSetInitializer # JSON compound set initializer

    def get_initializer_additional_arguments(self, validated_data):
        """
        This method can be used to pass extra arguments
        to the compound set initializer implemented
        by *initializer_class*.

        Parameters
        ----------
        validated_data
            validated data according to the *serializer_class*
        Returns
        -------
        parameters : dict
            keyword parameters for the __init__ method of the *initializer_class*
        """

        return {
            # get the molecules from the validated JSON data
            "molecules" : validated_data["molecules"],
        }

As you can see, we can implement a viewset with only a few lines of code. The BaseMolSetViewSet class from GenUI already handles quite a lot for us. We just need to tell it a few important things:

1. Specify the database query: In order to save and query the uploaded compounds, a Django model needs to be created that maps a Python class to a database table. The queryset parameters specifies a Django queryset that will be used to get the model instances for this viewset. We will cover this in more detail later: Defining Django Models.

2. Define a serializer class: Serializers are objects that can map Django models to JSON objects and vice versa. We will need to define a serializer for the JSONMolSet Django model in Defining Serializers and specify it here.

3. Define an initializer class: An initializer is a concept coming from the GenUI framework. An object of this class handles the creation of a compound set from the uploaded compounds. We will show how to implement it in our case later in this tutorial: Defining Compound Set Initializer.

Defining Django Models

The GenUI framework already has defined data structures for storage of chemical data. All compounds are saved as defined by the Molecule Django model class. This model is polymorphic so you can subclass it and add your own database fields. You can do it with the MolSet model just as well. For the purpose of this tutorial, these two classes are all we will need:

"""
models.py in src/genui/compounds/extensions/jsonimport/
"""

from django.db import models
from genui.compounds.models import Molecule, MolSet


class JSONMolecule(Molecule):
    name = models.CharField(blank=True, null=False, max_length=1024)

class JSONMolSet(MolSet):
    pass

In this simple case we did not really change the implementation of MolSet in JSONMolSet. However, it is still a good idea to create a separate model since we might want to extend it in the future and it is also an easy way to track data uploaded with our extension.

In the case of the Molecule model, we only add one field for the name of the compound in JSONMolecule.

Defining Serializers

Now that we have our models, it is time to tell Django how it should translate them to the JSON format. That is the purpose of serializers and GenUI has the MolSetSerializer, MoleculeSerializer and ActivitySerializer classes already implemented. All we need is to customize them to fit our models:

"""
serializers.py in src/genui/compounds/extensions/jsonimport

"""
from rest_framework import serializers

from genui.compounds.extensions.jsonimport.models import JSONMolecule, JSONMolSet
from genui.compounds.models import Activity
from genui.compounds.serializers import MolSetSerializer, MoleculeSerializer, ActivitySerializer

class JSONActivitySerializer(ActivitySerializer):
    """
    A simplified serializer for activity. We only
    use three fields from the parent:

        1. the numerical value of the activity
        2. the type of the activity (i.e. IC50)
        3. the units of activity for this value (i.e. nM)
    """

    class Meta:
        model = Activity
        fields = ('value', 'type', 'units')

class JSONMolSerializer(MoleculeSerializer):
    """
    A simplified serializer for compounds.

    Note that we do not need to specify the name field explicitly.
    The framework picks it up automatically from the *JSONMolecule* model.

    We also serialize the activities as a list of *JSONActivitySerializer*
    instances.
    """

    smiles = serializers.CharField(required=True)
    activities = JSONActivitySerializer(many=True)

    class Meta:
        model = JSONMolecule
        fields = ('id', 'name', 'smiles', 'activities')

class JSONMolSetSerializer(MolSetSerializer):
    """
    A compound set needs to have a few fields specified
    for successful creation. So in this case we take
    them from the *MolSetSerializer* explicitly and
    also add a list of molecules as specified by
    *JSONMolSerializer*.
    """

    molecules = JSONMolSerializer(many=True)

    class Meta:
        model = JSONMolSet
        fields = MolSetSerializer.Meta.fields + ('molecules',)
        read_only_fields = ('created', 'updated')

    def create(self, validated_data):
        """
        Create an instance of JSONMolSet from the validated data.

        Parameters
        ----------
        validated_data : dict
            Validated and parsed data from the JSON object obtained via POST.

        Returns
        -------
            model_instance : JSONMolSet
        """

        ModelClass = self.Meta.model
        return ModelClass.objects.create(
            name=validated_data["name"]
            , description=validated_data["description"]
            , project=validated_data["project"]
        )

This should allow the application to validate and parse the following data, for example:

{
    "name": "Test JSON Molecule Set",
    "description": "My molecule set for testing...",
    "project": 1, # id of the project to attach this compound set to
    "molecules" : [
        {
            "name": "Vismodegib",
            "smiles": "CS(=O)(=O)C1=CC(=C(C=C1)C(=O)NC2=CC(=C(C=C2)Cl)C3=CC=CC=N3)Cl",
            "activities": []
        },
        {
            "name": "Captopril",
            "smiles": "C[C@H](CS)C(=O)N1CCC[C@H]1C(=O)O",
            "activities": [
                {
                    "value": 20.0,
                    "type": {
                        "value": "IC50"
                    },
                    "units": {
                        "value": "nM"
                    }
                },
                {
                    "value": 7.7,
                    "type": {
                        "value": "pIC50"
                    },
                    "units": None
                }
            ]
        },
        {
            "name": "Nimesulide",
            "smiles": "CS(=O)(=O)Nc1ccc([N+](=O)[O-])cc1Oc1ccccc1",
            "activities": [
                {
                    "value": 11826.0,
                    "type": {
                        "value": "Ki"
                    },
                    "units": {
                        "value": "nM"
                    }
                },
                {
                    "value": 4.93,
                    "type": {
                        "value": "pKi"
                    },
                    "units": None
                }
            ]
        },
    ]
}

You can see the descriptions and implementations of the MolSetSerializer, MoleculeSerializer and ActivitySerializer classes to get a better idea of what other fields they define and what purpose they serve. Take a look at the genui.compounds.serializers package for more info about other serializers as well.

Defining Compound Set Initializer

Looking at the create method of JSONMolSetSerializer above, we can finally see how a compound set is initialized. However, we do not yet see how we can add the compounds we have uploaded to it. Populating a compound set with new compounds is the responsibility of a compound set initializer.

An initializer is any class derived from the MolSetInitializer abstract base class. In particular, we need to implement the MolSetInitializer.populateInstance and MolSetInitializer.updateInstance methods. In our case, we will also have to change the __init__ method because we are also passing the molecules from our POST request via the BaseMolSetViewSet.get_initializer_additional_arguments of our viewset (see Implementing a Compound Set Viewset). An example implementation of the initializer in our simple example could look like this:

"""
initializer.py in src/genui/compounds/extensions/jsonimport/

"""
from genui.compounds.extensions.jsonimport.models import JSONMolecule
from genui.compounds.initializers.base import MolSetInitializer
from genui.compounds.models import ActivitySet, Activity, ActivityTypes, ActivityUnits


class JSONSetInitializer(MolSetInitializer):
    """
    Our initializer. It takes a set of molecules
    as it was parsed from the JSON POST request.
    """

    def __init__(self, *args, molecules=tuple(), **kwargs):
        """

        Parameters
        ----------
        args
            positional arguments
        molecules
            as parsed from the JSON request and supplied by *get_initializer_additional_arguments* of the viewset
        kwargs
            any additional keyword arguments we do not care about
        """

        super().__init__(*args, **kwargs) # arguments that we do not want are passed to the base class
        self.molecules = molecules # save the data to be parsed

    def populateInstance(self) -> int:
        """
        Called when a new compound set is created.

        Returns
        -------
        count : int
            number of unique molecules found in the data
        """

        activity_set = None
        for idx, mol_data in enumerate(self.molecules):
            # note current progress
            progress = 100 * idx / len(self.molecules)
            msg = f"Saving molecule: {mol_data['smiles']}"
            self.progress_recorder.set_progress(progress, 100, description=msg)
            print(msg, f"({progress})")

            # create model instance
            mol_instance = self.addMoleculeFromSMILES(
                mol_data['smiles'],
                JSONMolecule,
                {
                    "name" : mol_data['name']
                }
            )

            # attach activities
            if mol_data['activities']:
                if not activity_set:
                    activity_set = ActivitySet.objects.create(
                        name=f"{self.instance.name} - imported activities",
                        description="Activities, which were imported with the data.",
                        project=self.instance.project,
                        molecules=self.instance
                    )

                for activity in mol_data['activities']:
                    type_ = ActivityTypes.objects.get_or_create(value=activity['type']['value'])[0]
                    units = None
                    if activity['units'] and activity['units']['value']:
                        units = ActivityUnits.objects.get_or_create(value=activity['units']['value'])[0]
                    Activity.objects.create(
                        value=activity['value'],
                        units=units,
                        type=type_,
                        source=activity_set,
                        molecule=mol_instance
                    )


        return self.unique_mols

    def updateInstance(self) -> int:
        """
        Called when a compound set is updated.

        For simplicity, we just remove all original data and populate again.
        Returns
        -------
        count : int
            number of compounds changed
        """

        self.instance.activities.all().delete()
        self.instance.molecules.clear()
        return self.populateInstance()

In comparison to the implementations we have seen so far, there is quite a lot going on, but it actually is no magic. Lets focus on the MolSetInitializer.populateInstance method because it showcases the most important features.

We begin with looping over the compounds found in the data:

activity_set = None
for idx, mol_data in enumerate(self.molecules):
    # note current progress
    progress = 100 * idx / len(self.molecules)
    msg = f"Saving molecule: {mol_data['smiles']}"
    self.progress_recorder.set_progress(progress, 100, description=msg)
    print(msg, f"({progress})")

We use the progress_recorder argument to record our progress. This is important since importing compounds is done asynchronously inside a Celery task and the progress recorder is used to propagate task progress data to the GenUI REST API services reporting on the status and progress of tasks.

Next, we create a JSONMolecule instance from the SMILES string provided in the JSON data:

# create model instance
mol_instance = self.addMoleculeFromSMILES(
    mol_data['smiles'],
    JSONMolecule,
    {
        "name" : mol_data['name']
    }
)

It is important to do so using the addMoleculeFromSMILES method. This method standardizes the structure of the compound using the ChEMBL Structure Pipeline and saves it into the database. By calling this method you ensure that there is consistency in the way structures are stored. All you have to do is specify the Django model class to use and any extra arguments that should be passed to its constructor.

Finally, we attach the activities to the created compounds if any are found in the data:

# attach activities
if mol_data['activities']:
    if not activity_set:
        activity_set = ActivitySet.objects.create(
            name=f"{self.instance.name} - imported activities",
            description="Activities, which were imported with the data.",
            project=self.instance.project,
            molecules=self.instance
        )

    for activity in mol_data['activities']:
        type_ = ActivityTypes.objects.get_or_create(value=activity['type']['value'])[0]
        units = None
        if activity['units'] and activity['units']['value']:
            units = ActivityUnits.objects.get_or_create(value=activity['units']['value'])[0]
        Activity.objects.create(
            value=activity['value'],
            units=units,
            type=type_,
            source=activity_set,
            molecule=mol_instance
        )

Note that we have to create the activity set first. This will ensure that we can easily distinguish the imported activities from the activities calculated from a QSAR model, for example. After that it is all just a matter of creating the Activity instances and supplying the correct data to the create method.

Setting Up the Extension

Now we have almost everything in place to put our extension to use. The only thing left to do is to create genuisetup.py:

"""
genuisetup.py in src/genui/compounds/extensions/jsonimport/

Created by: Martin Sicho
On: 1/12/21, 9:49 AM
"""

PARENT = 'genui.compounds'

def setup(*args, **kwargs):
    from . import models
    from genui.utils.init import createGroup
    createGroup(
        "GenUI_Users",
        [
            models.JSONMolecule,
            models.JSONMolSet
        ]
    )

The PARENT attribute tells GenUI that this extension is meant to be as a submodule for the genui.compounds application and, thus, all URLs defined in the extension will be prefixed with /compounds/.

The createGroup function manages user permissions. Every extension that defines new Django models should specify permissions for the “GenUI_Users” group. This determines what API methods will be available to users. Calling the createGroup function like this gives all “GenUI_Users” read and write permissions for our newly defined model classes.

Finally, we can migrate the database and run the setup genuisetup command to make sure correct permissions are applied:

python manage.py makemigrations
python manage.py migrate
python manage.py genuisetup

If you are running the server locally, you should now be able to see the appropriate REST API endpoints documented at http://localhost:{your_port}/api/.

Unit Testing

Just like with other types of extensions, it is a good idea to test them. You can use the following unit test as a template:

"""
tests.py in src/genui/compounds/extensions/jsonimport

"""
import json

from django.urls import reverse
from rest_framework.test import APITestCase

from genui.projects.tests import ProjectMixIn


class ChEMBLMolSetTestCase(ProjectMixIn, APITestCase):
    """
    We use the 'ProjectMixIn' class to automatically get
    a project instance initialized before our tests.
    It will become available from 'self.project'
    """

    def test_json_upload(self):
        post_data = {
            "name": "Test JSON Molecule Set",
            "description": "My molecule set for testing...",
            "project": self.project.id, # get id of our test project
            "molecules" : [
                {
                    "name": "Vismodegib",
                    "smiles": "CS(=O)(=O)C1=CC(=C(C=C1)C(=O)NC2=CC(=C(C=C2)Cl)C3=CC=CC=N3)Cl",
                    "activities": [] # our serializer should allow empty activities
                },
                {
                    "name": "Captopril",
                    "smiles": "C[C@H](CS)C(=O)N1CCC[C@H]1C(=O)O",
                    "activities": [
                        {
                            "value": 20.0,
                            "type": {
                                "value": "IC50"
                            },
                            "units": {
                                "value": "nM"
                            }
                        },
                        {
                            "value": 7.7,
                            "type": {
                                "value": "pIC50"
                            },
                            "units": None # the serializer should also allow unspecified units
                        }
                    ]
                },
                {
                    "name": "Nimesulide",
                    "smiles": "CS(=O)(=O)Nc1ccc([N+](=O)[O-])cc1Oc1ccccc1",
                    "activities": [
                        {
                            "value": 11826.0,
                            "type": {
                                "value": "Ki"
                            },
                            "units": {
                                "value": "nM"
                            }
                        },
                        {
                            "value": 4.93,
                            "type": {
                                "value": "pKi"
                            },
                            "units": None
                        }
                    ]
                },
            ]
        }

        # create new compound set instance
        response = self.client.post(reverse('jsonSet-list'), post_data, format='json')
        self.assertEqual(response.status_code, 201)
        set_id = response.data['id']
        act_set_id = response.data['activities'][0]

        # use the detail view to fetch the created instance
        response = self.client.get(reverse('jsonSet-detail', args=[set_id]))
        print(json.dumps(response.data, indent=4))
        self.assertEqual(response.status_code, 200)

        # get summary of the uploaded activity data
        summary_url = reverse('activitySet-summary', args=[act_set_id])
        response = self.client.get(summary_url)
        print(json.dumps(response.data, indent=4))
        self.assertEqual(response.status_code, 200)

Note the use of jsonSet-{identifier} to denote the proper view in our viewset (see Setting URL Prefix). This naming convention is a feature of the BasicRouter.