A Review of Biomedical Datasets Relating to Drug Discovery: A Knowledge Graph Perspective
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Review of Biomedical Datasets Relating to Drug Discovery: A Knowledge Graph Perspective Stephen Bonner1 , Ian P Barrett1 , Cheng Ye1 , Rowan Swiers1 Ola Engkvist2 , Andreas Bender3 , William Hamilton4,5 1 Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK arXiv:2102.10062v2 [cs.AI] 26 Feb 2021 2 Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden 3 Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK 4 School of Computer Science, McGill University, Montreal, Canada 5 Mila - Quebec AI Institute, Montreal, Canada Abstract Drug discovery and development is an extremely complex process, with high at- trition contributing to the costs of delivering new medicines to patients. Recently, various machine learning approaches have been proposed and investigated to help improve the effectiveness and speed of multiple stages of the drug discovery pipeline. Among these techniques, it is especially those using Knowledge Graphs that are proving to have considerable promise across a range of tasks, including drug repurposing, drug toxicity prediction and target gene-disease prioritisation. In such a knowledge graph-based representation of drug discovery domains, cru- cial elements including genes, diseases and drugs are represented as entities or vertices, whilst relationships or edges between them indicate some level of inter- action. For example, an edge between a disease and drug entity might represent a successful clinical trial, or an edge between two drug entities could indicate a potentially harmful interaction. In order to construct high-quality and ultimately informative knowledge graphs however, suitable data and information is of course required. In this review, we detail publicly available primary data sources containing information suitable for use in constructing various drug discovery focused knowledge graphs. We aim to help guide machine learning and knowledge graph practitioners who are in- terested in applying new techniques to the drug discovery field, but who may be unfamiliar with the relevant data sources. The chosen datasets are selected via strict criteria, categorised according to the primary area of biological information contained within and are considered based upon what type of information could be extracted from them in order to help build a knowledge graph. To help motivate the study, a series of case studies of successful applications of knowledge graphs in drug discovery is presented. We also detail the existing pre-constructed knowl- edge graphs that have been made available for public access which could serve as potential machine learning benchmarks, as well as starting points for further task- specific graph composition enrichments. Additionally, throughout the review, we raise the numerous and unique challenges and issues associated with the domain and its datasets – for example, the inherent uncertainty within the data, its con- stantly evolving nature and the various forms of bias in many sources. Overall we hope this review will help motivate more machine learning researchers to ex- plore combining knowledge graphs and machine learning to help solve key and emerging questions in the drug discovery domain. Preprint. Under review.
1 Introduction The process of discovering new drugs is an exceptionally complex one, requiring knowledge from numerous biological and chemical domains, however it is a vital task in saving, extending or im- proving the quality of human life. The overall drug discovery pipeline requires key understanding in various subtasks. For example, drugs are primarily developed in response to some disease or condi- tion negatively affecting patients. This implicitly requires that the mechanisms in the body by which the disease is caused are well understood so that a drug can be used to treat it – a process known as target discovery [78]. However, due to the complexities involved, the process of developing a new drug and bringing it to market is expensive and has a high chance of failure [96]. Hence, increasingly researchers are looking for new ways via which the drug discovery process can be undertaken, both in a more cost effective manner and with a higher probability of success. Graphs1 have long been used in the life sciences as they are well suited to the complex intercon- nected systems often studied in the domain [6]. Homogeneous graphs have, for example, been used extensively to study protein-protein interaction networks [100], where each vertex in the graph represents a protein, and edges capture interactions between them. However, recently Knowledge Graphs (KGs) have begun to be utilised to model various aspects of the drug discovery domain. Knowledge graphs are heterogeneous data representations (discussed in more detail in Section 2.2), where both the vertices and edges can be of multiple different types, allowing for more complex and nuanced relationships to be captured [54]. In the context of drug discovery, the vertices (commonly known as entities) in a knowledge graph could represent key elements such as genes, disease or drugs – with the edge types capturing different categories of interaction between them. As an example of where having distinct edge types could be crucial, an edge between a drug and disease entity could indicate that the drug has been proven clinically successful in treating the disease. Conversely, an edge between the same two entities could mean the drug was assessed but ultimately proved unsuccessful in alleviating the disease, thus failing the clinical trail. This crucial distinction in the precise meaning of the relationship between the two entities would not truly be captured in the simple binary option offered by homogenous graphs. Whereas, a knowledge graph representation would preserve this important difference and enable that knowledge to be used to inform better predictions. As a topical concrete application, knowledge graphs have been utilised to address various tasks in helping to combat the COVID-19 pandemic [35, 60, 145, 112, 142, 32, 55, 21, 9]. Additionally, considering the domain as a knowledge graph has the potential to enable recent advances in graph-specific machine learning models to be used to address some key tasks [41]. However, constructing a suitable and informative knowledge graph requires that the correct primary data is captured in the process. An interesting aspect of the drug discovery domain, and perhaps in contrast to others, is that there is a wealth of well curated, publicly available data sources, many of which can be represented as, or used to construct elements of, knowledge graphs [113]. Many of these are maintained by government and international level agencies and are regularly updated with new results [113]. Indeed, one could argue that there is sometimes too much data available, rather than too little, and researchers working in drug discovery must instead consider other issues when looking to use these data resources, particularly for graph-based machine learning tasks. Such issues include assessing how reliable the underlying information is, how best to integrate disparate and heterogeneous resources, how to deal with the uncertainty inherent in the domain, how best to translate key drug discovery objectives into machine learning training objectives, and how to model and express data that is often quantitative and contextual in nature. Despite these complications, an increasing level of interest in the area suggests that knowledge graphs could play a crucial role in enabling machine learning based approaches for drug discovery [41, 53, 156]. 1.1 Review Scope & Aim In this work, we present a review of the publicly available data sources which contain information pertinent to key areas within the drug discovery domain. The primary aim of the review is to serve as a guide for knowledge graph and machine learning practitioners who are new to the field of drug discovery and who therefore may be unfamiliar with major relevant data resources suitable 1 Also commonly known as networks within the biological domain. In this review we use the term graph interchangeably with network and without loss of generality. 2
specifically for use in biomedical knowledge graphs. We review these data sources, highlight their associated shortcomings and challenges, categorise them based upon their primary area of biological information and consider how amenable they are for use in a graph representation by detailing what type of information could be extracted from them (relational versus entity features). Thus, the review is conducted through the lens of knowledge graphs, particularly considering data suitability for use with modern graph-specific neural models [47]. We also aim to introduce the drug discovery domain for the reader, discuss the numerous unique challenges it poses, highlight key subtasks contained within and illustrate how knowledge graphs could potentially be used to address them. Additionally, we aim to introduce relevant ontologies and illustrate their use in knowledge graphs, as well as detail the range of existing pre-constructed knowledge graph resources suitable for drug discovery. Our hope is that this review will serve as motivation for researchers and enable greater, easier and more effective use of graph-based machine learning techniques in the drug discovery domain by signposting key resources in the field and highlighting some of the primary challenges. We aim to help foster a multi-disciplinary and collaborative outlook that we believe will be critical in con- sidering graph composition and construction in concert with analytical approaches and clarity of purpose. We think the review will also be useful for researchers in the drug discovery domain who are interested in the potential insights to be gained by applying graph-based methods to relevant datasets. 1.2 Dataset Selection Criteria For the purpose of this review, we use the following criteria when choosing datasets for inclusion: • Publicly Accessible - The dataset should be available for use within the public domain. Whilst many high quality commercial datasets exist, we choose to focus on only those datasets which are publicly accessible to some degree. • High Quality - The dataset should contain information of the highest biological quality. This will primarily be assessed through its popularity within the drug discovery literature. • Actively Maintained - The dataset, and the means to access it, should still be actively maintained. • Non-Replicated Data - It is common for databases in the drug discovery field to contain partial, or even complete, copies of other datasets. We aim to detail mostly primary dataset sources. The remainder of the paper is structured as follows: in Section 2 we introduce the required back- ground knowledge, Section 3 details competing reviews, Section 4 introduces the relevant ontolo- gies, in Section 5 the primary datasets are reviewed, in Section 6 existing drug discovery knowledge graphs are detailed, Section 7 presents some application case studies and the final conclusions are presented in Section 8. 2 Background In this section, we introduce some key background concepts including knowledge graphs, graph- based machine learning and the field of drug discovery. 2.1 An Introduction To Drug Discovery and Development Drug discovery and development is a complex and highly multi-disciplinary process [129] and is driven by the need to address a disease or other medical condition affecting patients for which no suitable treatment currently is in production, or where current treatments are insufficient [58]. Whilst a full review of the area is beyond the scope of this work, interested readers are referred to relevant reviews [139, 29, 96] and here we instead give a high-level overview of key concepts for machine learning practitioners, giving context for subsequent discussion of relevant knowledge graph resources. This section will make use of many of the biological terms and concepts defined in Table 1. 3
Drug discovery involves searching for causally implicated molecular functions, biological and phys- iological processes underlying disease, and designing drugs that can modify, halt or revert them. There are currently three main routes to drug discovery – selecting a molecular target(s) to design a drug against (targeted drug discovery), designing a high-throughput experiment to act as a surrogate for a disease process and then screening molecules to find ones that affect the outcome (phenotypic drug discovery), or using an existing drug developed for another disease (drug re-positioning). In targeted drug discovery, once a drug target has been identified, the process of finding suitable drug compounds can begin via an iterative drug screening process. Selected possible candidate drugs are then tested through a series of experiments in preclinical models (both in-vitro – study in cells or artificial systems outside the body, and in-vivo – study in a whole organism), and then clinical tri- als (drug development) to measure efficacy (beneficial modification of disease process) and toxicity (undesirable biological effects). Term Definition Important for preclinical research and data generation, e.g. studying Cells drug responses, gene and protein expression, morphological responses via imaging. Functional units of DNA, encoding RNA and ultimately proteins. Vari- Genes ants of a gene’s DNA sequence may be associated with disease(s). Gene sequences are transcribed into an intermediate molecule called Gene Sequence RNA, which is in turn “read” to produce protein molecules. Key functional units of a cell and that can play structural or signalling Proteins roles, or catalyse reactions, and interact with other proteins. Molecules and cells function together to perform biological processes Biological which can be conceptualised at different scales, from intracellular (e.g. Processes & signalling transmission via “pathways” between molecules) to intercel- Pathways lular processes, and ultimately physiological processes at the tissue and body scale. A condition resulting from aberrant biological/physiological processes. Different diseases may share symptoms and underlying aberrant pro- Diseases cesses, and can often be categorised into subtypes based on clinical and/or molecular features. A drug target is a molecule(s) whose modulation (by a drug) we hy- Targets pothesise will alter the course of the disease. Small molecules generated and studied as part of drug discovery Compounds are sometimes termed “compounds”, with an accompanying chemical structure representation. Studies and experiments taking place outside the body, either in cells or In-Vitro in cell-free, highly defined systems. Studies and experiments taking place in a physiological context (e.g. In-Vivo animal or human study). Table 1: Definitions of key terms used within the scope of drug discovery. Drugs can be different types of molecules, some of which are more established than others. Many are small chemicals (sometimes termed compounds) or antibodies (a type of protein) [10]. Vari- ous newer types of drugs, often collectively termed “drug modalities” are also being explored [10]. 4
These different types of drugs have particular advantages and disadvantages, but together open up a wider set of potential drug targets compared to historical approaches. Biomedical science has researched such processes at different scales, using technologies that probe the abundance and se- quence variation in DNA, RNA and proteins, and for studying specific biological functions via ex- perimentation. For example, studies in genetic variation associated with disease are used to provide support to hypotheses for new drug targets [99, 68]. Databases have been constructed to collate and disseminate such data and information [113]. Ontologies and dictionaries have also been developed to model relevant concepts, such as disease and biological process descriptions which are discussed more in Section 4. 2.1.1 Subtasks Within Drug Discovery The field is increasingly looking towards computational [129] and machine learning approaches to help in various tasks within the drug discovery process [136]. It can be helpful to consider partitioning the drug discovery process up into smaller subtasks which can be modelled through the use of machine learning. Some of the most common subtasks are: • Disease Target Identification - What molecular entities (genes and proteins) are implicated in causing or maintaining disease, could we develop new drugs to target? Also known as Target Identification and Gene-Disease Prioritisation. • Drug Target Interaction - Given a drug with unknown interactions, what proteins may it interact with in a cell? Also known as Target Binding and Target Activity. • Drug Combinations - What are the beneficial, or toxicity consequences of more than one drug being present and interacting with the biological system? • Drug Toxicity Predictions – What toxicities may be produced by a drug, and in turn which of those are elicited by modulating the intended target of the drug, and which are from other properties of the drug? Also known as Toxicity Prediction. 2.2 Knowledge Graphs There is currently not a strict and commonly agreed upon definition of a knowledge graph in the literature [54]. Whilst we do not aim to give a definitive definition here, we instead define knowl- edge graphs as they will be used through the remainder of this work. We first start by defining homogeneous graphs, before expanding the definition for heterogeneous graphs. A homogeneous graph can be defined as G = (V, E) where V is a set of vertices and E is a set of edges. The elements in E are pairs (u, v) of unique vertices u, v ∈ V . An example graph is illustrated in Figure 1a which demonstrates that homogeneous graphs can contain a mix of directed and undirected edges. It is common for these graphs to have a set of features associated with the vertices, typically represented as a matrix X ∈ R|V |×f , where f is the number of features for a certain vertex. The graphs frequently used as benchmarks in the graph-based machine learning field, Cora, Citeseer and PubMed, fall into this definition of homogeneous graph [56]. Heterogeneous graphs, or knowledge graphs as they are often referred to, are graphs which contain distinct different types of both vertices and edges, which can be defined as G = (V, E, R, Ψ) [151]. Such graphs now have a set of relations R, and each edge is now defined by its relation type r ∈ R – meaning that edges are now represented as triplet values (u, r, v) ∈ E [73]. The vertices in knowledge graphs are often known as entities, with the first entity in the triple called the head entity, connected via a relation to the tail entity. In a drug discovery context, multiple relations are crucial as an edge could indicate whether a drug up or down regulates a certain gene for example. Two vertices can also now be linked by more than one edge type, or even multiples of the same type. Again this is important in the drug discovery domain, as multiple edges of the same relation can indicate evidence from multiple sources. Additionally, each vertex in a heterogeneous graph also belongs to a certain type from the set Ψ, meaning that our original set of vertices can be divided into subsets Vi ⊂ V , where i ∈ Ψ and Vi ∩ Vj = ∅, ∀i ∈ Ψ 6= j ∈ Ψ [46]. Given the drug discovery focus, these types could indicate if a vertex represents a gene, protein or drug. Further, these vertex types can limit the type of relations placed between them, (u, r1 , v) ∈ E → u ∈ Vi , v ∈ Vj where i, j ∈ Ψ [46]. An edge relation type of ‘expressed-as’ makes sense between genes and proteins but not genes and drugs for example. Finally, heterogeneous graphs also commonly contain vertex-level features, with each type often having its own set of features. 5
v7 v12 v4 v32 e3 e1 e2 v1 v2 e1 v11 v21 e1 e2 v5 v22 e1 v3 v31 e2 e2 e3 v6 v33 (a) A Homogeneous Graph. (b) A Heterogeneous Graph. Figure 1: A Homogeneous and Heterogeneous Graph. A heterogeneous graph is presented in Figure 1b and contains some key differences with its homo- geneous counterpart: there are three types of vertex (v1, v2 and v3) and these are linked through a mix of directed and undirected edges of three relation types (e1, e2 and e3). 2.3 Knowledge Graph-based Machine Learning A growing number of methods have been presented in the literature combining graphs and machine learning [148, 47]. These range from approaches which attempt to learn low-dimensional represen- tations of vertices within the graph for use with various down-stream predictive tasks (many of which combine random walks on graphs with models from Natural Language Processing (NLP) [107, 44]), to custom graph-specific neural-based models for end-to-end learning using the raw graph as input [69, 48, 137]. Until recently, the majority of these graph-specific neural models were focused upon homogeneous graphs, however some methods have been created to process graphs with multiple edge types [116] and even fully heterogeneous graphs [57]. A somewhat parallel stream of work has focused on learning embeddings on knowledge graphs specifically [143]. Often these approaches are not graph structure specific, instead learning embed- dings by optimising the distance between entities after translation via relations [11] or by exploiting various notions of similarity [102]. Primarily these approaches are trained to perform knowledge graph completion – the task of ranking true triples within the graph above negative ones [120]2 . The resulting models can then be used to propose likely tail entities given a certain head and relation combination. For example, in the context of drug discovery, given a certain drug entity and the relation type of down-regulates, a model is trained to propose the most likely gene entity – thus completing the triple. 2.4 Knowledge Graph Use in Drug Discovery The study of knowledge graphs in the biomedical sciences and particularly drug discovery brings challenges and opportunities. Opportunities because biomedical information inherently contains many relationships which can be exploited for new knowledge. Unfortunately there are many chal- lenges that arise when constructing a knowledge graph suitable for use in drug discovery tasks. Some of the most interesting challenges are detailed below: • Graph Composition - Strategies are needed to define how to convert data into information for modelling in a graph (e.g., instantiating a node or edge versus a feature on those en- 2 Conceptually this is very similar to link-prediction in homogeneous graphs [86]. 6
tities), and what scale and composition of graph(s) may be optimal for a given task. In addition, which type of analytical approach to use - reasoning-based, network/graph theo- retical, machine learning, or hybrid approaches. • Heterogeneous & Uncertain - In biomedical graphs the data types are heterogeneous and have differing levels of confidence (e.g. well characterised and curated findings versus NLP-derived assertions), and much of the data will be dependent on both time and the dose of drug used as well as the genetic background in the study. Overall, this means edges are much less certain, and thus less trustworthy, than in other domains. • Evolving Data - The underlying data sources integrated and used in suitable knowledge graphs are also often changing over time as the field develops, requiring attention to ver- sioning and other reproducible research practices. As an example of this, the evolution of the frequently used STRING dataset is demonstrated in Figure 2.3 • Bias - There are various biases evident in different data sources, for example negative data remain under-represented in some sources, including the primary scientific literature, and some areas have been studied more than others, introducing ascertainment bias in the graphs [103]. • Fair Evaluation - Several works have shown promise in applying machine learning tech- niques on a knowledge graph of drug discovery data. However, ensuring a fair data split is used for evaluation is perhaps more complicated than other domains, as it is easy for biologically meaningful data to leak across splits. Thus, care should be taken to construct more biologically meaningful data splits, as well as considering if replicated knowledge has been incorporated in the graph. Ultimately though we feel there is now an interesting opportunity to experiment at the intersection of various research fields spanning graph theoretic and other network analysis approaches for molecule networks [5, 26], machine learning approaches [156], and quantitative systems pharmacology [123]. 9 10 8 10 7 10 6 10 Content Count Proteins 5 Interactions 10 Organisms 4 10 3 10 2 10 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 STRING Database Version Figure 2: The evolution of the STRING database over major versions showing the increase in Or- ganisms, Interactions and Proteins. 3 This data has been collected from https://string-db.org/cgi/access.pl?sessionId= dbw44gRWU7Xo&footer_active_subpage=archive 7
Sympto ms Disease Anato my s Diseas es Pathw ays Cellula Process r es Biolog ical Pr ocesse s Genes Protein s Mol F un c Genes /Protein s Side E ffect Pharm C Drugs Pharm acolog ic al Age nts Figure 3: A simplified hierarchical view of a drug discovery knowledge graph schema. 3 Competing Studies There have thus far been several other studies undertaken in the literature that address some of the same topics as this present work. It is interesting to note that many of these studies were published within the last few years, perhaps highlighting the growing interest in the field. This section will give an overview of these related studies and highlight how our own work both complements and differs from these. As a topic of great recent interest within the community, the area of drug repurposing has been directly addressed in several reviews – some of which detail suitable available datasets [127, 83, 154, 87]. Recent research has detailed over 100 relevant drug repurposing databases, as well as appropriate methods [127]. The authors group the datasets by the primary domain that they detail: Chemical, Biomolecular, Drug–target interactions and Disease – with many of these categories being further subdivided into more specific areas. One interesting aspect of this study is that a set of recommended datasets is provided covering each domain. In [83], a review of drug repurposing from the view of machine learning has been presented, covering many of the available methods and a more focused selection of over 20 datasets. The datasets covered in this review must be in the public domain and are split into two primary categories: drug-centric and disease-centric. Relevantly for this review, in [154], knowledge graph specific approaches for drug repurposing are explored. A brief overview of the available datasets is presented, with the authors then choosing 6 to form the basis of the knowledge graph used in their experimental evaluation. In [87], the authors review and then partition the available drug database resources into four major categories based upon the type of information contained within: raw data, target-based, area specific and drug design. The various datasets are then further classified based on whether the information is curated or whether the dataset is an integration of existing resources. Compared to our own work, many of these reviews 8
are confined to considering just a limited area of the domain and all but one do not consider how the datasets could be integrated into a knowledge graph. With a broader focus than the drug repurposing previously covered, the area of drug-target interac- tions has been detailed [4]. The review primarily focuses upon the various machine learning methods for predicting drug-target interactions, however over 20 potential data sources are also presented. The datasets are classified as containing information regarding drug-target interactions, drug or tar- gets alone and binding affinity. A prior review conducted with a similar focus also covered many of the same methods and databases [23]. Conversely, machine learning based methodologies for pre- dicting drug-drug interaction have been detailed, as well as a comparative experimental evaluation conducted [19]. As part of this process, the authors construct a drug-drug interaction knowledge graph from a subset of the Bio2RDF [7] resource, consisting of three of the constituent datasets. A more general review of 13 drug related databases has also been presented [155], covering a broad range of databases detailing drugs, drug-target interactions and other drug related information. One interesting aspect to this survey is that they categorise the studied datasets based upon the tasks in the literature they have been used for, as well as detailing any studies which made use of them. Perhaps one of the most closely aligned studies to our own work is presented here [94], where both datasets and approaches for biological knowledge graph embeddings are reviewed. Although the review focuses primarily upon the evaluation of different methodologies, 16 relevant databases are also discussed, identifying which topic they primarily contain information on. However, as the work is experimentally driven, only a limited dataset discussion is undertaken and the work is not geared towards graph-specific neural models. A comprehensive survey of the wider biomedical area and knowledge graph-based applications within it has been presented here [17]. Within the study, 13 datasets which meet the author’s criteria to be defined as knowledge graphs, are identified and de- tailed – although due to the wider scope of the study, not all are directly related to the target discovery domain. Finally, a recent study presents a detailed overview of the application of graph-based ma- chine learning in the drug discovery domain, focusing upon relevant methods and approaches [41]. The review is wide ranging, covering more than just knowledge graph based applications and un- like our work, makes no mention of suitable public datasets. We do however feel that it strongly complements our own review and serves as a method-focused counterpart to our dataset overview. Within the available reviews in the area, although much high-quality work has been performed, there are some definite gaps which our own review will aim to address. One clear issue is that many of the reviews are focused upon a specific area, with drug repurposing being well represented, thus they are not giving a clear view of the target discovery landscape. Another trend in the current reviews is for their primary focus to be upon the experimental evaluation of methodologies, with datasets given comparatively less attention. Additionally, most reviews are considering the resources solely as databases, rather than focusing upon how they could relate to, or indeed form part of, a knowledge graph. This lack of knowledge graph focus also means that many studies do not consider what type of information the databases could be used for, such as structural relations or entity features. Finally, many of the reviews have been written from a biological point of view, which may perhaps make them less accessible for machine learning practitioners who may be new to the domain, but who are interested in experimenting with relevant datasets. 4 Biological Ontologies 4.1 Introduction to Biological Ontologies An ontology is a set of controlled terms that defines and categorises objects in a specific subject area, and also the properties and relationships between the ontological terms. Modern biomedical ontologies are usually human constructed representations of a domain, capturing the key entities and relationships and distilling the knowledge into a concise machine readable format [38]. There is a need for consistency when discussing concepts like diseases and protein functions which can be interpreted in multiple ways. Therefore many biomedical ontologies have been created to cate- gorise and classify biomedical concepts such as genes, proteins, biological processes and diseases. Most ontologies have a Directed Acyclic Graph (DAG) structure with the nodes representing the ontological terms and the edges representing the relations between them. This induces a hierarchi- cal structure on the terms. The terms may also have properties that provide descriptions or cross references to terms in other ontologies. 9
4.2 Ontology Representations Most biomedical ontologies are expressed in a knowledge representation language such as the OBO language created by the Open Biological and Biomedical Ontologies Foundry (OBO), Resource Description Framework Schema (RDFS) or the Web Ontology Language (OWL) [2]. OBO is a biologically oriented ontology and is expressive enough to define the required terms, relationships and properties of an ontology. There exist free browser based tools to create, view and manipulate ontologies defined in OBO. There are also tools to check their completeness and logical consistency. It is possible to have a lossless transition from the OBO language to OWL which is a family of knowledge representation languages for creating ontologies. OWL was designed for the web but is also used for creating biomedical ontologies. RDFS is another ontology language designed for the web and used for biomedical ontologies. 4.3 Ontology Matching and Merging Ontologies provide such value in providing interoperability for biological data that there has been a proliferation of ontologies. However this in itself causes an issue, especially if a different ontology is used for the same biomedical entity. If database A labels diseases using ontology X and database B labels diseases using ontology Y it can be hard to know the relation of two different disease entities in the database. This type of scenario often occurs during the creation of biomedical knowledge graphs which are databases of relationships between different biological entities and often use data from multiple sources. Some resources exist to match together ontological terms; e.g. OXO [63] DODO [40]. However the mappings provided by these resources are different and between any two distinct ontologies, there is no guarantee of a direct map or any map at all between their ontological terms even if the subject matter is the same. Merging two different ontologies to become one ontology is an active area of research. There is demand for ontology merging as more and more databases are integrated together. However just mapping ontological terms to other ontological terms can create logical inconsistencies in the newly created ontology violating DAG structures. Therefore merging ontologies often involves lots of manual intervention and is a time consuming and error prone process. The Open Biological and Biomedical Ontologies Foundry (OBO) [130] was set up to provide rules and advice for ontologies to make them easier to merge and match. Their recommendations include a standard set of relations between ontological terms. 4.4 Ontology Overviews In the remainder of this section we will detail the major ontologies which are relevant for use in drug discovery tasks. These ontologies are detailed in Table 2. Average # of Classes with no Number of Max Ontology Name Entities Covered Classes License children definition Properties Depth Creative Monarch Disease Ontology Diseases 24K 5 8K 25 16 (MonDO) Commons Experimental Factor Ontol- Diseases 28K 6 7K 66 20 Apache 2.0 ogy (EFO) Creative Orphanet Rare Disease Ontol- Rare Diseases 15K 17 8.5K 24 11 ogy (ORDO) Commons UMLS Medical Subject Headings Medical Terms 300K 4 270K 38 15 (MeSH) License Disease Human Phentoype Ontology 19K 3 6.5K 0 16 HPO License Phenotype (HPO) Creative Disease Ontology (DO) Diseases 19K 4 8K 89 33 Commons Creative Drug Target Ontology (DTO) Drug Targets 10K 4 3K 43 11 Commons Creative Gene Ontology (GO) Genes 44K - - 11 - Commons Table 2: An overview of Ontologies suitable for use in drug discovery. EFO 10
The Experimental Factor Ontology (EFO) was created by the Eupropean Bioinformatics Institute (EBI) to provide a systematic description of experimental variables available in databases, such as disease, anatomy, cell type, cell lines, chemical compounds and assay [85]. However it has been widely used outside of the EBI. For example, the Open Targets Platform (detailed more in Section 5.1) uses EFO to provide the description, phenotypes, cross-references, synonyms, ontology and classification for annotating disease entities. MeSH One of the most commonly used and largest biomedical ontologies is the Medical Subject Head- ings Thesaurus or MeSH [79]. It was designed for indexing articles in the MEDLINE/PubMED database. Each article in PubMED has MeSH terms attached that specify which biological entities the article is describing. The ontology has been translated into many different languages and there are useful tools provided by MeSH. It is possible to generate MeSH terms from text automatically using APIs although these are not guaranteed to be correct. MeSH has around 300,000 different classes although these are not all disease specific. The relationships between the ontological terms is not very complex and there is not very good cross-referencing or interlinking of terms. This can cause difficulties when working with other ontologies. Disease Ontology The Human Disease Ontology (DO) is an ontology designed to link different datasets through dis- ease concepts [118]. The resource is community driven and aims to have a rich hierarchy allowing the study of different diseases from the ontological connections. DO has terms linked to well estab- lished ontologies such as MeSH, SNOMED and UMLS. It is a member of the OBO community of interoperable ontologies. Mondo Disease Ontology The Mondo Disease Ontology was semi-automatically created [97]. It aims to harmonise disease definitions between generalised ontologies such as MeSH and specific deep ontologies such as Or- phanet (an ontology focussing on rare diseases). Mondo merges these ontologies using algorithms such as Bayesian OWL ontology merging together with curated equivalence relations and user feed- back. One of the advantages is that the axioms linking to other resources are checked algorith- mically. This prevents inconsistencies and logical loops created from linking terms to different resources which can otherwise easily occur. It has around 20K different disease classes. HPO The Human Phenotype Ontology (HPO) provides an ontology to describe the phenotypes (the ob- servable traits) of disease [114]. The terms do not contain the actual disease entities such as “Flu”, instead there are symptoms such as a “runny nose” and “sore throat”. The ontology was originally developed for rare and typically genetic diseases but is now also used for common diseases. HPO includes terms to describe the speed of onset of the disease, how often symptoms occur as well as inheritance characteristics. HPO can be used to diagnose diseases and are used by clinicians, as well as researchers. HPO is developed by Monarch who also develop Mondo so these two ontologies work well together. GO 11
Gene Ontology (GO) is an ontology for gene functions [27]. GO consists of three disjointed sets of ontological terms. 1) Molecular function ontology describing the activity of the gene product on the molecular level. 2) Cellular component ontology specifying the location where the gene product performs its function 3) Biological processes; the functional process the gene is involved in. GO is used to create GO annotations which consist of a gene product, a GO term, references and evidence. These annotations are stored in a database and provide a convenient reference for information about genes. There are multiple products that allow searching of the GO annotations database. The GO database claims to be the world’s largest source of information on the functions of genes. GO terms are used in a wide range of bioinformatics applications. GO is well maintained and is updated weekly and was one of the original members of OBO. DTO As we have previously mentioned, a drug target is a protein or molecule within the body that is associated with disease and would be suitable to develop a drug against. The Drug Target Ontology (DTO) has terms to describe information about drug targets [77]. The ontology has terms to describe the type of protein, how well studied the protein is and what types of drugs may be used against the target. It is also used to describe the level of development of drugs for a protein. It uses the Human Disease Ontology to link diseases to the proteins and is designed to be modular and easily extensible. 5 Primary Domain-Specific Dataset Overviews As has been highlighted throughout this review, unlike some other domains, the drug-discovery area actually has a wealth of publicly available information, much of which has dedicated teams tasked with maintaining and updating the resources. Many of these are national or international level bodies, for example the US based National Center for Biotechnology (NCBI) or the Euro- pean Bioinformatics Institute (EBI). Additionally the pan-European ELIXIR body, an organisation dedicated to detailing best practices for life science datasets and enabling stable funding for them, maintains a list of core data resources, which includes many of the resources covered in this review [37]. In this section we introduce some of the key, primary resources covering the crucial entities of interest in drug discovery: genes, disease and drugs, as well as sources capturing the relationships between them via interactions, pathways and processes. The list of resources covered here is not designed to be exhaustive, instead here we signpost some of the most popular and trusted ones, suggest how they could be integrated into a knowledge graph and discuss the origins of the data. 5.1 Integrated Drug Discovery Resources This section outlines data sources which are tailored specifically for the drug discovery field. Typi- cally these resources combine two or more entity specific data sources and add additional knowledge useful for the domain. These resources can also be useful as a reference point for some best practices with regards to data handling and integration for the field. Open Targets: Open Targets is a resource which collects various disparate data sources together, covering the key entities for target discovery including genes, drugs and diseases [71, 18]. Open Targets was established in 2016 as a collaboration between academia and industry with the goal of enabling better science through the integration of public resources, primarily relating to genes and diseases, for the task of target prediction [71]. Data for Open Targets has been taken from 20 resources including Uniprot [3], Reactome [62] and ChEMBL [89]. As of January 2021, Open Targets contains data on 14K diseases and 27K targets. The resource is updated multiple times a year, with five main releases in both 2020 and 2019. Each release also contains detailed version information for the constituent datasets. Access to the data is enabled via a web-based REST API, a Python client, as well directly for download via JSON and CSV files. Recently Open Targets has been expanded with the addition of a Genetics portal [42], for studying genetic variants and their relation to disease. 12
As Open Targets is specifically designed to integrate data around linking potential target genes/proteins to diseases, each potential link is provided with annotated associative evidence scores for a variety of evidence classes including genetic, drug and text mining. This information is aggre- gated into a final association score, indicating how associated a certain target-disease pair is overall. Thus far, Open Targets does not provide any of its information in a format amenable for use in knowledge graphs, nor has its information been integrated into any of the existing drug discovery suitable graph resources. However it is a prime resource, with clear scope to enrich a knowledge graph with pertinent target discovery information. For example, it could be utilised to provide links between target gene entities in a knowledge graph and the relevant disease entities. Further, the various associative scores contained within Open Targets could be used to weight these edges and provide some notion of trust based on the type of association. Pharos: In a similar vein to Open Targets, the Pharos resource provides data integrations around the drug discovery domain, with a particular focus on the druggable genome [101]. Pharos is actually the front-end access point, with the underlying data resource being the Target Central Resource Database (TCRD), which was launched in 2014 as part of a National Institutes of Health (NIH) program. TCRD contains data on the key entities in the drug discovery domain, including genes and diseases, with the data being integrated from a large number of other public data sources such as ChEMBL [89], STRING [125], DisGeNET [109] and Uniprot [3]. TCRD also implements a number of ontologies including the Disease Ontology [119] and the Gene Ontology [28]. Access to the data through Pharos is provided programmatically through the use of a GraphQL API endpoint, a graph-like query language for data retrieval [50]. Both Pharos and TCRD are open-sourced and are updated regularly, with multiple yearly releases keeping up to date with changes in the constituent dataset sources, whilst also keeping version information. As with Open Targets, the information contained within Pharos could be used to provide links be- tween proteins and diseases. However, it also contains detailed protein-protein interaction which could also be added as relationships between protein entities, these could be weighted by the various confidence metrics with which the relationships are annotated. Additionally, Pharos contains vari- ous information types which could be used to add features onto entities. For example, for a given protein entity, Pharos contains structural and expression information which could be transformed into a generic and task agnostic set of features. 5.2 Protein and Gene Genes and Proteins are the key entities related to target discovery and as such there are numerous rich public resources related to them. We briefly review the most pertinent of these here, however interested readers are encouraged to refer to more detailed protein-specific dataset reviews [22]. The datasets are summarised in Table 3. First Update Updated < 1 Curation Primary Dataset Summary Released Frequency Year Ago Method Domain Primary protein resource used in the domain. Can be Expert & mined for protein-protein interactions and potentially UniprotKB 2003 8 Weeks 3 Proteins Automated protein features. One of the primary sources for gene data. Gene-gene Ensembl 1999 3 Months 3 Automated Genes and gene-disease relationships can be extracted, as well as many gene-based features. Expert & Another primary gene data resource. Used in existing Entrez 2003 Daily 3 Genes Automated KGs for gene entity annotations. Gene Table 3: Primary data sources relating to Genes and Proteins. Uniprot UniProt is a collection of protein sequence and functional information started in its current form in 2003 and provides three core databases: UniProtKB, UniParc, UniRef [3]. UniProtKB is the pri- mary protein resource and thus will be focused on here. Overall, UniProt is classified as an ELIXIR core resource and is frequently integrated into other datasets. The whole of UniProt is updated on 13
an eight week cycle and access is provided via a REST, Python, Java and SPARQL API endpoints. UniProt cross-references with many resources in the domain including the Gene Ontology [28], In- tAct [52] and STRING [125]. The UniProtKB database itself comprises two different resources: Swiss-Prot and TrEMBL [3]. Swiss-Prot contains the expert annotated and curated protein infor- mation, whilst TrEMBL stores the automatically extracted information. Thus, as of UniProt version 2020 6, TrEMBL contains a greater volume of entities at 195M versus the 563K entities in Swiss- Prot. The information contained within UniProtKB could be used to add protein-protein interaction relations into a knowledge graph. Additionally, numerous protein structural and sequence based features could be extracted to enrich the relevant entities. Ensembl Ensembl is primarily a data resource for genetics from the EBI, covering many different species which was founded in 1999 and considered an ELIXIR Core resource [150]. It provides detailed information on gene variants, transcripts and position in the overall genome. This data is extracted via an automated annotation process which considers only experimental evidence. The information in Ensembl is updated on approximately a three monthly schedule, with over 100 versions having been released to date. Access is provided via a REST endpoint as well as MySQL data dumps. The data in Ensembl could be used to provide gene-protein links, as well as gene-disease links through the integration with HPO. Entrez Gene Entrez Gene is the database of the NCBI which provides gene-specific information, which was initially launched in 2003 [84]. Gene can be viewed as an integrated resource of gene information, incorporating information from numerous relevant resources. As such, it contains a mix of curated and automatically extracted information, which is updated as frequently as daily and made available for direct download [14]. Owing to its status as an integrator of relevant resources, Gene provides the GeneID system, a unique integer associated to each catalogued gene. The GeneID can be useful as a translation service between other resources and is used by the Hetionet knowledge graph [53] as the primary ID for its gene entities. Entrez Gene could potentially be used to enrich a knowledge graph with gene level features, for example it contains detailed tissue expression data. 5.3 Interactions, Pathways and Biological Processes In this section, we detail the resources specialising in the linking of the entities discussed thus far through interaction4 , processes and pathways. The interaction resources are presented in Table 4, whilst the processes and pathways resources are detailed in Table 5 First Update Updated < 1 Curation Primary Dataset Summary Released Frequency Year Ago Method Domain One of the most commonly used sources for physical Expert & Protein/Gene and functional protein-protein interactions in existing STRING 2003 Monthly 3 Automated Interactions KGs. Contains interactions between gene, protein and Biological chemical entities with could be included directly in a BioGRID 2003 Monthly 3 Expert Interactions KG. Molecular Contains molecular reactions between gene, protein IntAct 2003 Monthly 3 Expert Interactions and chemical entities. Uses UniProt for identifiers. Table 4: Primary data sources relating to interactions. 4 A more focused review specifically detailing protein-protein interactions can be found here [88]. 14
5.3.1 Interaction Resources STRING The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) is another ELIXIR core resource containing detailed protein information, with a particular focus on capturing protein-protein interactions networks [125]. The resource was started prior to 2003 and has been updated typically bi-annually, with STRING version 11.0b containing over 24M proteins from over 5K different or- ganisms. STRING integrates with other resources such as UniProt [3], Ensembl [150] and the Gene Ontology [28] to allow for easier joining of entities. Access is provided via a REST API as well as a series of flat files, with the primary protein-protein interaction networks being provided as an edge- list – making the data extremely amenable for inclusion in a knowledge graph. The interactions in STRING are taken from a range of sources, including curated ones taken directly from experimental data and those which are mined from the literature using NLP techniques. BioGRID The Biological General Repository for Interaction Datasets (BioGRID) is a resource maintained by a range of international universities which specialises in collecting information regarding the interactions between biological entities including proteins, genes and chemicals [124]. The resource was started in 2003 and is updated monthly with curated information from the literature, with version 4.2 containing over 1.9M interactions. The data is provided for download directly as CSV files, as REST API and as a Cytoscape plugin. The data available via BioGRID could be directly used in a graph as edges between entities, with protein-protein and gene-protein interactions being clear candidates. This process is simplified as BioGRID uses Entrez Gene IDs to represent the entities. IntAct & MINT IntAct is a database of molecular interactions maintained by the EBI and considered a core resource by ELIXIR [52]. IntAct was first launched in 2003 and has been updated on a monthly cycle, with the version released in January 2021 containing over 1.1M interactions. The data is made available for download via flat file formats from the IntAct website. Like the other interaction datasets highlighted in this section, IntAct data can be interpreted as edges between entities describing protein-protein interactions, where the Protein entities are represented using UniProt IDs allowing for cross-resource linking. IntAct is closely linked with Molecular Interaction database (MINT), another core resource providing various interaction types between proteins [76]. 5.3.2 Pathway Resources First Update Updated < 1 Curation Primary Dataset Summary Released Frequency Year Ago Method Domain A core resource for pathways and reactions. Amiable Reactome 2003 > Annually 3 Expert Pathways for graph representation and already included in several KGs. An integrator of pathway resources that could be Omnipath 2016 > Annually 3 Expert Pathways included in a KG via its RDF version. A crowdsourced collection of pathway resources. Wikipathways 2008 Monthly 3 Expert Pathways Also provided in graph amiable formats. Expert & A collection of many resources, including the others Pathways 2010 Biannually 3 Pathways Automated discussed in this table. Commons Table 5: Primary data sources relating to pathways and processes. 15
Reactome Reactome is a large and detailed resource comprising biological reactions and pathways collected across multiple species including those from several model organisms and humans [62]. The project was started in 2003 and has been updated multiple times a year since then, with release version 75 containing over 2.4K human pathways, 13K reactions and 10K proteins. Reactome has been granted ELIXIR core resource status and integrates closely with other databases like IntAct, UniProt and ChEMBL. The data contained within Reactome is curated and peer reviewed by domain experts, with links provided to the originating literature. The resource is very amenable to graph based representation and is provided for download in a variety of formats, including a Neo4J database dump. Resources from Reactome have also already been included in existing knowledge graph resources like Hetionet. Omnipath Omnipath is a comparatively new resource for biological signalling pathway information with a fo- cus on humans and rodents [133]. The Omnipath web-resource was first made publicly available in 2016 and has been updated regularly since, although with an ad hoc pattern. Omnipath inte- grates over 100 literature curated data resources containing information on signalling pathways, this including many datasets covered in this present review such as IntAct, Reactome and ChEMBL. Access to Omipath is provided via a web-based REST API, a Cytoscape plugin, as flat files, as well as through libraries for the R and Python programming languages. Owing to Omnipath’s status as an integrator of other resources, it provides a wealth of different interaction types which could be utilised as edges in a knowledge graph. Wikipathways The Wikipathways project explores the use of crowdsourcing for community curation of pathway and interaction resources [121]. The project began in 2008 and has been updated on a monthly schedule since then. Owing to its crowdsourced nature, domain scientists can add new and edit existing information to ensure better overall quality and it has been designed to complement existing resources such as Reactome. Access is provided via a REST endpoint, as well as clients in various programming languages such as R, Python and Java. Additionally, Wikipathways has a semantic web portal, enabling access through a SPARQL endpoint to an RDF version of the resource for knowledge graph compatibility [138]. Pathways Commons The Pathways Commons project was started in 2010 with the aim of collecting and allowing easy access to biological pathway and interaction databases [20]. Pathways Commons aims to comple- ment, rather than compete with, other pathway resources and offers no additional curation on top of the constituent resources. As of version 12, 20 different resources have been included in Pathways Commons, which include Wikipathways, BioGRID and Reactome. The main resource is updated with data from the source datasets on an approximately biannual schedule [115]. Access is provided via a REST API, as well as a SPARQL endpoint which, through the returned RDF triples, means that the data could easily be incorporated into a knowledge graph. 5.4 Diseases We now detail the resources whose primary focus is providing information about diseases. These resources are detailed in Table 6. KEGG DISEASE 16
You can also read