2023 KnowledgeGraphEmbeddingsintheBi

From GM-RKB
Jump to navigation Jump to search

Subject Headings: KG Embedding, BioKG Knowledge Graph.

Notes

Cited By

Quotes

Abstract

Knowledge graphs are powerful tools for representing and organising complex biomedical data. Several knowledge graph embedding algorithms have been proposed to learn from and complete knowledge graphs. However, a recent study demonstrates the limited efficacy of these embedding algorithms when applied to biomedical knowledge graphs, raising the question of whether knowledge graph embeddings have limitations in biomedical settings. This study aims to apply state-of-the-art knowledge graph embedding models in the context of a recent biomedical knowledge graph, BioKG, and evaluate their performance and potential downstream uses. We achieve a three-fold improvement in terms of performance based on the HITS@10 score over previous work on the same biomedical knowledge graph. Additionally, we provide interpretable predictions through a rule-based method. We demonstrate that knowledge graph embedding models are applicable in practice by evaluating the best-performing model on four tasks that represent real-life polypharmacy situations. Results suggest that knowledge learnt from large biomedical knowledge graphs can be transferred to such downstream use cases. Our code is available at https://github.com/aryopg/biokge.

1. Introduction

Knowledge Graphs (KGs) are increasingly utilised for knowledge representation in the biomedical domain. Recent studies show that KGs can be utilised to aid drug repurposing research [18] and to predict the side effects of drug combinations [31, 6]. To maximise the utility of KGs in this domain, comprehensive coverage of entities and links is essential. A novel biomedical KG, called BioKG [23], has been developed to be the first in the domain that attempts to agglomerate a wide range of entity and link types. However, accurately predicting links between entities in KGs can be challenging.

Knowledge Graph Embeddings (KGEs) offer a solution by representing KGs in a low-dimensional space [9]. However, the potential utility of KGEs in the biomedical field remains underexplored. A recent study reports the limited success of KGEs for a biomedical KG [3], raising the question of whether KGE methods have reached their maximum efficacy in this domain.

In this study, we show that it is possible to learn to accurately predict links in BioKG. By taking into account recent studies outlining the best practices for KGE algorithms [17], previously determined limitations can be overcome. Furthermore, the pretrained KGE models are also transferable to four downstream polypharmacy tasks, suggesting that a transfer learning paradigm where KGE models trained on large KGs are adapted for solving downstream tasks is feasible. In addition, we investigate the efficacy of a rule-based model, called Anytime Bottom-Up Rule Learning [AnyBURL; 13], for BioKG. Such a rule-based model offers some degree of interpretability, which is important in the biomedical domain.

In summary, this study presents the following contributions:

  • A comprehensive evaluation of KGEs for BioKG using recent training best practices, which reveals significant HITS@10 and mean reciprocal rank (MRR) improvement compared to previous results [3]. The best-performing KGE model (ComplEx) reaches 0.793 HITS@10, compared to 0.286 in [3].
  • An investigation of the interpretability of a rule- based model for BioKG. AnyBURL achieves a competitive HITS@10 score of 0.677 while providing interpretable rules.
  • An investigation of applying KGE models in real- world tasks. The best-performing pretrained KGE model can easily be adapted to four downstream polypharmacy tasks in a transfer learning paradigm.
Table 1. An overview of KGE models, with the domain they embed in (d corresponds to the embedding size) and their scoring function. Here, ∗ denotes the convolution operation, Re(x) is the real part of x ∈ C, ⟨x, y, z⟩ = L xiyizi denotes the tri-linear dot product, g(x) is a nonlinear function, vec(x) is the flattening operator, w denotes the convolutional filter, and W denotes a linear transformation matrix.
Model	Domain	Scoring function f (es , rp , eo )
TransE [4]	Rd	−∥es + rp − eo ∥
TransH [25]	Rd	−∥(es − w⊤ eswp ) + rp − (eo − w⊤ eowp )∥
p	p
RotatE [19]	Cd	−∥es ◦ rp − eo ∥
DistMult [28]	Rd	⟨es , rp , eo ⟩
ComplEx [22]	Cd	Re(⟨es , rp , eo ⟩)
ConvE [8]	Rd	g(vec(g([es , rp ] ∗ w))W )eo


Fig. 1. An extract of BioKG [23]. The nodes represent entities in the KG, edges between them are links. The variety in identifier structure shows that BioKG is a combination of multiple smaller KGs. In this extract, the centre node (DB00860) represents the drug prednisolone, which targets the Glucocorticoid receptor (P04150). This receptor is associated with disorders related to or resulting from the use of cocaine (D019970), indicated by the Protein-Disease-Association relation (PDiA). Hence, prednisolone is connected to said disorders through the Drug-Disease-Association relation (DrDiA). The right-most node (D000544) represents Alzheimer’s disease, a genetic disorder (GeDi). One possible application that uses the information in the KG would be to train a model to predict missing links. Such a model could consider information from, for example, the Drug-Drug Interaction (DDI) and Protein-Protein Interaction (PPI) relations starting from prednisolone to predict that prednisolone could also be used to treat Alzheimer’s disease, as indicated by the dashed DrDiA relation between them.


2. Background

2.1. Link Prediction in Knowledge Graphs

KGs are a knowledge representation formalism in which knowledge about the world is modelled as relationships between entities [10]. A KG can be represented as a set of subject-predicate-object triples, where each (s, p, o) triple represents a relationship of type p between the subject s and the object o. Link Prediction (LP) is the task of identifying missing triples, i.e. triples encoding true facts that are missing from the KG. Consider, for example, the extract of BioKG presented in Figure 1. It contains information about prednisolone (DB00860), a drug that targets the Glucocorticoid receptor. An LP model that is trained on BioKG could be used to fill in blanks in triples such as (DB00860, DrDiA, ), effectively predicting other disorders that

prednisolone could treat. Similarly, such a model could be used to fill in blanks in triples such as (, DrDiA, D000544), where D000544 is Alzheimer’s disease, predicting drugs that could be repurposed to treat Alzheimer’s. Recent work does this using a different biomedical KG, finding prednisolone likely to be associated with Alzheimer’s disease [14]. Moreover, early-stage investigations have confirmed that high doses of prednisolone can result in some delay of cognitive decline [16]. Before considering such real-life applications, an LP model should be sufficiently evaluated using adequate baselines to avoid wasting resources in failed pharmaceutical trials. An LP model’s generalisation capabilities are evaluated using rank- based metrics. To do so, the KG is partitioned into training, validation, and test triples. For each test triple (s, p, o), trained models are used to predict the subject or the tail, i.e. fill in blanks in (, p, o) or (s, p, ), respectively. The resulting triples are then ranked based on how the model scores them. Subsequently, triples aside from (s, p, o) that exist in the training, validation, or test sets are filtered out such that other triples that are known to be true do not influence the ranking. The resulting ranking is ultimately used to calculate metrics such as the MRR or the average HITS@k (see Appendix A).

2.2. Knowledge Graph Embeddings

A prevalent class of LP models come in the form of KGEs, which represent entities and relations as low-dimensional vectors and use a scoring function to indicate the plausibility of a triple. Many models and training paradigms for embedding knowledge graphs have been proposed [24, 9]. Models usually differ in how entities and relation representations are used to compute the likelihood of a link in the KG. Examples are translational models such as TransE [4], TransH [25], and RotatE [19], factorisation models such as DistMult [28] and ComplEx [22], and neural-network models such as ConvE [8]. Table 1 provides a summary of the domains in which these KGEs embed and the scoring functions they use.

The paradigms wherein these models are usually trained can vary in several ways, with free variables such as the loss function, regularisation, initialisation, and data augmentation strategies [17]. Furthermore, KGs generally do not explicitly contain negative triples. However, for a KGE to be trained in a way that allows it to differentiate between true and false triples, negative triples do need to be generated explicitly at training time. Different approaches of generating negative samples consist of randomly corrupting certain selections of triples, possibly filtering out corrupted triples that already exist in the KG [4, 12, 17]. Investigating the appropriate settings of these hyperparameters is important. While certain combinations of settings have been found to often significantly outperform others, the KG itself still dictates which settings are best [17].

...

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 KnowledgeGraphEmbeddingsintheBiAryo Pradipta Gema
Dominik Grabarczyk
Wolf De Wulf
Piyush Borole
Javier Antonio Alfaro
Pasquale Minervini
Antonio Vergari
Ajitha Rajan
Knowledge Graph Embeddings in the Biomedical Domain: Are They Useful? A Look at Link Prediction, Rule Learning, and Downstream Polypharmacy Tasks10.48550/arXiv.2305.199792023