Addressing questions i-iv, we hypothesise that there are feasible ways to express implicit and explicit database content by formal-ontological means and combine this content with pre-existing domain ontologies. Regarding question v, previous work has shown how content of tables in scientific publications can be interpreted on formal grounds [ 10 ].
Question vi has been addressed in [ 11 ], which introduced the reasoning capabilities of querying highly axiomatised bio-ontologies. Question vii needs to be addressed after answering questions i-iv, but is beyond the scope of the present paper. We will demonstrate how entities referenced by a typical extract from a biomedical database can be interpreted under several ontological viewpoints, viz. The resulting OWL models are, then, tested under three aspects:. DL complexity and decidability: in order to solve DL queries, there should be theoretical guarantees that the machine performs under a reasonable cost and finite time complexity and always finishes its task decidability.
This section describes the ontology engineering principles we subscribed to, as well as the data we gathered to exemplify our approach. Firstly, we believe that ontology structure and content should be driven by the underlying reality, rather than by specific application needs. We subscribe to the principles of the OBO Foundry [ 4 ], and emphasise the use of a principled upper-level ontology, here BioTopLite2 BTL2 [ 12 ], which offers a set of high-level classes, together with constraining axioms, using a small number of core relations.
BTL2 regards all instances of its classes as implicitly time-indexed, thus solving the ambiguity problem of using binary relations for the cases where BFO2 [ 13 ] requires ternary ones, which are not expressible in OWL [ 15 ]. OWL2 supports classes, binary relations object properties , and individuals, together with related axioms and assertions, for which we will use the OWL2 Manchester Syntax [ 18 ]. Real world entities are often described in terms of dispositions, i. Saying that all animals are organisms is a universal statement; stating that all humans are able to develop diabetes mellitus type 2 is a dispositional statement.
Several works [ 12 , 19 — 21 ] have suggested to include dispositions in biomedical ontologies; e. Large parts of biomedical database content seem to be dispositional: In biochemistry, a statement that a protein A participates in a process B does probably not mean that all instances of A constantly participate in a process of type B , but rather that all instances of A have the disposition to participate in such a process.
Biomedical observations yield statistical results, which indicate that participants of an experiment are ascribed to certain capabilities e. Finally, database content as such needs ontological scrutiny, as highlighted in . Database content is ontologically best characterised as information content. This requires a strict distinction between i the database content proper and ii the entities in the world referenced by the former. Such information content entities do not necessarily denote particulars i. Similarly, a database entry on, e. For the analysis reported in this paper, we selected a typical biological database example cf.
Table 1 , generated by joining data from UniProt [ 1 ] and Ensembl [ 23 ] by standard database querying Additional file 1. This was performed in order to retrieve all related records to the metabolism of homocysteine and other sulphurated amino acids, like methionine and cysteine see [ 24 ] for more information regarding homocysteine metabolic pathway. All sample data were retrieved on January 22 nd , Data from the NCBI Taxonomy AA were incorporated at the end of the retrieval process, adding the taxonomy identifiers of the organisms from which data are recorded in UniProt and Ensembl.
They were created according to the data organisation presented in Table 1 , based on a sample record Table 2. The four OWL models uniformly represent all information entities database content as individuals. The models differ, however, in the way how referents of this information are interpreted, viz. In the following, names of individuals are picked out in bold face with lower case initials, in contrast to class names in italics with leading upper case character.
Symbols that include white spaces are enclosed in single quotes, e.
In order to test the fitness of these models, four competency questions CQs were formulated in natural language and then reformulated as DL queries cf. Table 3 in order to emulate typical query operations over ontologies and databases, performed by biomedical researchers. Q1 aims at retrieving biological processes in which certain proteins participate; Q2 retrieves the cellular component s a given organism includes, together with the proteins found in them.
Q3 retrieves proteins recorded as participant of biological processes in a given organism. Finally, Q4 retrieves organisms able to exhibit a specific phenotype. Table 1 represents the typical structure of the data analyzed in this work.
Volume-4 Issue-7 | International Journal of Innovative Technology and Exploring Engineering(TM)
It is categorized and organized by the following structure:. This structure was imported from UniProt and expanded with mappings to Ensembl via identifiers. Even if all terms from the database are understood, there are still numerous open questions regarding the precise meaning of such a database record. We fill this gap by eliciting the necessary implicit knowledge from a domain expert familiar with the process of database population, performing an in-depth ontological analysis in the line of Gangemi et al.
This analysis begins with the formal categorization of relations and basic classes, under a suitable upper-level ontology. How are the structural elements of a database i. Which knowledge is missing that is required for correctly understanding these relations? Which expressiveness is required to axiomatise the content in a logic-based language in an appropriate way to represent all implicit and explicit content? Which additional entities need to be included into the ontology e. Which compromises and simplifications may be needed?
Which propositions are categorical, which ones are dispositional? When it comes to an ontology-based representation of database content as exemplified in Table 1 , we face three interpretation challenges: i the data points and column headers, ii the relation between the data points and the column headers, and iii the relations among the columns.
Task i is facilitated by the fact that many of the content terms are already represented in biomedical ontologies like GO. Besides, the natural language terms used as field labels can easily be aligned to content from other ontologies. In our case, most field labels could be aligned with BTL2. Task ii will normally be accounted for by the subclass or instantiation relation: the content terms denote classes or instances of the class denoted by the field label.
Task iii requires reference to the implicit knowledge a scientist is likely to have. In the following, we investigate four different approaches for representing the meaning of the content and structure of biological databases:. Representation as defined maximally fine-grained subclasses, seeing as referents of the information entities in the database SUBC ;. The first representation is motivated by the fact that a database entry is about a concrete experiment, in which individual entities in space and time are described, e.
This view is agnostic with respect to whether the observed phenomena are manifestations of natural laws or not. We are aware that only collections of molecules and never single molecules and activities thereof are observed [ 22 ]. However, assuming that the observation of the behaviour of collective individuals allows us to deduce what happens at the level of individuals as done when describing chemical reactions or biochemical pathways with symbols denoting single molecular entities , we here populate the ABox with single, non-collective, sample entities and the relations among them.
Index numbers are aligned arbitrarily. In the following we describe our interpretation approach. For instance, individual protein molecules in individual organisms are active in processes, e. We also introduce instances for protein molecules that participate in process instances within an organism:. Protein molecules participate, within a particular organism, in process instances e.
Whenever the database fields for processes, molecules, or cell components have more than one entry, the database, unfortunately, leaves open which processes involve which molecules and where they are located. Ideally, this information might be retrieved from other sources. Otherwise, a relation between an individual processes and molecules participating in them can be expressed by referring to an appropriate process individual bp and an appropriate individual molecule m An analogous strategy is possible to express the participation of cell components in processes.
There are organisms with specific phenotypes, in which there is a protein of a certain type, which is however dysfunctional. Dysfunctionalities can be represented as qualities, here also expressed as the individual d For these data to be interpreted in a DL context, ABox entities in this scenario are to be understood as arbitrary individuals that participate in a specific experiment. For the sake of simplicity, for each assertion that can be derived from the database, new terms for individuals are created. Another simplifying assumption of this approach is that all database terms are non-empty, i.
Each information-content individual in the database needs to represent an existing individual involved in the experiment. This is, of course, problematic if the data is wrong due to curation errors, or if the biological processes recorded did not really happen. The second approach interprets database terms as referring to maximally fine-grained defined classes. The naming of these new subclasses follows strict naming criteria as exemplified below.
This is important for extracting the original class names from the subclass names, because only the former ones are interesting for querying. For instance, the database represents a protein class Prot 1 that is connected with an organism class Org 1 and a bioprocess class BProc 1. We leave open whether these defined classes are empty. In a way, defined classes are nothing more than logical artefacts.
For this reason, the creation of such defined OWL classes has a modest ontological engagement. Nevertheless, these defined classes can serve as the referents of the data instances [ 27 ]. In order to fully incorporate the idea that database entries are individuals that refer to classes by means of annotations, we create the following description logic formula for each database entity:.
Bearing this representation in mind, querying can be limited to the expression in parentheses, which brings two advantages, viz. In the following, the modelling patterns are given for proteins, organisms, small molecules, biological processes and phenotypes. Here, the index variable i denotes a record, in which field e. The other fields may be multiply filled; therefore the notation is, e. Proteins : We introduce classes for dysfunctional proteins as well as for organism-specific proteins and their combination:. Specifically, subclasses are created to represent the possible links among classes denoted by annotations within a record.
In addition, subclasses are introduced for phenotypes, processes, cell components and molecules:. Organisms : Classes are introduced for organisms with proteins in general, and for organisms with organism-specific proteins in particular. The latter ones are also specialized by phenotypes, processes and molecules:. Small molecules : We introduce classes for small molecules contained in organisms, and further specify these classes by stating the type of the proteins with which these small molecules interact, i.
Processes : Subclasses are introduced for the participating proteins which are included in a certain type of organism. Phenotypes : Subclasses are introduced for associated dysfunctional proteins and their respective organisms. The querying strategy for this representation model is to check whether specific subclasses are retrieved or not.
A disadvantage of the SUBC interpretation is that it requires the introduction of classes that are not to be found in the ontologies used for annotation such as GO or PRO and that these classes are retrieved by the above query. For querying purposes, their superclasses must be identified, viz. This requires some post-processing of the results as explained below. Thus, subclasses for all types of entities referred to in a database are created, which is on the one hand highly prolific, because every possible association of entries in table fields must be combined into a new defined class.
In the representational patterns IND and SUBC, database entries were seen as observations about individuals, either represented as existing ABox entities or as specific, potentially empty, subclasses. Whereas IND makes strong existential claims, stating that the content of a field is interpreted as representing an actually existing biological individual, the ontological engagement of SUBC is more modest, as it allows empty classes although non-denoting database entries are rather the exception than the norm.
In contrast, the DISP pattern goes a step further, assuming that the database content has been created to give insights into scientific regularities in the sense that all members of a class have a disposition to behave in a certain way, thus exhibiting a law of nature. To ascribe a disposition for a certain process P to an object m does not imply that m actually and at all times participates in an instance of P.
It implies only that the physical structure of m allows m to participate in processes of the type P.
Corpus Methods for Semantics
The proposed modelling pattern in DL is the following [ 29 ]:. The bearers of dispositions are independent continuants [ 19 , 20 ]. Thus, possible bearers of dispositions, in our case organisms, proteins, small molecules and cell components. Dispositions are, then, ascribed to organism-specific proteins within certain cellular components. We introduce dispositions to perform biological processes that have certain kinds of molecules as output.
Here is the general pattern. However, this restriction is rather weak due to the disjunction, which may leave room for several classes to be added. As a rule, dispositions have realisation conditions. The realisation of the disposition of a protein to participate in a given biological process depends, among others, on the chemical environment within the organism and the cell component. Our interpretation of the example is that the ability to exhibit a certain pathological phenotype is attributed to organisms in virtue of having a dysfunctional protein.
Again, the table does not tell us which kind of dysfunction affects which kind of process that results in which phenotype:. Formally, we could characterize a class of small molecules as bearing dispositions in the following way:. Mol 1 or Mol 2 or …or Mol k. As we said, dispositions could theoretically also be ascribed to cell components, as these are also independent continuants. However, according to the shared background assumptions of biologists, cellular components are not participants but only the locations of the biomolecular processes under scrutiny.
That an entity bears a disposition of being the arena in which a process might take place would require the extension of either the notion of disposition or the notion or participation. Therefore, we refrain from ascribing dispositions to cell components.
- Buy Journal on Data Semantics VII Book at 29% off. |Paytm Mall?
- SPLIT-DOLLAR MURDER: A Dan Ballantine Mystery (Dan Ballantine Mysteries Book 5);
- ISBN 13: 9783540463290;
- A Modern Way to Cook: Over 150 quick, smart and flavour-packed recipes for every day.
- Bibliografische Information;
- Checklist: Manual Medicine.
- Marx and Modernity: Key Readings and Commentary.
The use of general class inclusions GCIs , i. However, this strategy does not support retrieval purposes, as DL queries only retrieve simple names of classes or individuals, but not complex expressions. To avoid complex class expressions on the left hand side of GCIs, a feasible approach that supports DL queries on dispositions would require equivalence axioms as the following:.
This corresponds to the modelling pattern SUBC. This leads to a hybrid approach in which subclass definitions are still needed. The hybrid representation may be preferred as being more parsimonious, which however has to be traded off against the increase in DL expressiveness, viz.
We created four DL queries Q1—Q4 cf. Table 3 to evaluate i database content retrieval, using ontologies as query vocabulary; ii information completeness; and iii DL complexity and decidability. Q1 aims at retrieving biological processes in which certain proteins participate; Q2 aims at retrieving the cellular component s a given organism includes, together with the proteins found in them. Q3 aims at retrieving proteins recorded as participant of biological processes in a given organism.
Finally, Q4 aims at retrieving organisms able to exhibit a specific phenotype. This is easily achieved by extracting the original class names from the constructed names of each retrieved class; e. Results from Q1—Q4 are displayed in Table 4. These variations can exploit the axiomatic content of the linked ontologies, such as subclass axioms or role restrictions. Expressed in DL queries, these variations would require none or minor syntactic variations:. In Q1, a query could target a number of biological processes by a common ancestor process, or a phase of a certain process provided by GO;.
In Q1 and Q3, processes could be clustered by querying for metabolite characteristics. In Q4, phenotypes could be queried through how they are characterised, for instance by certain body locations. Users should choose an interpretation approach that accounts for their respective requirements and fits to the computational resources available. With IND, the whole semantic expressivity belongs to the ontology the individuals are imported into; there is no guarantee that this ontology is expressive enough to support reasoning and querying, whereas the patterns provided by SUBC and HYBR come with axioms that fulfil this task.
However, limitations may arise for these approaches due to the nontrivial use of dispositions and scalability problems, because the reasoning complexity increases with higher expressivity. In these respects, SUBC might be the most parsimonious solution, as it may be less problematic for scaling when applying reasoning and performing queries, with the expense of simulating relations to avoid the complexity that comes with the use of dispositions. Recently, ontology-aided interpretation of databases has emerged as a research topic in the biomedical domain, e.
As biomedical observation databases, e. In these works, authors suggest a deeper use of ontologies to support interpretation, which is something that goes beyond of what is currently performed with functional annotations. The representation pattern IND is completely based on single individuals ABox entities , present in the underlying experimental assays the results of which are referred to by the database content.
This approach, similarly to ontology population [ 32 ], refrains from raising any ontological claim apart from asserting the existence of individuals and relations among them. The ABox entities can then be retrieved by DL queries, but the performance problems of large ABoxes with expressive TBoxes are known [ 47 ] and may therefore hamper the theoretical issue of scalability.
In addition, the assertion of existence is an estimation, because data may exhibit errors, especially when not manually curated and, e. Previously, OWL models have been created in which OWL axioms and assertions were automatically generated from database schemes [ 33 ].
- Journal on Data Semantics | Springer.
- Journal on Data Semantics VII : Dr Stefano Spaccapietra : .
- Journal on Data Semantics VII.
These models, however, represent first of all data information entities and not the reality denoted by the data. Our approach, in contrast, aims at representing the latter, e. In addition, relations extracted from databases are semantically idiosyncratic and shallow, e.
- Russia Under Yeltsin and Putin: Neo-Liberal Autocracy (Transnational Institute Series)!
- Table VII from A note on the kappa statistic for clustered dichotomous data. - Semantic Scholar.
- Ontologies for information organization and information integration.
- A note on the kappa statistic for clustered dichotomous data.;
- Modern Social Theory: Key Debates And New Directions!
For instance, database integration following the Ontology-Based Data Access [ 34 ] OBDA approach relies on a limited set of ontological relations that are provided by ontologies. In OBDA, integration relies on connecting information present in databases with ontologies, without discussing which interpretation of the data is more appropriate, i. In practice, OBDA enables the user to retrieve individuals from a database virtually, e.
Such interpretation issues may be not so relevant for daily database usage, e. Reasoning is crucial for validating content interpreted according to the semantics provided by ontologies, which frequently employ OWL. Ceusters and Smith [ 40 ] describe an approach called Referent Tracking , which is mainly devoted to the identification of individuals from Electronic Health Records EHR.
Referent tracking is based on the generation of triples in order to record how individuals are related to each other within a specific context. This approach is similar to our IND strategy, but equally affected by the problems of non-referring representational units [ 41 ], e. The domain upper-level ontology BTL2 had been created with the purpose of enforcing temporal contexts for continuant individuals [ 15 ]. The inability to represent non-denoting database information was addressed by the SUBC modelling patterns which created a defined subclass for each putative referent.
Our approach for this modelling is agnostic to whether such classes are instantiated or empty, as their only rationale is to act as referents of information entities in the database. Therefore, this representation can in a way be considered ontologically neutral in the sense that we only describe potentially instantiated classes without being committed to the actual existence of any instances. On many occasions, researchers already use ontology terms in biological databases to express relations among classes, such as that in certain types of organisms, certain biological processes are performed by or with the aid of certain proteins.
In such cases, the SUBC modelling is more natural and will reflect the observed reality. However, one has to deal with a problem that so often appears in the area of knowledge representation, known as the frame problem.
When one ascribes a certain logical property to a class, it means that all members should possess it. But in biology, there are always exceptions and variations that arguably falsify universal statements about classes. The usefulness of a SUBC approach has been proven in practice in the realms of knowledge representation applications; nevertheless, proposals to accommodate exceptions [ 42 ], modal [ 43 ], and even probabilistic, fuzzy solutions [ 44 ] have appeared both in KR and DL [ 45 , 46 ].
This is possible by introducing dispositions, e. The DISP approach may be considered ontologically problematic, as it is quite promiscuous in ascribing dispositions on class level. What is observed in an experiment is the outcome of a particular process which might be a collective process. From the observation of the outcome, it is inferred that particular process happened, which gives support to the assumption that the participating particulars have had the disposition to participate in such a process.
The problem lies in the extrapolation from the observation of a single case to all members of a certain class — such inductive inferences are notoriously difficult. They may be quite safe when describing the behaviour of small molecules: knowing that one particular molecule has a certain disposition, we can quite safely assume that other molecules of the same kind share this disposition, as we can think of no intrinsic property that could make a difference here. However, on the biological level, systems are much more complex.
If a gene defect in a certain individual organism increases the risk for, e. We would, that is, not be justified to ascribe an increased diabetes risk to the latter population though we were justified to ascribe them a certain tendency to do so [ 19 ]. The fact that the class inclusion axioms proposed in DISP to introduce conditions are not suitable for DL querying, approximates the second and the third modelling approach in the sense that the latter also benefits from fully defined subclasses. Therefore, the combination of these two modelling styles HYBR proved to yield the best retrieval results with all four competency questions.
See also mention, use, meaning, reference, semantic descent, ontology, non-existence, predication.
Instead, they are intended to give a short introduction to the contributions below. Generalization: two kinds: a if names change: from Hans is Hans and from Fritz is Fritz, etc. Every thing is itself: no problem, no semantic ascent necessary. Truth predicate: reinstates reference to the object that was eliminated by the semantic ascent.
The corresponding books are indicated on the right hand side. If a German edition is specified, the page numbers refer to this edition.