Glytoucan Linked Data

The following is an example of how the tools created were used to maintain the linked data currently residing in glytoucan. A comprehensive list is available below.

Handling RDF with Care

The quadstore chosen to manage the linked data was Openlink’s Virtuoso. The open-source version 7.2 was compiled and setup to run in all of our environments (please refer to the infrastructure section). One of the first steps needed was to copy the glyspace database into the quadstore as RDF. To accomplish this, a gs2virt batch program was written. This process utilized both JDBC and Jena to access the postgresql database and transfer the data into Virtuoso. Once this batch is complete, a base GlycoRDF ontology with glycoct sequences will be available from the Virtuoso endpoint. The glycoct strings are stored in the GlycoSequence class, for example:

<http://rdf.glycoinfo.org/glycan/G64632PP/glycoct>
        a                              glycan:glycosequence ;
        glycan:has_sequence            "RES\n1b:x-dglc-HEX-1:5\n2s:n-acetyl\n3b:b-dglc-HEX-1:5\n4s:n-acetyl\n5b:b-dman-HEX-1:5\n6b:a-dman-HEX-1:5\n7b:a-dman-HEX-1:5\n8b:a-dman-HEX-1:5\n9b:a-dman-HEX-1:5\n10b:a-dman-HEX-1:5\nLIN\n1:1d(2+1)2n\n2:1o(4+1)3d\n3:3d(2+1)4n\n4:3o(4+1)5d\n5:5o(3+1)6d\n6:6o(2+1)7d\n7:5o(6+1)8d\n8:8o(3+1)9d\n9:8o(6+1)10d\nUND\nUND1:100.0:100.0\nParentIDs:7|9\nSubtreeLinkageID1:o(2+1)d\nRES\n11b:a-dman-HEX-1:5"^^xsd:string ;
        glycan:in_carbohydrate_format  glycan:carbohydrate_format_glycoct .
        

However in order to have functionality such as a motif search (a substructure search), it was necessary to extract the structural components of each sequence into RDF. Another issue is the variety of structures within the repository. Ambiguous linkages or residues, repeated sequences are not handled by all Glycan Structure RDF formats. The fastest method that could handle these types was the wurcsRDF format. Thus the end-goal was to have a wurcsRDF dataset generated for each sequence that was registered. How the wurcsRDF format was generated and why, is beyond the scope of this article.

In order to extract wurcsRDF, we first need to convert all of the glycoct sequences into WURCS format. The glytoucan batch program was written to retrieve data using sparql, manipulate the data, and then export it back into the quadstore. By applying the proper sparql to retrieve the sequences and convert them using the glycocttowurcs library, we were able to generate the GlycoSequence class again, specifying wurcs as a format:

<http://www.glycoinfo.org/rdf/glycan/G64632PP/wurcs>
        a                              glycan:glycosequence ;
        glycan:has_sequence            "WURCS=2.0/4,9,8/[x2122h-1x_1-5_2*NCC/3=O][12122h-1b_1-5_2*NCC/3=O][11122h-1b_1-5][21122h-1a_1-5]/1-2-3-4-4-4-4-4-4/a4-b1_b4-c1_c3-d1_c6-f1_d2-e1_f3-g1_f6-h1_i1-e2|g2"^^xsd:string .
        glycan:in_carbohydrate_format  glycan:carbohydrate_format_wurcs .

This was very straightforward thanks to the GlycoRDF ontology definition.

Once WURCS conversion is complete, a wurcsRDF ttl could quickly be generated by passing each WURCS sequence into the wurcsToRDF class of the wurcsframework library. How this is generated and the substructure query necessary to retrieve the relevant structures, will be explained in a separate article. After the wurcsRDF TTL file is loaded, a more complex batch process to find motifs was required:

Retrieve the list of Motifs (those > These qSaccharides with the a Motif definition)

PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan:  <http://www.glytoucan.org/glyco/owl/glytoucan#>
SELECT DISTINCT ?Saccharide ?PrimaryId ?Sequence
WHERE {
?Saccharide glytoucan:has_primary_id ?PrimaryId .
?Saccharide a glycan:glycan_motif .
?Saccharide glycan:has_glycosequence ?GlycoSequence .
?GlycoSequence glycan:has_sequence ?Sequence .
?GlycoSequence glycan:in_carbohydrate_format glycan:carbohydrate_format_wurcs
}
ORDER BY ?PrimaryId

This can be executed on the glytoucan endpoint

For each motif structure:
- execute a substructure query on the entire list of structures in the wurcsRDF - once again using wurcsframework. This time making use of the wurcs2sparql method.
- retrieve the list of substructures
- insert the has_motif predicate between the motif and the substructure.
continue until all motifs are complete.

With this, the has_motif predicate is now available, and so a standard query can be used on any structure to retrieve the motifs within it.

PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX glytoucan:  <http://www.glytoucan.org/glyco/owl/glytoucan#>
SELECT DISTINCT ?Saccharide ?Motif
WHERE {
?Saccharide a glycan:saccharide .
?Saccharide glytoucan:has_primary_id "G99973YH" .
?Saccharide glycan:has_motif ?Motif .
}
ORDER BY ?Motif

Here is the results of this query on the glytoucan endpoint

List of Tools

A complete listing of all of the tools used or created for the site.

openlink virtuoso

OpenLink’s Virtuoso is available with both commericial and open-source licenses. It not only provides a triple/quad-store for RDF but a variety of other services.

homepage
github
serves the glytoucan endpoint

gs2virt

gs2virt was created to convert data from a database into RDF. It first accesses the RDF data to confirm what remaining data is to be inserted. It then executes an SQL SELECT statement to retrieve the data from the database using JDBC. This data is then added into Apache Jena Model classes to be converted into RDF and stored in Virtuoso.

MolecularFramework

The MolecularFramework was a library developed within the EuroCarbDB project. It provides methods to parse, convert, and make calculations on various types of glycan sequence formats.

GlycoCTtoWURCS

An extension of the MolecularFramework was created in order to convert from GlycoCT to WURCS. As explained above, this library was used to generate the GlycoSequence class with a WURCS format.

source code

wurcsframework

The wurcs framework was used heavily in order to enrich the data based on structural information. This library was used to implement calculations such as Mass and as explained above, Motif relationships.

source code

glytoucan batch

Utilizing the logic held within the above libraries, the data already stored within the RDF could be enriched if there existed a method to easily modify it. Thus a new project was created to enable ETL functionality using SPARQL. The underlying framework used is Spring Framework’s subproject for Batch processing. The libraries were then linked within the glytoucan batch framework to transform and enrich the RDF.

The SPARQL created was based upon the glycoinformatics-specific RDF ontologies such as GlycoRDF and Glytoucan RDF.

Future Work

In preparation for the next release, one of the major items proposed is to have user-editable content. As the data within GlyTouCan grows, the maintenance required will also escalate. Therefore a community-driven framework for data maintenance would improve content quality and give direction to new functionality requirements.

Adding relationships between structures also adds value to the dataset. Enrichment with sub/super structure or isomer relationships is a straight-forward process, however performance analysis will be necessary as the processes could be highly resource intensive. Separately, a great deal of new work is being done based upon the WURCS sequence format, a transition to utilizing this as the main format for the entire Repository is also in discussion.
Finally, the post-registration processes describes the many steps involved with generating the relationships between structures. Currently these processes are run at a batched interval in order to process newly submitted structures. Ideally this should be a real-time process immediately after the sequence is registered, however the complexity involved and potential resource bottlenecks require more evaluation on how this is to be implemented. This is a high priority item that will be covered within the few months after the first release.