3. The Data of rPredictor

This section documents the main part of rPredictor’s backend that stores the data - the rDB database.

3.1. General information

The rDB is architected as relational (or object-relational) database. It is implemented in Postgresql 9.5. However, the principles are the same for most of enterprise RDBMS (ie. Oracle or SQL Server) so it could be maintained using some other system as well. Of course it has to be compatible with other parts of the rPredictor system.

3.2. Database schema

The database can be divided into 4 logical parts. This does not affect the database schema on a syntactic level, but it is important for understanding its semantics. The parts are:

rDB logical parts  
Primary structures  
Secondary structures  
Annotations  
Miscellaneous  

All database tables belong to the schema called rpredict.

Complete database diagram:

3.2.1. Naming conventions

  • Names of entities are in plural.
  • Strong entities have self-describing names.
  • Weak entities representing many-to-many relations have dash separated two words names. Both of words are names of tables in the relation.

3.2.2. Primary structures

3.2.2.1. Tables overview

Table name  
rpredict.Accessions  
rpredict.Qualities  
rpredict.Sequences  
rpredict.Taxonomy  

3.2.2.2. Tables details

3.2.2.2.1. rpredict.Accessions

Contains sequence identifiers.

Column name Type Note
id INTEGER Internal identifier
accession VARACHAR(255) Biological identifier
start INTEGER Sequence region start
stop INTEGER Sequence region stop
state INTEGER Allows to classify items
dataset VARCHAR(50) Source dataset
insertion TIMESTAMP Timestamp when sequence has been imported
fk_taxonomy INTEGER Reference to taxonomy
fk_annotation INTEGER Reference to sequence’s annotation
3.2.2.2.2. rpredict.Qualities

Contains information about sequence quality.

Column name Type Note
id INTEGER Internal identifier
fk_accession INTEGER Reference to accession
annotation_source VARCHAR(255) The source(s) where ENA took this data from
sequence_quality NUMERIC(20,7) SILVA quality indicator (combination of ambig./homopol./vec.contam.)
ambiguities NUMERIC(20,7) % of ambiguously sequenced nucleotides
homopolymers NUMERIC(20,7) % of homopolymers longer than 4 nt.
vector_contamination NUMERIC(20,7) % of sequence contaminated by vector
alignment_quality NUMERIC(20,7) Conservancy indicator
base_pair_score INTEGER SILVA-specific: possible base pairs in common with SILVA ref. structure
aligned_bases INTEGER Number of seq. bases aligned to within rRNA gene.
pintail_quality INTEGER Confidence that sequence is not anomalous
state INTEGER Allows to classify items
insertion TIMESTAMP Timestamp when quality has been imported
3.2.2.2.3. rpredict.Sequences

Contains strings of rRNA nucleotide sequences.

Column name Type Note
id INTEGER Internal identifier
fk_accession INTEGER Reference to accession
sequence TEXT String of nucleotide sequence
3.2.2.2.4. rpredict.Taxonomy

Contains taxonomy tree. It is implemented as a tree using self-reference to parent node. It is possible there is no one root only. More formally said, it is a forest.

Column name Type Note
id INTEGER Internal identifier
level INTEGER Number of edges from a root
fk_parent INTEGER Self-reference to parent node
name VARCHAR(255) Name of current node
path_name VARCHAR(4000) Contains names of all of node’s ancestors recursively from the root. By default separated by a semicolon.
is_specie BOOLEAN Flag if the node is a leaf
subtreesize INTEGER Number of leafs in the subtree

3.2.3. Annotations

3.2.3.1. Tables overview

Table name  
rpredict.Annotations  
rpredict.Annotation_keywords  
rpredict.Applicants  
rpredict.Applicants_references  
rpredict.Authors  
rpredict.Authors_references  
rpredict.Features  
rpredict.Keywords  
rpredict.Qualifiers  
rpredict.References  
rpredict.Xrefs  

3.2.3.2. Tables details

3.2.3.2.1. rpredict.Annotations

Contains information about sequence annotation.

Column name Type Note
id INTEGER Internal identifier
molecule_type VARCHAR(4000) Sequence source molecule type (rRNA, mRNA, DNA, etc.)
topology VARCHAR(4000) Whether the source is linear, or circular RNA
data_class VARCHAR(4000) ENA data class annotation
version INTEGER Version of the record in ENA database
first_public TIMESTAMP Date when it was first published
last_updated TIMESTAMP Date of the last update
last_updated_release VARCHAR(255) Identifier of the last release within this record has been updated
description TEXT Often contains 16S/23S/... info
comment TEXT Author’s comment
taxonomic_division VARCHAR(4000) ENA taxonomic division annotation
3.2.3.2.2. rpredict.Annotation_keywords

Weak entity connecting annotations and keywords.

Column name Type Note
fk_annotation INTEGER Reference to annotation
fk_keyword INTEGER Reference to keyword
3.2.3.2.3. rpredict.Applicants

Contains list of patent applicants mentioned in references.

Column name Type Note
id INTEGER Internal identifier
name VARCHAR(4000) Name of the applicant
3.2.3.2.4. rpredict.Applicants_references

Weak entity connection applicants and references.

Column name Type Note
fk_applicant INTEGER Reference to applicant
fk_reference INTEGER Reference to annotation reference
3.2.3.2.5. rpredict.Authors

Contains list of authors mentioned in annotation references.

Column name Type Note
id INTEGER Internal identifier
name VARCHAR(4000) Name of the author
3.2.3.2.6. rpredict.Authors_references

Weak entity connection authors and references.

Column name Type Note
fk_author INTEGER Reference to author
fk_reference INTEGER Reference to annotation reference
3.2.3.2.7. rpredict.Features

Unions some properties of a region of a nucleotide sequence (part of a gene).

Column name Type Note
id INTEGER Internal identifier
name VARCHAR(4000)  
location VARCHAR(255) The sequence region
fk_annotation INTEGER Reference to annotation
3.2.3.2.8. rpredict.Keywords

Contains keywords mentioned in annotation references.

Column name Type Note
id INTEGER Internal identifier
keyword VARCHAR(255) Keyword
3.2.3.2.9. rpredict.Qualifiers

Contains key-value pairs with sequence’s feature’s information.

Column name Type Note
id INTEGER Internal identifier
key VARCHAR(255) Key
value TEXT Value
fk_feature INTEGER Reference to sequence feature
3.2.3.2.10. rpredict.References

Contains information about references (ie. an article, paper,...).

Column name Type Note
id INTEGER Internal identifier
title VARCHAR(4000)  
consortium VARCHAR(4000)  
submission_date TIMESTAMP  
journal VARCHAR(4000)  
year INTEGER  
volume VARCHAR(4000)  
issue VARCHAR(4000)  
first_page VARCHAR(255)  
last_page VARCHAR(255)  
comment TEXT  
reference_location TEXT  
type VARCHAR(255) Contains: patent, thesis, unpublished, article, submission, book
number INTEGER  
location VARCHAR(4000)  
fk_annotation INTEGER Reference to annotation
3.2.3.2.11. rpredict.Xrefs

Contains data about references (links) to external databases. Reference can be stored for annotaion (reference to a DB) or annotation reference (journal, webpage) as well.

Column name Type Note
id INTEGER Internal identifier
db VARCHAR(255) Database name
db_id VARCHAR(255) Id in destination database
secondary_id VARCHAR(255) Secondary id in destination DB
fk_annotation INTEGER Reference to annotation
fk_reference INTEGER Reference to annotation reference

3.2.4. Secondary structures

3.2.4.1. Tables overview

Table name  
rpredict.Predictions  
rpredict.Structures  
rpredict.Toverhangs  
rpredict.Toverhangs_positions  
rpredict.Foverhangs  
rpredict.Foverhangs_positions  
rpredict.Helices  
rpredict.Helices_positions  
rpredict.Bulges  
rpredict.Bulges_positions  
rpredict.Loops  
rpredict.Loops_positions  
rpredict.Junctions  
rpredict.Junctions_positions  
rpredict.Hairpins  
rpredict.Hairpins_positions  

3.2.4.2. Tables details

3.2.4.2.1. rpredict.Predictions

Contains information about secondary structure prediction.

Column name Type Note
id INTEGER Internal identifier
fk_accession INTEGER Reference to accessions
algorithm VARCHAR(255) Algorithm used for prediction
template VARCHAR(255) Template used for prediction
similarity NUMERIC(20,7) Sequence similarity to the template
z_score NUMERIC(20,7) Quality of RNAdistance score with respect to a random sequence
3.2.4.2.2. rpredict.Structures

Contains rRNA secondary structures as strings in dot-paren format.

Column name Type Note
id INTEGER Internal identifier
fk_prediction INTEGER Reference to predictions
structure TEXT Secondary structure in dot-paren string
3.2.4.2.3. rpredict.Foverhangs

Contains instances of 5-overhangs.

Note

All of following tables are containing info about structural features. Informally said, structural feature is a substring of structure. These tables have the same “interface”. No inheritance is maintained so tables are defined in very similar way. The pattern is two tables per structural feature. The first table’s name likes structural feature’s name and represents one instance of the feature in given structure. The other table, with the suffix _positions constains information where in structure this instance of structural feature appears.*

Column name Type Note
id INTEGER Internal identifier
fk_prediction INTEGER Reference to predictions
3.2.4.2.4. rpredict.Foverhangs_positions

Contains positions of 5-overhangs.

Column name Type Note
id INTEGER Internal identifier
fk_foverhang INTEGER Reference to foverhangs
start INTEGER Start position
stop INTEGER Stop position
3.2.4.2.5. rpredict.Toverhangs

Contains instances of 3-overhangs.

Column name Type Note
id INTEGER Internal identifier
fk_prediction INTEGER Reference to predictions
3.2.4.2.6. rpredict.Toverhangs_positions

Contains positions of 3-overhangs.

Column name Type Note
id INTEGER Internal identifier
fk_toverhang INTEGER Reference to toverhangs
start INTEGER Start position
stop INTEGER Stop position
3.2.4.2.7. rpredict.Helices

Contains instances of helices.

Column name Type Note
id INTEGER Internal identifier
fk_prediction INTEGER Reference to predictions
3.2.4.2.8. rpredict.Helices_positions

Contains positions of helices.

Column name Type Note
id INTEGER Internal identifier
fk_helice INTEGER Reference to helices
start INTEGER Start position
stop INTEGER Stop position
3.2.4.2.9. rpredict.Bulges

Contains instances of bulges.

Column name Type Note
id INTEGER Internal identifier
fk_prediction INTEGER Reference to predictions
3.2.4.2.10. rpredict.Bulges_positions

Contains positions of bulges.

Column name Type Note
id INTEGER Internal identifier
fk_bulge INTEGER Reference to bulges
start INTEGER Start position
stop INTEGER Stop position
3.2.4.2.11. rpredict.Loops

Contains instances of loops.

Column name Type Note
id INTEGER Internal identifier
fk_prediction INTEGER Reference to predictions
3.2.4.2.12. rpredict.Loops_positions

Contains positions of loops.

Column name Type Note
id INTEGER Internal identifier
fk_loop INTEGER Reference to loops
start INTEGER Start position
stop INTEGER Stop position
3.2.4.2.13. rpredict.Junctions

Contains instances of junctions.

Column name Type Note
id INTEGER Internal identifier
fk_prediction INTEGER Reference to predictions
3.2.4.2.14. rpredict.Junctions_positions

Contains positions of junctions.

Column name Type Note
id INTEGER Internal identifier
fk_junction INTEGER Reference to junctions
start INTEGER Start position
stop INTEGER Stop position
3.2.4.2.15. rpredict.Hairpins

Contains instances of hairpins.

Column name Type Note
id INTEGER Internal identifier
fk_prediction INTEGER Reference to predictions
3.2.4.2.16. rpredict.Hairpins_positions

Contains positions of hairpins.

Column name Type Note
id INTEGER Internal identifier
fk_hairpin INTEGER Reference to hairpins
start INTEGER Start position
stop INTEGER Stop position

3.2.5. Miscellaneous

This part is somewhat removed from the bionformatical bag of tricks. At this time it contains a single table used as storage for rPredictor’s RSS channel.

3.2.5.1. Tables overview

Table name  
rpredict.rss  

3.2.5.2. Tables details

3.2.5.2.1. rpredict.rss

This table is an underlying storage for rPredictor’s RSS channel.

Column name Type Note
id INTEGER Internal identifier
title VARCHAR(50) Title of the new
description VARCHAR(255) The new’s desription
link VARCHAR(100) Hyperlink to the new
published TIMESTAMP When the new has been published

3.3. Functions

There are some database functions prepared. All of them together with usage are described in the ETL process documentation.

List of the database functions:

  • create_rdb_structures
  • prepareIntermediateStructure
  • importFASTA
  • transformIntermediateData
  • importTaxonomy
  • importQuality
  • importAnnotations
  • importFeatures
  • importQualifiers
  • importReferences
  • importAuthors
  • importApplicants
  • importXrefs
  • importKeywords
  • importStructures
  • importStructuralFeature
  • importStructuralFeatures
  • setTaxonomyAmounts
  • cleanup

3.4. Views

Currently, there is one view - rpredictor.vw_acc_ann_tax. This view is simple left join of table rpredictor.accessions with tables rpredictor.annotations and rpredictor.taxonomy. This results in shorter application code when performing search.

3.5. The subversion repository

The database branch in rPredictor’s repository contains folders named by the purpose of scripts it each contains. The name’s prefix is a number - the order according to the number is also one of correct order of execution it’s contents.

folder description
01_FCEPROC database functions
02_INDICES some index definitions
03_VIEWS definiton of views
09_DEPLOY script with correct execution order

Table Of Contents

Previous topic

2. rPredictor setup

Next topic

4. The ETL layer of rPredictor

This Page