2. rPredictor data and database

This chapter should give you a brief idea of what kind of data is available in rPredictor. It is a short, high-level overview only; for the detailed description of what rPredictor’s dataset offers, see rPredictor record detail.

To talk about the rPredictor data, we first need to clear up some terms.

  • The rPredictor dataset is the set of all retrievable information about rRNA contained in rPredictor.
  • The rDB database is the Postgres database that physically stores the dataset,
  • Some tools (Sequence search) need to transform the dataset into a special format. These transformed tool-specific databases are also a part of rData.
  • And additionally, the CP-predict2 algorithm uses its own infrastructure that is a part of the rData but is not a part of the dataset (the small package of data inside is not directly retrievable and doesn’t include the majority of descriptors that the rPredictor dataset defines).

Here, we will talk only about the contents of the rPredictor dataset, leaving the technical aspects aside. The rDB database is described in the technical part of the documentation: The Data of rPredictor, the dataset representations used by other tools are described in the individual tools’ setup instruction.

2.1. Overview of available information

Note

A complete overview of the individual fields available from the database are described in rPredictor record detail .

For an overview of available export formats, see Exporting results.

The dataset generally contains information of the following types:

  • Record information: information that identify the database record: accession number, start position, stop position, organism name, date of publication, etc. A full description is here: General information.
  • Primary structure: information about the sequence itself - sequence quality, description, taxonomic information, etc. A full description is here: Primary structure.
  • Secondary structure: fields pertaining to the predicted secondary structure of the sequence - the predicted structure in dot-paren notation, a summary of structural features and a visualization of the structure. The tool which was used for predicting the secondary structure is also given. The visualization is just a thumbnail; if you click it, the full-size image will be generated on the fly in a new tab/window. A full description is here: Secondary structure.
  • References: references to scientific literature pertinent to the rPredictor record. A full description is here: References.
  • Features: various other fields defined in the data sources. This section cannot be relied upon to be present everywhere in the same form. A full description is here: Specimen.
  • Xrefs: cross-references to other data sources from which the rPredictor record was assembled. A full description is here: Xrefs.

2.2. Sources of the rPredictor dataset

There are four external sources of data that are combined in the dataset, together with secondary structure data predicted in-house during ETL (Extraction-Transformation-Load, the process that assembles the dataset into rDB; a detailed description of the process is in section The ETL layer of rPredictor). The external sources are SILVA, Rfam, ENA (European Nucleotide Archive) and the Taxonomy (NCBI).

2.2.1. SILVA

The SILVA database provides the core of rDB for ribosomal RNA - primary structures (nucleotide sequences), their unique identification using accession numbers and sequence quality measures; In this area, SILVA is well-curated and has a comprehensive quality control system. It also provides a taxonomic information for the sequences, but its quality is insufficient for our needs.

The current publication for SILVA is the 2013 article The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. This article describes in detail the meaning of various quality indicators.

2.2.2. Rfam

The Rfam provides primary structures for other than ribosomal RNA together with their accession numbers. The Rfam does not provides quality measures as ENA, however it provides so-called ‘seeds’ - currated subsets of representative sequences - for each family and a consensus sequence. Thus, we use sequence similarity as a quality measure.

The current publication for Rfam is the 2014 article Rfam 12.0: updates to the RNA families database.

2.2.3. ENA (European Nucleotide Archive)

The ENA provides for rPredictor a wealth of additional annotation about the sequence: things like references to scientific literature, classification by source molecule type, method of obtaining the sequence, etc. The structure of ENA records is much more complicated than in SILVA (which is relatively flat); ENA itself integrates data from various sources. (For the purposes of rData, we use the ENA REST API to only retrieve records of interest.)

2.2.4. Taxonomy (NCBI)

As its name suggests, Taxonomy database provides taxonomic classification for all sequences in our dataset as the information in primary databases are often discontinuous and inconsistent.

2.2.5. Predicted secondary structures

The fifth source from which the dataset is built is an in-house secondary structure prediction method. For the current release of rPredictor, we use the second version of the custom rRNA secondary structure prediction algorithm to create the predictions. (See: CP-predict: a two-phase algorithm for rRNA structure prediction)

In addition to the predicted structure, a list of structural features is computed for each predicted structure. Structural feautres describe certain basic “building blocks”, secondary structure motifs of several nucleotides each. (They are described in detail in the section Structural features.)