11. rPredictor glossary

This is a list of terminology related to rPredictor. It aims to explain how certain terms are used in the context of rPredictor, not generally what these terms mean, because – sadly – not all terminology is used consistently across various bioinformatical sites.

11.1. Biology

11.1.1. Nucleotide

A nucleotide is a basic building block of RNA. The basic nucleotides are Adenine, Cytosine, Guanine, Thymine and Uracil (denoted A, C, G, T, U); of those, T occurs only in DNA molecules and gets substituted for U in genome transcription (when the messenger RNA molecule “copies” over the information from DNA), so RNA molecules are composed of A, C, G and U.

11.1.2. Residue

A residue is a more general term than nucleotide. Nucleic acid residues are nucleotides, protein residues are amino-acids. The scheme used to describe a structure in PDB, which is the de facto standard, is model –> chain –> residue –> atom, from highest to lowest level of description.

11.1.3. Sequence

The string assigning to each position of a molecule a nucleotide. Looks like ‘AAUGUUGACCGUGGACAG...’. Sequences are most often represented using the FASTA format, although many others are also possible.

11.1.4. Primary structure

Synonymous to sequence.

11.1.5. Secondary structure

The description of a nucleic acid molecule on the level of base pairs. For each position in the molecule sequence, the secondary structure of the give molecule says whether the nucleotide at the given position is paired or not. If paired, it also gives the position of the nucleotide in the sequence to which it is paired. This includes pseudoknots, non-canonical base pairs and anything that can be described in terms of base pairs, although some websites (notably the Comparative rRNA Web) refer to these base pairings as “tertiary interactions”.

The secondary structure is typically represented either in a dot-paren format, or as a list of base pairs (optionally with some additional information). The two most common base pair list formats are called *.bpseq and *.ct.

11.2. (Bio)Informatics

11.2.1. Guide tree

When building a multiple sequence alignment, the first step in many algorithms is to determine the order in which sequences are aligned to each other. This ordering is encoded by the guide tree: a tree graph where edge lengths represent how different the sequences are from each other. The closer the sequences to each other, the sooner they are aligned. The guide tree can be used also as an estimate of sequence similarities.

If you want to see an example, the Clustalw2 multiple sequence alignment algorighm will generate a guide tree in .dnd format.

11.3. rPredictor-specific

11.3.1. Reference structure

A reference structure is secondary structure derived from an experimentally verified 3-D rRNA structure. See the question “How do you get reference structures?” from the Frequently Asked Questions.

11.3.2. Region

A region of a molecule is a set of adjacent residues. However, adjacency of residues is a non-trivial term: usually, it means residues connected to each other by the sugar-phosphate backbone, but the backbone can sometimes be broken and we still use the term “region”, even if it includes the break. In rPredictor, for all intents and purposes, adjacency is defined by whichever numbering scheme is chosen - adjacent are residues that have numbers X, X+1.

Regions include their bounds: a region 5-7 will include residues 5, 6 and 7.

11.3.3. Structural feature

A structural feature is defined as a subset of the structure that fulfills some constraints on the base pairs it contains (and typically is a maximal such subset). For instance, an internal loop is a set of two intervals in the structure such that the 5’-end of one is paired with the 3’-end of the other and vice versa and no other residues are paired. The structural features are defined to represent some common elements of secondary structures, such as helices, hairpin loops, etc. For definitions and examples of structural features, see Structural features.

11.3.4. Dot-paren file string

A dot-paren file string is the content of a dot-paren file. The dot-paren file format is a way of storing secondary structure information. It looks like this:

>FASTA header of a sequence
AAACGCUAGCAGGAGUGCUUUGCACCGGAGAUCUCUGGAUAAGCACGGCGCGCAUCUCAGGAC
...(.((...))).((((((....((((((..))))))..))))))(.(..)).(((..))).

The first line is a FASTA header, the second line is the sequence and the third line is a dot-paren representation of the secondary structure of that sequence.