2 | METHODS
Introducing a new category into CASP requires planning data workflows,
designing formats and technical parameters for new types of models, and
incorporating those into the existing CASP infrastructure. Subsections
2.1-2.4 below describe the implementation details for four new CASP15
categories.
2.1 | RNA structure prediction. Prediction of RNA
structure from nucleic sequence is a challenging task as RNA molecules,
like proteins, can fold into a wide variety of 3D shapes. Several
research groups have been actively working in this area, and in 2010
Eric Westhof pioneered a CASP-like RNA-Puzzles challenge9 to track the state of
the art in the RNA structure prediction and provide a forum for
discussing methodological advances. Over the course of 12 years
(2010-2021) there were 22 evaluated RNA-Puzzles challenges, which
attracted the attention of around 10 returning participants10. In 2022, on the
initiative of Rhiju Das, Eric Westhof and CASP organizers, the
RNA-Puzzles joined forces with CASP, and RNA structure prediction became
a prediction category in CASP15. This helped expand the target and
predictors base of the RNA-modeling experiment (12 targets, 25 research
groups in CASP15), stimulate development of new RNA prediction methods
through the exchange of ideas and techniques with the protein prediction
community, where deep learning methods recently made a significant
impact on the modeling accuracy11,12,
increase visibility of the field, and use CASP’s standardized platform
for managing predictions and evaluating and comparing different
prediction methods.
To incorporate RNA prediction into CASP, we adhered as closely as
possible to the requirements and recommendations of the RNA-Puzzles
experiments 9.
2.1.1 | RNA prediction format(https://predictioncenter.org/casp15/index.cgi?page=format#TS).
Similarly to protein structure prediction, a CASP RNA submission file
starts with the CASP header including format specification code, target
identifier, author identifier, and description of methods used for
modeling. The file can include up to five RNA 3D models, each
encompassed by the MODEL/END keywords. Models are formatted according to
the established standards of the RNA-Puzzles community9:
- 3D coordinates are provided for the complete list of atoms for all
nucleotides from the target FASTA file.
- only natural nucleotides (A, C, G, U) are allowed.
- if present, modified monomeric units are transformed into unmodified
ones by discarding atypical atoms.
- only atoms from the following sets – (C2, C4, C6, C8, N1, N2, N3, N4,
N6, N7, N9, O2, O4, O6) for nucleobases, and (C1’, C2’, C3’, C4’, C5’,
O2’, O3’, O4’, O5’, OP1, OP2, P) for sugar-phosphate backbone are
allowed.
In case of protein-RNA complexes, protein chains are designated with
letters (A, B, C, …) and RNA chains with numbers (0, 1, 2, …).
An example of RNA prediction is provided in Example 3 on the CASP15
format pagehttps://predictioncenter.org/casp15/index.cgi?page=format.
2.1.2 | Preparation of targets and model templates.The CASP organizers prepare a FASTA file with the sequence of targeted
RNA. The file begins with a header containing target ID (e.g.,
>R1117) and chain IDs (i.e., numbers from 0 to 9) of all
strands in the target structure. The body of the file includes nucleic
acid sequence(s). In addition, the organizers generate a 3D structure
template using the RNA-Puzzles formatting tool13. The template is a
PDB file containing all the required ATOM records with zeroed coordinate
values. The information on targets is communicated to participating
groups via the CASP web portal (e.g.,https://predictioncenter.org/casp15/target.cgi?id=30&view=all).
Prior to submission, predictors can verify compatibility of their models
with the provided templates by running the RNA-Puzzles tool that checks
the number and ordering of residues and atoms in the submission13. If a prediction
file does not comply with the requirements, error messages are reported
to a log file. Non-compliant files can be reformatted with the
rna_pdb_toolsx.py script available from the rna-tools toolbox13,14.
2.1.3 | Setting the acceptance system. At the target
release time, each target is assigned a prediction time window, which is
typically 3 days for servers and 3 weeks for expert groups. RNA
structure models are accepted within the specified prediction window via
email or dedicated CASP prediction submission webform. The CASP
submission system automatically checks submissions for compliance with
the deadlines and format requirements and provides feedback to
predictors. The prediction format is checked with the same tools used to
generate model templates (section 2.1.2). If a prediction is rejected,
an error message is sent to the submitter, and they have until the
target deadline to fix the reported issue(s) and resubmit. Accepted
predictions are stored in the CASP system and eventually evaluated after
the target structure becomes available.
The same submission rules apply to other prediction categories discussed
further in this paper.
2.1.4 | RNA evaluation measures. Predictions in the
RNA category are assessed by checking their geometric plausibility and
comparing them with target structures. When alternative target
structures were available, given the early stage of RNA modeling, we
reported the best score per model. Evaluation measures include
Clashscore 15, Root
Mean Square Deviation (RMSD)16, Local Distance
Difference Test (lDDT)17, Template Modeling
score (TM-score) 18,
and Global Distance Test - Total Score (GDT-TS)19. These are commonly
used measures in protein-CASP evaluation that are also adopted here for
RNA evaluation. However, none of these measures are suitable for
assessing RNA-specific components, like canonical (G-C, A-U, G-U),
non-canonical, and stacking interactions between the nucleobases that
contribute to RNA folding and stabilization. Proper prediction of only
canonical interactions is usually insufficient to obtain a good model of
an RNA molecule (example in Figure 1), while prediction of non-canonical
interactions is very valuable but hard to achieve due to high
computational demands. We additionally consider an RNA-specific measure,
Interaction Network Fidelity (INF)13,20,
which evaluates different types of RNA interactions in models.
Calculation of these measures requires prior determination of RNA
interactions from the atomic coordinates. This is done using 2D
structure annotators such as RNAView21, MC-Annotate22, ClaRNA23 or FR3D24, which provide base
pairs and their classification25. Given two sets of
interactions, one for the model and another for the target, we identify
true positives (correctly predicted base pairs), false positives
(unpredicted base pairs), and false negatives (incorrectly predicted
base pairs), and then calculate the INF score as the Matthews
correlation coefficient26. The score ranges
from [0.0, 1.0], with higher scores indicating better prediction of
base-base interactions. The INF score is determined for all interactions
(INF_all), and separately for canonical (Watson-Crick, INF_WC),
non-canonical (non-Watson-Crick, INF_nWC), and stacking (INF_stacking)
interactions.