3.3.1 Analysis of chameleon sequences
To understand the differences between the networks on the CASP14
dataset, we applied the analysis of predictions for chameleon sequences
(ChSeqs) – specific sequences of amino acids that are known to adopt
different 3-class SS (H, E, C) in different unrelated proteins. This
analysis is considered one of the most rigorous tests for SS predictors
because the conformations of ChSeqs depend on non-local protein-specific
interactions 54,55. We searched CASP14 for all
4-element ChSeqs according to the database by 56 and
created a CASP14-ChSeqs set containing 3202 such 4-element sequences and
their associated SS (for the first element in the sequence) according to
DSSP. In Supplementary Figure S2, we compared the numbers and types of
mistakes made for CASP14-ChSeqs by all the networks. The largest number
of mistakes and largest differences between networks were observed for
the loop class. ProteinUnetLM mistook helix for coil (H → C) over 2x
less often than ProteinUnet2, reaching a level similar to AlphaFold2.
The biggest disadvantage of AlphaFold2 was the overprediction of helices
instead of coils (with the highest number of C → H errors out of all
networks) which is in line with the conclusions from30. MSA-based networks (ProteinUnet2 and SPOT-1D) made
over 80 mistakes more than their LM-based counterparts which confirms
the higher predictive power of LM features for challenging chameleon
sequences. ProteinUnetLM achieved 3rd best result
after AlphaFold2 and SPOT-1D-LM, beating NetSurfP-3.0.
Running time comparison
For a comparison of running times of LM-based models, we used a laptop
with Nvidia RTX 2080 Max-Q GPU and Intel i7-10750H CPU. In the
prediction time, we take into account the time needed for feature
generation and the time of inference of the networks for SS prediction
only (i.e., excluding regression-based networks for generating other
outputs of SPOT-1D-LM) using batch size 1. We do not take into account
the time needed for program initialization, data loading, or saving the
results on disk. We were unable to measure the inference time of
NetSurfP-3.0 on the same computer, as the model is accessible only for
online end-to-end prediction. However, we assumed that the inference
time of NetSurfP-3.0 is 5.3x shorter than for SPOT-1D-LM, based on the
information from article 26, this assumption was
marked with an asterisk in Table 4 which presents the times.
The features calculation time for ProtTransT5-XL-U50 (ProteinUnetLM) and
ESM-1b (NetSurfP-3.0) on GPU is similar, with ESM-1b being 1.5x faster
on CPU. The features calculation time for SPOT-1D-LM is a sum of both
which makes it around 2x longer. ProteinUnetLM has nearly 3x shorter
inference time on CPU (3 s) than on GPU (8 s). This is because
ProteinUnetLM is so lightweight that loading features from pLMs (1024 x
704 values) into GPU and retrieving the result takes longer than simply
running the model on the CPU. It leads to the situation where the
optimal approach is to generate features using GPU and to run inference
on CPU. It makes the inference time around 7x shorter than for
SPOT-1D-LM on GPU and around 66x shorter on CPU. It results in 38 s
(152 ms per sequence) of prediction time which is on par with the
estimated prediction time of NetSurfP-3.0 on GPU and 2.4 times shorter
than SPOT-1D-LM on GPU. Additionally, ProteinUnetLM can be effectively
used without GPU with a prediction time shorter than 3 s per sequence.
It is worth adding that ProteinUnetLM can be additionally sped up
without losing much accuracy by training without AA on input
(Supplementary Table S1) if necessary
Table 4 . The comparison
of running times for ProteinUnetLM, SPOT-1D-LM, and NetSurfP-3.0 on the
TEST2018 set with 250 sequences.