Additionally, we use a segment overlap score for 8 classes (SOV8) as
defined by the SOV_refine algorithm 45. The SOV score
was designed specifically to compare two sequences of protein secondary
structures in which the continuity of segments has important meanings.
It promotes classifiers that are able to more consistently predict
segments of the same structure without breaking it with incorrect
prediction. It takes value from range <0, 1>
where 1 is a perfect prediction. It can be calculated only at the
sequence level.
We report Q8 accuracy at the residue level just for compatibility with
previous literature, we consider it highly inappropriate for such an
imbalanced problem 31 and we do not perform any
statistical tests on Q8. To avoid any potential bias towards the MCC
metric that was optimized during training, we decided to avoid its
assessment during testing. The implementations of all the mentioned
metrics are available in our computational capsule on CodeOcean.
For fair experimental classifier evaluation 46, we
apply paired statistical comparisons between ProteinUnetLM and other
networks using two-sided paired sample permutation tests for difference
in mean classifier performances (perm.paired.loc function fromwPerm R package with 10’000 replications). We decided to use the
permutation approach as it does not make any assumptions about a
sampling distribution or a sample size. The tests were performed at the
sequence level, that is, we first calculated metric values for each
sequence separately, and then we run statistical tests on them. We
selected a significance level of 0.05 but we use Bonferroni correction
for multiple hypothesis testing (MHT) when ProteinUnetLM is compared
with more than 1 other classifier on the same dataset. It means that the
significance level is divided by the number of comparisons, i.e., the
significance level for TEST2016 is 0.025 (2 comparisons), for TEST2018
is 0.01 (5 comparisons), and so on. To quantify the effect size and its
direction, as proposed previously in 31, we use
Cohen’s d effect for paired samples calculated as the mean difference
divided by the standard deviation of the differences47.
Comparison with LSTM-based
networks
We compare ProteinUnetLM with the three latest networks for protein
secondary structure prediction based on features from protein LMs:
NetSurfP-3.0 26, ProtT5Sec 23, and
SPOT-1D-LM 27. SPOT-1D-LM and NetSurfP-3.0 are hybrids
of convolutional feature extractors and bidirectional recurrent neural
networks (BRNN) with long short-term memory (LSTM) units. SPOT-1D-LM
uses ResNet convolutional encoder 48, and NetSurfP-3.0
uses two convolutional layers with very large kernels (129 and 257) and
paddings (64 and 128) followed by 0.5 dropouts and ReLU activations. In
fact, SPOT-1D-LM, unlike ProteinUnetLM and NetSurfP-3.0, is an ensemble
of three separate networks, BRNN, ResNet, and ResNet-BRNN hybrid, which
increases the complexity and time of the training and prediction.
Interestingly, the authors of NetSurfP-3.0 showed that replacing their
downstream architecture with a transformer-based encoder resulted in
suboptimal performance 26.
The main purpose of LSTM networks is to learn both short and distant
dependencies within sequences 49. Distant dependencies
are not possible to capture by convolutional layers because of their
limited receptive field, but this can be overcome by using an attention
mechanism 50. The positive impact of the attention
mechanism on the results of ProteinUnetLM can be observed in
Supplementary Table S1. Additionally, long skip connections in U-Net
architecture, besides stabilizing gradient updates in deep
architectures, prevent from losing fine-grained details of the input
sequence 51. Moreover, the training time of LSTM
networks is several times longer than for fully-convolutional networks36,52. Mainly because RNNs are harder to parallelize
and they take less advantage of GPU processing 53.
ProtT5Sec was introduced as a simple classification backbone based on
ProtTrans features 23. The authors tested four
different classifiers: logistic regression, fully-connected network,
fully-convolutional network (CNN), and BRNN-LSTM. They concluded that
two-layer CNN (32 filters of size 7) provided the best performance while
being computationally less expensive than LSTM which reached similar
results. In our paper, we build on this conclusion, and we hypothesize
that LSTM networks are not necessary to achieve state-of-the-art results
in protein secondary structure prediction and can be effectively
replaced by the proposed Attention U-Net architecture when using
features from pLMs.