Additionally, we use a segment overlap score for 8 classes (SOV8) as defined by the SOV_refine algorithm 45. The SOV score was designed specifically to compare two sequences of protein secondary structures in which the continuity of segments has important meanings. It promotes classifiers that are able to more consistently predict segments of the same structure without breaking it with incorrect prediction. It takes value from range <0, 1> where 1 is a perfect prediction. It can be calculated only at the sequence level.
We report Q8 accuracy at the residue level just for compatibility with previous literature, we consider it highly inappropriate for such an imbalanced problem 31 and we do not perform any statistical tests on Q8. To avoid any potential bias towards the MCC metric that was optimized during training, we decided to avoid its assessment during testing. The implementations of all the mentioned metrics are available in our computational capsule on CodeOcean.
For fair experimental classifier evaluation 46, we apply paired statistical comparisons between ProteinUnetLM and other networks using two-sided paired sample permutation tests for difference in mean classifier performances (perm.paired.loc function fromwPerm R package with 10’000 replications). We decided to use the permutation approach as it does not make any assumptions about a sampling distribution or a sample size. The tests were performed at the sequence level, that is, we first calculated metric values for each sequence separately, and then we run statistical tests on them. We selected a significance level of 0.05 but we use Bonferroni correction for multiple hypothesis testing (MHT) when ProteinUnetLM is compared with more than 1 other classifier on the same dataset. It means that the significance level is divided by the number of comparisons, i.e., the significance level for TEST2016 is 0.025 (2 comparisons), for TEST2018 is 0.01 (5 comparisons), and so on. To quantify the effect size and its direction, as proposed previously in 31, we use Cohen’s d effect for paired samples calculated as the mean difference divided by the standard deviation of the differences47.

Comparison with LSTM-based networks

We compare ProteinUnetLM with the three latest networks for protein secondary structure prediction based on features from protein LMs: NetSurfP-3.0 26, ProtT5Sec 23, and SPOT-1D-LM 27. SPOT-1D-LM and NetSurfP-3.0 are hybrids of convolutional feature extractors and bidirectional recurrent neural networks (BRNN) with long short-term memory (LSTM) units. SPOT-1D-LM uses ResNet convolutional encoder 48, and NetSurfP-3.0 uses two convolutional layers with very large kernels (129 and 257) and paddings (64 and 128) followed by 0.5 dropouts and ReLU activations. In fact, SPOT-1D-LM, unlike ProteinUnetLM and NetSurfP-3.0, is an ensemble of three separate networks, BRNN, ResNet, and ResNet-BRNN hybrid, which increases the complexity and time of the training and prediction. Interestingly, the authors of NetSurfP-3.0 showed that replacing their downstream architecture with a transformer-based encoder resulted in suboptimal performance 26.
The main purpose of LSTM networks is to learn both short and distant dependencies within sequences 49. Distant dependencies are not possible to capture by convolutional layers because of their limited receptive field, but this can be overcome by using an attention mechanism 50. The positive impact of the attention mechanism on the results of ProteinUnetLM can be observed in Supplementary Table S1. Additionally, long skip connections in U-Net architecture, besides stabilizing gradient updates in deep architectures, prevent from losing fine-grained details of the input sequence 51. Moreover, the training time of LSTM networks is several times longer than for fully-convolutional networks36,52. Mainly because RNNs are harder to parallelize and they take less advantage of GPU processing 53.
ProtT5Sec was introduced as a simple classification backbone based on ProtTrans features 23. The authors tested four different classifiers: logistic regression, fully-connected network, fully-convolutional network (CNN), and BRNN-LSTM. They concluded that two-layer CNN (32 filters of size 7) provided the best performance while being computationally less expensive than LSTM which reached similar results. In our paper, we build on this conclusion, and we hypothesize that LSTM networks are not necessary to achieve state-of-the-art results in protein secondary structure prediction and can be effectively replaced by the proposed Attention U-Net architecture when using features from pLMs.