Introduction

Proteins are macromolecules built from amino acids (AAs) and the kind and order of the AAs (known as primary structure) are determined by DNA sequence. Most proteins are built out of 20 different AAs, which only differ in the organic side chain (R groups). The backbone is the same for all AAs, it consists of an amino group, a central carbon atom, and a carboxyl group. From a linear sequence of amino acids, a protein sequence folds rapidly into secondary or local arrangements, and then into a tertiary or three-dimensional structure. The regular structural elements created by hydrogen bonds between hydrogen donors in the nitrogen part and hydrogen acceptors in the carboxyl group of the backbone define the secondary structure (SS) of a protein. These structures stabilize the protein. Different types of secondary structures are known: the alpha-helix, beta-sheet, and loop are the three most important ones. There are different SS assignment methods among which the most commonly used is the algorithm of the Dictionary of Protein Secondary Structure (DSSP) 1. DSSP proposed 8 classes of secondary protein structures which provide more information about the actual 3D formation of the protein. There are three helix states: 310-helix (G), alpha-helix (H), and pi-helix (I); two beta-sheet states: beta-bridge (B) and beta-strand (E); and three loop (or irregular) classes: high curvature loop (S), beta-turn (T), and random coil (C).
SS prediction plays an important role in protein tertiary structure prediction as well as in the characterization of general protein structure and function. Because the SS provides the first step toward native or tertiary structure prediction, it is utilized in many protein folding algorithms 2–5 and in a variety of bioinformatics areas, including proteome and gene annotation6–9, the determination of protein flexibility10, the subcloning of protein fragments for expression and the assessment of evolutionary trends among organisms2. Therefore, protein SS prediction remains an active area of research and an integral part of protein analysis.
Three generations of methods and algorithms are described in the literature for secondary structure prediction 11. The first generation, represented by the Chou-Fasman method12, exploited statistical propensities of residues to a particular secondary structure class. These methods usually achieved a prediction accuracy of less than 60%. The second generation of methods was developed in the 1980s. They used advanced statistical methods, machine learning techniques, and information about neighbor residues, usually using a sliding window approach. These methods include, among others, GOR 13 and Lim 14. The accuracy of predicting secondary structure as assessed by the Q3 parameter was less than 65% 15. The third generation of methods appeared in the 1990s. They used neural networks and additional features based on multiple sequence alignment (MSA) profiles, e.g., PSSM - position-specific scoring matrices 16 or HHblits (iterative protein sequence search according to the hidden profile) Markov models 17. The Q3 accuracy of these methods exceeded 80% for models such as PSIPRED 18. Given the increasing number of known protein sequences and more efficient neural network architectures, the latest methods can predict the secondary structure with over 70% accuracy for an 8-class problem such as NetSurfP-2.0 (Q8 = 71.43% for CASP12) 19 or SPOT-1D (Q8 = 73.67% for CASP12) 20 based on long-term memory (LSTM) bidirectional recursive neural networks (BRNN).
The fourth, recently emerging, generation of methods uses protein language models (pLMs) inspired by advancements in the natural language processing (NLP) field21. Secondary structure predictors of the latest generation use embeddings from models like SeqVec 22or transformers-based networks like ProtTrans 23, ESM24, or BERT 25 that learn thegrammar of the language of life . The concept of embedding in machine learning is an idea of encoding categorical parameters (i.e., sequences of amino acids) as highly informative numerical vectors that can be used as inputs of neural networks. LM-based classifiers are able to achieve SS prediction performance close to or better than the previous generation of methods, e.g., NetSurfP-3.0 26and ProtT5Sec 23 which are comparable to NetSurfP-2.0; or SPOT-1D-LM 27 which improves over SPOT-1D. Most importantly, the sequence embeddings can be generated in a fraction of the time with respect to MSA-based features 23. Additionally, the recent success of AlphaFold2 28proved that NLP-inspired mechanisms like attention and transformers may be extremely useful in protein tertiary structure prediction. It predicted protein structures near the X-ray resolution in the latest Critical Assessment of protein Structure Prediction (CASP14)29. However, there is still room for improving AlphaFold2 predictions in terms of SS. The newest results demonstrate that its accuracy decreases for longer loop regions, and it has a tendency to slightly overpredict helices and beta-sheets30.
In our study, we present how our proposed ProteinUnetLM model based on Attention U-Net architecture and LM-based features improves over its previous MSA-based version 31, and over its closest LM-based competitors (SPOT-1D-LM, NetSurfP-3.0, and ProtT5Sec): (i) its prediction performance measured by sequence-level adjusted geometric mean (AGM) is better than all other LM-based networks while being comparable in segment overlap metric (SOV8) and Q8 accuracy, (ii) it provides the best results for rare structures G, B, and S, (iii) its prediction time is comparable to the fastest NetSurfP-3.0. These results support our hypothesis that LSTMs are not needed to achieve state-of-the-art performance as our fully-convolutional Attention U-Net architecture is at least as accurate and at least as fast as any LSTM-based competitor. We especially focus on the issue of imbalance in the SS8 prediction problem, so besides proposing proper metrics and statistical methodology, we extended the loss function of the network with the Matthews correlation coefficient (MCC) which improved the performance for rare structures. In comparison with secondary structures parsed from tertiary structure predictions of famous AlphaFold228 on the CASP14 dataset, ProteinUnetLM achieved better AGM for 10 out of 30 sequences and better precision for helix (H) structure proving its potential for making a significant step forward in the domain.