Introduction
Proteins are macromolecules built from amino acids (AAs) and the kind
and order of the AAs (known as primary structure) are determined by DNA
sequence. Most proteins are built out of 20 different AAs, which only
differ in the organic side chain (R groups). The backbone is the same
for all AAs, it consists of an amino group, a central carbon atom, and a
carboxyl group. From a linear sequence of amino acids, a protein
sequence folds rapidly into secondary or local arrangements, and then
into a tertiary or three-dimensional structure. The regular structural
elements created by hydrogen bonds between hydrogen donors in the
nitrogen part and hydrogen acceptors in the carboxyl group of the
backbone define the secondary structure (SS) of a protein. These
structures stabilize the protein. Different types of secondary
structures are known: the alpha-helix, beta-sheet, and loop are the
three most important ones. There are different SS assignment methods
among which the most commonly used is the algorithm of the Dictionary of
Protein Secondary Structure (DSSP) 1. DSSP proposed 8
classes of secondary protein structures which provide more information
about the actual 3D formation of the protein. There are three helix
states: 310-helix (G), alpha-helix (H), and pi-helix (I); two beta-sheet
states: beta-bridge (B) and beta-strand (E); and three loop (or
irregular) classes: high curvature loop (S), beta-turn (T), and random
coil (C).
SS prediction plays an important role in protein tertiary structure
prediction as well as in the characterization of general protein
structure and function. Because the SS provides the first step toward
native or tertiary structure prediction, it is utilized in many protein
folding algorithms 2–5 and in a variety of
bioinformatics areas, including proteome and gene annotation6–9, the determination of protein flexibility10, the subcloning of protein fragments for expression
and the assessment of evolutionary trends among organisms2. Therefore, protein SS prediction remains an active
area of research and an integral part of protein analysis.
Three generations of methods and algorithms are described in the
literature for secondary structure prediction 11. The
first generation, represented by the Chou-Fasman method12, exploited statistical propensities of residues to
a particular secondary structure class. These methods usually achieved a
prediction accuracy of less than 60%. The second generation of methods
was developed in the 1980s. They used advanced statistical methods,
machine learning techniques, and information about neighbor residues,
usually using a sliding window approach. These methods include, among
others, GOR 13 and Lim 14. The
accuracy of predicting secondary structure as assessed by the Q3
parameter was less than 65% 15. The third generation
of methods appeared in the 1990s. They used neural networks and
additional features based on multiple sequence alignment (MSA) profiles,
e.g., PSSM - position-specific scoring matrices 16 or
HHblits (iterative protein sequence search according to the hidden
profile) Markov models 17. The Q3 accuracy of these
methods exceeded 80% for models such as PSIPRED 18.
Given the increasing number of known protein sequences and more
efficient neural network architectures, the latest methods can predict
the secondary structure with over 70% accuracy for an 8-class problem
such as NetSurfP-2.0 (Q8 = 71.43% for CASP12) 19 or
SPOT-1D (Q8 = 73.67% for CASP12) 20 based on
long-term memory (LSTM) bidirectional recursive neural networks (BRNN).
The fourth, recently emerging,
generation of methods uses protein language models (pLMs) inspired by
advancements in the natural language processing (NLP) field21. Secondary structure predictors of the latest
generation use embeddings from models like SeqVec 22or transformers-based networks like ProtTrans 23, ESM24, or BERT 25 that learn thegrammar of the language of life . The concept of embedding
in machine learning is an idea of encoding categorical parameters (i.e.,
sequences of amino acids) as highly informative numerical vectors that
can be used as inputs of neural networks. LM-based classifiers are able
to achieve SS prediction performance close to or better than the
previous generation of methods, e.g., NetSurfP-3.0 26and ProtT5Sec 23 which are comparable to NetSurfP-2.0;
or SPOT-1D-LM 27 which improves over SPOT-1D. Most
importantly, the sequence embeddings can be generated in a fraction of
the time with respect to MSA-based features 23.
Additionally, the recent success of AlphaFold2 28proved that NLP-inspired mechanisms like attention and transformers may
be extremely useful in protein tertiary structure prediction. It
predicted protein structures near the X-ray resolution in the latest
Critical Assessment of protein Structure Prediction
(CASP14)29. However, there is still room for improving
AlphaFold2 predictions in terms of SS. The newest results demonstrate
that its accuracy decreases for longer loop regions, and it has a
tendency to slightly overpredict helices and beta-sheets30.
In our study, we present how our proposed ProteinUnetLM model based on
Attention U-Net architecture and LM-based features improves over its
previous MSA-based version 31, and over its closest
LM-based competitors (SPOT-1D-LM, NetSurfP-3.0, and ProtT5Sec): (i) its
prediction performance measured by sequence-level adjusted geometric
mean (AGM) is better than all other LM-based networks while being
comparable in segment overlap metric (SOV8) and Q8 accuracy, (ii) it
provides the best results for rare structures G, B, and S, (iii) its
prediction time is comparable to the fastest NetSurfP-3.0. These results
support our hypothesis that LSTMs are not needed to achieve
state-of-the-art performance as our fully-convolutional Attention U-Net
architecture is at least as accurate and at least as fast as any
LSTM-based competitor. We especially focus on the issue of imbalance in
the SS8 prediction problem, so besides proposing proper metrics and
statistical methodology, we extended the loss function of the network
with the Matthews correlation coefficient (MCC) which improved the
performance for rare structures. In comparison with secondary structures
parsed from tertiary structure predictions of famous AlphaFold228 on the CASP14 dataset, ProteinUnetLM achieved
better AGM for 10 out of 30 sequences and better precision for helix (H)
structure proving its potential for making a significant step forward in
the domain.