Materials and Methods
Datasets
For a fair comparison with previous MSA-based models ProteinUnet2 and
SPOT-1D, we use the same training, validation, and test sets. The
training set TR10029 contains 10’029 sequences, the validation set
VAL983 has 983 sequences, and there are two test sets: TEST2016 with
1213 and TEST2018 with 250. See 20 for the details
about these datasets.
For comparison with LM-based models, we use four additional test sets
introduced in SPOT-1D-LM 27. The largest TEST2020
contains 547 sequences deposited between the years 2018 and 2020 where
remote homology to proteins released before 2018 was removed. Two
separate test subsets were extracted from TEST2020 to assess the
performance of algorithms in specific cases, TEST2020-HQ with 121
sequences with high-resolution structures (< 2.5 Å), and
Neff1-2020 with 46 sequences with no homologs (Neff=1). The original
list of sequences in the TEST2020 set contained 671 sequences but we
could not find some sequence codes in Protein Data Bank (PDB) or
ProteinNet 32. Thus, we attached our lists of
sequences and their DSSP-generated SS8 for TEST2020 as Supplementary
File 2. The smallest CASP12-FM dataset contains 22 sequences without any
known structural templates (free-modeling) from CASP12.
Finally, we compare all the networks on the newest CASP14 dataset of 30
proteins (the same as used in 31) for which the PDB
targets were available on the official CASP14 challenge page
(prediction-center.org/download_area/CASP14/targets).
The ratios of each SS8 in all mentioned datasets are given in
Supplementary Figure S1. It gives an overview of how imbalanced is the
problem of SS8 prediction.
Attention U-Net for secondary structure
prediction
U-Net is a state-of-the-art architecture in image segmentation tasks33–35 and we previously successfully introduced it
into the domain of protein secondary structure prediction by creating
the ProteinUnet model 36. The Attention U-Net
architecture of ProteinUnetLM presented in this paper adapts our
previous ProteinUnet2 model 31 to the features from
pLMs.