Materials and Methods

Datasets

For a fair comparison with previous MSA-based models ProteinUnet2 and SPOT-1D, we use the same training, validation, and test sets. The training set TR10029 contains 10’029 sequences, the validation set VAL983 has 983 sequences, and there are two test sets: TEST2016 with 1213 and TEST2018 with 250. See 20 for the details about these datasets.
For comparison with LM-based models, we use four additional test sets introduced in SPOT-1D-LM 27. The largest TEST2020 contains 547 sequences deposited between the years 2018 and 2020 where remote homology to proteins released before 2018 was removed. Two separate test subsets were extracted from TEST2020 to assess the performance of algorithms in specific cases, TEST2020-HQ with 121 sequences with high-resolution structures (< 2.5 Å), and Neff1-2020 with 46 sequences with no homologs (Neff=1). The original list of sequences in the TEST2020 set contained 671 sequences but we could not find some sequence codes in Protein Data Bank (PDB) or ProteinNet 32. Thus, we attached our lists of sequences and their DSSP-generated SS8 for TEST2020 as Supplementary File 2. The smallest CASP12-FM dataset contains 22 sequences without any known structural templates (free-modeling) from CASP12.
Finally, we compare all the networks on the newest CASP14 dataset of 30 proteins (the same as used in 31) for which the PDB targets were available on the official CASP14 challenge page (prediction-center.org/download_area/CASP14/targets).
The ratios of each SS8 in all mentioned datasets are given in Supplementary Figure S1. It gives an overview of how imbalanced is the problem of SS8 prediction.

Attention U-Net for secondary structure prediction

U-Net is a state-of-the-art architecture in image segmentation tasks33–35 and we previously successfully introduced it into the domain of protein secondary structure prediction by creating the ProteinUnet model 36. The Attention U-Net architecture of ProteinUnetLM presented in this paper adapts our previous ProteinUnet2 model 31 to the features from pLMs.