Comparison with LM-based classifiers

We compared our network with the three latest networks of similar utility based on features from pLMs: NetSurfP-3.0 26, ProtT5Sec 23, and SPOT-1D-LM 27. SPOT-1D-LM uses features from both ProtTransT5-XL-U5023 and ESM-1b 39 LMs, NetSurfP-3.0 uses only ESM-1b with 1280 features, and ProtT5Sec only ProtTransT5-XL-U50 with 1024 features. We run SPOT-1D-LM from its source code (https://github.com/jas-preet/SPOT-1D-LM), and we used web interfaces to run NetSurfP-3.0 (https://dtu.biolib.com/NetSurfP-3/) and ProtT5Sec (https://api.bioembeddings.com/). It needs to be noted that these networks were trained on different, but partially overlapping datasets. ProteinUnetLM was trained on 10029 (TR10029 dataset) and validated on 983 sequences (VAL983 dataset), NetSurfP-3.0 and ProtT5Sec were trained on 10337 and validated on 500 sequences, and SPOT-1D-LM was trained on 38913 (including most of the sequences from TR10029 and TEST2016) and validated on 100 sequences. To ensure no overlap between the train and test sets, we used only test sets from SPOT-1D-LM for comparisons in this section. We attempted to train the ProteinUnetLM model using the larger datasets from SPOT-1D-LM but surprisingly the results were suboptimal (as presented in Supplementary Table S1), so we decided to keep the model based on the TR10029 dataset.
The comparison of ProteinUnet2 with these three networks on 5 different test sets is presented in Table 2. First of all, ProteinUnetLM was statistically significantly better than NetSurfP-3.0 for all test sets in macro-AGM and SOV8 metrics, with relatively large effect sizes (d > 0.3). ProteinUnet2 had also much better residue level metrics, excluding macro-AGM for TEST2018 for which NetSurfP-3.0 correctly predicted the rarest structure “I” (Supplementary Table S3). The main advantage of ProteinUnetLM over the SPOT-1D-LM network was better macro-AGM for all test sets, statistically significant (with a small effect size d ≈ 0.1) for the three largest sets TEST2018, TEST2020, and TEST2020-HQ. It comes from the fact that ProteinUnetLM achieves much better results for rare structures B, G, and S without losing much accuracy for the frequent ones. For the same reason, SPOT-1D-LM had better Q8 on most of the test sets (excluding CASP12-FM), but as mentioned in Section 2.4, this metric is not appropriate for assessing SS8 prediction.
Table 2 . The comparison of macro-AGM and Q8at the residue level , and SOV8 at the sequence level , on 5 test sets for ProteinUnetLM vs NetSurfP-3.0, ProtT5Sec, and SPOT-1D-LM. The best results for each dataset are boldfaced. The green shading of sequence level scores denotes the statistical significance that ProteinUnetLM has a better mean with standard deviations (SD), p-values, and Cohen’s effect size (d) given below the score.