MCC was already evaluated as one of the most reliable, universal, and informative metrics in machine learning and bioinformatics problems40–42. We involved MCC in the training loss to address the imbalance problem of the protein SS prediction and improve the results on rare structures. The ablation study in Supplementary Table S1 suggests that it was successfully achieved by improving metrics for TEST2018.
We used an Adam optimizer 43 with batch size 8 and an initial learning rate of 0.001. The learning rate was reduced by a factor of 0.1 when there was no improvement in the validation loss for 4 epochs. The training was stopped when the validation loss was not improving for 6 epochs and the checkpoint with the lowest validation loss among all epochs was selected as the final ProteinUnetLM model.
ProteinUnetLM was implemented in the environment containing Python 3.8 with TensorFlow 2.9 accelerated by CUDA 11.7 and cuDNN 8. The inference code and trained models are available on the CodeOcean platform (https://codeocean.com/capsule/7112101) ensuring high reproducibility of the results. An easy-to-use web interface is accessible on Biolib (https://biolib.com/SUT/ProteinUnetLM/). The code for training can be run in the Google Colab notebook (https://colab.research.google.com/drive/1Onh6xlg-a-_QDy2EL_t9XmKa8T3VLVEv).

Metrics and statistical testing

Following the reasoning from the ProteinUnet2 paper, we utilize the Adjusted Geometric Mean (AGM) metric as a primary metric for assessing the prediction performance. It is well-suited for bioinformatics imbalance problems, it performs better than F-score in these problems, and it has no parameters (like a beta in F-score) 44. It is given by Equation 4 where GM is the geometric mean (Equation 5) and Nn is the proportion of negative samples. It takes value from range <0, 1> where 1 is a perfect prediction. The metric can be calculated both at the residue and at the sequence level. By the residue level , we mean calculating the metric once for all residues in all sequences in the dataset, and bythe sequence level , we mean calculating the metric separately for each sequence in the dataset and taking an average out of scores. To aggregate the metric across 8 classes, we use macro averaging – we calculate the AGM score separately for each class and average the results to create the macro-AGM score.