The network learns higher-level features in convolutional contractive paths, concatenates them, and passes them to the attention gates that learn to filter out irrelevant features 37,38. Finally, the filtered features are passed to the convolutional expanding path that learns to predict the sequence of 8-class secondary structures as the output layer with softmax activation connected to the last up-block (Figure 1). As in ProteinUnet2, taking into account that the receptive field of our network includes 710 residues 36, we limited the input sequence length to 704. We also resigned from predicting 3-class secondary structures (SS3) as they can be easily derived from 8-class predictions (SS8), and we did not notice any advantages of including SS3 output in the network training in our previous works. Other hyperparameters were the same as in the ProteinUnet2 paper, to enable direct comparisons between the architectures. Specifically, we have 2 convolutions with 1D kernels of length 7 and ReLU activations, and dropouts with a 0.1 rate between convolutional layers in all blocks. In overall, the model has 2’501’260 trainable parameters. It is worth noticing that ProteinUnetLM is a single model, not an ensemble of 10 models like in previous versions of ProteinUnet. The ensembling provides a bit better performance in some metrics, as presented in Supplementary Table S1, but we decided to sacrifice it to improve the inference time.
ProteinUnetLM takes a sequence of feature vectors\(\ \ X=\left(x_{1},\ x_{2},\ x_{3},\ldots,\ x_{N}\right)\)as input, where \(x_{i}\ \)is the feature vector corresponding to thei th residue, and it returns a vector\(Y=\left(y_{1},\ y_{2},\ y_{3},\ldots,\ y_{N}\right)\ \)as output, where \(y_{i}\) is the vector 8 probabilities of i th residue being in one of the SS8 states. Our model is fed with 1024 features from ProtTransT5-XL-U50 23. Each feature is standardized to ensure a 0 mean and SD of 1 in the training data. Using features from the ESM-1b model 39 instead of ProtTrans resulted in suboptimal performance as presented in Supplementary Table S1. Additionally, we use a one-hot encoded sequence of amino acids as the second input to keep a close comparison with ProteinUnet2. However, Supplementary Table 1 suggests that it has a minor impact on the results.

Training procedures and improved loss function

We trained a single ProteinUnetLM model using TR10029 as a training set and VAL983 as a validation set. The model was trained to simultaneously minimize the categorical cross-entropy (CCE, Equation 1) and maximize the Matthews correlation coefficient (MCC, Equation 2) by defining a loss function as a difference between average CCE and average MCC across the training batch (Equation 3).