The
network learns higher-level features in convolutional contractive paths,
concatenates them, and passes them to the attention gates that learn to
filter out irrelevant features 37,38. Finally, the
filtered features are passed to the convolutional expanding path that
learns to predict the sequence of 8-class secondary structures as the
output layer with softmax activation connected to the last up-block
(Figure 1). As in ProteinUnet2, taking into account that the receptive
field of our network includes 710 residues 36, we
limited the input sequence length to 704. We also resigned from
predicting 3-class secondary structures (SS3) as they can be easily
derived from 8-class predictions (SS8), and we did not notice any
advantages of including SS3 output in the network training in our
previous works. Other hyperparameters were the same as in the
ProteinUnet2 paper, to enable direct comparisons between the
architectures. Specifically, we have 2 convolutions with 1D kernels of
length 7 and ReLU activations, and dropouts with a 0.1 rate between
convolutional layers in all blocks. In overall, the model has 2’501’260
trainable parameters. It is worth noticing that ProteinUnetLM is a
single model, not an ensemble of 10 models like in previous versions of
ProteinUnet. The ensembling provides a bit better performance in some
metrics, as presented in Supplementary Table S1, but we decided to
sacrifice it to improve the inference time.
ProteinUnetLM takes a sequence of feature
vectors\(\ \ X=\left(x_{1},\ x_{2},\ x_{3},\ldots,\ x_{N}\right)\)as input, where \(x_{i}\ \)is the feature vector corresponding to thei th residue, and it returns a vector\(Y=\left(y_{1},\ y_{2},\ y_{3},\ldots,\ y_{N}\right)\ \)as output,
where \(y_{i}\) is the vector 8 probabilities of i th residue
being in one of the SS8 states. Our model is fed with 1024 features from
ProtTransT5-XL-U50 23. Each feature is standardized to
ensure a 0 mean and SD of 1 in the training data.
Using features from the ESM-1b
model 39 instead of ProtTrans resulted in suboptimal
performance as presented in Supplementary Table S1. Additionally, we use
a one-hot encoded sequence of amino acids as the second input to keep a
close comparison with ProteinUnet2. However, Supplementary Table 1
suggests that it has a minor impact on the results.
Training procedures and improved loss
function
We trained a single ProteinUnetLM model using TR10029 as a training set
and VAL983 as a validation set. The model was trained to simultaneously
minimize the categorical cross-entropy (CCE, Equation 1) and maximize
the Matthews correlation coefficient (MCC, Equation 2) by defining a
loss function as a difference between average CCE and average MCC across
the training batch (Equation 3).