3 Methods
3.1 PLC validation criteria
PLC were obtained from the PDB, release 2023-03-15. The PDB Chemical
Component Dictionary41 was
downloaded on March 17, 2023. X-ray validation information was extracted
from the XML files provided by the PDB. Additional information including
the entry ID to polymer entity ID mapping, release date and polymer
composition for each entry as well as the canonical one-letter code
sequence for each entity in the dataset was retrieved with the
GraphQL-based API of the RCSB PDB Web Services42 on
2023-03-28. 37 entries marked as obsolete in the API results were
discarded.
Ligands were defined as any non-polymer entity. A PLC was defined as a
PDB entry with at least one polymer and one non-polymer entity (ion or
small molecule). PDB entries for which the “polymer composition” was
one of “DNA”, ”RNA”, ”DNA/RNA”, ”NA-hybrid”, ”other type pair”,
”NA/oligosaccharide” or ”other type composition”, as well as any
remaining entry containing DNA or RNA polymers were ignored.
Binding pockets were defined as the set of amino acid residues in the
reference structure with at least one heavy atom within a 6 Å radius of
any heavy ligand atom.
The filtering thresholds for the Iridium criteria were extracted from
the original manuscript18. The
suggestion to filter PLC where atoms from crystal packing are within 6 Å
of any ligand atom was not used as this information could not easily be
extracted from the PDB validation report.
3.2 PLC clustering and novelty
assessment
For PLC clustering, the set of PLC described in section 3.1 was used.
PLC were grouped together based on the cluster identifier of all the
unique polymer entities and the chemical component 3-letter code of the
ligands (i.e. identical ligands) they contained. Polymer entity cluster
identifiers were obtained by performing sequence-based clustering of all
polymer entities in the dataset with the cluster module from the MMseqs2
software (version 13.45111)43. Six
different sequence-based clustering patterns were obtained as a result
of clustering with minimum sequence identity thresholds of 100%, 95%,
90%, 70%, 50% and 30% respectively. For the sequence alignment, a
coverage threshold of 90% (-c 0.9) of both the query and target
sequences was used (–cov-mode 0). The sensitivity of the prefiltering
was set to (-s 8.0). Clustering was performed with the connected
component algorithm (–cluster-mode 1) with the option
(–cluster-reassign) to reassign cluster members to other clusters if
they no longer fulfill the clustering criteria after each iteration.
Each PLC entry in the dataset was subsequently given an identifying
string consisting of the cluster ids of the entities and the 3-letter
code of the unique ligands present in the PLC.
The assessment of the novelty of a given PLC with respect to a different
set of PLC, at a given minimum sequence identity threshold, was
performed by comparing its PLC identifier to the set of all PLC
identifiers of the other set.
3.3 Benchmarking state-of-the-art docking
tools
A Nextflow36pipeline (20.10.0) was developed to run and assess 5 state-of-the-art
PLC prediction tools. This is available athttps://github.com/PickyBinders/PickyBinder
3.3.1 Benchmark dataset
The 363 PLC in the PDBBind time-split test-set that were not used as
training data by TANKBind and DiffDock were used as a test set to
demonstrate the automated benchmarking workflow16. To
compare docking on experimental and predicted structures, AlphaFold
v2.3.0 39was used to predict models for 256 monomeric proteins in this set, using
the canonical one-letter code sequence, and default parameters and
relaxation. Results are present on the best relaxed model (according to
average pLDDT) for each protein.
3.3.2 Molecule preparation
Each ligand was prepared starting from the SMILES string. Ligands were
first standardized by neutralizing the charges and re-adjusted for pH 7
using protonation rules. Explicit hydrogen atoms were then added. The 3D
conformation was generated using the ETKDG method from RDKit44, and
stored in SDF format. For docking tools related to the AutoDock family,
the Python package Meeko (v0.4.0) was used to generate the PDBQT input
files 45.
3.3.3 PLC prediction tools
The predictions were run with the default parameters given by the tools
unless stated differently below.
(1) Autodock Vina version 1.2.330,31docking was performed with exhaustiveness set to 64 within a Conda37environment containing the required python bindings. Meeko v0.4.0 was
used to transform the PDBQT output file into an SDF file, to be used by
the evaluation tools. (2) SMINA32 was run
within a Conda environment (v2020.12.10, conda-forge:b08c07c, based on
AutoDock Vina 1.1.2) with exhaustiveness set to 64. (3) GNINA33 was run
using a Singularity image downloaded fromhttps://hub.docker.com/r/nmaus/gnina(digest: 7087cbf4dafd, gnina v1.0.2 (master:0cb5eb8, built Sep 29 2022))
with exhaustiveness set to 64. (4) TANKBind35, input
preparation and inference was run according to the code provided athttps://github.com/luwei0917/TankBindusing a Singularity image for the dependencies downloaded fromhttps://hub.docker.com/r/qizhipei/tankbind_py38.
(5) DiffDock34inference was run using –samples_per_complex 40 –batch_size 10
–actual_steps 18 –no_final_step_noise within a Conda
environment built according to the setup guide (master:2c7d438, built
Mar 13 2023).
Each tool except DiffDock allows for the definition of a pocket center
and grid size, within which the search space for ligand conformations is
restricted. To assess predictions for different pockets, P2Rank40 (v2.4)
was used to predict and rank multiple binding pockets,with default
parameters for experimental structures and -c alphafold option for
AlphaFold predicted models. The box in which Autodock Vina,GNINA and
SMINA search for binding poses was constructed around each predicted
P2Rank pocket center. The diameter of the search box was the diameter of
the ligand conformer generated by RDKit with an additional 10 Å on all 6
sides of the search box.Thus for each tool (p+1)*n predicted ligand
poses were obtained as outputs, where p is the number of pockets
predicted by P2Rank and n is the number of poses returned by the tool.
3.3.4 Scoring
BiSyRMSD (shortened to RMSD throughout this manuscript) and lDDT-PLI
scores were calculated with OpenStructure version 2.4.046 with
default parameters. The methods are identical to those described in the
CASP15 CASP-PLI assessment paper9. Every
ligand was scored separately and a summary CSV file containing scores
for each ligand pose, pocket, and blind docking is generated.