Standard bag-of-features have at minimum a bag of nuclear charges and a bag of two-body interactions as seen in BOB and further bags are added that contain additional information such as angles and torsions with BAT. This approach was taken for the BATTY representation with the modification of using minimal atom typing (i.e., sp, sp2, sp3 hybridization) to sort bags. Unlike other bag-of-features representations, the performance of BATTY was increased by removing the bags of nuclear charges and excluding the nonbonding interactions from the two-body interactions bag to create a bag of simple bonds. Since relative conformer energies are strongly dominated by non-bonded interactions, this finding is surprising, although perhaps separating bonding and two-body non-bonded interactions facilitate ML training. A recent example, BAND-NN, took the approach of separating the bonding and nonbonding information similarly to classical force fields and finds an improvement in performance.\cite{Laghuvarapu_2019}
ML commonly employs techniques to normalize the data, improving the model's training.\cite{szegedy2015,NIPS2017_6698} In this work, we used physically-motivated normalization techniques for the bag-of-features representations. Four molecular properties, the number of atoms, bonds, electrons, and the molecular mass, were chosen for normalizing the atomization energy. BATTY saw improvements in performance when normalizing by the number of atoms (i.e., BATTY/n) and the number of bonds (BATTY/b) across Spearman, R2, and MARE. The other bag-of-feature representations experienced a slight improvement in R2 when normalizing by the number of atoms but not an improvement in the MARE. Normalizing the atomization energy for bag-of-features methods does provide minor improvements, but not enough to compete with the ANI-1 and ANI-2 methods.
ML methods, despite training on density functional and coupled-cluster energies, are still not as accurate as conventional quantum methods for predicting conformer energies. At present, the ANI family is comparable to the semiempirical GFN methods for accuracy on this task.