Abstract
Machine learning and artificial intelligence are increasingly gaining in prominence through image analysis, language processing, and automation, to name a few applications. Machine learning is also making profound changes in chemistry. From revisiting decades-old analytical techniques for the purpose of creating better calibration curves, to assisting and accelerating traditional in silico simulations, to automating entire scientific workflows, to being used as an approach to deduce underlying physics of unexplained chemical phenomena, machine learning and artificial intelligence are reshaping chemistry, accelerating scientific discovery, and yielding new insights. This review provides an overview of machine learning and artificial intelligence from a chemist’s perspective and focuses on a number of examples of the use of these approaches in computational chemistry and in the laboratory.
Introduction
Over the past two decades, machine learning and artificial intelligence have grown in utility and applicability. Increased computational power at lower hardware costs combined with algorithm development have allowed for the widespread implementation of technologies like facial recognition and natural language processing [1], [2], [3]. At the same time, “Big Data”, coupled with increasingly sophisticated algorithms to mine it, has created transformations in diverse fields ranging from manufacturing, technology, and healthcare, to the natural sciences, especially in biotechnology and pharmaceuticals [4], [5], [6], [7], [8], [9], [10], [11].
Machine learning is also rapidly finding new uses in chemistry, with the purpose of gaining as much insight into a chemical system or process as possible, reducing the need for time, computational resources, or physical materials. Many experimental and theoretical studies necessitate the collection of repetitive information. Machine learning, with its capacity for quickly learning systematic patterns of information, can often be used to capture these underlying patterns yielding more chemical insight with less data and less experiments than in traditional studies, and even creating new insights that may not have been possible otherwise [12]. Figure 1 shows this rapid growth in the hybridization of chemistry and machine learning.

Google Scholar articles by publication year highlighting the explosive growth of chemistry combined with machine learning, artificial intelligence, and deep learning, respectively.
Although the terms artificial intelligence and machine learning are used somewhat interchangeably, there are subtle differences between them. Artificial intelligence is focused on teaching computers to learn information and perform tasks in similar ways to humans, whereas machine learning generally refers to the algorithms that underpin artificial intelligence. “Deep learning” uses multi-layer artificial neural networks. Figure 2 shows how artificial intelligence, machine learning, and deep learning interface with one another.

Venn Diagram relationship between artificial intelligence, machine learning, and deep learning. Artificial intelligence utilizes machine learning algorithms to learn to solve problems more similar to how humans do than with traditional computer science algorithms. Deep learning is a subset of machine learning that uses multilayered neural networks.
This review focuses on a number of accomplishments of machine learning in computational chemistry and analytical chemistry. Among the topics that are addressed include efforts to represent chemical systems for machine learning algorithms, progress being made toward mechanistic insights derived from machine learning and applied to natural systems, and novel ways in which machine learning is being used to extract more information out of conventional experiments. Finally, a brief synopsis of the landscape for machine learning in chemistry is overviewed.
The paradigm of machine learning differs from that of traditional scientific inquiry. Instead of deducing mechanisms to relate properties of interest to underlying principles, a machine learning algorithm utilizes mathematical forms to determine empirical relationships [13]. Many chemists already utilize machine learning in their daily tasks. Spectroscopists, for instance, frequently use Partial Least Squares (PLS) to build regression curves and extract concentrations from spectroscopy data [14]. Rather than simply characterizing peak shapes and heights, PLS allows for entire sections of the spectra to be used, by projecting both X (spectra) and Y (concentration) variables into a new latent space based on correlation and then reducing the dimensionality of this space by removing uncorrelated regions [15]. PLS often does a better job than basic linear regression of both determining which spectral regions correlate highest to concentration changes and of discarding regions of the spectra that contain noise. Classification machine learning models, which are trained to assign data to predetermined classes based on similar attributes, are frequently used in spectroscopy as well, to identify unknown components mixtures. Partial Least Squares Discriminant Analysis (PLS-DA), is a variant of conventional PLS often used for these types of tasks [16, 17]. New data points are projected onto the PLS-created latent space, and their identities are determined by where they are positioned. Like PLS, most types of machine learning models have versions for both classification and regression, and the model used is selected based on the task at hand.
PLS is a linear method, best suited for when observed variables (i.e., spectra), correlate linearly with response variables (i.e., concentration). Other model types exist, however, that are able to map responses in increasingly complex fashions. Many of these models have hyperparameters, or values used to control the models instead of learned directly from data, that can be used to optimize how they learn. Because of the number of models available and the ways they can be altered, there is a near infinite number of models and their respective hyperparameters that can be used to derive a relationship between any set of data and properties of interest [18]. However, as model complexity increases, the risk of overfitting also rises. Overfitting occurs when unrelated information or instrumental error in a dataset can be made to appear to correlate to a property of interest but has no predictive power on new data. To help alleviate this problem, simpler models with fewer parameters are attempted first, as they are less prone to overfitting noise in the data from which they are learning. If results are not satisfactory, increasingly parameterized and complex models are used. Table 1 shows some of the most commonly used types of machine learning models. Neural networks generally represent the most complex type of model, as they can be adjusted to any level of accuracy, but are the most prone to overfitting, in which errors or outliers in the data are improperly modeled as being significant, leading to poor model predictions on new data. Outliers in a dataset can be created by mistake or mismeasurement, but they may also have been measured correctly, but are too unlike the rest of the data being used for modeling that they impute an improper bias in the entire model [19]. Regardless of model type selected, outliers need to be removed before modeling, otherwise they will improperly influence the model and lead to poor model performance [19]. Simply seeing how well a model represents its underlying data is not sufficient, as predictions made for new data outside of the model can be substantially different. Data sets should be separated into training, validation, and testing sets, in which the model is trained on the training subset of the data and then optimized on the validation set [20]. After completion, the model is checked against the testing set to estimate how the model will perform on new data and ensure that prediction quality remains similar across all sets [20].
Machine learning model types, roughly sorted in order of ascending complexity.
Model type | Hyperparameters | Description |
---|---|---|
Multiple Linear Regression [25] | None | Simplest multivariate model, linear regression for multivariate data |
Principle Component Analysis [26, 27] | Number of components | Projects X variables onto latent space based on correlation and orthogonality; effective for dimension reduction |
Partial Least Squares [28] | Number of latent variables | Similar to PCA but also projects independent Y variables |
Random Forests (RF) [29, 30] | Number of trees, maximum features in trees, maximum depth of trees, splitting criteria | Ensemble method that utilizes collections of decision trees |
Support Vector Machines (SVM) [31] | Kernel, gamma | Robust, non-probabilistic method that can also be used in linear and nonlinear forms |
Kernel Ridge Regression (KRR) [32, 33] | Kernel, alpha, gamma | Similar to SVM with different loss function that can improve speed of training |
Gaussian Process Regression (GPR) [34] | Kernel, alpha | Bayesian, Probabilistic algorithm that works well on small datasets |
Neural Networks (NN) [35, 36] | Neuron type, number of layers, structure of layers, etc. | Tunable to reach any level of arbitrary accuracy, requires care to avoid overfitting |
Obtaining the best accuracy in a model with the least risk of overfitting new data is one of the chief objectives of machine learning model selection, and many statistical tests have been developed for this purpose. Section l in Computational Chemistry provides examples on recent work toward making machine learning models for chemistry applications more robust. Many tests are embedded directly in the software packages used for machine learning. (For more details on guidance in model selection, parameterization, and verification, References [21], [22], [23] are recommended. For guidance on the application of statistics in machine learning specifically aimed at chemistry, see Reference [24]).
Machine learning tasks can generally be divided into supervised and unsupervised learning. Supervised learning occurs when data points are categorized or labeled by an outside source and the model is tasked in learning how the labels are correlated to the underlying data. Returning to the above spectroscopy examples, both regression and classification would be examples of supervised learning, because in each case, known, “labeled” values are used to train their respective models which can then be used to label or interpolate new data where true values are not known. In unsupervised learning, the model itself determines the organization of the data. Unsupervised learning is used less frequently in chemical applications, and is mostly used for the exploration of chemical systems in which large amounts of data are available but little is known about mechanisms or relationships. Some examples of unsupervised learning include organizing unsorted chemical data, searching chemical space for molecules with desired properties, interpreting imaging results, and extracting patterns from data on chemical reactions [37], [38], [39], [40], [41], [42].
Unlike spectroscopy, many types of chemical experiments do not use data in a form easily accessible to machine learning. Often, the goal of an experiment is to correlate a property of interest to an underlying chemical structure. Descriptors, also called Molecular Representations, are used to transform chemical structures into computer-readable vectors, and in many machine learning tasks, the choice of descriptors used to represent molecules of interest is as important as the machine learning model itself [20, 43]. Descriptors can be very simple, such as molecular weights and atom types, or take on much more complex forms, even utilizing the same information as quantum mechanical calculations [44, 45]. Because of their importance, different types of descriptors will be discussed later in this review.
Since the 2000’s, there have been a large number of open-sourced programs developed for the explicit purpose of easily implementing machine learning. Although not made specifically for the sciences, many of these programs can be useful for chemists, though a coding background can be beneficial. Examples of regularly-used programs include Scikit-Learn, TensorFlow, Keras, Apache Spark, R, and many others [46], [47], [48], [49], [50]. For the most part, machine learning and computational chemistry calculations still occur in their own separate software packages except for prototypical projects. Although still in its infancy, there are several recent attempts to make machine learning more accessible for researchers with less programming background. One of the first programs in this front, The MLatom package, has been developed for use by computational chemists [51]. Like many traditional ab initio programs, MLatom is written in Fortran, although it is also accessible via a Python interface. It allows for easy implementation of common machine learning algorithms, including many of the techniques discussed later in this paper. Another program, ChemML serves as a machine learning workflow tailored to chemistry, with interfaces between molecular descriptor generators, general machine learning packages, and data sources [52]. The Open Chemistry project serves as an open-sourced framework for machine learning and chemistry, and it includes connections to ChemML as well as interactivity with online Jupyter notebooks and several common computational chemistry programs [53]. Another platform, OpenChem, focuses specifically on deep learning for drug discovery and molecular modeling [54]. In the materials realm, ML4Chem operates as a pipeline for developing and deploying machine learning models [55]. In the future, programs that integrate machine learning and chemistry will likely become increasingly common. Attesting to increased interest in machine learning from the physical sciences, there are now university level courses for integrating chemistry and machine learning [56]. For example, one course, which is aimed primarily at undergraduate chemists, uses properties of wines as its primary data set, and teaches basic Python, statistics, data visualization, and classification and regression approaches [56].
Depending on the demands of the project, analyzing chemical data via machine learning can be done on a regular laptop or may require a high performance computer (HPC). For simpler models with datasets containing hundreds to tens of thousands of molecules, higher-end laptop and desktop computers may suffice, whereas machine learning models that require more data or have more complex parameters to learn often require significant computational resources. Many programs allow deployment across a range of computer systems. Tensorflow, for instance, can be employed on both personal computers and large HPCs [57]. Approaches for HPC-based platforms are becoming more common; recent examples tailored specifically for combining chemistry and machine leaning include CASTELO, for cloud-based drug-lead optimization, Stream-AI-MD, which is optimized for molecular dynamics simulations on an HPC by streaming simulation data as it is created, and Cloud 3D-QSAR, which generates molecular descriptors online in a systematic manner for further screening [58], [59], [60].
Computational chemistry
Although machine learning has been used in conjunction with chemistry for decades, its use in computational chemistry has spread substantially over the last few years [61]. Early machine learning algorithms have been used for decades in the analytical chemistry field, especially in calibration and multivariate analysis, such as via chemometrics [62], [63], [64]. Other techniques, such as Quantitative Structure Activity Relationships (QSARs) have become standard practice in pharmacology and materials research [65]. Yet somewhat ironically, machine learning has been a relative latecomer in computational chemistry. One of the reasons for the relatively late impact entrance of machine learning into the computational chemist’s toolbox is the sheer complexity of chemical space and the amount of relevant data that must be acquired in order to build models. Recently, however, new methodologies have been developed that are far more transferable between chemical systems, even with limited data to use in model building [66], [67], [68]. These new algorithms harness machine learning’s innate capacity to quickly make predictions while using new data paradigms to make predictions more applicable.
Unlike traditional quantum-based methods in which calculations are run independently, machine learning has the benefit of being able to learn patterns from one dataset and transferring it to another. It assumes that much of the information collected in a series of calculations are repetitive and can be recognized in a pattern that can be inferred at much lower computational cost in terms of computer time, memory, and disk space, allowing a machine learning algorithm to analyze large datasets far more quickly and efficiently than traditional computational chemistry methods. Figure 3 illustrates the use of machine learning on a dataset already generated by other methods.

Illustration of data already generated by conventional techniques (a), analyzed by machine learning (b), with results that can then be applied to similar systems to what the machine learning model was trained on (c).
However, machine learning creates unique challenges. Most machine learning models are empirical, deriving relationships between the features of a system and its properties of interest based solely upon mathematical models where the relationship is known and then transferring the same relationship to make predictions where the property is unknown [13]. As such, these models are intrinsically demanding of data that has been curated and verified. Additionally, machine learning models cannot be applied to data that differs much from the data on which they have been trained. Hence, computational chemistry and machine learning generally have opposite limitations: traditional computational chemistry methods are more universal, yet extremely time-consuming to utilize as molecule size and complexity arise, whereas machine learning methods are generally very fast, yet lack transferability to systems unlike the ones they have previously encountered [67].
Vast opportunities remain for machine learning to provide new insights and computational savings in computational chemistry. There is much interest in the development of new techniques that require far less data for model training and move from modeling system-specific descriptors in favor of algorithms that are more broadly applicable across chemical systems [67].
QSARs
In the past, Quantitative Structure-Activity Relationships (QSARs) and Quantitative Structure-Property Relationships (QSPRs) have been used to represent chemical systems in an efficient manner [69]. QSARs and QSPRs are sets of generally discrete descriptors that contain chemical information of a system, often analogous to how chemists interpret molecules as being a collection of atoms and bonds and other quantifiable properties [69]. Thousands of QSARs are known to exist, with varied target applications [44]. Among the most common descriptors include spatial descriptors, which include bonds, angles, and atomic identities. Additionally, dozens of QSARs specifically for computational chemistry and quantum chemistry have been developed [70, 71]. These quantum QSARs include properties commonly calculated in computational chemistry, such as electronic energies, dipole moments, electron densities, HOMO/LUMO energies, etc. [69, 70] Electron correlation energy has also been used as a quantum descriptor [72].
QSARs, although powerful, are generally limited to systems where large amounts of very similar data are available [65]. QSAR research is still active, both in discovering new applications for QSARs and the development of new descriptors themselves [73], [74], [75]. Although accurate, most systems modelled this way only contain several dozen molecules and are application specific. It is unlikely, for example, that a model trained on one class of pesticides would perform well on a different class of pesticides. Most QSAR models therefore do not have the versatility of general-purpose calculations. Additionally, with thousands of available descriptors, including descriptors that often overlap, QSAR models are often prone to overfitting and are difficult to optimize [76].
Molecular representations
To be used in machine learning, chemical species must be represented in a mathematical form that is computer-readable. These molecular representations, or descriptors, generally take the form of vector of numbers. Because one of the greatest benefits of using machine learning is the low computational cost it entails to make predictions on new data, descriptors need to be generated at far greater speeds than the methods they are trying to emulate. Capturing chemically relevant portions of a system in a compact, efficient descriptor is one of the most important tasks for a chemist wishing to institute machine learning, and an area of highly active research. Recent open-sourced programs to generate common descriptors from molecular inputs for use in machine learning models include Mordred, RDkit, DScribe, and MolML [77], [78], [79], [80].
In quantum mechanics, a tremendous benefit of the Schrödinger equation is its universal applicability. Given a suitable wavefunction and fundamental physical properties, it is able to represent a chemical system without any further information [69]. However, solving the Schrödinger equation exactly is computationally impossible for all but one-electron systems, and approximations must be used for larger systems, often aiming to reach the best level of accuracy possible for a given amount of computational effort [81]. As machine learning gains greater utilization in quantum mechanical calculations, descriptors are being developed that are transferable between differing chemical systems and properties. Instead of quantifying chemical systems by using traditional metrics like atoms and bonds, these next-generation descriptors take on a more versatile form.
A brief synopsis of a number of the most common wave-function-like descriptors is presented, as many of these descriptors are currently actively being improved and used in novel applications.
Coulomb Matrix
The Coulomb Matrix (CM) is a descriptor used to represent chemical species that requires the same information about a chemical system as the Hamiltonian: atomic coordinates and nuclear charge on atoms [82]. Bonds are not explicitly defined, making the CM a very compact and transferable representation of a chemical wavefunction. To generate a CM, each atom in a molecule is compared to itself and all other atoms. When compared to itself, the atom’s CM contribution is defined as
where
Faber–Christensen–Huang–Lilienfeld fingerprints (FCHL)
The Faber–Christensen–Huang–Lilienfeld (FCHL) representation is similar to Coulomb Matrixes but also includes the groups and periods of atoms in a molecule and bond angles in addition to bond distances [86]. The descriptor has the ability to accurately predict a wide variety of electronic properties and is far more robust at predicting other regions of chemical space than any of its predecessors. FCHL has also been revised into the FCHL19 descriptor, in which the descriptor is discretized and optimized via Monte Carlo, yielding a set of transferable hyperparameters that allow the descriptors to be used on diverse chemical datasets without first requiring computationally-intensive re-optimization [87].
Persistence images
An image-based molecular representation, based on persistent homology, has been adapted for chemical applications [88]. The descriptor takes the form of a computer-readable pixel image called a persistence image, which is used for representing the 3D topological features of a system on a 2D plane. The persistence images consist of a series of persistence diagrams, which capture the connectivity and proximity of atoms in molecules. Pixel intensity represents the multiplicity of each topological feature. To make molecules of similar structure but different atomic makeup distinguishable, atomic identity is introduced indirectly, by representing the difference in electronegativity between atomic components as a smeared Gaussian kernel. To demonstrate the effectiveness of persistence images, a high-throughput study was conducted to screen and identify organic molecules that interact with CO2 [88]. 133,000 organic molecules from the GDB-9 database were selected for the study, and machine learning models were used, including Random Forest, Gaussian Process Regression, and Kernel Ridge Regression. 220 molecules that showed strong interactions with CO2 and weak interactions with N2 were used for model training. From this relatively small training set, 44 molecules from the GDB-9 database were identified that had highly exothermic interaction energies with CO2. Models using persistent images performed better than comparative models using other descriptors, and in an optimization experiment, models using the descriptor were able to identify 44 novel molecules with strong CO2 affinity from over 133,000 molecules, using a training set of only 220 molecules.
Amons
In active learning, a machine learning algorithm itself determines what data points should be used to train it, by automatically selecting what points would most improve its accuracy [89]. It is useful in instances where data is difficult or expensive to obtain, because it allows for smaller training set sizes [90]. Active learning has been shown in the past to be effective at modeling local chemical environments and accounting for outlier molecules [91]. In recent years, Gaussian process regression has become one of the most commonly employed strategies for utilizing active learning. Gaussian processes work by comparing distributions of functions at fixed points of data. When model confidence at a given point is too low, the model can be recalculated by the introduction of reference data taken at that point. Hence, large training sets do not need to be used, because the model selectively samples data and makes recommendations for more data where its uncertainty is high, so that new training data to be generated “on the fly” only as it is needed. Being probabilistic and not requiring a predetermined functional form, Gaussian process regression works well at determining the form of sparse and uneven datasets, as chemical data often is. “Amons,” or atom-in-molecule-based fragments, are a series of molecular fragments built to represent local chemical environments in target molecules of interest. Amons are similar to conventional molecular fingerprints, but are derived from molecular structural data by the machine learning model as they are needed, accelerating the learning rate of the model. Recently, Huang and Lilienfeld used amons in an active learning and Gaussian process-based experiment dubbed Amon Machine Learning (AML) [92]. In this project, a dictionary of ∼25,000 amons was built using ∼110,000 organic molecules from the QM9 database [93]. For a given target molecule to be computed, a subset of the amon dictionary is chosen to represent it. The number of selected amons is kept manageable by only including the most relevant amons for the task at hand, selected “on the fly” based on similarity, starting with the smallest fragments and increasing in size. Once the training set of amons has been optimized, the energy and other properties of the target molecule are estimated based on the calculated properties of the amons themselves. AML is applicable for a wide variety of properties including energies, forces, polarizabilities, charges, and NMR shifts, and may allow for accuracy beyond what can be achieved by the Hartree–Fock approximation for large molecules in the near future.
Partial charge descriptors
Work has also been done in the development of descriptors to estimate partial charges [94]. Partial charges, although a pseudo-physical property, are still useful as inputs in many simulations, including molecular dynamics, free energy estimations, and molecular docking, but are often too costly to estimate with traditional theoretical methods, especially for high throughput screening. The Atom-Path-Descriptor (APD) approach calculates the number of interatomic reactions up to a user-selected maximum path length. Duplicate paths are deleted and the resulting vectors are organized to make molecules rotation invariant. Then, Random Forest or XGBoost models are used to create regression models to estimate partial charges to be used for further simulations. When tested on a mixture of two datasets, APD results show that predicted partial charges were on average ∼40 % more accurate than the techniques currently and commonly used to estimate partial charges for high throughput screenings.
Partial charges can also be used as an efficient approximation of the wavefunction [95]. In tight-binding forms of DFT (DFTB), electrons are assumed to be localized around their respective atoms, yielding a pseudo-physical atomic charge. Rather than using the traditional self-consistent field (SCF) procedure, self-consistent charges (SCCs) can be used to estimate the wavefunction and other properties at lower computational cost. One process that seeks to further increase the computational speed uses machine learning to estimate atomic charge, dubbed ML-DFTB. An environmental descriptor is used that incorporates atomic fingerprints, in which atoms and their neighbors are envisioned as crystal unit cells. Then, KRR is used to learn partial charges. From a speed perspective, the ML-DFTB model estimated charges two times faster for small molecules up to 10 times faster for large molecules, and allowed for parallel calculations on different pieces of a molecule. DFTB has also been combined with deep tensor neural networks (DTNN), coined DFTB-NNrep [96]. DFTB-NNrep has shown itself to be substantially more accurate than DFTB or DTNN alone for multibody calculations and may eventually allow chemical accuracy, or errors within 1 kcal mol−1 relative to experiment, for molecules with tens of thousands of atoms.
Combinations of descriptors
Machine learning has also been used in the development of descriptors themselves, both by creating novel combinations of conventional descriptors and by helping to deduce structural features. Winter et al. [97] used neural networks to create a translation between SMILES and InChi representations. These two representations, both of which have longstanding use in chemistry, are semantically equivalent and are often used as chemical descriptors. In this new approach, they are combined via a neural network process, in which the complementary information of both representations are compressed into a single new, low-dimensional descriptor. Results showed that the new descriptor worked similarly or better than many of the most currently developed types of molecular fingerprints and graph-convolution models. Additionally, the descriptors are continuous and can be translated back into molecular structures, giving researchers new ways to navigate chemical space.
Direct wavefunction analysis
Appropriately accounting for electron correlation – the interactions between electrons – is one of the fundamental challenges involved in computational quantum chemistry. It is also one of the most computationally expensive problems. Since machine learning is able to quickly and efficiently deduce complex patterns, it is therefore not surprising that attempts have been made to directly emulate the quantum mechanical wavefunction. Since quantum mechanical operators are used to gain insight about properties of a chemical system, it is not surprising that adapting them for machine learning use would be an intuitive way to integrate generalized machine learning models [98, 99]. Christensen et al. sought to utilize derivatives of energy with respect to nuclear positions or external electric fields in chemical systems in a kernel matrix machine learning model [99]. The model was used to successfully predict atomic forces and dipole moments using much less data for training than traditional machine learning techniques require.
One study has shown that the Schrödinger equation can be deduced for a single particle by using autoencoder neural networks, which are a type of neural network that learn a representation of input data. Wang et al. used a single quantum particle’s movement in one dimensional space to demonstrate that an autoencoder could learn the form of both the Schrödinger equation and the quantum wave function [100]. The potential and density were broken down into discrete space and then treated as a sequence. After training, the resultant model was shown to be mathematically equivalent to the Schrödinger equation and wavefunction. Attempts have also been made to deduce the Schrödinger equation for multiple electrons without the need for atomic orbitals. Han et al. used variational Monte Carlo to create trial wave functions for several small atoms and molecules [101]. Coined the Deep WaveFunction method, or DeepWF, the method uses a deep neural network to represent the trial wavefunction, which is then optimized variationally by Monte Carlo simulation. For the smallest systems, sub kcal mol−1 accuracy was achieved for ground-state energies; however, the accuracy was diminished as the numbers of electrons in a system was increased. Future work will need to include a better accounting of the electron correlation for larger systems.
Machine learning-based treatments of the wavefunction continues to advance. Hermann et al. developed PauliNet, a neural network representation of the wavefunction that is trained via Quantum Monte Carlo [102]. Although it scales at only N4 with system size, it is able to capture 97–99.9 % of correlation energy for molecules with up to 30 electrons. Schütt et al. developed SchNOrb [103], which uses a local basis of atomic orbitals and is an extension of SchNet [104, 105], Like the traditional wavefunction, all ground state properties can be derived from it. As such, the wavefunction serves as an interface between quantum mechanics and machine learning, as analysis can be performed on the machine-learning generated wavefunction in the same way it could be if generated by ab initio methods. Further improvements will be needed to tackle larger systems, yet the work serves as a proof of concept for the direct machine learning modeling of electronic structures, and may lead to a pathway for inverse chemical design, in which materials are designed starting with specific functionalities and then assigned corresponding molecular structures.
“Gaussian Moments” are conceptually similar to the Gaussian-type atomic orbitals used in many conventional computational chemistry calculations [106]. They are approximations formed by using linear combinations of functions used to represent atomic orbitals, and are atom-centered and combined to describe whole molecules and periodic systems. Gaussian moments rely solely on atomic coordinates instead of electronic positions, are invariant toward geometric translations, and can be used in many types of common machine learning models. For demonstration purposes, the researchers used feed-forward neural networks with Gaussian moments to evaluate three benchmark sets. Models using Gaussian moments were able to obtain accurate results for energy and force predictions that were comparable to far more complex neural networks using other descriptors, giving it an edge in efficiency.
Neural networks have been used to variationally determine the optimal form of the ground state wavefunction. Yang et al. developed a Boltzmann machine-based network that required no reference data or prior knowledge of the wavefunction, but instead learns the variational energy procedure for minimizing energy itself, based off of the earlier neural networks of Carleo and Troyer [107, 108]. Yang extends their approach by including higher order Boltzmann Machine models and by utilizing the coefficients of the complete active space configuration interaction (CAS-CI) computational chemistry approach, corresponding to orbital occupancies, which become the molecular descriptors for the machine learning routine. In this manner, the neural networks are able to solve for the static and multireference electron correlation relevant to many-electron wavefunctions of molecular systems. It is still in the prototyping stage, and future work will include expanding its applicability into larger active spaces.
Transfer learning
Transfer learning encompasses a broad type of machine learning category in which previously-trained models are used as starting points for new models. By utilizing an existing model as a starting point, training time can be significantly reduced and the amount of new data needed can be pared down by simply tuning the existing model with a different dataset [109]. Unlike traditional machine learning, where models are built from data from the same feature space they will be applied to, transfer learning models are used with two distinct but related sets of data [109]. The first set of data, the “source” is generally larger than the second, and contains the information necessary to derive a relationship with properties of interest for a model. Then, the model is tuned and adjusted to be applied to the target dataset [109]. Transfer Learning has found many applications in image analysis and language processing. In computational chemistry, it shows promise in situations where large, general or low accuracy datasets exist in conjunction with small, targeted datasets. Figure 4 illustrates how transfer learning works.

Illustration of transfer learning. First, a dataset with a large amount of data (a) is used to train a machine learning model (b). Then, the model is fine-tuned using a different and generally smaller set of data (c), which can give higher accuracy results (d) then if the model was just trained on the different data set alone.
As a relatively new entrant for predicting chemical data, transfer learning continues to be researched and refined for use by the computational chemist. Iovanac and Savoie [110] discovered that transfer learning can be used to simultaneously predict multiple energetic properties between chemical systems, and that higher-correlated properties need less training data in the space they are being transferred to. Additionally, they found that linear predictors are more transferable across chemical systems, more translatable, and do not decrease accuracy. Amabilino et al. [111] discovered that transfer learning could be used for generating text-based Simplified Molecular Input Line Entry System (SMILES) representations for automated drug design. A general model was trained using a recurrent neural network on a broad dataset in order to learn how the SMILES methodology works, and then fine-tuned on a smaller dataset that would be more similar to targets of interest. Results showed that good SMILES representations were generated when the model was optimized between giving unique representations while still maintaining similarities to the class of molecules in the transfer learning dataset.
Benchmarking computational methods against experimental data is common practice, but a new transfer learning approach seeks to directly integrate experimental results into a quantum-mechanics based machine learning model [112]. As a starting point, a deep neural network, ElemNet [113], was trained on DFT data alone using 341 K formation energy DFT calculations from the Open Quantum Materials Database (OQMD) dataset [114]. Then, two smaller DFT-generated datasets, which contained more specifically targeted molecules, were trained independently from scratch, as well as one dataset containing ∼1960 experimental values. To test the efficacy of transfer learning, the smaller DFT dataset models and the experimental dataset model were then retrained and fine-tuned using the initial parameters derived from the OQMD-based model. Results showed that this novel leveraging of DFT and experimental data yielded better accuracy, nearly halving the average error of formation energy predictions relative to the independent experimental model and decreasing errors for both smaller independent DFT models as well. Apparently, the patterns learned and transferred from the OQMD dataset were the cause of this sizable increase in accuracy.
Integrated methods
During the past few years, machine learning methods have been blended into computational chemistry calculations in novel ways. These approaches tend to be more targeted than the broad, more qualitative studies used for high-throughput screening, and rather than utilizing large sets of chemical data generated elsewhere, they seek to integrate themselves more directly into a computational chemistry workflow. They are more analogous to traditional computational chemistry methods, with one or more pieces augmented with machine learning, than they are to “Big Data” approaches, as shown in Figure 5.

Illustration of machine learning integrated directly into a calculation process. Machine learning is used directly in the acquisition of data (a), possibly invisibly to the end user. Results (b) are a hybrid of traditional methods for data collection and machine learning.
Δ-ML is a hybrid approach that combines strengths from traditional quantum methods and machine learning, and which may serve as a stepping stone to future, high-accuracy machine learning models for quantum mechanics [115]. Δ-ML operates on the principle that often, only a small portion of the overall energy of a system is actually relevant for a chemical property of interest. This portion – the electron correlation energy – can often be the most computationally expensive to describe, whereas the rest of the system’s energy (the Hartree–Fock energy) can be easily quantified by a computationally inexpensive, commonly used calculation (i.e. a restricted Hartree–Fock (RHF) calculation). As Δ-ML is intuitively grasped and implemented, Δ-ML may be a starting point for future research into integrating machine learning into existing workflows. Although similar techniques have been used previously, Δ-ML was first formalized by Ramakrishnan et al., who calculated atomization enthalpies, free energies, and electron correlation for 7000 organic molecules from the GDB dataset, along with the thermodynamic properties of 16,000 isomers of C7H10O2, with results close to those provided by density functional methods, but at much lower computational cost [115].
Progress in Δ-ML types of approaches continue to be made. Recently, Nandi et al. used a Δ-ML approach to scale up DFT calculations on the potential energy surfaces of five molecules akin to a far more accurate CCSD(T) level of theory [116]. The differences between the two levels of theory were estimated against molecular structure by using the permutationally invariant polynomial (PIP) method, a mostly linear least-squares approach for fitting electronic energies to mid-sized chemical systems that has seen increasing prominence in the last decade and a half [117].This fitting approach required only about 6 % of the CPU time required to calculate each conformation at CCSD(T) levels of theory, and that energy fitting errors relative to CCSD(T) were generally within a few hundred cm−1. The authors suggest that their method could be extended to larger molecular systems as well.
Δ-ML approaches have also been used to augment semi-empirical methods. In one recent study, low-cost DFTB-SK was used as a semi-empirical baseline for measuring thermodynamic properties of medium sized organic molecules [118]. KRR was used to improve DFTB-SK generated molecular energies by using energies derived from more advanced ab initio methods, and combined with a sampling scheme for Hamiltonian and reservoir replica exchange, the approach, dubbed MORESIM, was able to traverse free energy landscapes and accuracy levels computationally unfeasible using traditional methods. Similarly, OrbNet starts with semi-empirical electronic structure calculations and scales them up to DFT levels of accuracy via a neural network [119]. OrbNet uses symmetry-adapted atomic orbitals as molecular descriptors, and is able to obtain its DFT-like accuracy at three levels of magnitude less computational power. Recently, a scaled-up version with better algorithms and more training data has been introduced, called OrbNet Denali [120].
An extension of Δ-ML, Hierarchical Machine Learning (hML) combines synchronous Δ-ML models calculated at increasingly higher levels of theory to run fast calculations of Potential Energy Surfaces (PES) [121]. In this methodology, multiple levels of Δ-ML models are used, each correcting off the baseline set by the last. Training set sizes for each level of the hierarchy are optimized to reduce the amount of reference calculations needed. A test potential energy surface, CH3Cl, was calculated using hML at eight different levels. Computational cost was reduced by a factor of roughly 100, with levels of accuracy around 1 cm−1.
In the spirit of Δ-ML, Δ-DFT adds corrections to a DFT-created baseline energy to improve its accuracy, using DFT-calculated density as a descriptor [122]. This technique uses a KRR model to directly correlate electron density calculated from DFT to properties from much higher order calculations, generally CCSD(T). The error in DFT itself is directly modeled as a correction, so that CCSD(T) energies can be estimated as function of DFT densities and CCSD(T) energies alone. Differences in geometry optimization in molecules from the two methods are also corrected implicitly in the model. Importantly, the Δ-DFT approach was shown to be more effective than using machine learning to predict DFT or CCSD(T) energies separately, since the machine learning models learned the innate error within DFT. Model predictions showed that sub-kcal mol−1 accuracy was obtainable on test data for the approximate computation cost of DFT and a few dozen CCSD(T) calculations.
A similar experiment was also performed by Dick and Fernandez-Serra, who developed the NeuralXC functional in order to increase the accuracy of low-cost DFT calculations [123]. Density was calculated via DFT was used as a descriptor, and was projected onto atom-centered basis-functions. Like in Δ-ML, a low-cost baseline was calculated, in this case by using the relatively inexpensive DFT method PBE. Then, corrections were added to the baseline calculation by using the NeuralXC potential to estimate the difference between PBE energy and high-accuracy CCSD(T). The technique was shown to be highly transferable amongst systems and very accurate for a molecular dynamics simulation of liquid water. NeuralXC scales linearly with system size and presents a new, computationally efficient route for accurately simulating large systems.
Electron density itself has also been generated using machine learning. Cuevas-Zuviría and Pacios recently developed the anisotropic analytical model of density neural network (A2MDNet), which predicts electron densities of molecules of biological interest [124]. A2MDNet, which scales linearly, is meant to allow for the quick estimation of the electron densities of large macromolecules, which in turn can be used to predict properties not present in the training set. Work is being done to increase the accuracy of the method and to expand its applicability to biopolymers.
Molecular Orbital Basis Machine Learning (MOB-ML) interjects machine learning directly into the Self Consistent Field (SCF) procedure [67]. It utilizes elements of the Fock matrix F, Coulomb matrix J, or exchange matrix K. Because these elements are based on molecular orbitals, they are thought to be transferable between systems where atoms differ. In the MOB-ML procedure, the Hartree Fock energy of a chemical system is calculated conventionally. The on-diagonal and off-diagonal elements of the Fock matrix are modelled against the correlation energy recovered at higher levels of theory using Gaussian Process Regression, allowing electron correlation energy can be estimated from Hartree Fock molecular orbitals at the computational expense of HF [125]. MOB-ML appears to require far fewer training calculations to get similar accuracy as Δ-ML [125].
The domain that MOB-ML has been trained on is constantly being expanded and optimized. Recent efforts have included expanding it to contain energies for organic molecules and molecules with transition metals, as well as non-covalent interactions and transition state energies [126]. Active learning is also being used to quickly and efficiently extend MOB-ML, for instance by extending it to describe protein backbone-backbone interactions with sub kcal accuracy [126]. As MOB-ML appears to be both size consistent and rooted in physical theory, its potential applications seem to be to be manyfold. Some future applications the authors suggest include applying it to open-shell systems and electronically excited states. In another recent extension, MOB-ML was expanded to include analytical nuclear gradients [127]. It was shown that the addition of the gradients allowed for a MOB-ML-based model to obtain a mean absolute error of only 1.64 kcal/(Mol * A) for the ISO17 dataset when trained on only 100 reference energies, which is a third the error of the next best method which required several hundred thousand training molecules.
Conceptually similar to MOB-ML, the deep post Hartree–Fock (DeePHF) approach predicts energy differences between low-cost Hartree–Fock and more accurate CCSD(T) from ground-state electronic orbitals, essentially capturing correlation energy [128]. Active learning, in which training sets are iteratively constructed to best represent the area of space being modeled, was used in order to reduce overall computations of reference data. Accuracy on the order of ∼kcal mol−1 was obtainable for many organic molecules with a computational cost far less than that of a traditional Hartree Fock calculation.
Machine learning methods are increasingly being used to calculate complex electronic properties. Tuan-Anh and Zaleśny used neural networks and Random Forest to predict molecular polarizabilities and hyperpolarizabilities of organic compounds [129]. Over 50,000 molecules were studied using this process, and the machine learning models’ predictions showed consistency and accuracy across diverse groups of molecules.
Even chemical machine learning models themselves can be optimized via machine learning approaches. Many machine learning models and some molecular representations have tunable hyperparameters that are adjusted to get the optimum model balance between variance and bias. Traditionally, an exhaustive grid search is used to optimize these parameters, by iterating through a combinatorial range of values for each parameter. Stuke, Rinke, and Todorović instead used Bayesian Optimization to determine the hyperparameters for kernel ridge regression models applied to two different chemical descriptors to predict molecular orbital energies for organic molecules [130]. The coulomb matrix and MBTR descriptors were used. The KRR models themselves each contained three hyperparameters. Coulomb matrices do not have hyperparameters to tune, but MBTR representations contain 14. Prior work had shown some hyperparameters to not have much impact on final model performance, so the number of hyperparameters to optimize was narrowed down to two for coulomb matrix models and four for MBTR models, respectively. Results showed that for the optimization of the simple coulomb matrix model, results between Bayesian Optimization and grid search were similar, showing that Bayesian Optimization did not introduce more error. For optimization of the more complex MBTR model, however, Bayesian Optimization was significantly quicker than a grid search and yielded errors in the final model similar or smaller to those found by the conventional approach.
Using machine learning to accelerate the optimization of potential energy surfaces has become common practice, yet usually, different configurations are determined separately and independently. A new approach aims to optimize multiple geometric configurations simultaneously by sharing information between configurations. Yang et al. [131] used a neural network system called SingleNN to optimize metal slabs and nanoparticles, with and without adsorbates. Unlike most other neural networks, SingleNN uses a single neural network to calculate the contributions of different elements, and that allows for new elements to be fit by only changing weights in its output layer [68]. The researchers optimized systems of increasing complexity, using an active learning approach while retaining the results from prior optimizations. Overall results showed that the need for 50–98 % of high-cost reference data were eliminated compared to traditional optimization techniques, depending on the system of interest.
Another approach to optimization of a potential energy surface involves utilizing reinforcement learning to assist in a traditional PES minimization procedure [132]. Ahuja, Green, and Li augmented the quasi-Newton optimization method Broyden–Fletcher–Goldfarb–Shanno (BFGS) with a force-field trained reinforcement learning model that would produce a corrective term for each optimization step [132]. The result was a 30 % reduction in numbers of optimizations required in cases in which initial geometry estimations were poor, allowing the BFGS optimizer to perform more similarly to more complex optimizers without requiring costly higher order derivatives of the PES.
Machine learning-augmented calculations are being trained to emulate increasingly complex systems. Richings and Habershon have developed a potential energy curve approach that can model nonadiabatic chemical dynamics with KRR [133]. This system works “on the fly” by using variance in the KRR model to determine where data is insufficient in the PES and running new calculations at these points before proceeding, minimizing the number of ab initio calculations required. Once the model is optimized, it is able to interpolate across the PES, reducing the need for expensive calculations.
Nonlinear machine learning models have been used to predict and extrapolate exchange spin coupling. Bahlke et al. compared linear ridge regression with Gaussian process regression for the prediction of Heisenberg exchange spin coupling constants for 257 doubly bridged di-copper complexes [134]. Interestingly, although the two types of regression performed roughly the same when applied to molecules similar to what they were trained on, Gaussian process regression was shown to also work for extrapolating the properties of dicopper complexes outside the systems that were modeled, and then only with a relatively simple descriptor dating back to the 1950s, whereas more recent, complex descriptors were not able to accurately capture the chemical information necessary for extrapolation.
Just as recent work has been done to make machine learning models and descriptors more transferable across systems, so there has also been research to test the boundaries of the applicability of machine learning methodologies to different types of chemical problems, and to alleviate errors that occur when an existing methodology is applied to a different problem. DeePMD-kit is a software package written to assist with deep learning for potential energies and force fields in molecular dynamics in applications including finite and extended systems and metallic and chemically bonded systems [135]. Yue et al. extended it to include cluster and vapor phase properties, and discovered that a local representation methodology did not adequately account for long-range interactions [136]. However, including long range electrostatic interactions analogous to how short-range interactions were accounted for interfered with the model’s ability to calculate short-range interactions as well. To combat this cross-contamination, the model was decomposed into a local non-electrostatic component and a pairwise electrostatic component that extends to infinite distances. This split of terms resulted in not only better overall predictive performance and transferability by including long-range interactions, but also improved the model’s ability to assess short-range contributions as well. The authors speculate that this new understanding of short and long range effects could be applied to other machine learning methods and even ab initio systems. DeePMD-kit continues to increase the capabilities of molecular dynamics in terms of system sizes, timescales, and accuracy. Recently, DeePMD-kit was used to simulate a molecular dynamics trajectory of over 100 million molecules for a duration of over one nanosecond per day of computational time at ab initio levels of accuracy on a high performance computer [137]. This impressive feat claimed the Association for Computing Machinery’s Gordon Bell Prize, which is an annual award given in recognition of innovations in high performance computing for applications in science, engineering, and data analytics. An increasing number of studies that combine high-performance computing with AI and MD have been recently recognized in the Gordon Bell competition.
Machine learning potentials and force fields
Machine learning potentials (MLPs), derived from atomic positions and/or forces, are increasingly being used as ways to achieve accuracy approaching ab initio methods with low computational costs [138]. Similar to force fields, they can be used in molecular dynamics simulations and potential energy surface calculations, but can approach ab initio accuracy at the cost of force field approaches [139, 140].
Open-sourced software is being developed for the use of MLPs in chemistry, bridging the space between the artificial intelligence programmer and the computational chemist. In PiNN, several new and commonly used artificial neural network potentials trained for potential energy surfaces and physicochemical properties are packaged into an easily distributable Python library that is built upon TensorFlow [141]. It includes the well-known Behler–Parrinello neural network, as well as PiNet, which is a new graph convolutional neural network, and PiNNBoard, which is a visualizer, and is able to interface with other molecular software.
MLPs have been used to augment conventional force fields. Guo et al. optimized the parameters of the reactive force field ReaxFF via neural network operations [142, 143] called Intelligent-ReaxFF (I-ReaxFF). It serves as a balance between full machine learning methods and classical force field methods, not requiring the heavy computational resources of full neural networks but able to utilize quick and effective gradient-based optimizers used by neural networks with the power of GPU acceleration. It also integrates the direct use of DFT-calculated molecular dynamics trajectories. Venturi et al. used a Bayesian approach to estimate uncertainties in ab initio PES calculations [144]. The methodology is able to correlate PES errors and quantities of interest, and promisingly, can also identify high-error regions of a PES for further refinement.
Other recent work has focused on replacing force fields with machine learning in simulations involving non-equilibrium structures. Iype et al. predicted atomization energies and optimized molecular structures of six small molecules using such an approach [145]. Data was generated using DFT on non-equilibrium structures of the six smaller molecules, and two Kernel Ridge Regression models were trained using the Bag of Bonds and Many Body Tensor Representations to reproduce individual geometries. Then, Metropolis Monte Carlo was used on both the DFT-based data and the machine learning model outputs to predict annealing energies and find optimized structures. The simulation based on the machine learning model was found to be comparable to running the simulation entirely with data obtained with electronic structure methods, yet at much higher speeds and without the losses in accuracy inherent in using force field methods. The machine learning models achieved energies within 7 kcal mol−1 of those resulting from DFT only-based structures for simulated annealing, showing that machine learning has the potential to be beneficial to simulations where calculating entire datasets using electronic structure methods would be too time-consuming.
Modifying earlier discovered symmetry functions [146], Smith et al. created a MLP with the ability to span both configurational and conformational space approaching DFT predictions but at a computational cost similar to that of force field-based methods [147]. Called ANAKIN-ME (Accurate NeurAl networK engINe for Molecular Energies) or ANI-1, the neural network potential was trained on a dataset containing small organic molecules with sizes up to eight heavy atoms, including hydrogen, carbon, nitrogen, and oxygen. Results showed the potential to be very transferable, being able to successfully analyze molecules with up to 24 atoms, and that predicted values were akin to their DFT-generated counterparts. Further refinements were made by using active learning to sample chemical space for the selection of training molecules for a new potential, called ANI-1x, which is also more conformationally diverse than its predecessor [148, 149].
One of the limitations of the ANI-1 and ANI-1x potentials was the accuracy of the DFT-based data used to train them. In another experiment, Smith et al. used transfer learning to try to reach CCSD(T)/CBS-like accuracy for reaction chemistry, dubbed ANI-1ccx [149, 150]. To accomplish this, a 10 % subset of five million molecule conformations within the ANI-1x dataset was selected via active learning and calculated using DPLNO-CCSD(T) [151]. Transfer learning was then used to impute these very high-accuracy calculations to the DFT-generated potential, and as a control, another potential was also constructed just consisting of the CCSD(T)*/CBS subset of data. The neural network potential derived using transfer learning outperformed the potential trained solely on the lower-accuracy DFT data, as well as the potential trained only on the much smaller DPLNO-CCSD(T) dataset. Additionally, errors for ANI-1ccx predictions against several benchmarks were lower than those obtained solely using DFT, and the calculations were done a billion times faster.
The functionality and applicability of ANI type potentials continues to be expanded. Recent efforts include extending it to include sulfur and halogens (dubbed ANI-2x) [152], comparing it against conventional force fields [153], extending it for druglike molecule discovery (Schrödinger-ANI) [154], and repackaging the ANI engine itself into a portable, lightweight version called TorchANI, which runs on PyTorch, an open-sourced library of machine learning functions built in Python [155, 156]. Built to be more versatile and modifiable than its C++ based predecessor, TorchANI should serve as an easy-to-use platform for the further development of these promising neural network potentials.
SchNet is a deep tensor neural network that models atomistic systems [104, 105]. SchNet is somewhat unusual in that it has been used to predict force fields and potential energy surfaces of materials as well as forces on individual organic molecules. It represents a unique way of scaling atomic properties toward more bulk properties while still retaining the accuracy of traditional methods. The SchNet architecture was tested on a variety of systems, including potential energy surfaces, small organic molecules, and a fullerene. It has also been extended to include multiple electronic states, termed SchNarc [157, 158].
An “on-the-fly” approach calculates force fields for molecular dynamics via machine learning. In the generalized energy-based fragmentation (GEBF) approach, the ground-state energies of large molecular systems are estimated as a function of small electrostatically embedded subsystems, which can be calculated via virtually any ab initio method with any quantum chemistry software [159]. Cheng et al. combined the GEBF approach with Gaussian processes to decrease the amount of ab initio data required to construct the GEBF subsystem force fields of a series of large alkanes for running molecular dynamics simulations [160]. Separate Gaussian process models were developed for each subsystem of a target molecule, which were then combined for the entire molecule. Predictions for the alkanes were very similar to those achieved by ab initio molecular dynamics methods though at a computational cost that is several orders of magnitude lower.
MLPs are getting better at achieving ab initio-like accuracy, but their use is limited to the same or similar chemical systems on which they were trained. They are unlikely to result in accurate predictions for molecules that are qualitatively different than those in the training set (i.e., a training set of organic species is unlikely to be transferable for transition metals). As the number of potentials being built increases, new benchmarks and guidelines are being created to help users select the best MLP for a given task with the optimum tradeoff between accuracy and computational requirements [139, 161, 162].
Molecular dynamics (HPC + AI + MD)
In molecular dynamics simulations, the motions and interactions of atoms and molecules are simulated for a fixed amount of time in order to determine how a system evolves. Because many potential trajectories in the system need to be calculated, molecular dynamics simulations require large numbers of calculations, which unsurprisingly are good targets for augmentation by machine learning. Molecular dynamics simulations are increasingly large, yielding vast amounts of data that is generally collected after a simulation is completed. Machine learning in the future will likely be used to gain insights from systems as they are being simulated instead of afterward [163]. Efforts are underway to transform traditional molecular representations into metadata representations of these simulated systems for further analysis [163].
Similar to the Δ-ML approach, the Δ-NetFF technique uses a neural network to capture the force differences between classical force fields and DFT [164]. For a trial experiment, a subset of data was generated using both the force fields and DFT, and Δ-NetFF was used to estimate the differences between them. Then, molecular dynamics simulations were performed from the forces generated by Δ-NetFF, yielding an accuracy close to DFT at little additive costs to the force fields.
In another recent example, the interactions between six pairs of radical species in the gas phase were simulated using variational transition state theory [165]. An artificial neural network was used. Initially, only potential energy was used to train the neural networks, which reduced the number of calculations needed to run the simulations by a factor of four. Adding forces to the model further reduced the need for reference calculations by about an order of magnitude, but also required increased computational costs to train the model.
Machine learning models have been used on entire molecular dynamics simulations to determine if they could accurately deduce the underlying physical principles of the system being simulated. In one such experiment, the decomposition time of 1,2-dioxetane was modeled via ab initio, and then the trajectories were fed into two Bayesian Neural Networks, one of which used only molecular geometry and one of which also included trajectories [166]. With a limited amount of data, the neural networks were able to successfully project dissociation times, and excitingly, the most important features in the models were shown to correlate to the normal modes of vibration understood to be responsible for dissociation.
Some ML approaches to MD do not require time-based trajectories at all. Lin et al. [167] devised a neural-network based technique to develop a reactive PES by comparing the error between two separate neural network models at different points on the PES, and then adding points to the PES where it is unlikely to be accurate. This approach was able to successfully emulate several classically constructed PESes with minimal data requirements and without the need for molecular dynamics simulations.
The interactions of water clusters have been studied using active learning and artificial neural networks. The study of Loeffler et al. [168] is particularly interesting in that it uses sparse training data and includes failed configurations in its next steps of iteration. In this approach, Nested Ensemble Monte Carlo was used to query potential energy surfaces for configurations of water clusters. Instead of starting with a large number of structures, training data was generated “on the fly”, starting with only one to five data points. Stochastic algorithms were used to test the neural network’s predictions on the energetics of various configurations, and the areas of poor performance were reevaluated with the recalculated failed configurations for the next iteration until the model attains reasonable convergence. Upon final training, the model was applied to 100,000 configurations ranging from 1 to 200 molecules. The mean absolute error of the energy for each configuration was within 2 meV/mol and forces were within a mean absolute error of MAE 40 meV/Å. Only 426 structures were needed, compared to the thousand or more commonly used to train artificial neural networks.
“Rare events,” or configurations that are not common but can be important in the simulation of a reaction, represent a unique challenge for machine learning applications in molecular dynamics, because models generally perform poorly when applied to structures unlike those they were trained on. Vandermause et al. [169] developed a Bayesian force field approach dubbed fast learning of atomistic rare events (FLARE) for automatically training force fields for molecular dynamics simulations that include crystal melts and diffusive events in bulk metal systems. Systems were broken down into small clusters of 2–3 atoms as a simplifying approximation, and Gaussian process regression was used to select data and train the force field at the same time. Because the model only uses clusters with small number of particles as descriptors, fewer building blocks are needed in training as the data is more generalizable, including toward rare events. Results showed that FLARE was able to successfully model rare diffusion and reaction events in these dynamical systems, a task difficult for conventional ab initio methods.
Instead of adding points to training data as the active learning process proceeds, the complexity of the model used to regress the PES can be expanded to improve accuracy. Dai and Krems used Gaussian process regression with composite kernels, in which accuracy could be improved for a fixed number of points by increasing the kernel complexity by maximizing the Bayesian information criterion varying the training data distributions [170]. For this experiment, a six-dimensional H3O+ molecule PES was selected for construction. Accuracy was further improved by varying the training point distribution but keeping the total number of points as a fixed constant, essentially optimizing the training points and composite kernel at the same time. RMSE for interpolation was 65.8 cm−1 using only 500 ab initio energy points between 0 and 21,000 cm−1. Additionally, this technique also works for the extrapolation of higher energy structures than what it was trained on. From 1500 energy points below 10,000 cm−1, the PES was able to be extrapolated up to 21,000 cm−1 and still maintain an RMSE below 200 cm−1 [134], [135], [136], [137].
High throughput screening
A long-standing practical use for theoretical chemistry has been high-throughput screening, in which dozens to millions of molecules are quickly scanned for target properties at a fraction of the cost of synthesizing them. Only the most promising candidates are used for further, higher-cost analysis, which could include calculations at ab initio levels of theory or experimental synthesis. However, since scans are quick and often qualitative, valuable information can be missed, especially for complex target properties. Hence, new work is being done to better search through chemical space and to identify promising targets at higher levels of accuracy via machine learning. Figure 6 illustrates how machine learning can be used to generate new molecular candidates for high throughput screening studies.

Illustration of machine learning used for molecular generation. A small class of molecules (a) is trained on a machine learning model (b) that learns the patterns and connectivity of the molecules to generate more similar molecules (c) that can then be further screened for desirable properties.
Genetic algorithms are a class of search and optimization techniques based on Darwinian reproduction theory [171]. In a genetic algorithm, a region being searched is randomly encoded into “chromosomes” to create “populations” that are then evaluated using fitness functions. The best chromosomes are then carried over to the next generation, where they are crossed over and mutated until the process completes itself. In this manner, large amounts of dataspace can be efficiently optimized with the optimum or near-optimum regions quickly discovered [171].
In situations where an abundance of diverse data exists, it has been shown that genetic algorithms can be effective in sorting chemical space to select data prior to its use in training a model. Browning et al. [172] determined that small organic molecules could be grouped using genetic algorithms before further analysis by machine learning was applied. Sorted Coulomb Matrices were used as descriptors, although the researchers postulated that a variety of representations would also have worked. The models built using data pre-selected by the genetic algorithm contained up to 75 % less error when applied to new chemical species as opposed to models built on training sets of the same size but selected at random.
Imbalanced datasets, where available data does not uniformly represent a system of interest, can pose major challenges to machine learning [173]. Even a single point of data can significantly bias a model if it is different from most of the other data in a set [19]. Statisticians have developed techniques to successfully remove outliers, however discarding data further exacerbates data shortages in systems that are already information-sparse. One approach used to compensate for imbalanced datasets is to cluster and classify chemical species before regression, and then build separate machine learning models for each discovered grouping [174]. In this manner, models are not biased by outliers that aren’t chemically relevant to the portion of the system being analyzed. Additionally, different types of machine learning models and different descriptors can be used for each derived portion of chemical space, yielding potentially greater prediction accuracy.
To test this methodology, Haghighatlari, Shih, and Hachmann used data from the Harvard Clean Energy Project, which by its nature is unbalanced [174]. To create a reference point, data was sampled at random to create a universal model. Then, descriptors were compared for physical features to see how the data naturally segregated itself. The final groupings were then regressed separately, and then new data was assigned a grouping based on similarity and then regressed by its respective model. Results were promising, with errors consistently below the unclustered data for each grouping, and even lower potential errors through ensemble methods, where the separation and regression of data were combined. However, model accuracy will differ depending on what region of chemical space was modelled.
Machine learning is continually used to help sift through chemical space to find promising target molecules for problems at hand. Doan et al. [175] used Bayesian Optimization, which is designed to quickly find the global minimum or maximum of functions that are difficult to evaluate via statistically-driven sampling, to find targets with desirable oxidation properties with a minimal number of ab initio computations. In this experiment, DFT was used to estimate the oxidation potentials of homobenzylic ether (HBE) molecules. First, a small subset of HBEs were randomly chosen as a training set, and their oxidation potentials were generated via DFT. A Gaussian Process Regression model was used to estimate the oxidation potentials of the remaining molecules, and additional HBEs found to be beneficial to modeling oxidation potentials would be added into the training set, and the process would be iterated through again. The researchers estimate that this method was five times more computationally efficient than creating training sets at random.
Another approach combines molecular generation, screening, and docking to identify and characterize molecules with a target properties of interest. Starting with benzene as an input molecule, Xu, Wauchope, and Frank harnessed generative artificial intelligence and molecular docking to construct a library of docking candidates for the cyclin-dependent kinase 2 (CDK2) protein and the active site of the SARS-CoV-2 virus [176]. Potential candidates were generated via the junction-tree variational autoencoder (JTVAE) approach, in which a tree-structured scaffold is generated over molecular substructures, which are used in a graph message passing neural network for molecular generation or optimization [177]. In each iteration of the docking process, 20 molecules similar to each other were selected for a docking simulation. The compound with the highest predicted binding affinity was used as starting place for next iteration in which another 20 molecules were selected, until optimum molecules were found. For CDK2, the candidates discovered via this approach were very similar to molecules known to be inhibitors, and for the SARS-CoV-2 active site, six existing drugs were identified as likely targets for docking.
Another similar experiment used active learning and DFT to discover novel hole-conducting organic materials [178]. Antono et al. used active learning and a combination of DFT and molecular dynamics calculations in order to identify semiconductor compounds that exhibited high hole mobility, desirable for applications in printed electronics, solar cells, and image sensors. The model found a novel molecule with a larger DFT-predicted hole mobility than what was used for training after only 35 cycles, and that the structure of promising candidates could be generalized by number and type of aromatic rings. In the course of the experiment, only about 200 DFT calculations had to be performed in a chemical space of over a million potential molecules.
There is also research being done toward obtaining near ab initio quality data for molecules of interest without the need for any traditional energy-based calculations, allowing for very fast initial screening of myriads of molecules. One such approach correlates molecular structure to molecular stoichiometry and bonding information [179]. Dubbed Graph-To-Structure (G2S), this methodology bypasses energy optimization and performs the same or better as most empirical methods. In G2S, Kernel Ridge Regression is used to generate structure and energy from only bond and stoichiometry information of constitutional isomers from the QM9 dataset. The model had mean absolute error predictions of less than 0.2 Å for out of sample compound structures, and the generated geometries can be used as accurate starting points for conventional ab initio calculations. A similar approach uses SMILES descriptors as inputs for machine learning to model ground-state properties without the need of atomic coordinates [180]. A set of 130k organic molecules were used, also from the QM9 dataset, and a feed-forward neural network was built to correlate the molecular properties directly to their SMILES strings. Results showed nine times better accuracy than when the same type of model was trained using a sorted coulomb matrix, and that performance on predicting atomization energies approached chemical accuracy. This approach is yet another way to estimate the properties of novel molecules without full geometry optimization.
Finally, unique ways of integrating low-cost approximations of molecular properties into higher accuracy machine learning workflows are also being developed. In one recent approach, a low-cost semi-empirical Crude Estimation of Property (CEP) is used as a descriptor in the machine learning estimation of the same property at a higher level of theory. Properties tested included band gaps, lattice thermal conductivities, and elastics of zeolite values, and datasets were kept small at ∼100 molecules. Results showed that the simple inclusion of these low-cost descriptors made more accurate machine learning models possible without sacrificing model generalizability, allowing for the increase of model accuracy without sacrificing its applicability [181].
Outliers can wreck machine learning models, but researchers scanning through chemical space often want to discover molecules with novel properties. This presents a dilemma, in which machine learning models need high explorative power to identify novel molecules in high-throughput screening. New statistical techniques are being used to better estimate machine learning performance scores for finding these molecules. Specifically, Xiong et al. [182] invented the k-fold-m-step forward cross-validation (kmFCV) method, in which property values are first arranged and then sorted into subsets for cross-validation. It separates training and testing of samples better than the k-fold forward cross-validation that it is based upon, avoiding overestimating the effects of similar samples on a model. Formation energies, superconducting critical temperatures, and band gaps were modeled for inorganic compounds. The authors note that better ways of improving the explorative power of machine learning for the discovery of new materials are needed, with kmFCV serving as a good first step.
Different types of neural networks have been used to explore chemical space and create promising target molecules. Grebner, Plowright, and Hessler used reinforcement learning with different input chemical spaces and different rewards to characterize how changing these parameters would result in different generated molecules [183]. For their calculations, three separate datasets were used to represent three different chemical spaces: ChEMBL24, which includes biologically active compounds; Enamine REAL space, which includes drug-like compounds; and the Sanofi compound collection, which also includes biologically active compounds. They observed that by switching between 2D and 3D representations of molecules, models could be tuned between lead optimization, in which more similar structures would be derived relative to the original query, and lead generation, in which diverse sets of potential molecules would be produced unlike the molecules used in training. Additionally, the researchers were able to generate structurally accurate functional groups that were not present in the original training set. By altering the choice of molecules, their representations, and the hyperparameters of the models’ scoring functions, they were able to fine-tune the types of newly generated molecules generated to be either broadly applicable or constrained within certain regions of chemical space, showing reinforcement learning as a viable approach to de novo chemical design.
DFT
Density Functional Theory (DFT) is the most commonly used traditional computational chemistry approaches, with an estimated 30,000 scientific papers mentioning it in a year [184]. In DFT, the electronic energy of a system is derived via a functional of the ground state electron density [185]. The ground state density and energy can be determined by minimizing the electronic energy over all density values [186].
In its earliest inception, all energy components in a DFT calculation were determined directly from electron density, but results were generally poor [187]. Current DFT methods calculate electron density from orbitals, while the exchange-correlation functional
Much work has been done to improve the speed and accuracy of functional calculations via machine learning. Kohn–Sham equations have been bypassed entirely by training millions of DFT-generated densities around grid-points via neural network [193]. The resulting model allowed for far faster machine learning calculations that approached DFT accuracy, and which scaled linearly with system size. Orbital-free DFT was also re-attempted using both the kinetic energy functional and its derivative on one dimensional systems [194, 195]. Both KRR and convolutional neural networks were used, and results showed that adding the derivative greatly improved results, which may someday lead toward an accurate, density-free broadly applicable functional. In another experiment, the Kohn–Sham equations themselves were solved directly in a neural network via differentiable programming [196]. This approached allows prior physical knowledge of the functional to be learned and transferred across systems implicitly by using the mathematical form of the Kohn–Sham equations, instead of simply emulating their results through explicit model parameterization. Although very successful in predicting the one-dimensional H2 dissociation curve, much work remains to be done to make this approach suitable to larger systems.
The selection and optimization of DFT functionals and basis sets has long been one of the greatest challenges of the DFT method. Unlike variational methods, DFT results do not automatically converge with greater complexity. Often, a functional that performs well in one application or for one region of chemical space will perform poorly in another one. Machine learning techniques are being used to aid in the process of functional selection and to parameterize the functionals themselves.
Early work in DFT functional selection tended to focus on QSPR models for narrow applications [197]. Current research aims to make functional selection more widely applicable. Vargas−Hernández recently used Bayesian Optimization to select and calibrate Hybrid-Density Functional models as an alternative to traditional grid searches [198]. Many DFT models contain free parameters that are empirically deduced. In this experiment, the free parameters were optimized by minimizing root-mean-square error in atomization energies via Bayesian Optimization. Additionally, the selection of the exchange-correlation functionals themselves could also be jointly determined with the free parameters by comparing across functionals. Results showed that better accuracy was obtainable than by traditional parameterization metrics and by grid searches, with fewer calculations required.
In game theory, multiple players with different objectives interact in hopes of finding an optimum strategy for the game. Another new approach uses techniques from this field of mathematics to assist with functional and basis set selection for a given problem of interest. McAnanama-Brereton and Waller used game theory to optimize the interactions between accuracy, complexity, and similarity for density functional and basis set selection [199]. Accuracy was measured as mean absolute percent deviation from a benchmark reference, complexity was a function of the molecular system being tested, the functional complexity, and the basis set’s number of basis functions, and similarity was derived from system similarity between molecular systems of interest and benchmark results. Results showed promise; the selected functional and basis set would not always be the most accurate or the fastest, but would be an optimized tradeoff between time and accuracy, and importantly, the “game” presents an unbiased way for researchers to objectively select functionals in the future.
The exchange correlation functional itself has also been built via machine learning. Nagai, Akashi, and Sugino used a flexible feed-forward neural network to map density distributions to energies [200]. Additionally, a novel nonlocal descriptor was added to improve accuracy without the computational cost generally associated with orbital dependency. The resulting functional was broadly applicable to a diverse set of molecules with only minimal training. Another recent example, DeepMind 21 (DM21), uses a neural network trained on a fictitious system with fractional charges and spins [201]. The resulting functional is able to accurately describe electronic properties of chemical systems far different than what it was trained on and was particularly well adept at estimating barrier heights despite having not been trained on transition states. On the GMTKN55 benchmark, DM21 outperforms all hybrid functionals and approaches the performance of double hybrid functionals, which take much longer to compute. Additionally, DM21 estimated atomization energies with errors under 5 kcal mol−1 on the mindless benchmark subset (MB16-43) of GMTKN55, which consists of randomly generated atomic geometries that give errors of 15–30 kcal mol−1 when used with otherwise accurate functionals. The authors believe that DM21 can also be systematically improved toward determining an exact universal functional.
Mechanistic ML
Machine learning models are empirically based, being derived from mathematical relationships that model a mechanism regardless of the mechanism’s underlying reality. One of the temptations of “Big Data” is to simply agglomerate massive amounts of scientific data and hope that the algorithms sort it out, revealing insight into correlations which may or may not actually exist [202]. Increasingly, however, there is interest in applying machine learning algorithms in such a way that they empirically deduce the underlying physical principles of the systems they are applied to, known as interpretable ML [203, 204]. The promise of these approaches are multifold; a machine learning model with a mechanism that mimics nature would be more stable and less likely to give incorrect results. Also, similar models could be used to deduce not only relationships, but also mechanisms on systems that are not well understood. Finally, a machine learning model that emulates the underlying mechanisms of fundamental physical laws could be far more universal and applicable to systems different from what it was tested on, alleviating the need for extensive reference data.
Unlike traditional deductive techniques usually employed by machine learning to solve computational chemistry problems, End-to-end (E2E) machine learning is inferential, utilizing direct relationships between inputs and results [205]. Whereas traditional techniques to measure chemical properties would generally include a chain of processes, one of which might be accelerated by machine learning, the E2E method simply establishes a direct relationship between the input representation and the results. A new descriptor of this type was recently developed, called bag of clusters (BoC), that can be viewed as an extension of Bag of Bonds, but which does not use pair distances [205]. Under the BoC approach, each bag represents a cluster of atoms in a fixed shape, and the descriptor becomes the number of occurrences of each cluster. Unlike most other descriptors, mathematical operations can be done on them, such as adding molecules and taking the difference between two molecules. A deep convolutional neural network was used to learn the outcomes for the three types of problems: determining coordination numbers of atoms in molecules, identifying types of missing atoms, and predicting reaction directions. Results showed that E2E learning as able to achieve 99, 98, and 93 % accuracy at these tasks, respectively.
Getting neural network descriptions of molecules to more closely align to the real underlying physics of a system of interest is increasingly being used to improve modeling performance. In a recent approach, incorporating molecular symmetry as well as combining real space and momenta space was used to create a new artificial neural network, called SY-GNN, for a symmetrical graph neural network [206]. SY-GNN can predict multiple numerical and orbital properties simultaneously and has significantly lower computational requirements than other neural network-based systems. Additionally, prediction error across 12 common molecular properties for symmetric molecules frequently showed lower error in SY-GNN than for other common neural network potentials, showing that correlating momenta space with real space and modeling them simultaneously gives substantial advantages in both time and accuracy compared to the conventional approach of considering them separately.
Workflow
Recent efforts have focused not only on using machine learning to make faster and more accurate predictions on chemical species, but to interface with the computational chemistry workflow itself. Commonly used molecular descriptors can be used with standard machine learning algorithms to predict computation times of species of interest at different levels of theory. These types of machine learning models don’t replicate underlying chemistry at all; rather, they seek to augment the human intuition required to select and compute complex methods. The insights, in turn, allow for better allocation of computers and queues, quicker calculation turnarounds, and less resources used. Figure 7 illustrates this approach.

Illustration of utilizing machine learning directly into the workflow of computational chemistry. Machine learning is combined with the configurations of calculations (a) to generate computational data (b) for a desired result (c).
In a recent experiment, several common types of quantum mechanical calculations were used for computational wall time estimations [207]. After proving the concept with a “toy system”, three large datasets of existing data, which included five levels of theory and four basis sets, were collected and split into seven tasks, which included single point energy timings, geometry optimization timings, and transition state search timings. KRR was used for building machine learning models to estimate the timings, and Bag of Bonds and FCHLs were used as descriptors. Results showed that estimation error systematically reduced as training set size increased, and that wasted CPU time could be reduced by 10–90 %.
One approach to gaining increased accuracy in traditional quantum mechanical methods involves using multireference wavefunctions specifically in areas chemically relevant, saving computational power by reserving the highest order calculations only for atoms and orbitals where the highest levels of accuracy are required. A common method used for this task, Complete Active Space Self-Consistent Field (CASSCF), performs a full configuration interaction calculation in an active space defined by a user. Yet selecting the proper active space requires experimentation and expertise, limiting the method’s appeal. However, machine learning is now being used to assist in the active space selection [208]. For this study, 23 main group diatomic molecules were modelled at differing CASSCF active spaces. Molecular descriptors were generated from input and output parameters used in CASSCF calculations, including the number of active electrons and orbitals, internuclear distances, occupation numbers and molecular orbital coefficients. A supervised classification algorithm, XGBoost, was used to classify the data in order to make active space recommendations, which were compared against experimental dissociative and spectroscopic values, using an automatic labeling technique generated for this project to determine if a data point was good. Results showed that predictions were substantially better than guessing, yet highly subjective to differing chemistries between systems. Future work will include measuring and accounting for dissimilarities between systems and generating more diverse data.
Similar methodologies can be used to give insights into chemical systems before running costly computations. In another recent study, a semi-supervised model was built to predict the multi-reference character of a chemical system’s wavefunction [209]. Using far fewer calculations than traditional methods used to detect multireference character, the model was also able to correctly assess systems unlike its training data. Its future applications may include high-throughput screening for strong correlation in target molecules.
Machine learning is also finding uses in the development process of new basis sets that go beyond the Born-Oppenheimer Approximation [210]. The Nuclear-Electronic Orbital (NEO) framework includes select protons in quantum chemistry calculations. A new set of Gaussian-type basis functions for this framework, dubbed PB4, PB5, and PB6, were recently developed to optimize to NEO-CSD energies, proton densities, and proton vibrational excitation energies at different levels of tradeoffs between efficiency and accuracy. Rather than using traditional methods involving the gradient or Hessian for basis set optimization, Gaussian Process Regression was used to find best parameters, drastically reducing overall computational cost. The researchers believe that their technique for optimizing basis set parameters can be used for future nuclear basis set development.
Transition metals
Transition metals have long posed some of the greatest difficulties in computational chemistry methods, as their unpaired electrons are both a source of opportunity for novel discoveries and a source of complexity [211, 212]. Due to their unusual properties and the long calculation times often required to obtain high levels of accuracy, obtaining large enough datasets of transition metals to use as training sets for machine learning can seem daunting.
Recent approaches by Kulik et al. [211] seek to characterize properties of open-shelled transition metal compounds via machine learning. Due to the high computational costs associated with generating reference data for machine learning models, clever designs had to be used to make the study feasible. Approximately 1350 transition metal structures were used. DFT was used to generate the reference data for the machine learning model to be trained and tested on. Although not as accurate as higher levels of theory*, DFT is efficient. Since the uncertainty in a traditional computational chemistry method and the uncertainty in a machine learning model are essentially independent, a machine learning model trained on less accurate data would behave similarly when applied to data generated at a higher level of theory [211]. To further reduce the computation power needed to create the dataset, all descriptors were developed solely from the molecular graph of molecules. From about 160 descriptors, 25–40 were selected by using KRR. Properties measured included ΔEH–L, bond lengths, and frontier orbital energies. Errors varied by property and compound but were usually in the single digit kcal mol to sub-kcal mol range. Future work will include better size extensivity and descriptors that are more able to distinguish between similar ligands. Janet et al. in collaboration with Kulik also recently outlined the steps that have to be taken to effectively explore transition metal and inorganic chemical space, including the need to fully automate the simulation of new compounds, a knowledge of a method’s prediction sensitivity and accuracy, very fast property prediction, maps for rapidly traversing chemical space in relation to how compounds are situated against each other, and the means to gain deeper insights from greater amounts of high-throughput results [213].
(* It must be noted here that a great challenge in computational chemistry is the question of accuracy versus efficiency. Machine learning models and methods such as DFT provide efficiency. But ab initio methods have been vital to accuracy. And, in domains such as the transition metals and heavy elements in particular, a great deal of caution is warranted in trying to achieve quantitatively accurate thermochemical and spectroscopic properties, for example. Much further discussion on these issues can be found in References [214], [215], [216]).
Statistics in benchmarking
As is true in any paradigm, potential pitfalls need to be avoided. Just as a computational chemistry calculation that finishes successfully does not testify to its veracity, a machine learning model may give good results for a given dataset that it has not properly learned [217]. Models can improperly map relevant features in a statistically biased way, or conversely, irrelevant features can be improperly mapped as correlating to a property of interest when they in fact do not. Researchers may unwittingly be fine-tuning their methods (or selecting results, in the case of computational chemistry) to most closely match the reference data of their sample set instead of underlying physical realities. This same cancellation of error would not occur when the same method is used on a different data set. When applied to new data outside the sample set, a method fitted this way will give disappointing results.
Statisticians have developed multiple tests and metrics to avoid overfitting models, including cross-validation, training and testing sets, validation, t-tests and f-tests, among many others. Recently, work in the traditional computational chemistry discipline has emphasized the need to apply these tests to traditional benchmark sets as well during the construction of new methods [218], [219], [220]. Additionally, new tests are being developed to better quantify the effects of incomplete benchmark datasets and to compensate for their deficiencies.
Conventional metrics, such as root means square error (RMSE) and mean absolute deviation (MAD) suffer when datasets contain non-normal error distributions. When large amounts of data in an error distribution are skewed, there remains a greater risk of unknown data containing larger prediction errors than either MAD or RMSE would predict. It has even been suggested that these two metrics are simply inadequate by themselves for determining which of a plethora of methods gives the most accuracy, or if chemical accuracy can be expected to likely be obtained [221], however, the approaches can still provide useful insight. The alternative is to use probabilistic tests that assign the likelihood of a new sample falling outside of an error range instead of giving an overall estimate for the accuracy relative to a benchmarking set. These statistical tests are therefore more insightful on how a model or method might behave in the real world. Several recent metrics developed by Pernot and associates include systematic improvement probability (SIP), Q95, Pη, and the application of the Gini coefficient to benchmarking computational chemistry methods [222], [223], [224], [225]. SIP is a metric used to determine the probability that the absolute errors present in a method are actually smaller than the errors in another method being compared to it, giving insight into the risks incurred by changing methods. The Q95 metric represents a value for which there is a 5 % chance that true absolute error would exceed. Pη is the probability that absolute errors exceed a chosen threshold η. The Gini coefficient, often used to represent wealth inequality, can also be used as a single metric to view the magnitude of outliers in a dataset. It correlates with the bias and shape of a distribution and represents it as a single comparable number. Pernot gives examples of situations in which these metrics give differing conclusions about method performance than what would be obtained by conventional metrics, and shows how the method with the lowest error does not always correlate to the lowest risk on unseen data [224, 225].
Uncertainty estimations themselves have their own uncertainties attached to them. Work has been done to better estimate uncertainty in estimation errors in machine learning models specifically targeted toward chemistry. Musil et al. [226] developed a technique to measure the accuracy of an uncertainty measurement via resampling and testing with new inputs. In this technique, the training set is broken into smaller subsets via bootstrapping or subsampling. The subsets are then modelled independently to create submodels, and new inputs are applied to each submodel to measure their respective errors. The variation in the submodels’ error from the unseen samples can then become an error estimation in its own right. The authors posit that this unique way of combining cross-validation with the introduction of new data to add another dimension in error analysis should be widely applicable across many types of machine learning and chemical systems, and gave examples with NMR chemical shieldings and molecular and materials formation energies.
Analytical chemistry
Machine learning methods or their predecessors have been long used in the analytical chemistry fields. Often, the objectives are very similar to those of computational chemistry; to extract as much information from a chemical system as possible in a timely, inexpensive manner. Additionally, instrumental results often require mathematical transforms to make them accessible to human experimenters; complex IR spectra, for example, often contain highly correlated, extraneous information that can be reduced for their usage in calibration curves. Traditionally, simple, linear models were used to extract information out of varying types of spectra. Increasingly, however, more advanced machine learning methods are being adapted into the analytical chemist’s toolbox, to better predict expected results, to gain higher sensitivity, and to discover new insights.
Signal enhancement
The limit of detection (LOD) is a common metric used to create a bottom threshold for which a value is considered to be meaningful, generally at 3 times higher than instrument noise [227]. Yet signal can still remain below these values, which traditional methods are unable to elucidate. Cho et al. [228] tested this approach with gas sensors, which were used to search for hydrogen in six different metals via an artificial neural network. H2 was discovered to be present below the LOD in four of the metals thought to be inert, leading to further investigations about their potential sources. Importantly, the approach can be generalized to many different types of sensors and applications.
Traditional techniques for estimating analyte concentration in spectroscopic data often involve linear univariate and multivariate methods, yet new approaches seek to supplant these classical chemometrics techniques with artificial neural networks (ANN). One common ANN is the multilayer perceptron (MLP), however, statistical metrics and uncertainties for its predictions are far less developed than for traditional approaches, limiting its applicability. Work is being performed on this front, so that new forms of ML-based models can be compared directly to old models, and so that MLP-based calibrations can be used for processes in which regulatory requirements dictate the characterization of errors. Chiappini et al. [229] recently developed an estimate of sensitivity for MLP-based calibrations and applied it on a mixture of simulated data and experimental data from a fluorescence calibration. The MLP-based model was shown to be up to 30 times as sensitive as a conventional univariate method, allowing for a direct comparison for the first time.
Synthesis workflows
Designing and synthesizing chemicals is one of the foundational tasks of chemistry. In the last several years, there have been an increased interest in the optimization of flow chemistry, in which chemical reactions are designed, performed, analyzed, optimized, extracted, and adjusted in the same workflow, driving down costs and increasing efficiency [230, 231]. Recent work has focused upon utilizing machine learning to automatically optimize these workflows. Figure 8 shows how machine learning can be used to systematically optimize synthesis workflows by adjusting conditions on the fly as new data is collected.

Illustrative machine learning schematic for designing experiments and collecting experimental data. Initially, input experimental inputs are set (a). The machine learning algorithm (b) determines if further optimization is possible and if so, updates experimental conditions (c). New data is collected (d) and stored (e) for further machine learning analysis. Once the machine learning algorithm has adequately optimized the system, the loop is broken and final results are tabulated (f).
Clayton et al. [232] developed a workflow in which a multistep system is optimized for the Sonogashira and Claisen–Schmidt reactions. The workflow used the Thompson sampling efficient multi-objective (TSEMO) algorithm, in which Gaussian Process models are used to optimize acquisition functions to determine further evaluation points [233]. A controller was also written to allow the adjustment of pump flow rates and reactor temperatures based on real-time measurements and the TSEMO algorithm’s recommendations. Results showed that the Sonogashira reaction was optimized in only 13 h, and that the Claisen–Schmidt reaction could be optimized to three objectives in 65 h instead of the weeks it would normally take to optimize the objectives separately. Overall, machine learning was able to substantially reduce the time and materials required to optimize a multi-step flow chemistry process.
High-throughput experimentation takes high-throughput screening and puts it into the real world of the lab bench. Like in its in silico counterpart, it requires the generation and analysis of large amounts of candidates as it searches for those with desirable properties [234]. But it also involves the real-life synthesis and analysis of the compounds it selects, via an active feedback loop that takes real-time feedback from experiments performed before selecting the next iteration. One recent system includes the use of computer vision techniques to automatically analyze solubility information of synthesized compounds, by collecting images of the solution via a webcam [235]. Collected information is used to automatically adjust the system as the experiment is carried out, eliminating the need to explore areas of chemical space as soon as they are ruled out. Additionally, data collected in each iteration can be systematically stored and processed for further use, and reaction mechanisms can be probed at greater detail by continual, systematic measurements.
IR
Recent work in the IR field has focused on including anharmonic corrections to theoretical IR intensity predictions at low computational cost [236]. Rather than running molecular dynamics simulations as is traditionally done, only static calculations are required. First, a single, high-cost geometric optimization is done with a target species, similar to traditional ab initio methods, and harmonic frequencies are obtained with this calculation. Then, single-point energy and forces calculations are performed at fixed distances away from equilibrium geometries to train an artificial neural network (ANN). Further geometries were generated for use in calculating cubic and quartive derivatives. Rather than using conventional single point calculations, however, their energies were estimated using the ANN. Derivatives of cubic and quartive energies are calculated from the ANN energies, and anharmonic frequencies are computed by traditional means. This approach is dubbed as a hybrid approach, as it utilizes both DFT and machine learning calculations, and is called B2PLYP/ANN. Results showed that R2 correlation between calculations using B2PLYP/ANN versus the full ab initio approach were 0.9996 and 0.934 for fundamental frequencies and anharmonic corrections, respectively. Importantly, computation time scales linearly with system size. Errors relative to experiment were found to be similar between B2PLYP/ANN and pure B2PLYP.
Elucidating protein secondary structures from IR spectra is often a computationally expensive task, requiring a large number of quantum chemistry calculations of possible configurations to find the structure that matches the spectra. Ye et al. [237] recently used a machine learning approach to map the amide I region of IR spectra to protein structures. The technique works for identifying secondary protein structures, measuring variations with temperature, and characterizing protein folding. Two molecules were used for model training, N-methylacetamide (NMA) and N-acetyl-glycine-N′-methylamide (GLDP). For NMA, ab initio molecular dynamics trajectories were used to find relevant conformations, conformations of GLDP were selected by rotating around Ramachandran angles. The Hessian of each conformation was then calculated via DFT (B3LYP/cc-pVDZ). The coulomb matrix was used as a descriptor, and multi-layer perceptron neural networks were used to correlate structure to vibrational frequencies, transition dipole moments, and neighboring coupling constants, respectively. The models were then applied to 12 proteins. Results showed high correlation between predicted and experimental spectra, exceeding traditional mapping methods in accuracy and roughly four magnitudes of order faster than full DFT calculations. The authors postulate that their method could be extended to other spectral techniques including UV and Raman.
Sometimes, traditional types of machine learning modeling can be applied to new types of problems to yield novel insights. Da Silva et al. [238] applied linear techniques for the μ-FTIR classification and quantification of microplastics. The techniques, based on decades-old algorithms, were able to effectively sort and characterize these newly relevant environmental toxins efficiently and automatically.
NMR
Machine learning has been used to improve NMR predictions [239]. Traditionally, linear regression models are used to correlate experimental NMR shifts with DFT-generated isotropic shielding constants. This approach requires large amounts of data, however, and does not directly consider many other factors that can contribute to NMR shifts. A new technique, dubbed DFT + ML, takes a molecule’s isotropic shielding constant calculated by DFT, which contains information such as electronic states, solvents, and isomerization, and adds a chemical environment descriptor that is obtained directly from the molecular structure of the species of interest. The combined vector, featuring both the isotropic shielding constant and the chemical environment descriptor, is used as input to deep neural network, where the output is NMR chemical shift. Results were significantly better than using the isotropic shielding constant alone or by using simple linear regression. Errors were often reduced by an order of magnitude, and error distributions were Gaussian and less prone to major outliers than prior methods. Additionally, the deep neural network showed itself to be conducive to transfer learning and could easily be extended to systems unlike the ones it was trained on.
Another approach uses Δ-ML to upscale DFT calculations for NMR Resonance Chemical Shifts. Unzueta, Greenwell, and Beran created an ensemble of neural networks to take a baseline generated by PBE0/6-31G and scale it to the level of far more accurate PBE0/6- 311+G(2d,p) [240]. The errors from Δ-ML were less than DFT errors relative to experiment, at 1–2 orders of magnitude than the full calculations.
Mass spectrometry
Like in other analytical chemistry workflows, machine learning is being used to speed up data collection in mass spectrometry, and to increase the insights gained from the collected data. One recent approach, dubbed MealTime-MS, seeks to make mass spectrometry more sensitive to low-abundance proteins [241]. Rather than compile protein information after an experiment is complete as is traditionally done, MealTime-MS identifies them in real time, and as each protein is confidently identified, the algorithm searches elsewhere for less common proteins. Simulated results showed that MealTime-MS was able to identify over 92 % of proteins with only two-thirds of the data from a HEK293 cell lysate dataset.
Gas chromatography
Multimodal learning is a form of machine learning in which modes of differing yet related information are combined as inputs in a model, similar to how the human mind uses a combination of senses to draw conclusions from a person’s surroundings [242]. Rather than estimate gas chromatographic retention indexes via machine learning in the traditional way, in which different models are used for different stationary phases or stationary phases are not accounted for at all, Matyushin and Buryak used a multimodal machine learning approach to incorporate both molecular structure and stationary phase at the same time [243]. Four types of model structures were tested on several benchmark databases along with a fifth, “stacking model” that used the outputs of the previous four models as input for itself in order to make the final retention index prediction. Results showed that the stacking model was able to obtain unparalleled predictions for retention indexes by machine learning techniques. Using the combination of multiple modes – information about the molecules of interest and separate information about stationary phases – allowed the model to obtain this unprecedented accuracy.
Electron microscopy
Image analysis is one of the most prominently visible aspects of modern machine learning and artificial intelligence. There is increasing interest in using machine learning to gain more insight from image-based experimental observations of chemical systems, especially in the electron microscopy fields. Scanning probe microscopy (SPM), for instance, requires the precise placement of a probe and extensive trial and error to get high-quality images. One recent approach seeks to automate it. Krull et al. [244] developed DeepSPM, a convolutional neural network that selects scanning regions and parameters, assesses acquired images, saves the good ones, and adjusts scanning parameters in real-time based on immediate results. This novel application for machine learning is written to be generalizable across SPM applications and to allow faster collection and analysis of data than possible by human operators. Other techniques aid in the processing of data that have already been collected. Williamson et al. [245] developed a supervised machine learning model to cluster single-molecule localization microscopy data via a neural network. Muto and Shiga [246] recently elucidated on several machine learning techniques to gain quantitative information from microscopic images. Often, multiple components of a system being studied via electron microscopy are mixed in together, and characterizing these mixtures as well as isolating the contributions of each component can offer a plethora of new chemical data, similar to how peaks are deconvoluted in traditional spectroscopy. Even video of microscopy events have been subjected to machine learning to gain deeper insights. Yao et al. [247] applied a convolutional neural network called U-Net to videos of liquid-phase transmission electron microscopy (TEM). The simulation was able to effectively identify the boundaries of nanoparticles and should easily be transferable toward the analysis of other types of materials. Gaussian Processes have also been used in Piezoresponse force microscopy (PFM), which is closely related to atomic force microscopy, and used for piezoelectric materials. In a recent experiment, Kelley et al. [248] interpolated between the spirals of a CuInP2S6 van der Waal crystal measured by PFM. They found that the entire sample space could be quickly scanned using broad spirals and then interpolated via machine learning with only about 5 % error in the reconstruction. In the near future, it is likely that machine learning will be so integrated into the workflow and analysis of many types of nanomaterial imaging that it will be available to end users at the push of a button [249].
Conclusion and future outlook
Over the past decade, machine learning has become increasingly common in the chemical realm. Data scarcity is being overcome by new forms of machine learning and descriptors that address transferability and data limitations by being far more generalizable between chemical systems. Machine learning is becoming another pillar of computational chemistry, occupying a regime between ab initio and semiempirical methods [250]. It is also likely to contribute more significantly in the future in solving small but computationally complex parts of calculations, where computational savings are highest. Hierarchies of machine learning methods and their accuracy versus time tradeoffs are beginning to be constructed for common tasks within chemistry, along with direct comparisons of their performance against traditional methods [251]. The day may soon come when machine learning alone is used to tackle chemical problems, with no need to directly compare it to conventional methods, though much work is needed to reach this stage at present. As machine learning in the physical sciences becomes more common, new ways of sharing discoveries are being developed. Machine Learning: Science and Technology claims to be the first scientific journal dedicated specifically to all machine learning developments in the physical sciences [252]. Eventually, machine learning models may approach chemical accuracy for full systems. In the analytical chemist’s world, modern machine learning techniques are being used to more accurately predict experimental results and to gain deeper insight into traditionally-mined data, and may someday be an integral part of a broader shift toward laboratory automation [253]. Both the theoretical underpinnings and pragmatic first steps are being developed to bring big data and machine learning-based automation into the laboratory [234, 235]. As the pace of scientific research continues to accelerate, machine learning will likely become increasingly integrated into chemical discovery and analysis.
Article note:
A special collection of invited papers by recipients of IUPAC Distinguished Women in Chemistry and Chemical Engineering Awards.
References
[1] Y. LeCun, Y. Bengio, G. Hinton. Nature 521, 436 (2015), https://doi.org/10.1038/nature14539.Search in Google Scholar PubMed
[2] J. Hirschberg, C. D. Manning. Science (1979) 349, 261 (2015), https://doi.org/10.1126/science.aaa8685.Search in Google Scholar PubMed
[3] I. Adjabi, A. Ouahabi, A. Benzaoui, A. Taleb-Ahmed. Electronics (Switzerland) 9, 1188 (2020), https://doi.org/10.3390/electronics9081188.Search in Google Scholar
[4] Q. P. He, J. Wang. Processes 8, 951 (2020), https://doi.org/10.3390/pr8080951.Search in Google Scholar
[5] M. Wiener, C. Saunders, M. Marabelli. J. Inf. Technol. 35, 66 (2020), https://doi.org/10.1177/0268396219896811.Search in Google Scholar
[6] Y. Hajjaji, W. Boulila, I. R. Farah, I. Romdhani, A. Hussain. Comput. Sci. Rev. 39, 100318 (2021), https://doi.org/10.1016/j.cosrev.2020.100318.Search in Google Scholar
[7] P. Galetsi, K. Katsaliaki. J. Oper. Res. Soc. 1, 1511 (2020), https://doi.org/10.1080/01605682.2019.1630328.Search in Google Scholar
[8] M. Mallappallil, J. Sabu, A. Gruessner, M. Salifu. SAGE Open Med. 8, 1 (2020), https://doi.org/10.1177/2050312120934839.Search in Google Scholar PubMed PubMed Central
[9] J. Waring, C. Lindvall, R. Umeton. Artif. Intell. Med. 104, 101822 (2020), https://doi.org/10.1016/j.artmed.2020.101822.Search in Google Scholar PubMed
[10] A. Antonakoudis, R. Barbosa, P. Kotidis, C. Kontoravdi. Comput. Struct. Biotechnol. J. 18, 3287 (2020), https://doi.org/10.1016/j.csbj.2020.10.011.Search in Google Scholar PubMed PubMed Central
[11] K. M. Jablonka, D. Ongari, S. M. Moosavi, B. Smit. Chem. Rev. 120, 8066 (2020), https://doi.org/10.1021/acs.chemrev.0c00004.Search in Google Scholar PubMed PubMed Central
[12] C. W. Jones, W. Lawal, X. Xu. JACS Au 2, 541 (2022), doi:https://doi.org/10.1021/jacsau.2c00142.Search in Google Scholar PubMed PubMed Central
[13] Y.-C. Lo, S. E. Rensi, W. Torng, R. B. Altman. Drug Discov. Today 23, 1538 (2018), https://doi.org/10.1016/j.drudis.2018.05.010.Search in Google Scholar PubMed PubMed Central
[14] T. Mehmood, B. Ahmed. J. Chemom. 30, 4 (2016), https://doi.org/10.1002/cem.2762.Search in Google Scholar
[15] T. Mehmood, K. H. Liland, L. Snipen, S. Sæbø. Chemometr. Intell. Lab. Syst. 118, 62 (2012), https://doi.org/10.1016/j.chemolab.2012.07.010.Search in Google Scholar
[16] R. G. Brereton, G. R. Lloyd. J. Chemom. 28, 213 (2014), https://doi.org/10.1002/cem.2609.Search in Google Scholar
[17] L. C. Lee, C. Y. Liong, A. A. Jemain. Analyst 143, 3526 (2018).10.1039/C8AN00599KSearch in Google Scholar
[18] M. Rupp. Int. J. Quant. Chem. 115, 1058 (2015), https://doi.org/10.1002/qua.24954.Search in Google Scholar
[19] P. J. Rousseeuw, M. Debruyne, S. Engelen, M. Hubert. Crit. Rev. Anal. Chem. 36, 221 (2006), https://doi.org/10.1080/10408340600969403.Search in Google Scholar
[20] N. Artrith, K. T. Butler, F. X. Coudert, S. Han, O. Isayev, A. Jain, A. Walsh. Nat. Chem. 13, 505 (2021), https://doi.org/10.1038/s41557-021-00716-z.Search in Google Scholar PubMed
[21] J. Alzubi, A. Nayyar, A. Kumar. J. Phys. Conf. Ser. 1142, 012012 (2018), doi:https://doi.org/10.1088/1742-6596/1142/1/012012.Search in Google Scholar
[22] J. Ding, V. Tarokh, Y. Yang. IEEE Signal Process. Mag. 35, 16 (2018), https://doi.org/10.1109/msp.2018.2867638.Search in Google Scholar
[23] S. Arlot, A. Celisse. Stat. Surv. 4, 40 (2010), https://doi.org/10.1214/09-ss054.Search in Google Scholar
[24] G. Vishwakarma, A. Sonpal, J. Hachmann. Rev. Spec. Iss. Mach. Learn. Mol. Mater. 3, 146 (2020).10.1016/j.trechm.2020.12.004Search in Google Scholar
[25] D. L. Hahs-Vaughn, R. G. Lomax, D. L. Hahs-Vaughn, R. G. Lomax. Multiple linear regression. In Statistical Concepts, pp. 527–599 (2020).10.4324/9780429277825-8Search in Google Scholar
[26] S. Wold, K. Esbensen, P. Geladi, Chemom. Intell. Lab. 2, 37 (1987).10.1016/0169-7439(87)80084-9Search in Google Scholar
[27] H. Abdi, L. J. Williams. Wiley Interdiscip. Rev. Comput. Stat. 2, 433 (2010), https://doi.org/10.1002/wics.101.Search in Google Scholar
[28] S. Wold, M. Sjöström, L. Eriksson. Chemometr. Intell. Lab. Syst. 58, 109 (2001), https://doi.org/10.1016/s0169-7439(01)00155-1.Search in Google Scholar
[29] A.-L. Boulesteix, S. Janitza, J. Kruppa, I. R. König. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2, 493 (2012), https://doi.org/10.1002/widm.1072.Search in Google Scholar
[30] V. Svetnik, A. Liaw, C. Tong, J. Christopher Culberson, R. P. Sheridan, B. P. Feuston. J. Chem. Inf. Comput. Sci. 43, 1947 (2003), https://doi.org/10.1021/ci034160g.Search in Google Scholar PubMed
[31] R. G. Brereton, G. R. Lloyd. Analyst 135, 230 (2010), https://doi.org/10.1039/b918972f.Search in Google Scholar PubMed
[32] P. Exterkate, P. J. F. Groenen, C. Heij, D. van Dijk. Int. J. Forecast. 32, 736 (2016), https://doi.org/10.1016/j.ijforecast.2015.11.017.Search in Google Scholar
[33] K. Vu, J. C. Snyder, L. Li, M. Rupp, B. F. Chen, T. Khelif, K.-R. Müller, K. Burke. Int. J. Quant. Chem. 115, 1115 (2015), https://doi.org/10.1002/qua.24939.Search in Google Scholar
[34] C. E. R. J. Quinonero-Candela. J. Mach. Learn. Res. 6, 1939 (2005).Search in Google Scholar
[35] A. Patil, M. Rane. Smart Innov. Syst. Technol. 195, 21 (2021).10.1007/978-981-15-7078-0_3Search in Google Scholar
[36] P. Sharma, A. Singh, 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT), 1 (2017).10.1109/ICCCNT.2017.8204117Search in Google Scholar
[37] A. Glielmo, B. E. Husic, A. Rodriguez, C. Clementi, F. Noé, A. Laio. Chem. Rev. 121, 9722 (2021).10.1021/acs.chemrev.0c01195Search in Google Scholar PubMed PubMed Central
[38] L. Gu, X. Zhang, K. Li, G. Jia. J. Phys.: Conf. Ser. 1684, 012072 (2020).10.1088/1742-6596/1684/1/012072Search in Google Scholar
[39] T. Murta, R. T. Steven, C. J. Nikula, S. A. Thomas, L. B. Zeiger, A. Dexter, E. A. Elia, B. Yan, A. D. Campbell, R. J. A. Goodwin, Z. Takáts, O. J. Sansom, J. Bunch. Anal. Chem. 93, 2309 (2021), https://doi.org/10.1021/acs.analchem.0c04179.Search in Google Scholar PubMed
[40] G. Zhou, W. Chu, O. v. Prezhdo. ACS Energy Lett. 5, 1930 (2020), https://doi.org/10.1021/acsenergylett.0c00899.Search in Google Scholar
[41] S. v. Kalinin, O. Dyck, A. Ghosh, Y. Liu, R. Proksch, B. G. Sumpter, M. Ziatdinov. ArXiv (2021), https://doi.org/10.48550/arXiv.2010.09196.Search in Google Scholar
[42] J. Sarker, S. Broderick, A. F. M. A. U. Bhuiyan, Z. Feng, H. Zhao, B. Mazumder. Appl. Phys. Lett. 116 (2020), https://doi.org/10.1063/5.0002049.Search in Google Scholar
[43] G. W. A. Milne. J. Chem. Inf. Comput. Sci. 37, 639 (1997).10.1021/ci960165kSearch in Google Scholar PubMed
[44] R. Todeschini, V. Consonni. in Handbook of Molecular Descriptors, John Wiley & Sons, Hoboken, NJ (2008).Search in Google Scholar
[45] G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler, A. Ziehe, A. Tkatchenko, O. A. von Lilienfeld, K.-R. Müller. Adv. Neural Inf. Process. Syst. 1, 440 (2012).Search in Google Scholar
[46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg. J. Mach. Learn. Res. 12, 2825 (2011).Search in Google Scholar
[47] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard. OSDI’16, 265 (2016).Search in Google Scholar
[48] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin. Commun. ACM 59, 56 (2016), https://doi.org/10.1145/2934664.Search in Google Scholar
[49] R. Ihaka, R. Gentleman. J. Comput. Graph Stat. 5, 299 (1996), https://doi.org/10.2307/1390807.Search in Google Scholar
[50] M. Kuhn. R Foundation for Statistical Computing 1 (2015). at <https://cran.r-project.org/web/packages/caret/vignettes/caret.pdf>.Search in Google Scholar
[51] P. O. Dral. J. Comput. Chem. 40, 2339 (2019), https://doi.org/10.1002/jcc.26004.Search in Google Scholar PubMed
[52] M. Haghighatlari, G. Vishwakarma, D. Altarawy, R. Subramanian, B. Urala Kota, A. Sonpal, S. Setlur, J. Hachmann. WIREs Comput. Mol. Sci. 10, e1458 (2020).10.1002/wcms.1458Search in Google Scholar
[53] M. D. Hanwell, C. Harris, A. Genova, M. Haghighatlari, M. el Khatib, P. Avery, J. Hachmann, W. A. de Jong. Int. J. Quant. Chem. 121, e26472 (2020).10.1002/qua.26472Search in Google Scholar
[54] M. Korshunova, B. Ginsburg, A. Tropsha, O. Isayev. J. Chem. Inf. Model. 61, 7 (2021), https://doi.org/10.1021/acs.jcim.0c00971.Search in Google Scholar PubMed
[55] M. el Khatib, W. A. de Jong. ChemRxiv (2020), https://doi.org/10.48550/arXiv.2003.13388.Search in Google Scholar
[56] D. Lafuente, B. Cohen, G. Fiorini, A. García, M. Bringas, E. Morzan, D. Onna. J. Chem. Educ. 98, 2892 (2021).10.1021/acs.jchemed.1c00142Search in Google Scholar
[57] T. Kurth, M. Smorkalov, P. Mendygral, S. Sridharan, A. Mathuriya. Concurr. Comput. 31, e4989 (2019), doi:https://doi.org/10.1002/cpe.4989.Search in Google Scholar
[58] C. C. Yang, G. Domeniconi, L. Zhang, G. Cong. Proceedings – 2020 IEEE International Conference on Big Data, Big Data 2020 5861 (2020).10.1109/BigData50022.2020.9378387Search in Google Scholar
[59] A. Brace, M. Salim, V. Subbiah, H. Ma, M. Emani, A. Trifa, A. R. Clyde, C. Adams, T. Uram, H. Yoo, A. Hock, J. Liu, V. Vishwanath, A. Ramanathan. PASC’21 6, 1 (2021).10.1145/3468267.3470578Search in Google Scholar
[60] Y.-L. Wang, F. Wang, X.-X. Shi, C.-Y. Jia, F.-X. Wu, G.-F. Hao, G.-F. Yang. Brief Bioinform. 22, bbaa276 (2021), https://doi.org/10.1093/bib/bbaa276.Search in Google Scholar PubMed
[61] G. B. Goh, N. O. Hodas, A. Vishnu. J. Comput. Chem. 38, 1291 (2017), https://doi.org/10.1002/jcc.24764.Search in Google Scholar PubMed
[62] R. G. Brereton. J. Chemom. 28, 749 (2014), https://doi.org/10.1002/cem.2633.Search in Google Scholar
[63] S. D. Brown. J. Chemom. 31, e2856 (2017), https://doi.org/10.1002/cem.2856.Search in Google Scholar
[64] L. Eriksson, J. Trygg, S. Wold. J. Chemom. 28, 332 (2014), https://doi.org/10.1002/cem.2581.Search in Google Scholar
[65] C. Nantasenamat, C. Isarankura-Na-Ayudhya, T. Naenna, V. Prachayasittikul. EXCLI J 8, 74 (2009).Search in Google Scholar
[66] M. Rupp, R. Ramakrishnan, O. A. von Lilienfeld J. Phys. Chem. Lett. 6, 3309 (2015), https://doi.org/10.1021/acs.jpclett.5b01456.Search in Google Scholar
[67] L. Cheng, M. Welborn, A. S. Christensen, L. Cheng, T. F. Miller. J. Chem. Theor. Comput. 14, 4772 (2018).10.1021/acs.jctc.8b00636Search in Google Scholar PubMed
[68] M. Liu, J. R. Kitchin. J. Phys. Chem. C 4, 17811 (2020), https://doi.org/10.1021/acs.jpcc.0c04225.Search in Google Scholar
[69] B. K. Alsberg, N. Marchand-Geneste, R. D. King. Chemometr. Intell. Lab. Syst. 54, 75 (2000).10.1016/S0169-7439(00)00101-5Search in Google Scholar
[70] M. Karelson, V. S. Lobanov, A. R. Katritzky. Chem. Rev. 96, 1027 (1996), https://doi.org/10.1021/cr950202r.Search in Google Scholar PubMed
[71] P. Thanikaivelan, V. Subramanian, J. R. Rao, B. U. Nair. Chem. Phys. Lett. 323, 59 (2000).10.1016/S0009-2614(00)00488-7Search in Google Scholar
[72] Vikas, Reenu, Chayawan. J. Mol. Graph. Model. 42, 7 (2013), https://doi.org/10.1016/j.jmgm.2013.02.005.Search in Google Scholar PubMed
[73] Z. Cheng, Q. Chen, F. W. Pontius, X. Gao, Y. Tan, Y. Ma, Z. Shen. Chemosphere 240, 124928 (2020), https://doi.org/10.1016/j.chemosphere.2019.124928.Search in Google Scholar PubMed
[74] J. J. Villaverde, B. Sevilla-Morán, C. López-Goti, J. L. Alonso-Prados, P. Sandín-España. SAR QSAR Environ. Res. 31, 49 (2020).10.1080/1062936X.2019.1692368Search in Google Scholar PubMed
[75] S. Ghosh, P. K. Ojha, E. Carnesecchi, A. Lombardo, K. Roy, E. Benfenati. Ecotoxicol. Environ. Saf. 190, 110067 (2020), https://doi.org/10.1016/j.ecoenv.2019.110067.Search in Google Scholar PubMed
[76] F. Ghasemi, A. Mehridehnavi, A. Pérez-Garrido, H. Pérez-Sánchez. Drug Discov. Today 23, 1784 (2018).10.1016/j.drudis.2018.06.016Search in Google Scholar PubMed
[77] H. Moriwaki, Y. S. Tian, N. Kawashita, T. Takagi. J. Cheminf. 10, 1 (2018), https://doi.org/10.1186/s13321-018-0258-y.Search in Google Scholar PubMed PubMed Central
[78] G. Landrum. Release 1, 1 (2013).10.1111/phin.12010Search in Google Scholar
[79] L. Himanen, M. O. J. Jäger, E. v. Morooka, F. Federici Canova, Y. S. Ranawat, D. Z. Gao, P. Rinke, A. S. Foster. Comput. Phys. Commun. 247, 106949 (2020), https://doi.org/10.1016/j.cpc.2019.106949.Search in Google Scholar
[80] C. R. Collins, G. J. Gordon, O. A. von Lilienfeld, D. J. Yaron. J. Chem. Phys. 8, 241718 (2018), https://doi.org/10.1063/1.5020441.Search in Google Scholar PubMed
[81] D. P. Tew, W. Klopper, T. Helgaker. J. Comput. Chem. 28, 1307 (2007), https://doi.org/10.1002/jcc.20581.Search in Google Scholar PubMed
[82]. M. Rupp, A. Tkatchenko, K. Muller, O. A. von Phys. Rev. Lett. 108, 058301 (2012), https://doi.org/10.1103/physrevlett.108.058301.Search in Google Scholar PubMed
[83] K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis, O. A. von Lilienfeld, K.-R. Müller, A. Tkatchenko. J. Phys. Chem. Lett. 6, 2326 (2015), https://doi.org/10.1021/acs.jpclett.5b00831.Search in Google Scholar PubMed PubMed Central
[84] K. Hansen, G. Montavon, F. Biegler, S. Fazli, M. Rupp, M. Scheffler, O. A. von Lilienfeld, A. Tkatchenko, K.-R. Müller. J. Chem. Theor. Comput. 9, 3404 (2013), https://doi.org/10.1021/ct400195d.Search in Google Scholar PubMed
[85] G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. Tkatchenko, K.-R. Müller, O. A. von Lilienfeld. New J. Phys. 15, 95003 (2013), https://doi.org/10.1088/1367-2630/15/9/095003.Search in Google Scholar
[86] F. A. Faber, A. S. Christensen, B. Huang, O. A. von Lilienfeld. J. Chem. Phys. 8, 241717 (2018), https://doi.org/10.1063/1.5020710.Search in Google Scholar PubMed
[87] A. S. Christensen, L. A. Bratholm, F. A. Faber, O. Anatole Von Lilienfeld. J. Chem. Phys. 2, 044107 (2020), https://doi.org/10.1063/1.5126701.Search in Google Scholar PubMed
[88] J. Townsend, C. P. Micucci, J. H. Hymel, V. Maroulas, K. D. Vogiatzis. Nat. Commun. 11, 1 (2020), https://doi.org/10.1038/s41467-020-17035-5.Search in Google Scholar PubMed PubMed Central
[89] D. A. Cohn, Z. Ghahramani, M. I. Jordan. J. Artif. Intell. Res. 4, 129 (1996), https://doi.org/10.1613/jair.295.Search in Google Scholar
[90] J. Kremer, K. Steenstrup Pedersen, C. Igel. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 4, 313 (2014), https://doi.org/10.1002/widm.1132.Search in Google Scholar
[91] K. Gubaev, E. v. Podryabinkin, A. v. Shapeev. J. Chem. Phys. 148, 241727 (2018), https://doi.org/10.1063/1.5005095.Search in Google Scholar PubMed
[92] B. Huang, O. A. von Lilienfeld. Nat. Chem. 12, 945 (2020), https://doi.org/10.1038/s41557-020-0527-z.Search in Google Scholar PubMed
[93] R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld. Sci. Data 1, 140022 (2014), https://doi.org/10.1038/sdata.2014.22.Search in Google Scholar PubMed PubMed Central
[94] J. Wang, D. Cao, C. Tang, X. Chen, H. Sun, T. Hou. Bioinformatics 36, 4721 (2020).10.1093/bioinformatics/btaa566Search in Google Scholar PubMed
[95] M. Babaei, Y. T. Azar, A. Sadeghi. Phys. Rev. B 101, 115132 (2020), https://doi.org/10.1103/physrevb.101.115132.Search in Google Scholar
[96] M. Stöhr, L. Medrano Sandonas, A. Tkatchenko. J. Phys. Chem. Lett. 1, 6835 (2020), https://doi.org/10.1021/acs.jpclett.0c01307.Search in Google Scholar PubMed
[97] R. Winter, F. Montanari, F. Noé, D. A. Clevert. Chem. Sci. 10, 1692 (2019), https://doi.org/10.1039/c8sc04175j.Search in Google Scholar PubMed PubMed Central
[98] G. Birkhoff, J. von Neumann. Ann. Math. 37, 823 (1936).10.2307/1968621Search in Google Scholar
[99] A. S. Christensen, F. A. Faber, O. A. von Lilienfeld. J. Chem. Phys. 150, 064105 (2019), doi:https://doi.org/10.1063/1.5053562.Search in Google Scholar PubMed
[100] C. Wang, H. Zhai, Y. Z. You. Sci. Bull. (Beijing) 64, 1228 (2019), https://doi.org/10.1016/j.scib.2019.07.014.Search in Google Scholar PubMed
[101] J. Han, L. Zhang, W. E. J. Comput. Phys. 399, 108929 (2019), https://doi.org/10.1016/j.jcp.2019.108929.Search in Google Scholar
[102] J. Hermann, Z. Schätzle, F. Noé. Nat. Chem. 12, 891 (2020), https://doi.org/10.1038/s41557-020-0544-y.Search in Google Scholar PubMed
[103] K. T. Schütt, M. Gastegger, A. Tkatchenko, K. R. Müller, R. J. Maurer. Nat. Commun. 10, 1 (2019), https://doi.org/10.1038/s41467-019-12875-2.Search in Google Scholar PubMed PubMed Central
[104] K. T. Schütt, H. E. Sauceda, P. J. Kindermans, A. Tkatchenko, K. R. Müller. J. Chem. Phys. 8, 241722 (2018), https://doi.org/10.1063/1.5019779.Search in Google Scholar PubMed
[105] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, A. Tkatchenko. Nat. Commun. 8, 13890 (2017), https://doi.org/10.1038/ncomms13890.Search in Google Scholar PubMed PubMed Central
[106] V. Zaverkin, J. Kästner. J. Chem. Theor. Comput. 16, 5410 (2020), https://doi.org/10.1021/acs.jctc.0c00347.Search in Google Scholar PubMed
[107] P. J. Yang, M. Sugiyama, K. Tsuda, T. Yanai. J. Chem. Theor. Comput. 16, 3513 (2020), https://doi.org/10.1021/acs.jctc.9b01132.Search in Google Scholar PubMed
[108] G. Carleo, M. Troyer. Science (1979) 355, 602 (2017), https://doi.org/10.1126/science.aag2302.Search in Google Scholar PubMed
[109] S. J. Pan, Q. Yang. IEEE Trans. Knowl. Data Eng. 22, 1345 (2009), https://doi.org/10.1109/tkde.2009.191.Search in Google Scholar
[110] N. C. Iovanac, B. M. Savoie. J. Phys. Chem. A 4, 3679 (2020), https://doi.org/10.1021/acs.jpca.0c00042.Search in Google Scholar PubMed
[111] S. Amabilino, P. Pogány, S. D. Pickett, D. V. S. Green. J. Chem. Inf. Model. 60, 5699 (2020), https://doi.org/10.1021/acs.jcim.0c00343.Search in Google Scholar PubMed
[112] A. Agrawal. Nat. Commun. 10, 5316 (2019), https://doi.org/10.1038/s41467-019-13626-z.Search in Google Scholar PubMed PubMed Central
[113] D. Jha, L. Ward, A. Paul, W. Liao, A. Choudhary, C. Wolverton, A. Agrawal. Sci. Rep. 8, 17593 (2018), https://doi.org/10.1038/s41598-018-35934-y.Search in Google Scholar PubMed PubMed Central
[114] J. E. Saal, S. Kirklin, M. Aykol, B. Meredig, C. Wolverton. JOM 65, 1501 (2013), https://doi.org/10.1007/s11837-013-0755-4.Search in Google Scholar
[115] R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld. J. Chem. Theor. Comput. 11, 2087 (2015), https://doi.org/10.1021/acs.jctc.5b00099.Search in Google Scholar PubMed
[116] A. Nandi, C. Qu, P. L. Houston, R. Conte, J. M. Bowman. J. Chem. Phys. 4, 051102 (2021), https://doi.org/10.1063/5.0038301.Search in Google Scholar PubMed
[117] C. Qu, Q. Yu, J. M. Bowman. Annu. Rev. Phys. Chem. 69, 151 (2018), https://doi.org/10.1146/annurev-physchem-050317-021139.Search in Google Scholar PubMed
[118] R. Fabregat, A. Fabrizio, B. Meyer, D. Hollas, C. Corminboeuf. J. Chem. Theor. Comput. 16, 3084 (2020), https://doi.org/10.1021/acs.jctc.0c00100.Search in Google Scholar PubMed PubMed Central
[119] Z. Qiao, M. Welborn, A. Anandkumar, F. R. Manby, T. F. Miller. J. Chem. Phys. 3, 124111 (2020), https://doi.org/10.1063/5.0021955.Search in Google Scholar PubMed
[120] A. S. Christensen, S. K. Sirumalla, Z. Qiao, M. B. O’Connor, D. G. A. Smith, F. Ding, P. J. Bygrave, A. Anandkumar, M. Welborn, F. R. Manby, T. F. Miller. J. Chem. Phys. 155 (2021), https://doi.org/10.1063/5.0061990.Search in Google Scholar PubMed
[121] P. O. Dral, A. Owens, A. Dral, G. Csányi. J. Chem. Phys. 152, 204110 (2020), https://doi.org/10.1063/5.0006498.Search in Google Scholar PubMed
[122] M. Bogojeski, L. Vogt-Maranto, M. E. Tuckerman, K. R. Müller, K. Burke. Nat. Commun. 11, 5223 (2020), doi:https://doi.org/10.1038/s41467-020-19093-1.Search in Google Scholar PubMed PubMed Central
[123] S. Dick, M. Fernandez-Serra. Nat. Commun. 11, 3509 (2020), doi:https://doi.org/10.1038/s41467-020-17265-7.Search in Google Scholar PubMed PubMed Central
[124] B. Cuevas-Zuviría, L. F. Pacios. J. Chem. Inf. Model. 60, 3831 (2020), https://doi.org/10.1021/acs.jcim.0c00197.Search in Google Scholar PubMed
[125] L. Cheng, M. Welborn, A. S. Christensen, T. F. Miller. J. Chem. Phys. 150, 134103 (2019).10.1063/1.5088393Search in Google Scholar PubMed
[126] T. Husch, J. Sun, L. Cheng, S. J. R. Lee, T. F. Miller. J. Chem. Phys. 154, 064108 (2021), https://doi.org/10.1063/5.0032362.Search in Google Scholar PubMed
[127] S. J. R. Lee, T. Husch, F. Ding, T. F. Miller. J. Chem. Phys. 154, 124120 (2021).Search in Google Scholar
[128] Y. Chen, L. Zhang, H. Wang, W. E. J. Phys. Chem. A 4, 7155 (2020), https://doi.org/10.1021/acs.jpca.0c03886.Search in Google Scholar PubMed
[129] T. Tuan-Anh, R. Zaleśny. ACS Omega 5, 5318 (2020), https://doi.org/10.1021/acsomega.9b04339.Search in Google Scholar PubMed PubMed Central
[130] A. Stuke, P. Rinke, M. Todorović. Mach. Learn. Sci. Technol. 2, 035022 (2021), https://doi.org/10.1088/2632-2153/abee59.Search in Google Scholar
[131] Y. Yang, O. A. Jimenez-Negron, J. R. Kitchin. J. Chem. Phys. 154, 234704 (2021), https://doi.org/10.1063/5.0049665.Search in Google Scholar PubMed
[132] K. Ahuja, W. H. Green, Y. P. Li. J. Chem. Theor. Comput. 17, 818 (2021), https://doi.org/10.1021/acs.jctc.0c00971.Search in Google Scholar PubMed
[133] G. W. Richings, S. Habershon. J. Phys. Chem. A 4, 9299 (2020), https://doi.org/10.1021/acs.jpca.0c06125.Search in Google Scholar PubMed
[134] M. P. Bahlke, N. Mogos, J. Proppe, C. Herrmann. J. Phys. Chem. A 4, 8708 (2020), https://doi.org/10.1021/acs.jpca.0c05983.Search in Google Scholar PubMed
[135] H. Wang, C. Mathematics, L. Zhang, J. Han, C. Mathematics, C. Mathematics. Comput. Phys. Commun. 228, 178 (2018).Search in Google Scholar
[136] S. Yue, M. C. Muniz, M. F. Calegari Andrade, L. Zhang, R. Car, A. Z. Panagiotopoulos. J. Chem. Phys. 154, 034111 (2021), https://doi.org/10.1063/5.0031215.Search in Google Scholar PubMed
[137] W. Jia, H. Wang, M. Chen, D. Lu, L. Lin, R. Car, W. E, L. Zhang. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis 1 (2020). at <http://arxiv.org/abs/2005.00223>.10.1109/SC41405.2020.00009Search in Google Scholar
[138] T. Zubatiuk, O. Isayev. Acc. Chem. Res. 54, 1575 (2021), https://doi.org/10.1021/acs.accounts.0c00868.Search in Google Scholar PubMed
[139] M. Pinheiro, F. Ge, N. Ferré, P. O. Dral, M. Barbatti. Chem. Sci. 12, 14396 (2021), https://doi.org/10.1039/d1sc03564a.Search in Google Scholar PubMed PubMed Central
[140] J. Behler, G. Csányi. Eur. Phys. J. B 94, 142 (2021), doi:https://doi.org/10.1140/epjb/s10051-021-00156-1.Search in Google Scholar
[141] Y. Shao, M. Hellström, P. D. Mitev, L. Knijff, C. Zhang. J. Chem. Inf. Model. 60, 1184 (2020), https://doi.org/10.1021/acs.jcim.9b00994.Search in Google Scholar PubMed
[142] T. P. Senftle, S. Hong, M. M. Islam, S. B. Kylasa, Y. Zheng, Y. K. Shin, C. Junkermeier, R. Engel-Herbert, M. J. Janik, H. M. Aktulga, T. Verstraelen, A. Grama, A. C. T. van Duin. NPJ Comput. Mater. 2 (2016), https://doi.org/10.1038/npjcompumats.2015.11.Search in Google Scholar
[143] F. Guo, Y. S. Wen, S. Q. Feng, X. D. Li, H. S. Li, S. X. Cui, Z. R. Zhang, H. Q. Hu, G. Q. Zhang, X. L. Cheng. Comput. Mater. Sci. 172, 109393 (2020), https://doi.org/10.1016/j.commatsci.2019.109393.Search in Google Scholar
[144] S. Venturi, R. L. Jaffe, M. Panesi. J. Phys. Chem. A 4, 5129 (2020), https://doi.org/10.1021/acs.jpca.0c02395.Search in Google Scholar PubMed
[145] E. Iype, S. Urolagin. J. Chem. Phys. 150, 024307 (2019), doi:https://doi.org/10.1063/1.5054968.Search in Google Scholar PubMed
[146] J. Behler, M. Parrinello. Phys. Rev. Lett. 98, 146401 (2007), https://doi.org/10.1103/physrevlett.98.146401.Search in Google Scholar PubMed
[147] J. S. Smith, O. Isayev, A. E. Roitberg. Chem. Sci. 8, 3192 (2017), https://doi.org/10.1039/c6sc05720a.Search in Google Scholar PubMed PubMed Central
[148] J. S. Smith, B. Nebgen, N. Lubbers, O. Isayev, A. E. Roitberg. J. Chem. Phys. 8, 241733 (2018), https://doi.org/10.1063/1.5023802.Search in Google Scholar PubMed
[149] J. S. Smith, R. Zubatyuk, B. Nebgen, N. Lubbers, K. Barros, A. E. Roitberg, O. Isayev, S. Tretiak. Sci. Data 7, 1 (2020), https://doi.org/10.1038/s41597-020-0473-z.Search in Google Scholar PubMed PubMed Central
[150] J. S. Smith, B. T. Nebgen, R. Zubatyuk, N. Lubbers, C. Devereux, K. Barros, S. Tretiak, O. Isayev, A. E. Roitberg. Nat. Commun. 10, 1 (2019), https://doi.org/10.1038/s41467-019-10827-4.Search in Google Scholar PubMed PubMed Central
[151] C. Riplinger, P. Pinski, U. Becker, E. F. Valeev, F. Neese. J. Chem. Phys. 4, 024109 (2016), https://doi.org/10.1063/1.4939030.Search in Google Scholar PubMed
[152] C. Devereux, J. S. Smith, K. K. Davis, K. Barros, R. Zubatyuk, O. Isayev, A. E. Roitberg. J. Chem. Theor. Comput. 16, 4192 (2020).10.1021/acs.jctc.0c00121Search in Google Scholar PubMed
[153] S. L. J. Lahey, T. N. Thien Phuc, C. N. Rowley. J. Chem. Inf. Model. 60, 6258 (2020), https://doi.org/10.1021/acs.jcim.0c00904.Search in Google Scholar PubMed
[154] J. M. Stevenson, L. D. Jacobson, Y. Zhao, C. Wu, J. Maple, K. Leswing, E. Harder, R. Abel. ChemRxiv 1 (2019), https://doi.org/10.48550/arXiv.1912.05079.Search in Google Scholar
[155] X. Gao, F. Ramezanghorbani, O. Isayev, J. S. Smith, A. E. Roitberg. J. Chem. Inf. Model. 60, 3408 (2020), https://doi.org/10.1021/acs.jcim.0c00451.Search in Google Scholar PubMed
[156] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala. ArXiv (2019), https://doi.org/10.48550/arXiv.1912.01703.Search in Google Scholar
[157] M. Richter, P. Marquetand, J. González-Vázquez, I. Sola, L. González. J. Chem. Theor. Comput. 7, 1253 (2011).10.1021/ct1007394Search in Google Scholar PubMed
[158] J. Westermayr, M. Gastegger, P. Marquetand. J. Phys. Chem. Lett. 1, 3828 (2020), https://doi.org/10.1021/acs.jpclett.0c00527.Search in Google Scholar PubMed PubMed Central
[159] S. Li, W. Li, J. Ma. Acc. Chem. Res. 47, 2712 (2014), https://doi.org/10.1021/ar500038z.Search in Google Scholar PubMed
[160] Z. Cheng, D. Zhao, J. Ma, W. Li, S. Li. J. Phys. Chem. A 4, 5007 (2020), https://doi.org/10.1021/acs.jpca.0c04526.Search in Google Scholar PubMed
[161] G. Schmitz, I. H. Godtliebsen, O. Christiansen. J. Chem. Phys. 150, 244113 (2019), doi:https://doi.org/10.1063/1.5100141.Search in Google Scholar PubMed
[162] Y. Zuo, C. Chen, X. Li, Z. Deng, Y. Chen, J. Behler, G. Csányi, A. v. Shapeev, A. P. Thompson, M. A. Wood, S. P. Ong. J. Phys. Chem. A 4, 731 (2020), https://doi.org/10.1021/acs.jpca.9b08723.Search in Google Scholar PubMed
[163] M. Taufer, T. Estrada, T. Johnston. Phil. Trans. R. Soc. A 378, 20190063 (2020), https://doi.org/10.1098/rsta.2019.0063.Search in Google Scholar PubMed PubMed Central
[164] P. Pattnaik, S. Raghunathan, T. Kalluri, P. Bhimalapuram, C. v. Jawahar, U. D. Priyakumar. J. Phys. Chem. A 4, 6954 (2020), https://doi.org/10.1021/acs.jpca.0c03926.Search in Google Scholar PubMed
[165] X. Chen, C. F. Goldsmith. J. Phys. Chem. A 4, 1038 (2020), https://doi.org/10.1021/acs.jpca.9b11507.Search in Google Scholar PubMed
[166] F. Häse, I. Fdez. Galván, A. Aspuru-Guzik, R. Lindh, M. Vacher. Chem. Sci. 10, 2298 (2019), https://doi.org/10.1039/c8sc04516j.Search in Google Scholar PubMed PubMed Central
[167] Q. Lin, Y. Zhang, B. Zhao, B. Jiang. J. Chem. Phys. 152, 154104 (2020), https://doi.org/10.1063/5.0004944.Search in Google Scholar PubMed
[168] T. D. Loeffler, T. K. Patra, H. Chan, M. Cherukara, S. K. R. S. Sankaranarayanan. J. Phys. Chem. C 4, 4907 (2020), https://doi.org/10.1021/acs.jpcc.0c00047.Search in Google Scholar
[169] J. Vandermause, S. B. Torrisi, S. Batzner, Y. Xie, L. Sun, A. M. Kolpak, B. Kozinsky. NPJ Comput. Mater. 6, 1 (2020), https://doi.org/10.1038/s41524-020-0283-z.Search in Google Scholar
[170] J. Dai, R. v. Krems. J. Chem. Theor. Comput. 16, 1386 (2020), https://doi.org/10.1021/acs.jctc.9b00700.Search in Google Scholar PubMed
[171] U. Maulik, S. Bandyopadhyay. Pattern Recogn. 33, 1455 (2000), https://doi.org/10.1016/s0031-3203(99)00137-5.Search in Google Scholar
[172] N. J. Browning, R. Ramakrishnan, O. A. von Lilienfeld, U. Roethlisberger. J. Phys. Chem. Lett. 8, 1351 (2017), https://doi.org/10.1021/acs.jpclett.7b00038.Search in Google Scholar PubMed
[173] F. Provost. AAAI Technical Report WS-00-05 3 (2000).Search in Google Scholar
[174] M. Haghighatlari, C.-Y. Shih, J. Hachmann. ChemRxiv 1 (2019), https://doi.org/10.26434/chemrxiv.8796947.v2.10.26434/chemrxiv.8796947.v2Search in Google Scholar
[175] H. A. Doan, G. Agarwal, H. Qian, M. J. Counihan, J. Rodríguez-López, J. S. Moore, R. S. Assary. Chem. Mater. 2, 6338 (2020), https://doi.org/10.1021/acs.chemmater.0c00768.Search in Google Scholar
[176] Z. Xu, O. Wauchope, A. T. Frank. J. Chem. Inf. Model. 61, 5589 (2021).10.1021/acs.jcim.1c00746Search in Google Scholar PubMed
[177] W. Jin, R. Barzilay, T. Jaakkola. RSC Drug Discov. Ser. 2021, 228 (2021).10.1039/9781788016841-00228Search in Google Scholar
[178] E. Antono, N. N. Matsuzawa, J. Ling, J. E. Saal, H. Arai, M. Sasago, E. Fujii. J. Phys. Chem. A 4, 8330 (2020), https://doi.org/10.1021/acs.jpca.0c05769.Search in Google Scholar PubMed
[179] D. Lemm, G. F. von Rudorff, O. A. von Lilienfeld. Nat Commun 12, 4468 (2021).10.1038/s41467-021-24525-7Search in Google Scholar PubMed PubMed Central
[180] G. A. Pinheiro, J. Mucelini, M. D. Soares, R. C. Prati, J. L. F. da Silva, M. G. Quiles. J. Phys. Chem. A 4, 9854 (2020), https://doi.org/10.1021/acs.jpca.0c05969.Search in Google Scholar PubMed
[181] Y. Zhang, C. Ling. NPJ Comput. Mater. 4, 25 (2018), doi:https://doi.org/10.1038/s41524-018-0081-z.Search in Google Scholar
[182] Z. Xiong, Y. Cui, Z. Liu, Y. Zhao, M. Hu, J. Hu. Comput. Mater. Sci. 171, 109203 (2020), https://doi.org/10.1016/j.commatsci.2019.109203.Search in Google Scholar
[183] C. Grebner, H. Matter, A. T. Plowright, G. Hessler. J. Med. Chem. 63, 8809 (2020), https://doi.org/10.1021/acs.jmedchem.9b02044.Search in Google Scholar PubMed
[184] F. Brockherde, L. Vogt, L. Li, M. E. Tuckerman, K. Burke, K. R. Müller. Nat. Commun. 8, 872 (2017), doi:https://doi.org/10.1038/s41467-017-00839-3.Search in Google Scholar PubMed PubMed Central
[185] J. P. Perdew. Phys. Rev. B 33, 8822 (1986), https://doi.org/10.1103/physrevb.33.8822.Search in Google Scholar PubMed
[186] W. Kohn. Rev. Mod. Phys. 71, 1253 (1999), https://doi.org/10.1103/revmodphys.71.1253.Search in Google Scholar
[187] Y. A. Wang, E. A. Carter. in Theoretical methods in condensed phase chemistry, pp. 117–184, Springer, New York, NY (2002).Search in Google Scholar
[188] W. Kohn, L. J. Sham. Phys. Rev. 140, A1133 (1965), https://doi.org/10.1103/physrev.140.a1133.Search in Google Scholar
[189] A. D. Becke. J. Chem. Phys. 140, 18A301 (2014), doi:https://doi.org/10.1063/1.4869598.Search in Google Scholar PubMed
[190] U. von Barth. Phys. Scripta T T109, 9 (2004), https://doi.org/10.1238/physica.topical.109a00009.Search in Google Scholar
[191] V. N. Staroverov. Density-functional approximations for exchange and correlation, in A Matter of Density: Exploring the Electron Density Concept in the Chemical, Biological, and Materials Sciences, pp. 125–156, Wiley Online Library, Hoboken, NJ (2012).10.1002/9781118431740.ch6Search in Google Scholar
[192] S. F. Sousa, P. A. Fernandes, M. J. Ramos. J. Phys. Chem. A 1, 10439 (2007), https://doi.org/10.1021/jp0734474.Search in Google Scholar PubMed
[193] A. Chandrasekaran, D. Kamal, R. Batra, C. Kim, L. Chen, R. Ramprasad. NPJ Comput. Mater. 5, 22 (2019), doi:https://doi.org/10.1038/s41524-019-0162-7.Search in Google Scholar
[194] J. C. Snyder, M. Rupp, K. Hansen, K. R. Müller, K. Burke. Phys. Rev. Lett. 108, 253002 (2012), https://doi.org/10.1103/physrevlett.108.253002.Search in Google Scholar PubMed
[195] R. Meyer, M. Weichselbaum, A. W. Hauser. J. Chem. Theor. Comput. 16, 5685 (2020), https://doi.org/10.1021/acs.jctc.0c00580.Search in Google Scholar PubMed PubMed Central
[196] L. Li, S. Hoyer, R. Pederson, R. Sun, E. D. Cubuk, P. Riley, K. Burke. Phys. Rev. Lett. 126, 036401 (2021), https://doi.org/10.1103/physrevlett.126.036401.Search in Google Scholar
[197] V. Venkatraman, S. Abburu, B. K. Alsberg. Chemometr. Intell. Lab. Syst. 142, 87 (2015), https://doi.org/10.1016/j.chemolab.2015.01.013.Search in Google Scholar
[198] R. A. Vargas-Hernández. J. Phys. Chem. A 4, 4053 (2020), https://doi.org/10.1021/acs.jpca.0c01375.Search in Google Scholar PubMed
[199] S. McAnanama-Brereton, M. P. Waller. J. Chem. Inf. Model. 58, 61 (2018), https://doi.org/10.1021/acs.jcim.7b00542.Search in Google Scholar PubMed
[200] R. Nagai, R. Akashi, O. Sugino. NPJ Comput. Mater. 6, 43 (2020), doi:https://doi.org/10.1038/s41524-020-0310-0.Search in Google Scholar
[201] J. Kirkpatrick, B. Mcmorrow, D. H. P. Turban, A. L. Gaunt, J. S. Spencer, A. G. D. G. Matthews, A. Obika, L. Thiry, M. Fortunato, D. Pfau, L. R. Castellanos, S. Petersen, A. W. R. Nelson, P. Kohli, P. Mori-Sánchez, D. Hassabis, A. J. Cohen. Science (1979) 374, 1385 (2021), https://doi.org/10.1126/science.abj6511.Search in Google Scholar PubMed
[202] P. v. Coveney, E. R. Dougherty, R. R. Highfeld. Phil. Trans. Math. Phys. Eng. Sci. 374, 20160153 (2016).10.1098/rsta.2016.0153Search in Google Scholar PubMed PubMed Central
[203] R. Dybowski. New J. Chem. 4, 20914 (2020), https://doi.org/10.1039/d0nj02592e.Search in Google Scholar
[204] F. Emmert-Streib, O. Yli-Harja, M. Dehmer. WIREs Data Min. Knowl. Discovery 10, e1368 (2020).10.1002/widm.1368Search in Google Scholar
[205] X. Liu, T. Zhang, T. Yang, X. Liu, X. Song, Y. Yang, N. Li, G. Rignanese, Y. Li, X. Wen. J. Phys. Chem. A 124, 8866 (2020), https://doi.org/10.1021/acs.jpca.0c06319.Search in Google Scholar PubMed
[206] S. Ye, J. Liang, R. Liu, X. Zhu. J. Phys. Chem. A 4, 6945 (2020), https://doi.org/10.1021/acs.jpca.0c03201.Search in Google Scholar PubMed
[207] S. Heinen, M. Schwilk, G. F. von Rudorff, O. A. von Lilienfeld. Mach. Learn. Sci. Technol. 1, 025002 (2020).10.1088/2632-2153/ab6ac4Search in Google Scholar
[208] W. S. Jeong, S. J. Stoneburner, D. King, R. Li, A. Walker, R. Lindh, L. Gagliardi. J. Chem. Theor. Comput. 16, 2389 (2020), https://doi.org/10.1021/acs.jctc.9b01297.Search in Google Scholar PubMed
[209] C. Duan, F. Liu, A. Nandy, H. J. Kulik. J. Phys. Chem. Lett. 1, 6640 (2020), https://doi.org/10.1021/acs.jpclett.0c02018.Search in Google Scholar PubMed
[210] Q. Yu, F. Pavošević, S. Hammes-Schiffer. J. Chem. Phys. 2, 244123 (2020), https://doi.org/10.1063/5.0009233.Search in Google Scholar PubMed
[211] H. J. Kulik. Wiley Interdiscip. Rev. Comput. Mol. Sci. 10, 1 (2020), https://doi.org/10.1002/wcms.1439.Search in Google Scholar
[212] J. P. Janet, H. J. Kulik. J. Phys. Chem. A 1, 8939 (2017), https://doi.org/10.1021/acs.jpca.7b08750.Search in Google Scholar PubMed
[213] J. P. Janet, F. Liu, A. Nandy, C. Duan, T. Yang, S. Lin, H. J. Kulik. Inorg. Chem. 58, 10592 (2019), https://doi.org/10.1021/acs.inorgchem.9b00109.Search in Google Scholar PubMed
[214] N. J. Deyonker, T. R. Cundari, A. K. Wilson. J. Chem. Phys. 124, 114104 (2006).10.1063/1.2173988Search in Google Scholar PubMed
[215] W. Jiang, N. J. Deyonker, J. J. Determan, A. K. Wilson. J. Phys. Chem. A 6, 870 (2012), https://doi.org/10.1021/jp205710e.Search in Google Scholar PubMed
[216] L. E. Aebersold, A. K. Wilson. J. Phys. Chem. A 5, 7029 (2021), https://doi.org/10.1021/acs.jpca.1c06155.Search in Google Scholar PubMed
[217] T. Dietterich. ACM Comput. Surv. CSUR 27, 326 (1995), https://doi.org/10.1145/212094.212114.Search in Google Scholar
[218] T. Dietterich, M. J. Willatt, M. A. Langovoy, M. Ceriotti, A. Nicholls. J. Comput. Aided Mol. Des. 28, 887 (2014).Search in Google Scholar
[219] A. Nicholls. J. Comput. Aided Mol. Des. 28, 887 (2014), https://doi.org/10.1007/s10822-014-9753-z.Search in Google Scholar PubMed PubMed Central
[220] A. Nicholls. J. Comput. Aided Mol. Des. 30, 103 (2016), https://doi.org/10.1007/s10822-016-9904-5.Search in Google Scholar PubMed PubMed Central
[221] P. Pernot, B. Huang, A. Savin. Mach. Learn.: Sci. Technol. 1, 035011 (2020).10.1088/2632-2153/aba184Search in Google Scholar
[222] P. Pernot, A. Savin. J. Chem. Phys. 8, 241707 (2018), https://doi.org/10.1063/1.5016248.Search in Google Scholar PubMed
[223] P. Pernot, A. Savin. J. Chem. Phys. 152, 164108 (2020), https://doi.org/10.1063/5.0006202.Search in Google Scholar PubMed
[224] P. Pernot, A. Savin. J. Chem. Phys. 152, 164109 (2020), https://doi.org/10.1063/5.0006204.Search in Google Scholar PubMed
[225] P. Pernot, A. Savin. Theor. Chem. Acc. 140, 1 (2021), https://doi.org/10.1007/s00214-021-02725-0.Search in Google Scholar
[226] F. Musil, M. J. Willatt, M. A. Langovoy, M. Ceriotti. J. Chem. Theor. Comput. 15, 906 (2019), https://doi.org/10.1021/acs.jctc.8b00959.Search in Google Scholar PubMed
[227] A. Shrivastava, V. Gupta. Chronicles Young Sci. 2, 21 (2011), https://doi.org/10.4103/2229-5186.79345.Search in Google Scholar
[228] S. Y. Cho, Y. Lee, S. Lee, H. Kang, J. Kim, J. Choi, J. Ryu, H. Joo, H. T. Jung, J. Kim. Anal. Chem. 92, 6529 (2020), https://doi.org/10.1021/acs.analchem.0c00137.Search in Google Scholar PubMed
[229] F. A. Chiappini, F. Allegrini, H. C. Goicoechea, A. C. Olivieri. Anal. Chem. 92, 12265 (2020), https://doi.org/10.1021/acs.analchem.0c01863.Search in Google Scholar PubMed
[230] S. Caron, N. M. Thomson. J. Org. Chem. 80, 2943 (2015), doi:https://doi.org/10.1021/jo502879m.Search in Google Scholar PubMed
[231] P. Sagmeister, J. D. Williams, C. A. Hone, C. O. Kappe. React. Chem. Eng. 4, 1571 (2019), https://doi.org/10.1039/c9re00087a.Search in Google Scholar
[232] A. D. Clayton, A. M. Schweidtmann, G. Clemens, J. A. Manson, C. J. Taylor, C. G. Niño, T. W. Chamberlain, N. Kapur, A. J. Blacker, A. A. Lapkin, R. A. Bourne. Chem. Eng. J. 4, 123340 (2020), https://doi.org/10.1016/j.cej.2019.123340.Search in Google Scholar
[233] E. Bradford, A. M. Schweidtmann, A. Lapkin. J. Global Optim. 1, 407 (2018), https://doi.org/10.1007/s10898-018-0609-2.Search in Google Scholar
[234] N. S. Eyke, B. A. Koscher, K. F. Jensen. Trends Chem. 3, 120 (2021), https://doi.org/10.1016/j.trechm.2020.12.001.Search in Google Scholar
[235] Y. Shi, P. L. Prieto, T. Zepel, S. Grunert, J. E. Hein. Acc. Chem. Res. 54, 546 (2021), https://doi.org/10.1021/acs.accounts.0c00736.Search in Google Scholar PubMed
[236] J. Lam, S. Abdul-Al, A. R. Allouche. J. Chem. Theor. Comput. 16, 1681 (2020).10.1021/acs.jctc.9b00964Search in Google Scholar PubMed
[237] S. Ye, K. Zhong, J. Zhang, W. Hu, J. D. Hirst, G. Zhang, S. Mukamel, J. Jiang. J. Am. Chem. Soc. 142, 19071 (2020), https://doi.org/10.1021/jacs.0c06530.Search in Google Scholar PubMed
[238] V. H. da Silva, F. Murphy, J. M. Amigo, C. Stedmon, J. Strand. Anal. Chem. 92, 13724 (2020), https://doi.org/10.1021/acs.analchem.0c01324.Search in Google Scholar PubMed
[239] P. Gao, J. Zhang, Q. Peng, J. Zhang, V. A. Glezakou. J. Chem. Inf. Model. 60, 3746 (2020), https://doi.org/10.1021/acs.jcim.0c00388.Search in Google Scholar PubMed
[240] P. A. Unzueta, C. S. Greenwell, G. J. O. Beran. J. Chem. Theor. Comput. 17, 826 (2021), https://doi.org/10.1021/acs.jctc.0c00979.Search in Google Scholar PubMed
[241] A. R. Pelletier, Y. E. Chung, Z. Ning, N. Wong, D. Figeys, M. Lavallée-Adam. J. Am. Soc. Mass Spectrom. 31, 1459 (2020), https://doi.org/10.1021/jasms.0c00064.Search in Google Scholar PubMed
[242] T. Baltru-saiti, C. Ahuja, L.-P. Morency. ArXiv 1 (2017), https://doi.org/10.48550/arXiv.1705.09406.Search in Google Scholar
[243] D. D. Matyushin, A. K. Buryak. IEEE Access 8, 223140 (2020), https://doi.org/10.1109/access.2020.3045047.Search in Google Scholar
[244] A. Krull, P. Hirsch, C. Rother, A. Schiffrin, C. Krull. Commun. Phys. 3, 1 (2020), https://doi.org/10.1038/s42005-020-0317-3.Search in Google Scholar
[245] D. J. Williamson, G. L. Burn, S. Simoncelli, J. Griffié, R. Peters, D. M. Davis, D. M. Owen. Nat. Commun. 11, 1 (2020), https://doi.org/10.1038/s41467-020-15293-x.Search in Google Scholar PubMed PubMed Central
[246] S. Muto, M. Shiga. Microscopy 69, 110 (2020), https://doi.org/10.1093/jmicro/dfz036.Search in Google Scholar PubMed PubMed Central
[247] L. Yao, Z. Ou, B. Luo, C. Xu, Q. Chen. ACS Cent. Sci. 6, 1421 (2020), https://doi.org/10.1021/acscentsci.0c00430.Search in Google Scholar PubMed PubMed Central
[248] K. P. Kelley, M. Ziatdinov, L. Collins, M. A. Susner, R. K. Vasudevan, N. Balke, S. v. Kalinin, S. Jesse. Small 16, 2002878 (2020), https://doi.org/10.1002/smll.202002878.Search in Google Scholar PubMed
[249] O. M. Gordon, P. J. Moriarty. Mach. Learn. Sci. Technol. 1, 023001 (2020), https://doi.org/10.1088/2632-2153/ab7d2f.Search in Google Scholar
[250] O. A. von Lilienfeld. Angew. Chem. Int. Ed. 7, 4164 (2018), https://doi.org/10.1002/anie.201709686.Search in Google Scholar PubMed
[251] D. Folmsbe, G. Hutchison. Int. J. Quant. Chem. 121, e26381 (2020).10.1002/qua.26381Search in Google Scholar
[252] O. A. von Lilienfeld. Mach. Learn. Sci. Technol. 1, 010201 (2020), https://doi.org/10.1088/2632-2153/ab6d5d.Search in Google Scholar
[253] A. G. Godfrey, S. G. Michael, G. S. Sittampalam, G. Zahoránszky-Köhalmi. Front Robot AI 7, 24 (2020), https://doi.org/10.3389/frobt.2020.00024.Search in Google Scholar PubMed PubMed Central
© 2022 IUPAC & De Gruyter. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. For more information, please visit: http://creativecommons.org/licenses/by-nc-nd/4.0/
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Articles in the same Issue
- Frontmatter
- In this issue
- Preface
- IUPAC Distinguished Women in Chemistry and Chemical Engineering Awards 2021
- Invited papers
- My journey in chemistry education
- My path in the company of chemistry
- When passion meets purpose: love for chemistry drives female Jordanian professor
- Perspective on advanced nanomaterials used for energy storage and conversion
- Progress of albumin-polymer conjugates as efficient drug carriers
- In-situ synthesis of metal nanoparticle embedded soft hybrid materials via eco-benign approach
- Machine learning, artificial intelligence, and chemistry: How smart algorithms are reshaping simulation and the laboratory
Articles in the same Issue
- Frontmatter
- In this issue
- Preface
- IUPAC Distinguished Women in Chemistry and Chemical Engineering Awards 2021
- Invited papers
- My journey in chemistry education
- My path in the company of chemistry
- When passion meets purpose: love for chemistry drives female Jordanian professor
- Perspective on advanced nanomaterials used for energy storage and conversion
- Progress of albumin-polymer conjugates as efficient drug carriers
- In-situ synthesis of metal nanoparticle embedded soft hybrid materials via eco-benign approach
- Machine learning, artificial intelligence, and chemistry: How smart algorithms are reshaping simulation and the laboratory