Development of Data-driven Atomic Features as Basis Sets for Molecular Machine Learning
- Abstract
- An important concept in chemistry is that molecular properties can be explained in a more microscopic view, mostly in atomic or electronic level. In this sense, electronic structure theories are developed on the basis of molecular orbital theory, which generally begins with the linear combination of atomic orbitals. Likewise, typical machine learning (ML) models for predicting molecular properties are built on the basis of atomic features. However, although some molecular properties were well-explained by properties of constituent atoms, there is no certain evidence that tabulated atomic properties are well-correlated to all the molecular properties. Here, we present our data-driven feature generation method for predicting molecular properties, which could not be easily decomposed into atomic properties.
Recently, ML has become one of the popular computational tools for establishing structure-property relationships. To predict molecular properties using ML, molecules should be represented as input variables in ML models, while retaining the structural characteristics of molecules. To make a meaningful representation, graph neural networks (GNNs) which treat a molecular structure as a molecular graph, have been widely adopted due to their remarkable performance. GNNs can process both 2D and 3D molecular representations which can be given in atomic coordinates, inter atomic distances or strings such as simplified molecular-input line-entry system (SMILES). Together with the graph structure with atomic feature matrices, advanced GNN models have shown the state-of-the-art performance. Nevertheless, these methods rely on conventional atomic features such as electronegativity, atomic weight, and van der Waals radius. Here in, we propose our method for generating fully data-driven atomic features that can be used as atomic properties in predicting molecular properties. New data-driven atomic feature showed considerable prediction performance even with small information compared to classical atomic properties in specific circumstances, and if used in conjunction with convolutional atomic properties, performance improvement could be expected in several cases. Interestingly, our generation method is based on metaheuristic algorithm, so it can be applied to various chemical fields. This research is expected to help people who want to create node features in GNNs to improve their model performance in various fields.
- Author(s)
- Dabean Han
- Issued Date
- 2024
- Type
- Thesis
- URI
- https://scholar.gist.ac.kr/handle/local/19129
- 공개 및 라이선스
-
- 파일 목록
-
Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.