图模型在分子信息学中的研究综述与展望

Overview of graph models and their prospects in molecular informatics

  • 摘要: 分子信息学作为化学与人工智能交叉融合的前沿领域,正迅速推动药物设计与功能材料开发等领域的技术革新. 分子表示学习作为其核心基础,通过将分子结构编码成保留其拓扑与理化性质的数值向量,为分子性质预测和分子生成等下游任务提供高效特征表示. 相比基于规则和基于字符序列的表征,图模型能够充分利用分子天然的图结构(原子为节点、化学键为边),能够精准捕捉分子拓扑信息和复杂相互作用,现已成为该领域的主流技术. 本文系统综述了图模型在分子信息学中的最新研究进展和应用. 首先详细梳理了分子表征方法的发展历程,阐述图模型的基本概念和独特优势. 其次,围绕分子性质预测和分子生成两大核心任务,系统梳理了常用数据集、评价指标以及各类图判别和图生成模型的特点与研究现状. 同时,结合材料性能预测与晶体生成任务,探讨了不同深度图模型在实际应用中的优缺点、适用场景以及技术挑战. 最后,探讨了大规模预训练、可解释性方法和多模态学习等新兴趋势在分子信息学中的应用潜力,并展望了未来研究方向. 本综述旨在为化学领域研究者快速定位前沿技术与适用方法,同时为人工智能领域研究者梳理技术路线,以推动更高效的算法设计及其在分子信息学中的落地应用.

     

    Abstract: The rapid growth of molecular data and advances in deep learning have facilitated significant strides in molecular informatics. Molecular informatics is an emerging field that integrates chemistry, computational science, and artificial intelligence (AI) and employs data-driven methods to decode the relationships between molecular structures and their properties, thereby supporting drug design and material discovery. Molecular representation learning (MRL) is a fundamental aspect of molecular informatics, involves encoding molecular structures and properties into numerical vectors to provide efficient representations for downstream tasks. High-quality molecular representations are critical for accurate property prediction, optimization, and generation. However, traditional rule-based MRL methods rely on handcrafted features that are time-consuming and expert-dependent. Sequence-based MRL methods, such as the simplified molecular input line entry system (SMILES), often separate connected atoms into distant positions, leading to suboptimal representations that fail to fully capture spatial and topological information. In contrast, given that molecules naturally form graph structures with atoms as nodes and bonds as edges, graph-based models can effectively utilize these molecular graphs. Aided by the exceptional performance of graph models in representing complex structures, learning cross-scale features, and constrained optimization, graph-based MRL methods have achieved significant advancements in the prediction and generation of molecular properties. In this review, we first introduce the evolution of molecular representation methods, focusing on 2D and 3D molecular graph representations. We then classify the graph models into discriminative and generative categories and discuss their concepts and applications. Graph-discriminative models encode topological structures and node/edge features to capture nonlinear structure-property relationships for classification and regression tasks. Graph-generative models learn from molecular distributions to optimize existing structures or design novel compounds with the desired properties. Next, we review commonly used datasets, evaluation metrics, and research progress in molecular property prediction and molecular generation. Molecular property prediction is employed to predict physical and chemical properties by analyzing internal molecular information, thereby helping researchers quickly identify suitable candidates from a large number of potential compounds. We briefly present the three categories of the property prediction methods: 2D graph-based, 3D graph-based, and domain knowledge-integrated approaches, and introduce a recent representative method for each category. Furthermore, we review the research focusing on various graph neural network models in material property prediction tasks and their corresponding application scenarios. The goal of molecular generation is to learn latent distributions from limited datasets and generate novel structures that satisfy specific chemical functions through sampling and decoding. We introduce widely used frameworks for molecular generation such as variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, and diffusion models, which have demonstrated strong capabilities in capturing complex molecular features and optimizing chemical properties while preserving chemical validity. In addition, using crystal material generation as an example, we introduce and compare different deep generative models for material discovery, highlighting their specific application scenarios, strengths, and limitations. Finally, we discuss future research directions for graph models in molecular informatics from the perspectives of large-scale pre-training, explainable AI, and multimodal learning strategies. This review aims to assist molecular informatics researchers in identifying cutting-edge studies and applicable methods, while clarifying the technical pathways for AI researchers to promote more efficient algorithm design and implementation.

     

/

返回文章
返回