Basic Molecular Representation for Machine Learning
机器学习的基本分子表示
Basic Molecular Representation for Machine Learning
机器学习的基本分子表示
From SMILES to Word Embedding and Graph Embedding
从微笑到词语嵌入和图形嵌入
This post describes some codes used in the implementation of the above conceptual framework, including:
这篇文章描述了在实施上述概念框架过程中使用的一些代码,包括:
reading, drawing, analyzing a molecule,
读,画,分析分子,
generating molecular fingerprint from a SMILES string,
从微笑串产生分子指纹,
generating one-hot encoding from a SMILES string,
从微笑字符串生成one-hot编码,
generating word embedding from a SMILES string, and
从微笑字符串生成单词嵌入,以及
generating molecular representation in graph.
生成图中的分子表示。
Reading, Drawing, and Analyzing a Molecule
读,画和分析分子
RDKit is an open-source library for cheminformatics. Figure 2 shows the code for reading the SMILES string of caffeine and drawing its molecular structure. Notice that C is carbon, N is nitrogen, and O is oxygen in a SMILES string. A molecule could be displayed without labeling carbon as shown in Figure 3 or with labeling carbon as shown in Figure 4.
RDKit是一个用于化学信息学的开源库。图2显示了读取咖啡因的SMILES字符串并绘制其分子结构的代码。请注意,在微笑字符串中,C是碳,N是氮,O是氧。一个分子可以不标注碳,如图3所示,也可以标注碳。
Figure 5 shows the code for displaying the atoms and bonds in the molecule of caffeine.
图5显示了用于显示咖啡因分子中原子和键的代码。
Figure 6 and Figure 7 show the details of atoms and bonds in the molecule of caffeine, respectively. Notes:
图6和图7分别显示咖啡因分子中原子和键的细节。注意事项:
The term “aromatic” could be simply regarded as “ring” in the following tables. GetIsAromatic in Figure 6 indicates if the atom is in a ring or not, and GetBondType in Figure 7 indicates if the bond is in a ring or not.
在下表中,“芳香”一词可简单地视为“环”。图6中的GetIsAromatic表示原子是否在环中,图7中的GetBondType表示键是否在环中。
Figure 6 and Figure 7 could be regarded as a simple atom attribute matrix and a simple bond attribute matrix in the conceptual framework shown in Figure 1.
在图1所示的概念框架中,图6和图7可以看作是一个简单的原子属性矩阵和一个简单的键属性矩阵。
The list of bonds in Figure 7 could represent the graph form for a molecule, i.e., the link list of an adjacency matrix.
图7中的键列表可以表示分子的图形形式,即邻接矩阵的链表。
Generating Molecular Fingerprint from a SMILES String
从微笑串生成分子指纹
RDKit supports several fingerprint functions, which outputs could be used for calculating molecular similarity or as the inputs to the downstream machine learning models. Figure 8 shows the codes for retrieving RDKit Fingerprint and Morgan Fingerprint, and Figure 9 shows the results of these fingerprint functions.
RDKit支持多种指纹功能,其输出可用于计算分子相似性或作为下游机器学习模型的输入。图8显示了检索RDKit指纹和Morgan指纹的代码,图9显示了这些指纹函数的结果。
Generating One-Hot Encoding from a SMILES string
从SMILES字符串生成One-Hot编码
Considering SMILES strings as text in natural language, probably the simplest representation method for SMILES strings is one-hot encoding at the character level. Figure 10 shows the code for generating one-hot encoding at the character level of a SMILES string.
将SMILES字符串视为自然语言中的文本,对于SMILES字符串最简单的表示方法可能是字符级别的单热编码。图10显示了在SMILES字符串的字符级生成单热编码的代码。
Note that one-hot encoding could be also used at the atom level or in the atom/bond attribute matrix.
注意,单热编码也可以在原子级或原子/键属性矩阵中使用。
Generating Word Embedding from a SMILES String
从SMILES字符串生成单词嵌入
In the context of language modeling, a more sophisticated approach for generating molecular representation is to apply the method of word embedding to the substructures of a molecule. The code in Figure 11 shows the process of using mol2vec and word2vec on generating word embedding for all the molecules in the HIV dataset. There are 41127 molecules in the dataset (Figure 12) and each molecule is encoded as a 300-dimensional vector (Figure 13). Note that the code is extracted from “Simple ML In Chemistry Research: RDkit & mol2vec” which explains the solution for predicting HIV activity in detail.
在语言建模的背景下,生成分子表示的一个更复杂的方法是将词嵌入的方法应用到分子的子结构中。图11中的代码显示了使用mol2vec和word2vec为HIV数据集中的所有分子生成单词嵌入的过程。数据集中有41127个分子(图12),每个分子编码为300维向量(图13)。注意,代码摘自“化学研究中的简单ML:RDkit&MOL2VEC”,其中详细解释了预测HIV活性的解决方案。
Generating Molecular Representation in Graph
生成图中的分子表示
The process of manipulating molecules/atoms/bonds in RDKit provides the foundation for generating the graph form of molecular representation. Figure 5, Figure 6, and Figure 7 above have shown the adjacency matrix, the node attribute matrix, and the edge attribute network for caffeine. However, converting a molecule in RDKit into a graph in NetworkX (an open-source library for network analysis) could leverage the research of the traditional graph algorithms and the modern graph models for investigating molecular structure and property. Figure 14 shows the code for converting a molecule in RDKit into a graph in NetworkX. Figure 15 shows the molecular graphs drawn by RDKit and NetworkX.
RDKit中操纵分子/原子/键的过程为生成分子表示的图形形式提供了基础。上面的图5,图6和图7显示了咖啡因的邻接矩阵,节点属性矩阵和边缘属性网络。然而,将RDKit中的一个分子转换为NetworkX(一个开放源码网络分析库)中的一个图,可以利用传统图算法和现代图模型的研究来研究分子结构和性质。图14显示了将RDKit中的分子转换为NetworkX中的图的代码。图15显示了RDKit和NetworkX绘制的分子图。
One important research area in graph networks is graph embedding. Generally speaking, graph embedding consists of three topics: node-level embedding (which encodes nodes in a graph as vectors), edge-level embedding (which encodes edges in a graph as vectors), and graph-level embedding (which encodes a whole graph as a vector.) In this post, we consider the term graph embedding as graph-level embedding, which finds a vector for a molecule that could be used as the input for the downstream models. Figure 16 shows the code for converting molecules in RDKit to graphs in NetworkX, and generating its graph embeddings via Graph2Vec under KarateClub. Graph2Vec is a graph embedding algorithm and KarateClub is a package providing unsupervised machine learning models for graph data. Figure 17 shows Graph2Vec embedding for the molecules in the HIV dataset. KarateClub has covered several graph embedding algorithms in the library.
图网络的一个重要研究领域是图嵌入。一般来说,图嵌入由三个主题组成:节点级嵌入(将图中的节点编码为向量),边级嵌入(将图中的边编码为向量),图级嵌入(将整个图编码为向量。) 在这篇文章中,我们认为术语图嵌入是图级嵌入,它为一个分子找到一个向量,这个向量可以用作下游模型的输入。图16显示了将RDKit中的分子转换为NetworkX中的图形,并通过KarateClub下的Graph2Vec生成其图形嵌入的代码。Graph2Vec是一个图嵌入算法,KarateClub是一个为图数据提供无监督机器学习模型的包。图17显示了HIV数据集中分子的Graph2Vec嵌入。KarateClub已经涵盖了库中的几种图嵌入算法。
Conclusions
结论
This post has described several molecular representations, including string-based format, graph-based format, and some variants such as word embedding and graph embedding. These molecular representations, together with different machine learning algorithms including deep learning models and graph neural networks, could serve as the baseline for approaching molecular machine learning problems.
这篇文章描述了几种分子表示,包括基于字符串的格式,基于图形的格式,以及一些变体,如单词嵌入和图形嵌入。这些分子表示与不同的机器学习算法(包括深度学习模型和图神经网络)一起,可以作为探讨分子机器学习问题的基线。
Thanks for reading. If you have any comments, please feel free to drop me a note.
谢谢你的阅读。如果你有任何意见,请随时给我留言。