## Molecular Graphs in Chemoinformatics

Chemoinformatics is a research field aimed at using computational methods derived from information theory to predict and analyze properties of molecules. Quite often, it is true that a molecule’s properties highly depend on its structure. Thus, a key principle known as the similarity principle is that two molecules that have very similar structure ought to have similar properties. This idea is often utilized in the design of statistical models that map the structures of a set of given molecules to their corresponding properties. The model then predicts properties of new candidate molecules with unknown properties by making statistical comparisons between the structures of these candidate molecules and molecules whose properties are already known.

In these statistical models, molecules are generally represented using graph structures. The nodes of the graphs correspond to atoms, whereas the edges correspond to bonds between atoms. One similarity metric used to compare the graph structures of molecules is known as the edit distance. The key idea behind this approach is that as the similarity between two molecules increases, the number of changes required to convert the graph of one molecule into that of the other should decrease.  A basic algorithm for this metric aims to determine the fewest number of operations required to edit the graph of one molecule so that it matches that of another molecule being used as a comparison. There are a few edit operations considered in this approach:

• Removing a node
• Inserting a node
• Removing and edge
• Adding an edge

A score can be assigned to each edit operation to determine a quantitative score of the similarity between two molecules. For instance, you might assign a score of 10 points for adding or removing a node and a score of 3 points for adding or removing an edge. After the number of operations has been determined, the scoring metric is simply:

Similarity = 10 * (# of times added or removed node) + 3 * (# of times added or removed edge)

In this scheme, a lower score corresponds to higher similarity.

Sources: