Determine if a substructure is present in a chemical structure using SMILES string (substructure fingerprint)

Question 1

A SMILES (Simplified molecular-input line-entry system) string is a string that represents a chemical structure using ASCII characters. For example, water (\$H_2O\$) can be written in SMILES as H-O-H.

However, for simplicity, the single bonds (-) and hydrogen atoms (H) are frequently omitted. Thus, a molecules with only single bonds like n-pentane (\$CH_3CH_2CH_2CH_2CH_3\$) can be represented as simply CCCCC, and ethanol (\$CH_3CH_2OH\$) as CCO or OCC (which atom you start from does not matter).

n-pentane:n-pentane

ethanol:ethanol

In SMILES, double bonds are represented with = and triple bonds with #. So ethene:

ethene

can be represented as C=C, and hydrogen cyanide:

HCN

can be represented as C#N or N#C.

SMILES uses parentheses when representing branching:

HCN

Bromochlorodifluoromethane can be represented as FC(Br)(Cl)F, BrC(F)(F)Cl, C(F)(Cl)(F)Br, etc.

For rings, atoms that close rings are numbered:

cyclohexane

First strip the H and start from any C. Going round the ring, we get CCCCCC. Since the first and last C are bonded, we write C1CCCCC1.

Use this tool: https://pubchem.ncbi.nlm.nih.gov/edit3/index.html to try drawing your own structures and convert them to SMILES, or vice versa.

Task

Your program shall receive two SMILES string. The first one is a molecule, the second is a substructure (portion of a molecule). The program should return true if the substructure is found in the molecule and false if not. For simplicity, only above explanation of SMILES will be used (no need to consider stereochemistry like cis-trans, or aromaticity) and the only atoms will be:

O
C
N
F

Also, the substructure do not contain H.

Examples

CCCC C
true
CCCC CC
true
CCCC F
false
C1CCCCC1 CC
true
C1CCCCC1 C=C
false
COC(C1)CCCC1C#N C(C)(C)C // substructure is a C connected to 3 other Cs
true
COC(C1)CCCCC1#N COC1CC(CCC1)C#N // SMILES strings representing the same molecule
true
OC(CC1)CCC1CC(N)C(O)=O CCCCO
true
OC(CC1)CCC1CC(N)C(O)=O NCCO
true
OC(CC1)CCC1CC(N)C(O)=O COC
false

Shortest code wins. Refrain from using external libraries.

Question 2

Could you add a few more complexer test cases like the last one? The first five test cases can be solved using a single contains builtin.

Question 3

As for that last test case, the COC(C1)CCCC1C#N can't be pasted to the tool you've linked (it automatically changes to COC1CC(CCC1)C#N..) Also, would COC(C1)CCCCC1#N with CCCCCC result in truthy, since it does contain a substructure of six subsequence C-atoms (the entire circle (C1)CCCCC1, and the additional branch to a C in COC)?

Question 4

@KevinCruijssen i will add more examples. The tool converts the SMILES string to a canonical SMILES string (it uses a standard algorithm that ensures a unique output). COC(C1)CCCCC1#N with CCCCCC will be truthy. COC(C1)CCCCC1#N with COC1CC(CCC1)C#N will be truthy.

Question 5

Mathematica, 31 bytes

MoleculeContainsQ@@Molecule/@#&

Takes input as a list of 2 strings (the source and the pattern molecules).

As you could guess, this checks if the first molecule (parsed via Molecule) contains the second one by using the MoleculeContainsQ function.

This doesn't seem to work in the online interpreter on TIO; I'm not sure what I am doing wrong. It works on my local machine, though. Of course, this is not using an external library: it's completely built-in functionality!

Question 6

i'm looking for answers that do some parsing, but as your answer uses built-in functionality, i guess you've found a loophole! +1

the default. 8,4651 gold badge37 silver badges63 bronze badges · Accepted Answer · 2020-06-12 06:19:24Z

Mathematica, 31 bytes

MoleculeContainsQ@@Molecule/@#&

Takes input as a list of 2 strings (the source and the pattern molecules).

As you could guess, this checks if the first molecule (parsed via Molecule) contains the second one by using the MoleculeContainsQ function.

This doesn't seem to work in the online interpreter on TIO; I'm not sure what I am doing wrong. It works on my local machine, though. Of course, this is not using an external library: it's completely built-in functionality!

i'm looking for answers that do some parsing, but as your answer uses built-in functionality, i guess you've found a loophole! +1

Stack Exchange Network

Determine if a substructure is present in a chemical structure using SMILES string (substructure fingerprint)

1 Answer 1

Mathematica, 31 bytes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Determine if a substructure is present in a chemical structure using SMILES string (substructure fingerprint)

1 Answer 1

Mathematica, 31 bytes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions