SMILES™, SMARTS™

Codenames: smiles, smarts=smiles:s

Atoms:
- Atoms are represented by their atomic symbols.
- Isotopic specifications are indicated by preceeding the atomic symbol.
- Any atom but not hydrogen is represented with '*'.
- '[Z]' symbols are imported as R-group attachment points. The attachment orders are ascending with atom indexes. Exporting R-group attachment points is possible in ChemAxon Extended SMILES (CXSMILES).
- Since radicals are not stored in SMILES format, they are calculated during SMILES import for atoms that tend to have radicals. It is done in the case when no implicit Hydrogens can be added because of the SMILES definition and the valence of the atom has to be corrected. E.g. for the SMILES string [Cl] no Hydrogens are allowed (because of the brackets), thus Chlorine is imported with a monovalent radical. No radicals are added to metals, for them a valence property is set if their valence differs from the usual. E.g. [AsH2] is imported with valence property 2 and two implicit Hydrogen atoms, since its usual valence would be different (it would be 3).
  No radicals are added to the following atoms:
  - Helium(He), Lithium(Li), Neon(Ne) and Sodium(Na)
  - All atoms above Chlorine(Cl) except Bromine(Br) and Iodine(I).
  Radicals are stored in ChemAxon Extended SMILES (CXSMILES) format, for cases when the radical would be lost is SMILES, please, use CXMSILES.
Bonds:
- Single, double, triple, and aromatic bonds are represented by the symbols -, =, #, and :, respectively.
- Single and aromatic bonds may be omitted.
- Branches are specified by enclosing them in parentheses. The implicit connection to a parenthesized expression (a branch) is to the left.
- Cyclic structures are represented by breaking one single (or aromatic) bond in each ring and the missing bond is denoted by connection placeholder numbers
Disconnected structures:
- Disconnected compounds are written as individual structures separated by a period.
Isomeric specification
- Configuration around double bonds is specified by "directional bonds": / and \.
- Configuration around tetrahedral centers may be indicated by a simplified chiral specification (parity) @ or @@.
Unique SMILES.
The "unique" name can be sometimes misleading when dealing with compounds with stereo centres.
Daylight's SMILES specification (3.1.SMILES Specification Rules) defines generic, unique, isomeric and absolute SMILES as:
1. generic SMILES: representing a molecule (there can be many different representations)
2. unique SMILES: generated from generic SMILES by a certain algorithm [1]
3. isomeric SMILES: string with information about isotopism, configuration around double bonds and chirality
4. absolute SMILES: unique SMILES with isomeric information - in Marvin during graph canonicalization the isomeric information is also considered as an atom invariant
The name canonical SMILES is used for absolute or unique SMILES depending whether the string contains isomeric information or not (both strings are "canonicalized" where the atom/bond order is unambigous).
Marvin generates always canonical SMILES with isomerism info if it is possible to find out from the input file. The molecule graph is always canonicalized using the algorithm in article [1] but it is not guaranteed to give absolute SMILES for all isomeric structures. The unique SMILES generation (option u) currently uses an approximation to make the SMILES string as absolute (unique for isomeric structures) as possible. In this case the form of any aromatic compound is aromatized before SMILES export. For correct exact (perfect) structure searching MolSearch and JChemSearch classes of JChem Base or the jc_equals SQL operator of the JChem Cartridge are suggested.
The initial ranks of atoms for the canonicalization are calculated using the following atom invariants:
1. number of connections
2. sum of non-H bond orders (single=1, double=2, triple=3, aromatic=1.5, any=0)
3. atomic number (list=110, any atom=112)
4. sign of charge: 0 for nonnegative, 1 for negative charge
5. formal charge
6. number of attached hydrogens
7. isotope mass number
See ref. [1] for details.
With option u it is possible to include chirality into graph invariants. This option must be used with care since for molecules with numerous chirality centres the canonicalization can be very CPU demanding [2].
SMILES canonicalization algorithm is not generic, it depends on the software package, so it is most useful to compare SMILES strings within a software package.
Stereochemistry
- Parity is a general type of chirality specification based on the local chirality.
- Cis-trans isomerism
  The default stereoisomers in small rings (size < 8) are cis, which are not written explicitly.
  See import option c to override this feature.
Reactions
- syntax: reactant(s)>agent(s)>product(s), where
       reactants = reactant1 . reactant2.....
       agents = agent1.agent2 . ....
       products = product1.product2 . ...
  
  Agents are molecular structures that do not take part in the chemical reaction, but are added to the reaction equation for informative purpose only.
  All of the above sections are optional. For example:
  - a reaction with no agents: reactant(s)>>product(s)
  - a reaction with no agents and no products (mainly used in reaction search): reactant(s)>>
  - a reaction with no agents and no reactants (mainly used in reaction search): >>product(s)
- atom maps
Not supported SMILES features:
- Branch specified if there is no atom to the left.
- General chiral specification: Allene like, Square-planar, Trigonal-bipyramidal, Octahedral.

SMARTS

Marvin imports and exports SMARTS strings with the following features:

SMARTS features interpreted during import/export as full-functional (editable) query features:
- atom lists like [C,N,P] and 'NOT' lists like [!#6!#7!#15]
- any bond: ~
- ring bond: C@C
- hydrogen count: H0, H1, H2, H3, H4
- valence: v0, v1, ..., v8
- connectivity: X0, X1, X2, X3, X4
- in ring: R
  ring count: R0, R1, ..., R6
- size of smallest ring: r3, r4, r5, r..
- number of ring bonds: x2, x3, x4
  at least one ring bond: x
- aromatic and aliphatic atoms: a, A
- aliphatic, aromatic atom query properties
- single_or_double, single_or_aromatic, double_or_aromatic bonds (used in Marvin)
- directional or unspecified bonds: C\C=C/?C
- chiral or unspecified atoms: C[C@?H](Cl)Br
- component level grouping: (C).(O) (C.O)
A subset of SMARTS features are imported as SMARTS atoms/bonds. These atoms/bonds have limited editing support in the Marvin GUI, but can be exported and evaluated (e.g. JChem structure searching handles them correctly):
- implicit hydrogen count: h2, h3, h..
- degree: D2, D3, D..
- more difficult logical expressions in atom or bond expressions: &,;!
  (Simpler cases, like atom lists, not lists, "and"-expressions are handled by the above features.)
- recursive SMARTS: [$(CCC)]
A subset of features are exported as SMARTS atoms/bonds.
- MDL Substitution Count query atom property s<n> is converted to degree Dn. In case of s* the non-H neighbours are counted and exported as degree D<number>.
- MDL Unsaturated Atom query atom property u is converted to recursive SMARTS: $([*,#1]=,#,:[*,#1]) is appended after the SMARTS atom.

In case of SMARTS:

Impicit H atoms are not written inside brackets. Eg: [C:1]
Query H atoms are written inside brackets without using the low precedence "and" operator ';'. Eg: [CH3]

Implicit bond types: The default bond types for import and export strongly depend on the atoms connected by the bond.

Aromatic bonds are not written explicitly if neither atoms are aliphatic and they are in a ring.
Eg: c1ccccc1 But: c:c, c:[c;a], [#6]:c
Single bonds are not written explicitly if at least one atom is not aromatic.
Eg: CC, C[c;a], Cc, C[C;A], [#6]C But: [#6]-[c;a], c1ccc(cc1)-c2ccccc2
Single_or_aromatic bonds are not written explicitly if both atoms of the bond are aromatic and any of them is not in the same ring.
Eg: [#6]cc, [#6][c;a], [#6][#6]

Smiles/Smarts with additional information

Information stored after the SMILES string separated by space or tab character are treated as molecule field. (According to the SMILES definition they can be ignored or used as comment.) More molecule fields can be stored after the first one but they should be separated by tab character (to allow space in the data field). The newline and tab characters are escaped during export. By default the first additional information is the molecule name. However, the molecule name is never considered as a field, it is a special property of the molecule. After that more informations can be stored as fields in field_1, field_2, etc. The default behavior can be overridden by import option f, eg: import option "fid,flogP" imports the first field as "id" and the next one as "pKa".

Examples:
Smiles file containing the following line (note the separator characters are tabs):
CC ethane   1   1.35
By default imported as a methane molecule with molecule name: ethane, with data field_1: 1 and data field_2: 1.35.
With import option "fname,fid,flogP" it is imported as a methane molecule with molecule name: ethane, with id: 1 and logP: 1.35.

Smiles/Smarts files with header

As SMILES format does not support to save additional information stored in the molecule, Chemaxon adds a header line to the smiles file, if the export of these additional information is requested by T option.
The header files starts with the '#' character followed by the file format string "SMILES" or "SMARTS" and the field names separated by tab characters. The lines followed by the header contains the smiles/smarts string and the field data separated by tab characters.

Examples:
Smiles file (1.smi) containing the following line (note the separator characters are tabs):
CC ethane   1   1.35
Exported to smiles format (molconvert smiles 1.smi):
CC
Exported to smiles format with export option T* (molconvert smiles:T\* 1.smi) results in:
#SMILES name    field_1 field_2
CC      ethane  1       1.35
With import option "fname,fid,flogP" and export option T* (molconvert smiles:T\* "1.smi{fname,fid,flogP}") results in:
#SMILES name    id      logP
CC      ethane  1       1.35

Import options

f
{fFIELD1,fFIELD2,...}
Import data fields from a multi-column file. The fields should be separated by tab character. The first column contains the SMILES/SMARTS strings, the second may contain the molecule name or the data field called FIELD1, the following columns contain the other fields.
Example:
molconvert sdf "foo.smi{fname,fID}" 
reads the smiles string, the name and the ID from the foo.smi file and converts it to sdf format.
d
Import with Daylight compatiblity for query H.
In daylight smarts, H is only considered as H atom when the atom expression has the syntax [<mass>H<charge><map>] (mass, charge and map are optional). Otherwise it is considered as query H count.
Examples: [!H!#6] without d option is imported as an atom which is not H and not C. However with d option it is imported as an atom which has not one H attached, and which is not C.
Use "H1" or "#1" or "#1A" instead of "H" to avoid ambiguous meaning of H. "H1" always means query H count. "#1" always means H atom, "#1A" means aliphatic H atom.

c
Ignore fixing of double bond stereo information in small rings, also ignore fixing of aromatic bonds to aliphatic if necessary.
Double bonds in small rings (ring size < 8) is imported automatically with CIS stereo information. If c options is set, the double bond stereo information is not changed to CIS during the import.
By default the bond is aromatic between two aromatic atom. But this is not true e.g. in case of biphenyl where the bond connecting the two aromatic ring is single. If biphenyl is represented with the SMILES string: "c1ccc(cc1)c1ccccc1" then it is necessary to set the bond between the two rings to single. If the molecule is exported by Chemaxon tools, the single bond between two aromatic atom is always explicitly written to avoid any confusion, so fixing aromatic bonds to aliphatic can be avoided.

Z
Import compressed smiles. The compressed format must be specified expicitly, as it is not recognized by the importer automatically.

After importing SMILES, invoking of MoleculeGraph.clearCashedInfo method is recommended in order to remove cashed informations which results increased molecule size.

Export options

Export options can be specified in the format string. The format descriptor and the options are separated by a colon.

... Basic options for aromatization and H atom adding/removal.

0 Do not include chirality (parity) and double bond stereo (cis/trans) information.
Examples: "smiles:0" (not stereo), "smiles:a0" (aromatic, not stereo)

q Obsolete option.
Atom equivalences are checked by default using graph invariants at double bonds.
Example: molconvert smiles -s "C/C=C(/C)C" results CC=C(C)C

ri Smiles export rigorousness (i with the following values):

Export the most information from the molecule to SMILES or SMARTS format. Don't check anything.
Atoms, bonds and the molecule is checked for SMILES, SMARTS compatibility (default).
In addition to the checks in case of value 5, double bonds in alternating single and double bond chain are checked for correct export.
Example: Let m_1.mrv file contain the molecule CC=CC=CC=CC where the two side double bonds are in TRANS configuration but the middle one has no CIS, TRANS information (crossed double bond, or double bond with wiggly bond).
molconvert smiles:r7 m.mrv will drop an Exception: "Nonstereo double bond between active CIS TRANS stereo bonds. Not possible to export it correctly to SMILES"
molconvert smiles m.mrv results C\C=C\C=C\C=C\C (which is incorrect in the sense that the middle bond became TRANS configuration).

s Write query smarts. (See query Smarts for details.)

u Write unique smiles (considering chirality info also [2]). Note: Use this option if you want unique smiles export.

h Convert explicit H atoms to query hydrogen count.

Tf1:f2:... Export f1, f2 ... SDF fields. The fields are separated by tab character.
If '-' is given before the T option like '-Tf1:f2:...' then no header line is written.
'*' character is used to export all fields (and name also) in the molecules.
'name' field is used to export molecule name (if no 'name' field in the molecule exists).

t Export terminal atom with single_or_aromatic bond.
Examples: instead of [#6]-c1ccccc1 export the molecule to [#6]c1ccccc1
instead of [#6]-[#6] export the molecule to [#6][#6]

n Export molecule name (the first line of an MDL molfile).

Z Use compressed format, and compress the SMILES string. Note that the compressed format is not recognized by the import, so it should be specified explicitly.

BOM Write the UTF-8 byte order mark (BOM), if the given or the system's encoding is UTF-8.

Reference

[1]	SMILES 2. Algorithm for Generation of Unique SMILES Notation; D. Weininger, A. Weininger, J. L. Weininger; J. Chem. Inf. Comput. Sci. 1989, 29, 97-101
[2]	A New Effective Algorithm for the Unambiguous Identification of the Stereochemical Characteristics of Compounds During Their Registration in Databases; T. Cieplak and J.L. Wisniewski; Molecules 2001, 6, 915-926

™: SMILES, SMARTS, and SMIRKS are trademarks of Daylight Chemical Information Systems.

f {fFIELD1,fFIELD2,...}	Import data fields from a multi-column file. The fields should be separated by tab character. The first column contains the SMILES/SMARTS strings, the second may contain the molecule name or the data field called FIELD1, the following columns contain the other fields. Example: molconvert sdf "foo.smi{fname,fID}" reads the smiles string, the name and the ID from the foo.smi file and converts it to sdf format.
d	Import with Daylight compatiblity for query H. In daylight smarts, H is only considered as H atom when the atom expression has the syntax [<mass>H<charge><map>] (mass, charge and map are optional). Otherwise it is considered as query H count. Examples: [!H!#6] without d option is imported as an atom which is not H and not C. However with d option it is imported as an atom which has not one H attached, and which is not C. Use "H1" or "#1" or "#1A" instead of "H" to avoid ambiguous meaning of H. "H1" always means query H count. "#1" always means H atom, "#1A" means aliphatic H atom.
c	Ignore fixing of double bond stereo information in small rings, also ignore fixing of aromatic bonds to aliphatic if necessary. Double bonds in small rings (ring size < 8) is imported automatically with CIS stereo information. If c options is set, the double bond stereo information is not changed to CIS during the import. By default the bond is aromatic between two aromatic atom. But this is not true e.g. in case of biphenyl where the bond connecting the two aromatic ring is single. If biphenyl is represented with the SMILES string: "c1ccc(cc1)c1ccccc1" then it is necessary to set the bond between the two rings to single. If the molecule is exported by Chemaxon tools, the single bond between two aromatic atom is always explicitly written to avoid any confusion, so fixing aromatic bonds to aliphatic can be avoided.
Z	Import compressed smiles. The compressed format must be specified expicitly, as it is not recognized by the importer automatically.