Codename: cxsmiles,cxsmarts
SMILES_String |<feature1>,<feature2>,...|
In extended smiles export the following additional features are exported:
The relative stereoconfiguration is stored as "r".
If a reaction contains components with absolute and relative stereo the indexes of the fragments with relative configuration is written.
The absolute stereoconfiguration is the default, which is not marked.
(Absolute stereoconfiguration known also as "Chiral flag" in MDL molfiles. )
Example: "r:2,4,5"
The following stereochemical group types are stored:
Atom labels / aliases are written between "$" characters
each label is separated by ";" characters.
Atom values are written after "$_AV:" separated by
semicolon characters and closed with "$" tag.
Special chacters are escaped.
Atom indexes relating to wiggly bonds are written after "w:"
followed by a dot character and the wiggly bond index.
The wiggly bonds are separated by commas.
If atomic coordinates are also exported, then UP bonds are written
after "wU:"
DOWN bonds are written after "wD:" in a similar way to
wiggly bond export.
Bond indexes of the double bonds in SSSR are written.
The bond stereo information is generated as the following:
the double bond has the representation a1-a2=a3-a4, where
Atom indexes with local ODD parity are written after "@:", while atom indexes with local EVEN parity are written after "@@:" characters separated by commas.
For each ligand connected to non-bridgehead atoms of bicyclo-alkanes, if they are in a syn/anti or endo/exo position (ligand is not in the plane of the bridge to which it is connected), their relative position in the ring system is stored by their position in relation to the bridges to which they are not connected. Bridges are identified by the indexes of the contained atoms: higher bridge is the one with the highest atom index, the other is the lower bridge. The ligand's position can be:
Examples:
[H][C@]12CCC[C@]([H])(CC(C)C1)C2(S)Cl |r,TLB:13:11:2.4.3:7.10.8,THB:12:11:2.4.3:7.10.8,9:8:11:2.4.3| | |
[H][C@]12COC[C@]([H])(C[C@@H](CCCC)C1)[C@@]2(Cl)C1CCCC1 |r,TLB:15:14:2.4.3:7.13.8,THB:16:14:2.4.3:7.13.8,9:8:14:2.4.3| |
Atom indexes with
The indexes of the atoms having bond connected lone electron pairs are written after "LP:".
The indexes of the atoms followed by a colon character and the number of
explicit lone electron pairs are written after
"lp:".
(See live example.)
Example: "LP:1,lp:0:1,2:2"
The atom index of the first atom in the coordinate bond is written after
"C:" followed by a dot character and the coordinate bond index.
The coordinate bonds are separated by commas.
In the smiles part of cxsmiles the atom-to-atom coordinate bonds are
represented by single bonds, which are corrected according to the
C information at the extended part.
Hydrogen bonds exported in the same format after "H:".
The multicenter atom indexes written after "m:" followed by a colon character and the indexes of the atoms which forms the given SGroup separated by ".". The SGroups are separated by commas.
Example: "m:0:7.6.5.4.3,2:12.11.10.9.8,C:0.0,2.1"The link node atom indexes are written after "LN:" followed by a colon character, the minimum repetitions, maximum repetitions, the node first and second outer atom indexes separated by ".". If the link node has only two connections, then the first and second outer atom indexes are obvious, so they are omitted. The link nodes separated by commas.
Example: "LN:1:1.5.3.0,6:1.2.7.5,9:1.10.10.8"
The atomic coordinates are written between parentheses.
Each atomic coordinate triplet (x, y, z) is separated by semicolon, and the
x y z coordinates are separated by commas. Zero coordinates are omitted.
Note: The CIS/TRANS information is redundant in this case. It is specified
in the SMILES string and also in the atomic coordinates. The
atomic coordinates has priority over the SMILES string.
Atomic indexes in the data sgroup are written after
"SgD:" followed by
field name, data value, query operator, unit, tag
and coordinates in parenthesis if necessary, separated by
colon characters. The field values with special characters are escaped. If atomic coordinates are exported (with
c option) (-1) is used in the coordinate field
for data sgroup attached to the atoms.
Example: "SgD:3,2,1,0:name:data:like:unit:t:(-1)"
The R-group attachment point is written explicitly as ANY atom into the SMILES
string.
The order of attachment point is written as alias string
(see above) after "_AP" separated by semicolon characters.
Before version 5.4 only two attachment point type was supported, the
attachment point was not exported to the SMILES string explicitly.
In the extended part the atomic indexes of the attachment points written
after "AP_x:" format was used, where x denoted
attachment type 1, 2 or 3 for attachment points 1, 2 or both.
Example: "C[C@H](N*)C(*)=O |$;;;_AP1;;_AP2;$|", before version 5.4: "C[C@H]([NH])[C]=O |AP_1:2,AP_2:3|"
S-group attachment point informations are not handled by cxsmiles or cxsmarts.
Rgroup information can be exported to extended cxsmiles/cxsmarts. Rgroups in the molecule is exported to ANY atom in the SMILES part, they are described in the alias part as "_Rn". Rgroup descriptions (molecules) are enumerated also in the extended part after "RG" followed by a colon character.
Ligand order information can be exported to extended cxsmiles/cxsmarts after "LO" followed by a colon character.
Pseudo atoms can be exported to extended cxsmiles/cxsmarts.
They are described in the alias part as "pseudo_p
",
where pseudo is the value of the pseudo atom.
Example:
CCCC* |$;;;;pseudo_p$|
Special atoms AH, QH, M, MH, X, XH and Pol, are exported to cxsmiles/cxsmarts as pseudo atoms, i.e. AH_p, QH_p, M_p, MH_p, X_p, XH_p, and Pol_p, respecively. Special atoms Q and star are exported as Q_e and star_e, respectively. Special atom A can be handled by SMILES export, therefore it is not written to the alias part of the cxsmiles/cxsmarts.
Examples:
*C(*)CC(*)CC(*)* |$;;Pol_p;;;Q_e;;;star_e;M_p$| *C(*)CC(*)CC(*)* |$Q_e;;AH_p;;;X_p;;;QH_p;XH_p$|
Ring bond count (rb), Substitution count (s) and unstaturated atom (u) are exported in the following form:
rb:atomIndex1:value,atomIndex2:value
s:atomIndex1:value,atomIndex2:value
u:atomIndex1,atomIndex2,atomIndex3
Examples: "rb:1:2,2:*,4:2", "u:3,4,5"
Each Sgroup exported after "Sg:" in fields separated by a colon. Fields are:
Keyword | Sgroup Type |
n | SRU |
mon | monomer |
mer | mer |
co | copolymer |
xl | crosslink |
mod | modification |
mix | mixture |
f | formulation |
any | anypolimer |
gen | generic |
c | component |
grf | graft |
alt | alternating copolymer |
ran | random copolymer |
blk | block copolymer |
A colon is needed after the last non-empty field.
If one needs to retain not only the chemically relevant information, but the whole structure (as drawn), then the c export option
should be used.
Examples:
CCCC |Sg:gen:0,1,2:|
CCCC |Sg:n:0,1,2:3-6:eu|
*CC(*)C(*)N* |$star_e;;;star_e;;star_e;;star_e$,Sg:n:6,1,2,4::hh,f:6,0,:4,2,|
C1=CC=CC=C1 |c:0,2,4,(-4.62,1.05,;-3.29,.28,;-3.29,-1.27,;-4.62,-2.04,;-5.95,-1.27,;-5.95,.28,),Sg:mon:0,5,4,3,2,1:::::(d,s,-7.03,2.12,-2.21,2.12,-2.21,-3.11,-7.03,-3.11,)|
Parent-child relationship of the sgroups is described with the "SgH" tag.
The structure of the SgH tag is the following:
SgH:parentSgroupIndex1:childSgroupIndex1.childSgroupIndex2,parentSgroupIndex2:childSgroupIndex1
The indices of the sgroups come from the order in they are written in the cxsmiles string, i.e. the first sgroup has the index 0, the second has 1, and so on. This includes datasgroups and polymer sgroups as well. Examples:
CC(N)C=O |Sg:gen:0::,Sg:mon:1,2,4,0,3::,SgH:1:0| // A monomer sgroup contains all 5 atoms, and it contains the generic sgroup with 1 atom.
C1CCCCC1 |SgD:0,1,2,3,4,5:f:34::::,Sg:mon:0,1,2,3,4,5::,SgH:1:0| // A monomer sgroup contains all the atoms, and it contains the datasgroup too.
C.C |SgD:1::::::,SgD:0,1::::::,SgD:0::::::,SgD:0::::::,Sg:gen:0::,Sg:gen:1::,Sg:gen:1::,SgH:5:6,6:0,2:4.3| // A more difficult example with multiple sgroup relations.
Atom properties are exported to CXSMILES and CXSMARTS after the keyword 'atomProp' at the extended part. Every property is exported separately with the following rule:
The properties are separated with colons. The end of the atom property block is marked with a comma. If the atom has a non-string property, an exception is thrown. Example:
CNC |atomprop:0.key1.value1:0.key2.value2:1.key3.value3| // The 0th atom has two properties and the 1 indexed atom has one.
R-logic is exported along with the R-group information. It is indicated by the LOG tag, which includes the list of R-logics for the R-groups. The list items are separated by dots. One item consits of the R-logic properties separated by semicolons: identifier of an other rgroup which is after the 'then' part of the R-logic (e.g. 'if R1>0 then R2'), the restH property ('H' if set, empty if not) and the R-logic range. If there is no R-logic specified for an R-group, then it is not included in the list. Example:
[*]C1CCCCC1[*] |$_R1;;;;;;;_R2$,RG:_R1={CCC},_R2={N},LOG={_R1:;;>0._R2:_R1;H;0,1}|
s
Fix chiral flag from cxsmiles input.
By default the molecule absolute stereoconfiguration (relative or absolute chirality - chiral flag) is specified at the extended part of the cxsmiles string. If it is missing it is assumed to be absolute by default (see Molecule absolute stereoconfiguration above). Using the 's' option the molecule's absolute stereoconfiguration is tried to be figured out.
Example: molconvert cxsmiles -s 'C[C@H]1CC[C@@H](C)CC1{cxsmiles}' results C[C@H]1CC[C@@H](C)CC1
But: molconvert cxsmiles -s 'C[C@H]1CC[C@@H](C)CC1{cxsmiles:s}' results C[C@H]1CC[C@@H](C)CC1 |r|
See also SMILES import options.
Export options can be specified in the format string. The format descriptor
and the options are separated by a colon.
All options have default values (see below).
Using the "+" or "-" sign the default export values
can be changed to "true" or "false" respectively. If the option is given without "+" or "-" modifier then the
default values are not used and only the specific feature is exported.
Examples:
"cxsmiles:" writes all default features
(absolute stereoconfiguration, enhanced
stereo features, atom labels, wiggly bond indexes, ring stereo bond info and
reaction fragment level grouping),
"cxsmiles:lc" writes the atom labels and the atomic coordinates only,
"cxsmiles:+c" writes writes all default features and the atomic coordinates,
"cxsmiles:-le" writes absolute stereoconfiguration, enhanced
stereo features, ring stereo bond info and
reaction fragment level grouping but not atom labels and
wiggly bond indexes.
u Write unique cxsmiles output. (Includes unique smiles string.)
Enhanced stereo information are also stored in unique format.
Default value: false.e Write relative stereo configuration and enhanced stereo features. Default value: true. l Write atom labels / aliases / values. Default value: true. w Write wiggly and in case of atomic coordinate export also UP and DOWN bond indexes. Default value: true. d Write CIS, TRANS ring bond indexes. Default value: true. f Reaction fragment level grouping. Default value: true. p Write local parities. Default value: true. R Write radical numbers. Default value: true. L Write lone electron pairs. Default value: true. m Write multicenter SGroups and coordinate bonds. Default value: true. N Write link nodes. Default value: true. c[p] Write atomic coordinates. p can optionally specify the coordinate precision. If p is not specified, the default value 2 is used. Default value: false. D Write Data Sgroup information. Default value: true. BOM Write the UTF-8 byte order mark (BOM), if the given or the system's encoding is UTF-8. Default value: false. q Write MDL query features. Default value: true. P Write polymer Sgroups. Default value: true. b Write local bicyclo-alkane stereo information. Default value: true. B Write Hydrogen bonds. Default value: true. A Write atom properties. Default value: true.
See also SMILES export options and basic export options.