This document describes ChemAxon's Chemical Terms Language. This language is used to formulate chemical expressions in general. Its current usage includes chemical rules for reaction processing, search filters or both as chemical calculations and chemical filtering in JChem Cartridge. The Evaluator command line tool and the Evaluator API are also available for general purpose expression evaluation.
The Chemical Terms Evaluator is designed to evaluate mathematical expressions on molecules using built-in chemical and general purpose functions. It is also possible to extend this built-in set of calculations by a user-defined configuration.
The heart of the evaluator mechanism is the JEP Java Expression Parser, equipped with chemical plugin calculations, chemical substructure search and some additional chemical and general purpose functions. User defined functions can also be added to this function set.
Here are some simple examples showing how some well-known chemical rules can be formulated for a given input molecule read from a molecule context:
The following filters are used in drug discovery and drug development to narrow down the scope of molecules. They provide estimation on solubility and permeability of orally active compounds considering their physical and chemical properties. The examined properties are given as chemical terms.(mass() <= 500) && (logP() <= 5) && (donorCount() <= 5) && (acceptorCount() <= 10)
(mass() <= 450) && (logD("7.4") >= -4) && (logD("7.4") <= 4) && (ringCount() <= 4) && (rotatableBondCount() <= 10) && (donorCount() <= 5) && (acceptorCount() <= 8)
(mass() <= 500) + (logP() <= 5) + (donorCount() <= 5) + (acceptorCount() <= 10) + (rotatableBondCount() <= 10) + (PSA() <= 200) + (fusedAromaticRingCount() <= 5) >= 6Note, that summing up the
7
subresults above means to count how many of them
are satisfied. The requirement that this sum should be at least 6
means that
we do not require all of the subconditions to be satisfied but instead we allow
at most one of them to fail.
(mass() >= 160) && (mass() <= 480) && (atomCount() >= 20) && (atomCount() <= 70) && (logP() >= -0.4) && (logP() <= 5.6) && (refractivity() >= 40) && (refractivity() <= 130)
refmol = "actives.sdf"; dissimilarity("ChemicalFingerprint", refmol) - dissimilarity("PharmacophoreFingerprint", refmol) > 0.6Note, that molecule constants can be defined by a molecule file path or a SMILES string. Multiple expressions are separated by ';' characters, whitespace characters can be added freely for readability, since they are not considered by the evaluation process.
A set of working examples is also available.
The Chemical Terms Evaluator parses and evaluates expressions that are built from the following language elements:
+
),
substraction(-
), multiplication (*
) and division (/
),
&&
), OR (||
), NOT (!
)
true
if a matching is found, false
otherwise
A set of short reference tables provides a summary of the available functions / calculations and the use of matching conditions.
Expression strings consist of an arbitrary number of initial assignments followed by a last subexpression that provides the evaluation result. An assignment sets a variable to the evaluation result of a subexpression. This variable can later be used to reference this result. The assignment syntax is:
<identifier> = <subexpression>;
Note the ending ';' character. Examples for assignments:
x = 2; y = x + 8; z = f(x,y) + g(x,y);where
f
and g
are predefined functions.
An expression is an optional sequence of assignments followed by a subexpression providing the evaluation result:
<identifier1> = <subexpression1>; <identifier2> = <subexpression2>; ... <identifierN> = <subexpressionN>; <result subexpression>where
N
can also be zero in which case the expression coincides with the result
subexpression.
Here is an example with assignments:
a = f(2,3); b = g(4,5); x = a + b; x*x
Here is the same without assignments:
(f(2,3) + g(4,5))*(f(2,3) + g(4,5))
Assignments increase efficiency if the same evaluation result is used more than once since inline repetition of a subexpression results in multiple evaluation. Assignments can also be used to increase readability. However, in most cases, when the expression is simple, assignments are not needed. Note, that whitespace characters (new-line, tab, space) are skipped when parsing the expression string, so whitespace characters can be freely used for increasing readability.
The following examples demonstrate the expression syntax with very simple subexpressions. Examples with chemical meaning are shown later for matching conditions, chemical calculations and chemical and general purpose functions.
Examples:3+2
x = 2; y = 3; x + y
x = 2; y = 3; z = 8*(x + y); t = 6*x*y; z + t
x = (3 + 4)*8 + 16; y = 3*x; z = x + 20; 5*(y + 8) + 4*z
It is sometimes easier to refer molecules by names rather than explicit SMARTS
strings or molecule file paths. For example, you may want to write nitro
or carboxyl as query in a match
function. Frequently used queries are pre-defined in the
built-in functional groups file
(chemaxon/marvin/templates/functionalgroups.cxsmi
within MarvinBeans-templates.jar
).
You can also define your favourite query SMARTS in marvin/config/marvin/templates/functionalgroups.cxsmi
file and in
$HOME\chemaxon\marvin\templates\functionalgroups.cxsmi
(Windows) or
$HOME/.chemaxon/marvin/templates/functionalgroups.cxsmi
(UNIX / Linux)
file where marvin
is the Marvin istallation directory, $HOME
is your user
home directory.
However, there are some limitations when choosing the molecule names.
Molecule names should be composed of letter, digit characters, and the '_' character.
This means that molecule names cannot contain special characters, such as '=', '-', etc. with the exception of '_'.
Molecule name definitions in functionalgroups.cxsmi
file can contain whitespace characters (space, tab), but when
names are referenced from a Chemical Terms expression the whitespace characters should be replaced with a single '_' character
(e.g. secondary amine
should be referred as secondary_amine
in Chemical Terms expressions).
Note: from Marvin 5.4 mols.smarts
configuration file is not used by Chemical Terms. It is replaced by functionalgroups.cxsmi
file.
You can define molecule sets and other constants in the user-defined
initial script $HOME/chemaxon/MARVIN_MAJOR_VERSION/jep.script
(Windows) or
$HOME/.chemaxon/MARVIN_MAJOR_VERSION/jep.script
(UNIX / Linux), where $HOME
is
your user home directory, and MARVIN_MAJOR_VERSION
is the major version of Marvin (e.g. "5.1").
This script is run right after the molecule sets are read and the constants defined here can be
used later in your chemical expressions. Any valid chemical terms assignment is allowed here,
and the terminating ';' characters may be omitted as long as you write each assignment in a
separate line.
Typically, you will define a molecule set by
x = {acid_halide, alcohol, "[#6]CC[#8]"} y = {alkene, amide, imide, imine} z = {alkene, amide, amine, alcohol, isocyanate}
all = x + y + z (union of x, y, z) join = y * z (join of y and z) C = (x + y) * z (join of the union of x and y with z) D = z - alcohol (all elements of z except alcohol) E = (x + y) - z (union of x and y without the elements of z)where
+
means set-union, *
means set-join and
-
means exclusion.
Predefined molecules and molecule sets are most useful in query definitions of the match function:
match(amide)
will test whether the input molecule matches an amide group,
match(reactant(0), {amide,amine})
will test whether the first reactant in a
reaction context matches an amide or an amine
match(2, {metalloid,alcohol}, 1)
will check whether atom 2
of the input molecule matches either a metalloid or an alcohol carbon - the last
parameter 1
denotes the query atom map which picks the carbon from the
alcohol definition; match(ratom(2), {metalloid,alcohol}, 1)
is the same in a
reaction context with checking target reactant atom which
corresponds to the reactant atom with map 2
in the reaction equation.
When evaluating an expression, the Evaluator substitutes data reference symbols by the corresponding data items. All data items belong to exactly one of the following data groups:
5
, 7.4
,
"acidic"
, "mols/amine.mol"
,
"NCC(N)C1=CC(=CC=C1)C(O)=O"
)
nitro
,
hydrazide
, carboxyl
)
The type of the input data depends on the expression evaluation environment, which currently is one of the following:
The evaluation environment provides a specific input context for accessing its input data. The input context consists of a bunch of accessor functions that can be used in the expression strings to access the input data. The following input contexts correspond to the evaluation environments described above:
mol()
: refers to the current input molecule
mol()
: refers to the current input molecule
atom()
: refers to the current input atom index in the input molecule
jcsearch
and search queries):
mol(), target()
: both refer to the search target molecule
query()
: refers to the search query molecule
m(int i)
: refers to the query atom index with atom map i
hit(), h()
: both refer to the search hit array
hit(int i), h(int i)
: both refer to the i
-th element of the
search hit array, this is the target atom index matching the query atom with
atom index i
hm(int i)
: refers to the target atom index matching the query atom with
atom map i
(shorthand for h(m(i))
)
reactant(int i)
: refers to the i
-th reactant (0-based indexing)
product(int i)
: refers to the i
-th product (0-based indexing)
ratom(int m)
: refers to the reactant atom corresponding to
reactant atom map m
according to the reaction equation
patom(int m)
: refers to the product atom corresponding to
product atom map m
according to the reaction equation
Note, that the default input molecule is the molecule returned by mol()
in case when this function exists in the context.
The built-in configuration XML can be extended by user-defined functions and plugin calculations. The configuration syntax is described in the Evaluator Manual.
The examples below are divided into sections according to the input context applied, which corresponds to the different applications that can make use of ChemAxon's chemical expressions. These examples use the built-in configuration XML, the referenced functions and plugin calculations are listed in the short reference tables.
Plugin references provide access to ChemAxon's calculator plugins. These calculations equip our expressions with chemical meaning.
7.4
of the input molecule:
microspecies("7.4")
0
, 2
and 3
(0-based)
of the input molecule:
charge(0, 2, 3)
7.4
:
charge(0, 2, 3, "7.4")
0
in the input molecule
is greater than or equal to this charge value in the physiological microspecies at pH 7.4
:
charge(0) > charge(0, "7.4")
9
(0-based)
of the input molecule:
pka(9)
9
(0-based)
of the input molecule:
pka("acidic", 9)
Note that if the pKa type "acidic" or "basic" is omitted (as in the previous example), then the more significant value is returned, while specifically the "acidic" (or "basic") pKa value is returned if the type is specified.
pka("acidic", "1")
Note the difference in the last two examples: in pKa calculation a number
denotes the atom index while a number in quotation marks denotes the strength order:
9
in the previous example refers to atom 9
while "1"
in the above example refers to the strongest acidic pKa value
("2"
refers to the second strongest value, etc.).
logp()
pH=7.4
of the input molecule:
logd("7.4")
Note that in logD calculation the pH value should be enclosed in quotation marks.
logd("7.4") - logd("3.8") > 0.5
mass()
acceptorCount()
7.4
:
acceptorCount("7.4")
acceptorCount("7.4") - acceptorCount() > 1
There are different type of functions provided by ChemAxon:
isQuery
function)
7
, 8
and 9
(0-based) of the input molecule:
min(charge(7), charge(8), charge(9))
2
(0-based) of the input molecule:
hcount(2)
2
of the input molecule:
valence(2)
filter("charge() > 0")
count(filter("charge() > 0"))
charge(filter("charge() > 0"))
sortAsc(charge(filter("charge() > 0")))
0.4
in the major microspecies at pH=7.4
:
filter("charge('7.4') >= 0.4")
charge(filter("charge('7.4') >= 0.4"))
min(pka(filter("match('[!#6!#1;H1]')"), "acidic"))
0.75
:
min(pka(filter("match('[!#6!#1;H1]')"), "acidic")) < 0.75
maxAtom("pka('basic')", 2)
Note, that expression strings can be enclosed by either double or single quotes, in case of nested strings these can be used alternated. However, some UNIX shells interpret single quotes and therefore single quotes are hard to use in command line input - the file input solves this problem, or else single double quotes can be replaced by escaped inner double quotes:
maxAtom("pka(\"basic\")", 2)
maxValue("pka('basic')", 2)
x = maxAtom("pka('basic')", 2); charge(x[0]) > charge(x[1])
Note, that in the current version the above expression cannot be evaluated if there are less than two basic pKa values in the input molecule.
sortDesc(pka("basic", filter("charge() > 0")))
Note, that in the current version NaN
(meaning that there is no valid
pKa for the given atom) values are put to the end of the array after sorting.
x = sortDesc(pka("basic", filter("charge() > 0"))); x[0] - x[1] > 1.5
eval("hcount()")
sum(eval("hcount()"))
refmol = "c1ccccc1"; dissimilarity("PF", refmol)
Note: dissimilarity
function is not available in Marvin; it can be used only
if JChem software package is installed.
refmol = "c1ccccc1"; dissimilarity("PF:Euclidean", refmol)
Note: dissimilarity
function is not available in Marvin; it can be used only
if JChem software package is installed.
1, 6, 8
(0-based atom indices) having
the first and second biggest hydrogen counts (molecule context):
x = array(1, 6, 8); y = maxAtom(x, "hcount()", 2); charge(y)
6
(0-based atom index) has the first or second smallest
partial charge among atoms 1, 6, 8, 10, 12
(molecule context):
x = array(1, 6, 8, 10, 12); y = minAtom(x, "charge()", 2); in(6, y)
There are three options to reference substructure search
from our expressions: match
function returns a true / false
answer while
matchCount
and disjointMatchCount
functions return the number of search hits.
Note: match
, matchCount
and disjointMatchCount
functions are not
available in Marvin, they can be used only if JChem software package is installed.
match("C1CCOCC1")
2
(0-based)
of the input molecule and query atom set being all query atoms:
match(2, "C1CCOCC1")
2
(0-based)
of the input molecule, and query atom set being both query carbon atoms
attached to the oxygen:
match(2, "C1C[C:1]O[C:2]C1", 1, 2)
match(2, "mols/query.mol", 1, 2)
nitro
as a
predefined molecule constant:
match(2, nitro, 1, 2)
matchCount("C=O") + matchCount("CO")
6
"C=O" and "CO" groups in the input molecule alltogether:
match("S") && (matchCount("C=O") + matchCount("CO") >= 6)
Note: Reactor is part of JChem software package, it is not available in Marvin.
Plugin references provide access to ChemAxon's calculator plugins. These calculations equip our expressions with chemical meaning.
7.4
of the second reactant:
microspecies(reactant(0), "7.4")
1
in the reaction equation:
charge(ratom(1))
7.4
:
charge(ratom(1), "7.4")
2
in the first reactant:
charge(reactant(0), 2)
Note: Evaluation of this expression will result in error if there is no atom with index 2
in the first reactant. In reaction context referring by atom index (instead of atom map) is recommended only if the
atom index(es) are returned by a Chemical Terms expression (see this example).
1
is greater than or equal to this charge value in the physiological microspecies at pH 7.4
:
charge(ratom(1)) > charge(ratom(1), "7.4")
3
in the reaction equation:
pka(ratom(3))
pka(ratom(3), "acidic")
Note that if the pKa type "acidic" or "basic" is omitted (as in the previous example), then the more significant value is returned, while specifically the "acidic" (or "basic") pKa value is returned if the type is specified.
pka(reactant(0), "acidic", "1")
logp(product(0))
pH=7.4
of the first product:
logd(product(0), "7.4")
Note that in logD calculation the pH value should be enclosed in quotation marks.
logd(product(0), "7.4") - logd(product(0), "3.8") > 0.5
mass(product(1))
acceptorCount(reactant(1))
7.4
:
acceptorCount(reactant(1), "7.4")
acceptorCount(reactant(1), "7.4") - acceptorCount(reactant(1)) > 1
There are different type of functions provided by ChemAxon:
isQuery
function)
2
,
3
and 4
:
min(charge(ratom(2)), charge(ratom(3)), charge(ratom(4)))
2
:
hcount(patom(2))
2
:
valence(ratom(2))
filter(reactant(0), "charge() > 0")
count(filter(reactant(0), "charge() > 0"))
charge(reactant(0), filter(reactant(0), "charge() > 0"))
sortAsc(charge(reactant(0), filter(reactant(0), "charge() > 0")))
0.4
in major microspecies of the first product at pH=7.4
:
filter(product(0), "charge('7.4') >= 0.4")
charge(product(0), filter(product(0), "charge('7.4') >= 0.4"))
min(pka(reactant(0), filter(reactant(0), "match('[!#6!#1;H1]')"), "acidic"))
0.75
in the first reactant:
min(pka(reactant(0), filter(reactant(0), "match('[!#6!#1;H1]')"), "acidic")) < 0.75
min(pKa(reactant(0), filter(reactant(0), "aliphaticAtom()", "acidic")))
1
and reactant atom matching
map 2
is a single or double bond.
(bondType(reactant(0), bond(ratom(1), ratom(2))) == 1 || bondType(reactant(0), bond(ratom(1), ratom(2))) == 2)
Note, that bond(ratom(1), ratom(2))
subexpression returns an <atomIndex1>-<atomIndex2>
string, so in reaction context the molecule parameter also must be passed to bondType()
function (see this note). In the example reactant atoms matching maps 1
and
2
are atoms of the first reactant (reactant(0)
).
maxAtom(product(0), "pka('basic')", 2)
Note, that expression strings can be enclosed by either double or single quotes, in case of nested strings these can be used alternated. However, some UNIX shells interpret single quotes and therefore single quotes are hard to use in command line input - the file input solves this problem, or else single double quotes can be replaced by escaped inner double quotes:
maxAtom(product(0), "pka(\"basic\")", 2)
maxValue(product(0), "pka('basic')", 2)
x = maxAtom(product(1), "pka('basic')", 2); charge(x[0]) > charge(x[1])
Note, that in the current version the above expression cannot be evaluated if there are less than two basic pKa values in the molecule.
sortDesc(pka("basic", reactant(0), filter(reactant(0), "charge() > 0")))
Note, that in the current version NaN
(meaning that there is no valid
pKa for the given atom) values are put to the end of the array after sorting.
x = sortDesc(pka("basic", reactant(0), filter(reactant(0), "charge() > 0"))); x[0] - x[1] > 1.5
eval(product(0), "hcount()")
sum(eval(product(0), "hcount()"))
dissimilarity("PF", reactant(0), product(0))
dissimilarity("PF:Euclidean", reactant(0), product(0))
There are three options to reference substructure search
from our expressions: match
function returns a true / false
answer while
matchCount
and disjointMatchCount
functions return the number of search hits.
match(reactant(0), "C1CCOCC1")
2
and query atom set being all query atoms:
match(patom(2), "C1CCOCC1")
2
,
and query atom set being both query carbon atoms attached to the oxygen:
match(patom(2), "C1C[C:1]O[C:2]C1", 1, 2)
match(patom(2), "mols/query.mol", 1, 2)
matchCount(product(1), "C=O") + matchCount(product(1), "CO")
6
"C=O" and "CO" groups in the input molecule alltogether:
match(product(1), "S") && (matchCount(product(1), "C=O") + matchCount(product(1), "CO") >= 6)
Plugin references provide access to ChemAxon's calculator plugins. These calculations equip our expressions with chemical meaning.
1
should be positive:
charge(hm(1)) > 0
7.4
:
charge(hm(1), "7.4") > 0
1
is greater than or equal to this charge value in the physiological microspecies at pH 7.4
:
charge(hm(1)) > charge(hm(1), "7.4")
3
should be greater than 8.0
:
pka(hm(3) "basic") > 8.0
0.5
:
pka("acidic", "1") < 0.5
Note, that by default, the expression refers to the target.
Write query()
to refer to the query:
pka(query(), "acidic", "1") < 0.5
logp(query()) > logp()
pH=7.4
of the target should be less than
the logD value at pH=3.4
:
logd("7.4") < logd("3.4")
Note that in logD calculation the pH value should be enclosed in quotation marks.
logd("7.4") - logd("3.8") > 0.5
1
:
(mass() > 500) && (charge(hm(1)) > 0)
There are different type of functions provided by ChemAxon:
2
,
3
and 4
should be negative, that is there should be at least
one negative among these charge values:
min(charge(hm(2)), charge(hm(3)), charge(hm(4))) < 0
1
and on
target atom matching map 2
should be at least 1
:
(hcount(hm(1)) >= 1) && (hcount(hm(2)) >= 1)
1
or map 2
should be at least 1
:
(valence(hm(1)) >= 1) || (valence(hm(2)) >= 1)
count(filter("charge() > 0")) >= count(filter(query()"charge() > 0"))
0.75
,
that is, there should be at least one hetero atom with a single hydrogen
with acidic pKa less than 0.75
:
min(pka(filter("match('[!#6!#1;H1]')"), "acidic")) < 0.75
x = maxAtom("pka('basic')", 2); charge(x[0]) > charge(x[1])
Note, that in the current version the above expression cannot be evaluated if there are less than two basic pKa values in the molecule.
x = max(pka("basic", query(), filter(query(), "charge() > 0"))); y = max(pka("basic", filter("charge() > 0"))); (x - y > 1.5) || (y - x > 1.5)
dissimilarity("PF", target(), query()) < 0.6
Note, that the target()
can be omitted as in the above examples:
dissimilarity("PF", query()) < 0.6
dissimilarity("PF:Euclidean", target(), query()) < 0.6