Sequence import and export
Codename: peptide, rna, dna
Contents
Peptides can be entered using one or three letter amino acid abbreviations.
A text file containing sequences should contain only one type of sequence (only
one or only three lettered sequences but not both). Each line must have one
and only one continuous line in the text file without spaces.
Abbreviations used:
Ala | Arg | Asn | Asp | Asx | Cys | Gln |
Glu | Glx | Gly | His | Ile | Leu | Lys |
Met | Phe | Pro | Pyl | Sec | Ser | Thr |
Try | Tyr | Val | Xaa | Xle |
A | R | N | D | B | C | Q | E |
Z | G | H | I | L | K | M | F |
P | O | U | S | T | W | Y | V |
X | J |
Valid files are like:
ProProProAlaLeuProProLysLysArg |
AlaProThrMetProProProLeuProPro |
but these are incorrect:
PPPALPPKKR |
AlaProThrMetProProProLeuProPro |
ProProProAlaLeuProProLysLysArg |
AlaProThrMetPPPLPP |
--peptide <string>
|
The string is a valid one or three
letter sequence. Example:
molconvert --peptide FFKMLL mol -o peptide.mol
|
will convert a one-letter sequence to a molfile |
|
peptide:3 |
Using this option the output will be a three-letter sequence.
Examples:
echo "[H]NCC(=O)NC(C)C(=O)NCC(O)=O" | molconvert peptide:3
|
will convert SMILES representation to
a three-letter sequence |
molconvert --peptide GAG peptide:3
|
will convert one-letter sequence to
a three-letter sequence |
|
peptide:1 |
One-letter peptide sequence option. Example:
echo "[H]NCC(=O)NC(C)C(=O)NCC(O)=O" | molconvert peptide:1
|
will convert the SMILES string to
a one-letter sequence |
|
Apart from the essential amino acids that are already recognizable, it is
possible to define custom amino acids with non-standard sidechains or with
alternative protonation states. The usual format of the dictionary
file is:
Ala A [CX4H3][C@HX4H1]([NX3])C=O 3 4
Arg R [N;X3][C@@H]([CH2][CH2][CH2][N;H1X3][C;X3]([N;H2X3])=N)C=O 1 10
Asn N [#7;X3][C@@H]([CH2]C([N;H2X3])=O)[C;X3]=O 1 7
Asp D [NX3][C@@HH1]([CH2]C([OX2H1])=O)C=O 1 7
...
where the corresponding columns are:
- long (three-letters code) abbreviation
- short (one-letter code) abbreviation
- SMARTS representation of the amino acid fragment
- the number of the backbone N in the SMARTS string (the third atom
for Ala in the first line of the example)
- the number of the backbone C next to the acyl oxygen (fourth atom
for Ala in the first line of example)
The columns should be separated by tab characters.
To create a custom amino acid abbreviation it is assumed that its name will
start with X and some other letters will follow this character
between parentheses. It is adviced to set this string for both the short
and the long name of the custom amino acid. Valid lines are:
X(Hcy) X(Hcy) [SX2H1][CH2][CH2][C@HH1]([NX3])C=O 5 6
X(1-foo) X(1-foo) [SX2H1][CH2][C@HH1]([NX3])C=O 4 5
X(b) X(b) [CH3][CH2][CH2][CH2][CH2][C@HH1]([NX3])C=O 7 8
...
Since Marvin 6.2 it's possible to define a molecule name in the dictionary.
The name can be defined in the first column in the file using the molName= prefix:
molName=L-Alanine Ala A [CX4H3][C@HX4H1]([NX3])C=O 3 4
Note the SMARTS strings representing amino acid fragments are denoting the
hydrogens and sometimes the connection numbers to avoid ambiguity. For
example if only the C[C@H](N)C=O string is used for alanin, this would match
for many other amino acids as well as some of them are "containing" alanin
as a substructure. Users can store their custom amino acids in the
custom_aminoacids.dict file in the
.chemaxon directory (UNIX) or the user's
chemaxon directory using MS Windows.
DNA/RNA sequences can be entered using one letter nucleic acid abbreviations.
Each line must have one and only one continuous line in the text file without spaces.
Abbreviations used:
Valid files are like:
A-C-G-T-A-C-G-T |
A-C-C-C-C-G-T-G-G-G-T |
dA-dC-dG-dT-dA-dC-dG-dT |
dA-dC-dC-dC-dC-dG-dT-dG-dG-dG-dT |
but these are incorrect: