This is readme file of version 2.1 of ChemGenome. If you encounter any problem or error while downloading or running the program, please inform us at scfbio@scfbio-iitd.res.in

About ChemGenome
ChemGenome is a physico-chemical method which accepts DNA sequence in FASTA format and predicts genes, based on hydrogen bonding energy, stacking energy and protein-nucleic acid interaction parameter for each trinucleotide (codon).

ChemGenome is ab-initio in nature and has been tested on 372 prokaryotic genomes with sensitivity, specificity and correlation coefficients averaged over 356208 genes and an equal number of frame-shifted genes (non-genes) as 97.5%, 97.20% & 94.25% respectively. The software can be downloaded from the following link.
	www.scfbio-iitd.res.in/chemgenome/chemgenomenew.jsp

Installing and Running ChemGenome
ChemGenome has been written and compiled in Linux environment. Following instructions will run on Linux system. 

Installation
To install ChemGenome download the Chemgenome2.1.tar files from website. Size of the compressed file is 2.8MB. 

Copy the tar files in your current directory and uncompress it by using this command 
 	$ tar -xvf Chemgenome2.1.tar 

The ChemGenome2.1 contains five files- ChemGenome2.1.sh, Protein_Score.exe, Chemgenome2.0, data directory and readme.txt. 

To run ChemGenome2.1 properly, user should copy data directory in to their current directory before running Chemgenome2.1. After execution of ChemGenome2.1 all the result files will be copied in to the current directory.

Running
ChemGenome2.1 can simply be called by providing first argument as chromosome file in FASTA format and second input is 1 or 2 on the basis of organism selected (1 for sequence from prokaryotic organism and 2 for sequence from Eukaryotic organism or from unknown sequence)
	$ sh ChemGenome2.1.sh <genome_file_name> <1 or 2> 

For Advanced feature user can modify ChemGenome2.1.sh file!

In ChemGenome2.1.sh, the first executable program is Chemgenome2.0 with given parameters,
        $ ./Chemgenome2.0 <genome_file_name> <orf_length> <method> <Start Codon (ATG OR|AND CTG OR|AND GTG OR|AND TTG) > 

Arguments
ORF length: If you have small genome you can specify lower threshold value to find smaller genes. If you have large genomes you can specify higher threshold value to weed out false positives.
Start Codon: You can specify what should be the start codon with which you want to find genes.

Method: 
DNA Space: The method takes complete or part of genome sequence of prokaryotic species in FASTA format as input file. It searches for genes based on physico-chemical properties of double-helical deoxyribonucleic acid (DNA). 

Protein Space: The method takes the result generated from DNA space as input file and works as a filter based on stereochemical properties of protein sequences to reduce false positives. 

Swissprot Space: The method takes the results generated from protein space as input file and calculates the standard deviation of a query nucleotide sequence (predicted gene sequence) with the Swiss-Prot proteins based on the frequency of occurrence of amino acids. A threshold standard deviation is chosen to keep the false positives at minimum and precision at maximum. 

The output of Chemgenome2.0 is further passed through protein based filters to produce final output,
1. 1main_orfs.txt - Position of genes predicted in 1st main reading frame
2. 2main_orfs.txt - Position of genes predicted in 2nd main reading frame
3. 3main_orfs.txt - Position of genes predicted in 3rd main reading frame
4. 1complementary_orfs.txt - Position of genes predicted in 1st complementary reading frame
5. 2complementary_orfs.txt - Position of genes predicted in 2nd complementary reading frame
6. 3complementary_orfs.txt - Position of genes predicted in 3rd complementary reading frame
7. Gene_sequences.txt - Gene Sequences of the predicted genes along with position.
8. Protein_sequences.txt - Protein sequences of the predicted genes along with position.

Speed
Time taken by the program will depend on genome size and the speed of the system on which its run. It takes usually 1-2 minutes for 1MB genome.

References  
[1] "Molecular Dynamics Based Physicochemical Model for Gene Prediction in Prokaryotic Genomes", Poonam Singhal,B Jayaram,Surjit B. Dixit and David L. Beveridge.Manuscript under revision.
[2] "A Physico-Chemical model for analyzing DNA sequences", Dutta S, Singhal P, Agrawal P, Tomer R, Kritee, Khurana E and Jayaram B, J.Chem. Inf. Mod.,46(1), 78-85, 2006.
[3] "Beyond the Wobble : The rule of conjugates", Jayaram B, Journal of Mol. Evol.,1997,45,704-705.
 

