Genome analysis entails the prediction of genes in uncharacterized genomic sequences. The 21st century has seen the announcement of the draft version of the human genome sequence. Model organisms have been sequenced in both the plant and animal kingdoms.
However, the pace of genome annotation is not matching the pace of genome sequencing. Experimental genome annotation is slow and time consuming. The demand is to be able to develop computational tools for gene prediction.
Computational Gene prediction is relatively simple for the prokaryotes where all the genes are converted into the corresponding mRNA and then into proteins. The process is more complex for eukaryotic cells where the coding DNA sequence is interrupted by random sequences called introns.
Some of the questions which biologists want to answer today are:
Given a DNA sequence, what part of it codes for a protein and what part of it is junk DNA.
Classify the junk DNA as intron, untranslated region, transposons, dead genes, regulatory elements etc.
Divide a newly sequenced genome into the genes (coding) and the non-coding regions.
The importance of genome analysis can be understood by comparing the human and chimpanzee genomes. The chimp and human genomes vary by an average of just 2% i.e. just about 160 enzymes. A complete genome analysis of the two genomes would give a strong insight into the various mechanisms responsible for the differences.
Given below is a table listing down the estimated sizes of certain genomes and the number of genes in them.
Species |
Genome size (Mb) |
Number of genes |
Mycoplasma genitalium |
0.58 |
500 |
Streptococcus pneumoniae |
2.2 |
2300 |
Escherichia coli |
4.6 |
4400 |
Saccharomyces cerevisiae |
12 |
5800 |
Caenorhabditi elegans |
97 |
19,000 |
Arabidopsis thaiana |
125 |
25,500 |
Drosophila melanogaster |
180 |
13,700 |
Oryza sativa |
466 |
45-55,000 |
Mus musculus |
2500 |
29,000 |
Homo sapiens |
3300 |
27,000 |
Arabidopsis and Humans have the same number of genes, though the Arabidopsis genome is around 250 times smaller than humans. How is that ?
The human genome has a lot of junk DNA, specifically transposons and mobile genetic elements. This increases the size of the human genome, though the number of genes is only 27,000.
However, the number of protein products in humans is significantly higher. Many of the sequenced human genes have alternative splice products. In addition, several other processes (e.g. signal transduction) proceed via further protein modifications, such as Glycosylation. Therefore, the number of human protein products could far exceed the number of genes.
Why do plants have such bulky genomes when they are not as complex as some of the higher eukaryotes ?
This is mostly due to two factors: the ability of plants to duplicate their genomes in order to reproduce (a process known as polyploidization) and the susceptibility of plants to mobile genetic elements.
Polyploidization allows plants to more easily form hybrids when pollen and ova from different species come together. The result of such hybridization events are plants with genomes that are the sum of the two parent genome sizes (as opposed to half of one parent’s genome and half of the other parent’s genome as in normal sexual reproduction.
Also, in case of plants, it is fairly common to observe insertion of transposable elements in intergenic regions. This also explains the difference in the sizes of plant genomes among themselves as well.