Home
Search Options
GO Viewer
BLAST Search
cSNP
Download
Contact Us

 

spacer

Processing pipeline for EST analysis


Processing pipeline for analysis of A. assama ESTs

An EST processing pipeline was developed to analyse the ESTs generated from different tissues of muga silkworm. Sequence chromatograms were analysed using Phred score. A cut off Phred score of ≥15, was assigned to extract quality sequences from chromatograms. The quality sequences were screened for presence of vector sequences and subsequently detected vector sequences were then removed using 'Cross Match' program.

Trimming vector sequences

As each EST is usually sequenced from a SK(-) vector-directed primer T3, each sequencing trace will include a small piece of SK(-) vector DNA sequence. At the vector-insert junction there may also be sequence corresponding to linkers or adapters used in the construction of the cDNA bank. In the case of a short insert, the derived sequence may extend through the poly(A) tail into the vector on the other side. These sequences should also be trimmed from ESTs before further analysis. cDNA library construction can also inadvertently result in the cloning of contaminant DNA, particularly DNA from E. coli and its bacteriophages that gets into the reactions through contamination of recombinant enzymes. These sequences should also be removed. A program 'Cross Match' was used to find away any vector, linker/adapter and contaminant sequences. Subsequently, identified vector, linker/adapter and contaminant sequences were removed from ESTs using trimming tool developed in-house.

Distribution of read length of ESTs derived from different tissues of A. assama

One of the many ways to assess the quality of ESTs is by their read lengths. If average read length is between 450-500 bp then the ESTs quality is considered to be good. We analyzed the length of ESTs, and calculated their mean lengths in all the 10 tissue ESTs. The distribution of lengths was plotted against the number of ESTs. Average length was 436, 496, 488, 463, 495, 468, 448, 433, 499 and 496 bp in the ESTs derived from embryo, brain, testis, ovary, midgut, fatbody, middle silkgland, posterior silkgland, epidermis and compound eye, respectively. Average length of only embryo derived ESTs was less than 450 bps. All other library ESTs had good average read lengths. Majority of ESTs were having the lengths ranging between 400-600 bp.

Clustering and assembly of ESTs into putative genes

Many ESTs may derive from the same gene. It is therefore advisable to group the sequences on the basis of sequence similarity into clusters which can be used to derive consensus sequences. A cluster is defined as a unique set of sequences which share common sequence similarity. A cluster containing only one sequence is termed a singleton. Clustering both reduces the level of redundancy and increases the overall quality of the derived sequence.

Clustering was carried out using TGI Clustering tools (TGICL) a software system for fast clustering of large EST datasets developed by The Institute of Genomic Research (TIGR). This package automates clustering and assembly of a large EST/mRNA dataset. The clustering was performed by a slightly modified version of NCBI's megablast, and the resulting clusters are then assembled using CAP3 assembly program. TGICL starts with a large multi-FASTA file and outputs the assembly files as produced by CAP3. The results obtained by clustering and assembly of ESTs is summarised in Table 2.

Table 2: Total number of ESTs generated in each EST library, number of quality sequences obtained after trimming of contaminants and number of non-redundant sequences (contigs and singletons) obtained after clustering and assembly.

Library nameNo. of raw sequencesNo. of quality ESTs obtainedNo. of contigsNo. of ESTs in contigsNo. of singletonsTotal non-redundant sequences% reduction in sequence data
aem12,50011,50245810,6889211,37988.00
abr6,0005,2995363,4111,9412,47753.26
ats5,0004,2354971,7012,5723,06927.53
aov4,5003,8914612,0761,8462,30340.81
amg3,0002,4391162,28618530187.66
afb3,0002,3182311,88446069170.19
apsg3,0002,5431252,22731644182.66
amsg1,5001,0316590515021579.15
aep1,5001,3861151,04035947465.80
ace1,5001,07811862846157946.29
Total41,50035,7222,26030,2045,9378,19777.05
In total there was 77% reduction in initial EST data due to clustering and assembly. In other words, we obtained 8,197 non-redundant putative gene transcripts from 35,722 ESTs. We observed varied reduction in number of ESTs after redundancy removal in different tissue ESTs. Highest number of unique transcripts was obtained in testis ESTs and least from ESTs of middle silkgland. Even though 11,502 ESTs were generated from embryo, we narrowed down to a small number, 1,370 unique sequences after removal of redundancy.

Annotation of non-redundant putative transcripts using BLAST search

The unique putative gene sequences obtained by clustering and assembly were annotated by running BLAST against non-redundant (nr) protein database of NCBI. Further, BLAST output was parsed to classify the putative gene transcripts into different functional classes. Functional annotation based on BLAST against NCBI protein database revealed the presence of several genes that are involved in silk production, circadian rhythm, sex determination, immune response and also several novel tissue specific genes.



Copyright © 2008 All Rights Reserved, CDFD, Hyderabad, India