[
Search
] - [
Help
] - [
FAQ
] - [
FTP data
] - [
Release Notes
] - [
Build Procedure
]
[
Contact Us
] - [
Related Links
] - [
IMAGE Home
]
About This Document
All annotations throughout this document refer to the
IMAGEne Build Procedure Diagram,
a flowchart that graphically describes the IMAGEne Build process.
Note that each annotation number corresponds to a numbered
circle in the Build Procedure Diagram.
What is clustering?
Clustering is the process of finding small groups (or clusters) of similar
sequences. The Imagene build process goes about this by taking all
the IMAGE clone sequence data and assembling them first against Known Genes
(we sometimes refer to it as the "KG set" or just KG), which we've defined
to be MRNA sequences from the reference sequence project of NCBI
(http://www.ncbi.nlm.nih.gov/RefSeq/
). The unclustered sequences (in addition to TIGR
ESTs in the case of the human portion of Imagene) are used to build the
Candidate Gene clusters (the CG set). Once those are complete we
are left with only Singletons, which are high quality clones for which
we can find no homologous clones from the same species.
Collecting Data
A separate process that runs nightly performs the task of synching up our
Oracle database with the data we want from the database at NCBI.
This is done by automatically fetching the data through FTP every night,
parsing the records and comparing that data with what we already have -
either to create new records or update existing ones
1.
A key component to the cluster quality of Imagene is that all the ESTs
are masked for repeats using the University
of Washington Repeat Masker. This is done on every sequence as
it encountered for the first time by our systems. Needless to say,
this is computationally very expensive.
Preparing for the Known Gene Build
Our first step is to pull all the IMAGE sequences for the target species
from our database and weed out all the undesirable sequences, which are:
-
All sequences that don't have a region of at least 50 base pairs of contiguous
nonmasked sequence. Put another way, there must be at least one region
with 50 or more As, Cs, Ts and Gs in a sequence to be included in the build.
-
All sequences reported to be a problem in our problem database where
the problem report originated from a nameable source (ie not from the general
public).
-
All sequences that the submitter has tagged as overall poor quality
in the original genbank record.
Note that the CG build also uses these criteria, although they are applied
in a different portion of the build, as explained in the CG section.
The Known Gene Build
For the KG build, sequences meeting any of the three undesirable sequence
criteria are eliminated from the data set, and from the remaining sequences,
we build a custom Megablast
database2
(the same type of database that BLAST
accepts). Then we create a FASTA file of all reference sequences
and iterate over each sequence, megablasting it against the sequence database
(containing both ESTs and Full Insert sequences) to find homology between
them. Regardless of how similar they are, all sequences for which
there were matches are assembled and compared against the reference sequence
using the FASTA program3.
At this point, we consider all sequences with an OPT score of 1000 or more
to be a sufficient match, and with those, we create alignment data using
SIM 4.
Each cluster is given a cluster ID that is the associated accession number
from the reference sequence. This 3 tiered process is considered
one job which is run using a program called the Sun Grid Engine (SGE).
The SGE is very powerful software that allows us to utilize the compute
power of many machines instead of just one.
Now that we know which sequences are homologous to which reference sequences,
we place this data in our database; it will be used after the Imagene build
is complete to drive our web interface. Furthermore,
to extract the desired data set for the CG build, we generate what are
called "knockout-lists," which are lists of Genbank Accession numbers to
be removed from the CG build. There are four types, the first three of
which are generated after the KG build:
-
KG Knockout List: contains Genbank Accession
numbers of all KG cluster members
5,
ie all sequences clustering to the reference sequences.
-
Removed from Genbank: contains all accession
numbers in our database that have, for one reason or another, been removed
from Genbank.
-
TIGR Knockout List: contains all TIGR accession
numbers whose sequences are homologous to Reference sequences (for the
human build only).
-
Ribosomal Knockout List: contains all Ribosomal
Cluster sequence accession numbers (generated after the previous CG build
and explained below).
All of these lists of accession numbers are eliminated
from the CG data set before we begin to form L1 clusters, described below.
The Candidate Gene Build
As described in the previous section, all available
knock-out lists are applied6
before the CG build begins. Later we will elaborate on
the identifcation and removal of ribosomal sequences, using this method.
The Candidate Gene build is intended to cluster all homologous sequences
together into related groups. Sequences will tend to do this if they
are from the same gene. Thus, this data supplements the KG set by
providing data that could represent genes for which no reference sequence,
a definitive source, is available. It is a bit more complex
than the known gene build, and takes place afterwards because it relies
on results from the KG build to get started. The CG build has five
stages:
-
Data Gathering
-
Level 1 (L1) clustering
-
L2 clustering
-
Additional weeding of undesirable sequences
-
Cluster Assembly
Data Gathering
As described in "Preparing for the Known Gene Build," data is extracted
for the CG build. At this stage, only the first and second criteria
to weed out undesirable sequences are applied. The third criteria
is applied after L2 clustering.
L1 Clustering
This is effectively an N x (N-1) Megablast, meaning that every sequence
is megablasted against every other sequence except itself in order to form
what we call L1 clusters. An L1 cluster
7
is just a list of sequences that the sequence in question is similar to.
Similarity is defined to be 95% identity and a minimum length of
50 base pairs. In the earlier builds of IMAGEne, BLAST was used instead
of Megablast, and due to the extreme computational power needed to blast
millions of sequences against each other (the original human NxN blast
done for Imagene 3.0 was done on 1200 CPUs of the IBM super computer ASCI
Blue) we structured this algorithm to use a previous build's NxN blasts
(e.g. 3.1 used the L1 clusters derived in 3.0 ). So, a subsequent
build only need blast all new sequences obtained since the build date of
the previous build.
This feat is accomplished by taking the N sequences of the previous
build, adding M new sequences and recreating the blast database
(see diagram). We then blast all the new sequences against
the combined database. Then we add the L1 clusters from a previous
build to the newly obtained L1 clusters and continue to L2 clustering...
By definition, if there are N+M sequences total in the CG build then
there will be N+M L1 clusters at this point. Said another way, for
every sequence in an Imagene build there is an L1 cluster.
L2 Clustering
Now we bring together all L1 clusters to get the maximum possible representation
of the gene from which the clones of each cluster originated. We
pull together any number of L1 clusters into an L2
8
if either two of the sequences in two L1 clusters came from the same clone,
or the L1 clusters share a particular sequence, determined by their unique
accession number. Further, therse must be at least two occurrences
total of either of these criteria to pull two L1 clusters together (eg
two L1s could merge because they share two sequences, or because they shared
a sequence and two other sequences came from the same clone).
Additional weeding of undesirable sequences
At this point we have thousands of L2 clusters for which we build FASTA
files for later assembly, and we further weed from these clusters the sequences
meeting the third criteria for undesireable sequences (referred to as low
quality sequences).9
This is not done earlier because some sequences labeled low quality may
help pull together two L1 clusters, giving us a better chance at discovering
a more representative sequence for each gene.
We
also remove, via knockout list, all sequences from the formation
of a "Ribosomal Cluster," which is extremely large due to the high level
of homology among ribosomal DNA. The Ribosomal Cluster stands
out for elimination because it is very large compared to the rest of the
data set (e.g. of 960926 sequences going into the mouse build of Imagene
4.0, an unusually large 117520 seqeunces were in the same cluster).
At this stage, the "Ribosomal Cluster" is thrown out and the list of members
in it is retained, and reapplied at an earlier stage of subsequent builds
(see above). Thus, reducing the amount of megablasting needed in those
builds.
Cluster Assembly
With FASTA files containing all high quality sequences of each cluster,
we generate a consensus sequence. We have chosen CAP
10
to do this. CAP is another computationally expensive operation taking
hours on many CPUs. The output from CAP is then parsed and placed
into our database.
Singletons
At this point any sequence in the CG set that does not match some other
sequence in the CG set is considered a Singleton
11
and is tracked by our database. Since all the KG, CG and Singleton
sets of data are weeded for poor quality, it follows that the whole IMAGE
collection of clones can fall into one of four categories: 1) matches
KG 2) matches CG 3) Singleton 4) annotated poor quality.
Further Information
Please visit the IMAGEne links page for more information about the tools and algorithms that IMAGEne uses, including FASTA, Megablast, SIM and more.