[ Search ] - [ Help ] - [ FAQ ] - [ FTP data ] - [ Release Notes ] - [ Build Procedure ]
[ Contact Us ] - [ Related Links ] - [ IMAGE Home ]


About This Document

All annotations throughout this document refer to the IMAGEne Build Procedure Diagram, a flowchart that graphically describes the IMAGEne Build process. Note that each annotation number corresponds to a numbered circle in the Build Procedure Diagram.

What is clustering?

Clustering is the process of finding small groups (or clusters) of similar sequences.  The Imagene build process goes about this by taking all the IMAGE clone sequence data and assembling them first against Known Genes (we sometimes refer to it as the "KG set" or just KG), which we've defined to be MRNA sequences from the reference sequence project of NCBI (http://www.ncbi.nlm.nih.gov/RefSeq/ ).  The unclustered sequences (in addition to TIGR ESTs in the case of the human portion of Imagene) are used to build the Candidate Gene clusters (the CG set).  Once those are complete we are left with only Singletons, which are high quality clones for which we can find no homologous clones from the same species.

Collecting Data

A separate process that runs nightly performs the task of synching up our Oracle database with the data we want from the database at NCBI.  This is done by automatically fetching the data through FTP every night, parsing the records and comparing that data with what we already have - either to create new records or update existing ones 1.  A key component to the cluster quality of Imagene is that all the ESTs are masked for repeats using the University of Washington Repeat Masker.  This is done on every sequence as it encountered for the first time by our systems.  Needless to say, this is computationally very expensive.

Preparing for the Known Gene Build

Our first step is to pull all the IMAGE sequences for the target species from our database and weed out all the undesirable sequences, which are:
  1. All sequences that don't have a region of at least 50 base pairs of contiguous nonmasked sequence.  Put another way, there must be at least one region with 50 or more As, Cs, Ts and Gs in a sequence to be included in the build.
  2.  All sequences reported to be a problem in our problem database where the problem report originated from a nameable source (ie not from the general public).
  3.  All sequences that the submitter has tagged as overall poor quality in the original genbank record.
Note that the CG build also uses these criteria, although they are applied in a different portion of the build, as explained in the CG section.

The Known Gene Build

For the KG build, sequences meeting any of the three undesirable sequence criteria are eliminated from the data set, and from the remaining sequences, we build a custom Megablast database (the same type of database that BLAST accepts).  Then we create a FASTA file of all reference sequences and iterate over each sequence, megablasting it against the sequence database (containing both ESTs and Full Insert sequences) to find homology between them.  Regardless of how similar they are, all sequences for which there were matches are assembled and compared against the reference sequence using the FASTA program3.   At this point, we consider all sequences with an OPT score of 1000 or more to be a sufficient match, and with those, we create alignment data using SIM 4.  Each cluster is given a cluster ID that is the associated accession number from the reference sequence.  This 3 tiered process is considered one job which is run using a program called the Sun Grid Engine (SGE).  The SGE is very powerful software that allows us to utilize the compute power of many machines instead of just one.

Now that we know which sequences are homologous to which reference sequences, we place this data in our database; it will be used after the Imagene build is complete to drive our web interface. Furthermore, to extract the desired data set for the CG build, we generate what are called "knockout-lists," which are lists of Genbank Accession numbers to be removed from the CG build. There are four types, the first three of which are generated after the KG build:

  1. KG Knockout List: contains Genbank Accession numbers of all KG cluster members 5, ie all sequences clustering to the reference sequences.
  2. Removed from Genbank: contains all accession numbers in our database that have, for one reason or another, been removed from Genbank.
  3. TIGR Knockout List: contains all TIGR accession numbers whose sequences are homologous to Reference sequences (for the human build only).
  4. Ribosomal Knockout List: contains all Ribosomal Cluster sequence accession numbers (generated after the previous CG build and explained below).
All of these lists of accession numbers are eliminated from the CG data set before we begin to form L1 clusters, described below.

The Candidate Gene Build

As described in the previous section, all available knock-out lists are applied6 before the CG build begins.    Later we will elaborate on the identifcation and removal of ribosomal sequences, using this method.

The Candidate Gene build is intended to cluster all homologous sequences together into related groups.  Sequences will tend to do this if they are from the same gene.  Thus, this data supplements the KG set by providing data that could represent genes for which no reference sequence, a definitive source, is available.   It is a bit more complex than the known gene build, and takes place afterwards because it relies on results from the KG build to get started.  The CG build has five stages:

Data Gathering

As described in "Preparing for the Known Gene Build," data is extracted for the CG build.  At this stage, only the first and second criteria to weed out undesirable sequences are applied.  The third criteria is applied after L2 clustering.

L1 Clustering

This is effectively an N x (N-1) Megablast, meaning that every sequence is megablasted against every other sequence except itself in order to form what we call L1 clusters.  An L1 cluster 7 is just a list of sequences that the sequence in question is similar to.  Similarity is defined to be 95%  identity and a minimum length of 50 base pairs. In the earlier builds of IMAGEne, BLAST was used instead of Megablast, and due to the extreme computational power needed to blast millions of sequences against each other (the original human NxN blast done for Imagene 3.0 was done on 1200 CPUs of the IBM super computer ASCI Blue) we structured this algorithm to use a previous build's NxN blasts (e.g. 3.1 used the L1 clusters derived in 3.0 ).  So, a subsequent build only need blast all new sequences obtained since the build date of the previous build.

This feat is accomplished by taking the N sequences of the previous build, adding M new sequences and recreating the blast database (see diagram).   We then blast all the new sequences against the combined database.  Then we add the L1 clusters from a previous build to the newly obtained L1 clusters and continue to L2 clustering...

By definition, if there are N+M sequences total in the CG build then there will be N+M L1 clusters at this point.  Said another way, for every sequence in an Imagene build there is an L1 cluster.

Diagram of N X N build vs. M X N build

L2 Clustering

Now we bring together all L1 clusters to get the maximum possible representation of the gene from which the clones of each cluster originated.  We pull together any number of L1 clusters into an L2 8 if either two of the sequences in two L1 clusters came from the same clone, or the L1 clusters share a particular sequence, determined by their unique accession number.  Further, therse must be at least two occurrences total of either of these criteria to pull two L1 clusters together (eg two L1s could merge because they share two sequences, or because they shared a sequence and two other sequences came from the same clone).

two L1 clusters merging into one L2 cluster

Additional weeding of undesirable sequences

At this point we have thousands of L2 clusters for which we build FASTA files for later assembly, and we further weed from these clusters the sequences meeting the third criteria for undesireable sequences (referred to as low quality sequences).9  This is not done earlier because some sequences labeled low quality may help pull together two L1 clusters, giving us a better chance at discovering a more representative sequence for each gene. We also remove, via knockout list, all sequences  from the formation of a "Ribosomal Cluster," which is extremely large due to the high level of homology among ribosomal DNA.  The Ribosomal Cluster stands out for elimination because it is very large compared to the rest of the data set (e.g. of 960926 sequences going into the mouse build of Imagene 4.0, an unusually large 117520 seqeunces were in the same cluster).  At this stage, the "Ribosomal Cluster" is thrown out and the list of members in it is retained, and reapplied at an earlier stage of subsequent builds (see above). Thus, reducing the amount of megablasting needed in those builds.

Cluster Assembly

With FASTA files containing all high quality sequences of each cluster, we generate a consensus sequence.  We have chosen CAP 10 to do this.  CAP is another computationally expensive operation taking hours on many CPUs.  The output from CAP is then parsed and placed into our database.

Singletons

At this point any sequence in the CG set that does not match some other sequence in the CG set is considered a Singleton 11 and is tracked by our database.  Since all the KG, CG and Singleton sets of data are weeded for poor quality, it follows that the whole IMAGE collection of clones can fall into one of four categories:  1) matches KG 2) matches CG 3) Singleton 4) annotated poor quality.

Further Information

Please visit the IMAGEne links page for more information about the tools and algorithms that IMAGEne uses, including FASTA, Megablast, SIM and more.




© Copyright 1997 All Rights Reserved
LLNL Disclaimer
UCRL-MI-119848
Web page maintained by
imagene@image.llnl.gov