[ Search ] - [ Help ] - [ FAQ ] - [ FTP data ] - [ Release Notes ] - [ Build Procedure ]
[ Contact Us ] - [ Related Links ] - [ IMAGE Home ]


FAQ

What is IMAGE?
What is IMAGEne?
How do you come up with the clusters?
What is the "master array"?
When will the "master array" be made available?
How current is the clustering data, and when will it be updated?
How can I download all the data?
How can I get a listing of the top choices per gene cluster?
How can I get all the Candidate Gene sequences?
How do I cite this data?
How can I look up IMAGEne gene clusters on my web site?
Why can a clone from a known gene cluster also appear in a candidate gene cluster?
What is the cluster ID about?
How does IMAGEne handle alternative splices?
Some known genes have a specified coding sequence that does not start with ATG. Why do these genes occasionally have full-coding clones associated with them?
Why are there so many dashes ('-') before sequence begins for this cluster?

What is IMAGE?

Integrated Molecular Analysis of Genomes and their Expression. The I.M.A.G.E. Consortium was founded in 1993 to accelerate gene discovery through the use of arrayed cDNA libraries, and to aid in the accumulation of sequence, map, and expression information for all genes. One of the initial goals of the Consortium was to create a non-redundant set of unique genes representing the complete set of human transcripts, and to provide this resource to the research community as a basis for the analysis of the human genome. Recently, the Consortium has begun to focus on the genomes of model organisms, such as mouse and zebrafish, to complement the work being done with human clones. In addition, the I.M.A.G.E Consortium is a part of the NCI Cancer Genome Anatomy Project in which cDNA clones derived from tumor libraries will be used to study gene expression patterns in tumors, and the NIH Mammalian Gene Collection, focusing on obtaining full-length cDNA clones.

All clones are available from any of our authorized distributors, and all sequence obtained from the clones is submitted immediately to Genbank.

More information is available through our web page at http://image.llnl.gov

What is IMAGEne?

IMAGEne is a software package for clustering IMAGE clones/ESTs to known genes, and to each other. It is a useful tool to aid in the re-array of IMAGE clones for public distribution. The publicly accessible web interface that by now you've seen has a many purposes

How do you come up with the clusters?

We created a document expressly to address this. It can be found here.

What is the "master array"?

One of the goals of the IMAGE Consortium is to make a "master array" containing one representative cDNA from each transcriptional unit in the human genome. This complete rearray of cDNAs will become the foundation for further study, including the isolation and sequencing of full-length cDNAs, mapping, gene expression, functional analysis, and molecular analysis of tumor cells in conjunction with the NCI Cancer Genome Anatomy Project.

When will the "master array" be made available?

A variety of rearrays of IMAGE cDNAs are available now through our authorized distributors. Please contact them for most current descriptions.

How current is your data, when will it be updated?

The IMAGEne database is rebuilt with each major release of GenBank, which is currently every two months. For details of each release see our release notes.

How can I download all the data?

Email us at imagene@image.llnl.gov to discuss your interest.

How can I get all the Candidate Gene cluster consensus sequences?

Consensus sequences can be obtained via our anonymous ftp site at image.llnl.gov. The data is located within the /image/imagene directory. The data is marked with the IMAGEne version number in the filename, thus Imagene 3.0 data can be found in the file called consensus_seq_3.0.fasta.

How can I get a listing of the top choices per gene cluster?

Master and candidate_gold listings can be obtained via our anonymous ftp site at image.llnl.gov. These listings are located within the /image directory and are appended with the IMAGEne version number they were derived from.

A master listing contains the "best" clone for each known gene. A candidate_gold listing is a subset of the master list, containing only full-coding clones.

How do I cite this data?

Use of an IMAGE clone can be referenced as: Lennon, G.G., Auffray, C., Polymeropoulos, M., Soares, M.B. The I.M.A.G.E. Consortium: An Integrated Molecular Analysis of Genomes and their Expression. Genomics 33:151-152 [1996].

Use of the IMAGEne tool can be referenced as: Cariaso, M., Folta, P., Lennon, G., Wagner, M., Kuczmarski, T.; IMAGEne I: The clustering of ESTs corresponding to known genes. Bioinformatics [Volume 15, Number 11 pp 965-973].

How can I look up IMAGEne gene clusters on my web site?

Please see the Linking to Imagene web page

Why can a clone from a known gene cluster also appear in a candidate gene cluster?

There are both biological and computational reasons for this, two examples might be alternative splicing or low quality sequence.

An EST may not be included in a known gene cluster even if another EST from the same clone is included. Some reasons for this are sequencing errors or incorrect EST to clone ID associations in GenBank. Beginning with release 3.3, we have made efforts to improve the accuracy of the clusters by noting any problems found/reported with IMAGE clones in our problem database, and excluding those clones from future IMAGEne builds.

What is the IMAGE cluster ID about?

All Cluster IDs not in GenBank format belong to the Candidate Gene clusters and appear in the form CXXXXXX-YY, where X and Y are digits (eg.C001496-01). The 'C' in the id stands for cluster, the X's are the cluster number and the Y's are the contig number of that cluster. So, C001496-01 is Cluster 1496, contig 1. The candidate gene cluster IDs that we generate have reserved ranges to accomodate for multiple species. IDs less than C100000 are for human and less than C200000 are for mouse.

The difference between a cluster and a contig comes about because of the methods by which we form clusters. A candidate gene cluster is formed by first blasting all ESTs against each other (level 1 clustering) in order to determine similiarity. We then merge those sets if they have more than one clone in common (level 2 clustering). Since clone sequences are ESTs and usually don't represent the full insert of the clone, you can have gaps in the consensus sequence of a cluster. Each gap splits a cluster into contigs, or groups of contiguous sequence. The most common cause for multiple contigs is that the 5' ends of clones cluster together and the 3' ends of clones cluster together, but the two groups of sequences do not overlap with each other.

So in order to see all of the sequence of a candidate gene cluster you must view all of its contigs and realize there are unknown gap lengths between contigs.

How does IMAGEne handle alternative splices?

The IMAGEne algorithms currently make no serious attempt to address alternative splices in the EST cluster for a known gene. This may change in the future. Right now the clustering results for alternatively spliced genes depend on the number of sequences in the cluster and the length of the alternatively spliced section.

Some known genes have a specified coding sequence that does not start with ATG. Why do these genes occasionally have full-coding clones associated with them?

In about 13% of the known genes from GenBank, the specified coding sequence does not start with ATG. (This may indicate an incomplete coding sequence, or other situations). The current IMAGEne algorithm will mark a clone as full-coding if it is homologous to both the 3' and 5' ends of the coding sequence as specified in GenBank.

Why are there so many dashes ('-') before sequence begins for this cluster?

The number of extra dashes ('-') to put before and after the sequence displayed by the Java is determined based on the alignment of the other sequences in the cluster. There is at least one sequence in the displayed cluster beginning at the rightmost edge, and at least one ending at the leftmost edge.

Due to the fact that clusters can become very large, only the top clusters (currently 500), ordered by relevance, are displayed. This was done in an effort to speed up the HTML and Java display, and avoid bogging down the user's system. If the sequence(s) determining the right- and leftmost edge are below this limit, they will not be displayed, producing the empty space that you are seeing.





© Copyright 1997 All Rights Reserved
LLNL Disclaimer
UCRL-MI-119848
Web page maintained by
imagene@image.llnl.gov