Fall 2007 Gen/ComS/BCB 596 Project 4

Finding genes in genomic DNA sequences

Due: Tuesday, November 6 at 11 am.

In this project, your job is to recognize, isolate and suggest some function for genes contained in the two genomic DNA sequences ~cs596/pub/proj4data/MBK21.seq and ~cs596/pub/proj4data/MGH6.seq. Note that these two sequences have nothing to do with the easy and moderate data set you did in project 3. To provide a common starting point for project 4 disregard your different assembly results in project 3, we have obtained these two new BAC sequences for you to conduct your projects.

Grading: (100pt)

Compose your GenBank report for each sequence using sequin. You can use the program already installed in the class account, or try to download and install a copy onto your own computers. You are actually encouraged to do the latter, just so you can practice software download and installation. Your reports should contain predicted gene structures including their exon locations and compositions (cf. the file ~cs596/pub/script/arabidopsis for an example), and their predicted (often called putative ) functions. A function of "unknown" is acceptable for some genes. Sometimes it is simply an honest thing to say that you do not know the function of some genes. However, it is unlikely that functions for all gene in these two BAC sequences cannot be guessed. Therefore, if your reports contain only unknown genes, they will be considered incomplete, and penalties will be applied.

 

After you have done your project, use the submit program to submit your results for grading. You absolutely do NOT want to follow through the GenBank submission guideline in sequin to submit your report to GenBank, although you should be familiar with that procedure when one day you do have some data to submit to GenBank.

Methods:

There are several steps involved in raw DNA annotation. Most of the details have been covered in the class. Here is just a summary:

  1. Determine gene structures.
    This is to determine the structure of genes in a genomic DNA. There are two different strategies for this step: one based on pre-trained machine-learning programs which recognize plausible gene structures (like Genscan), the other based on comparing against known EST, protein and other information previously stored in public databases (like AAT tools). You should try both kinds of methods and combine their results in sequin. You can edit the results there based on your best judgment.
  2. Predict gene functions.
    After you have identified a possible gene in the genomic DNA, sometimes called an Open Reading Frame (ORF), your next step is to guess its function based on bioinformatics tools and database searches. NOTE: no software based prediction can be 100% correct; so a true confirmation always has to come from the wet lab. That being said, it is a common practice to computationally determine gene functions first, then leave the confirmation research to the interested users of the data. Your most valuable resource for this step can be GenBank, Swiss-Prot and other large biological data depositories on the Internet.
  3. Write up GenBank reports.
    Finally, to let the world share your findings, you need to submit them to GenBank. Of course we are just doing a fictional student project and you are NOT going to submit your data to GenBank. However, you should still summarize your data rigorously using the sequin GenBank report generation tool. A sample of a completed GenBank report can be seen in ~cs596/pub/scripts/arabidopsis.

Resources:

The following is a non-exhaustive list of tools and resources that may be helpful to you for doing this project:

  1. Genscan. This is available in your class account. Just type "genscan".
  2. FGENESH. This is similar to Genscan and can be tested at "http://www.softberry.com/berry.phtml".
  3. Sequin. This has been installed in your class account as well. Type "sequin".
  4. GenBank. If you haven't heard about this place you are likely in troubles. Point your browser to "www.ncbi.nlm.nih.gov".
  5. Swiss-Prot protein database. Point your browser to "http://www.expasy.ch/". There are other goodies on this site.
  6. AAT tools for gene finding. Go to "genome.cs.mtu.edu".
  7. PFam protein domain search site, at "pfam.wustl.edu".
Last modified July 13, 2007. All rights reserved.