BCB
444/544X
Lab
9 - Revised 10/23/05
Gene
Prediction
Objectives
Introduction
Drena
will provide this in lab!
Exercises
Required questions are in red.
Turn
in the answers to the questions by emailing them to terrible@iastate.edu by noon on Monday.
Exercise
1 Genome Browsers
We
will first use some genome browsers to see what kinds of information are
available to help us in predicting genes.
Go to the UCSC
Genome Browser and take a look
at what is available. The input
boxes at the top of the page allow you to choose which region of which genome
you want to look at. For our
purposes, we can accept the values that are already there (which should be
looking at a region of human chromosome 7). Click on the submit button to get started.
Take
a look at everything this server allows you to do. The top of the page has navigation controls that let you
move upstream and downstream, as well as zoom in or out, or jump directly to a
position. Below that, there is a
picture of the entire chromosome with a red line showing the region you are
currently looking at. The main box contains a graphical representation of a
huge amount of information including sequence markers, known genes, ESTs,
conservation, SNPs, etc. The rest
of the page lists available tracks.
Since
we are working on gene prediction, we should see what gene prediction tracks
are available and display them.
Click on the hide all button to clear all selected tracks. Then scroll down the page and select
some of the gene prediction tracks.
Play around with the different display options (dense, squish, pack,
full) to see which options you like the best. Notice the differences in gene predictions by the different
methods.
Next
go to PlantGDB and browse the Arabidopsis genome. This site was created and is maintained
by Volker Brendel and his group here at Iowa State. The browser at this site shows less information, but it is
displayed very nicely. Everything
is color coded so that you can tell at a glance what evidence there is for a
gene in the region. The color key
on the left side of the page shows what the different color boxes mean. The labels are all self-explanatory
except for the UCA. UCA stands for
User Contributed Annotation. One
of the features of PlantGDB is that users can look at the EST, mRNA, cDNA, and
outside evidence and contribute annotations to the genome project.
Exercise
2 Gene Prediction
The
human uroporhphyrinogen decarboxylase
(URO-D, U30787) is
used in this exercise. An SP1
transcription factor binding site, a TATA box, and 10 exons in the forward
strand have been annotated in the sequence of 4514 bp.
1.
Go to EBI database and download the URO-D sequence, both in
FASTA format (for use in exercises below) and in default format.
http://www.ebi.ac.uk/cgi-bin/emblfetch
(you
can download sequence in several formats from here)
2.
Use GeneID http://www1.imim.es/geneid.html
to
predict splice sites and START and STOP codons in the sequence.
Identify the real sites among the predictions.
Do they tend to show higher scores?
3. Now, use GeneID to predict all possible exons.
Compare
the exon predictions with the real exons.
Why is the initial exon not included in the final gene assembly?
4. The initial exon is not detected by ab
initio methods or homology searches.
(What does ab initio mean, in the context of gene prediction methods?)
Explain this observation.
5. Use
GENESCAN http://genes.mit.edu/GENSCAN.html
&
FGENESH http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=gfind
with
parameters from other species (try a plant, a non-vertebrate animal, and a
yeast) to predict genes in the URO-D sequence.
Discuss the results.
Now,
do repeat these predictions using the appropriate parameters (i.e., those for
human).
How much improvement do you observe?
5. Locate the region in the Drosophila genome that encodes the URO-D gene and use GeneID,
GENSCAN and FGENESH with human parameters to make the predictions.
Compare with the predictions using the Drosophila parameters.
What differences can be noted?
Exercise
3 Promoter Prediction
The
promoter region of the human obese gene (leptin, U43589) includes
3 regulatory elements that have been annotated: an SP1 site, a cEBP box, and a TATA box. The sequence can be
downloaded from EBI database.
1. Go to TRANSFAC database http://www.generegulation.com/pub/databases.html#transfac
and
obtain the matrix representing the TATA box. You may be presented with several
potential TATA motifs.
Find the one motif that is bound by TATA binding factor (TBP or TBF)
and save the header information for this motif.
Carefully
read the comments of the record.
How many sites were used to build this matrix?
2.
Repeat the above process for SP1 and cEBP.
How many sites were aligned to build their matrices?
Is there any relationship between the quality of the predictions and
the number of collected binding sites?
3.
Access the program MATCH which can be used to scan sequences for potential
transcription factor binding sites.
http://www.gene-regulation.com/pub/programs.html#match
If
you haven't before, you will need to register for access to this, but registration
is free.
http://www.gene-regulation.com/register
Scan
the promoter sequence using the full collection of vertebrate matrices.
Identify the real binding sites in the output.
To do this, you need the coordinates
for the real binding sites. They are provided below in GFF format, relative to
the transcription start site (TSS) at position 1000.
Here are coordinates of 3 annotated
elements (in GFF format)
U43589 SP1 904 909 SP1 # GGGCGG
U43589 CEBP 947 956 cEBP # GTTGCGCAAG
U43589 TATA 972 977 TATA
# TATAAG
4. Repeat # 3 above, using the program MATINSPECTOR
http://www.genomatix.de/cgi-bin/matinspector/matinspector.pl
The link provided for MatInspector
in the original version of this lab didn't work - you probably Googled
it and discovered that you must also register for free access to this software.
It may take a while to get the password back (via email). If you do not get a
response form MatInspector (or don't feel like waiting for one), just choose
another promoter prediction program from your textbook (or from that
excellent optional review Drena recommended in lecture - see PPTs from Friday) and try it out. Answer Question 5 based on the program you chose instead
of MatInspector.
5. Which program
do you like better and why?
6. Use BLAST2SEQ at NCBI to align the human and mouse promoters (U43589
& U36238) and obtain a graphical
output. Set a very restrictive
mismatch penalty (-5) and a neutral gap extension penalty (0) to recover short
very conserved stretches of genomic sequence.
Compare the alignment blocks with the annotations.
Are the real binding sites conserved in these promoters?
If you actually try to set
both the gap initiation and gap extension penalties to 0, you should get an
error message. Play around with
the mismatch and gap penalty settings and examine the results.
Do any of these settings
allow you to detect the "real" TF binding sites noted in 3.3 for the
human promoter?
7. Now, repeat #6 using the promoter
region of URO-D homologs from as many species as possible.
Search for conserved
elements, using MEME and MAST. http://meme.sdsc.edu/meme/intro.html
Try
this with CLUSTAL if you like.
Can conserved elements be identified across all the promoter
elements?
Do they correspond to the known binding motifs?