Fall 2010 Gen/ComS/BCB 596 Project 1

Extract useful data from GenBank files using Perl

Due: Thursday, September 30 at 11.59 pm.

The Story

I received the following email from a previous student, and it gave me an ideal for our Perl project:

Dear Dr. Chou,

My name is Jane Dole and I am a recent BCB graduate. I am now a postdoc at Wonderful University Medical Center. I took a position doing comparative genomics with strains of Helicobacter pylori. Unfortunately I spent 96% of my time at ISU doing bench work so comparative genomics is very new to me even though I took your genomics data processing class several years ago. In your class we examined the use of Perl to parse GenBank files. I am currently trying to use BioPerl to extract data from a GenBank file, however, I am having difficulties extracting the gene coordinates that appear after CDS. Do you have any suggestions on how to pull this data? I need to create a file that looks like the following:

 

GI number for gene1/5’coordinate/3’coordinate/translation/function

GI number for gene2/5’coordinate/3’coordinate/translation/function

GI number for gene3/5’coordinate/3’coordinate/translation/function

Etc.

 

Thank you for your help,

Jane Dole

This is a real world example from a current postdoc that illustrates the kind of jobs many of you will face in the future. There is no existing solution that exactly produce the output as desired by the student. It is not hard to write a small Perl program to perform the job (my solution is just 40 lines of Perl code, including headers.) Therefore, can you solve this problem? No, we do not need to use BioPerl for this simple problem. In the following, we define the problem to be solved and provide some hints about how the Perl program may be constructed.

 

The Project

Download the Helicobacter pylori P12 strain plasmid genome (smaller for testing) or complete genome (bigger for actual computation) to your computer where you can execute Perl. These two are standard GenBank genome files containing among other information, the coding sequence information identified by the CDS tag. For example, the following is a CDS record example:

     CDS             complement(8..1171)
                     /locus_tag="HPP12_p01"
                     /codon_start=1
                     /transl_table=11
                     /product="MccC"
                     /protein_id="ACJ08728.1"
                     /db_xref="GI:210133738"
                     /translation="MNKALKVNILSYYCATMLLVIAQSLPHAILTPLLLNKGLSLSQI
                     MIIQATYSLCVLLSEYPSGVLADLINRKMLFIVSKLVLVCSFFCVLWLDSFVSMVLAW
                     GIYGLSNALDSGTIDATLITNIKDQKEDLSPFIAKSRQISFIGLIIGSCSGSFLYLKY
                     GISMYYISVLLTLLCISVIILFFKDDSSHKLEKKQQLQILKQHVRESFRELKDKPLLK
                     NLILLSLILQIFFQSHFQLWQAYFLDKQITENYLFIFYISFQIIGILVHFCNANSYNR
                     YMTITMIFLTGLASALLLSMDILLFVSAYIISVALFTYIDFIVSYQFSLYVSKEKISS
                     LTSLKSSVSRVISVIILGITSLELLIFSPTIVIAIHFIVTLILTCVFLYKVKI"

You can see that it contains several subfields including the coordinate of the gene encoding region on the genome (0..1171), whether it's on the reverse strand or not (complement), the Gene Index number (GI) on GenBank, the function of the gene (MccC; sometimes it may be just a hypothetical gene), and the protein sequence this gene encodes (translation). A bacteria CDS is much easier to decipher since it does not usually involve multiple exons and introns. For example, the CDS from a eukaryote genome might look like the following:

     CDS             complement(join(9536..9578,9660..9741,9833..9953,
10043..10108,10203..10310,10533..10598,10697..10725,
10816..10957))

You can see that it can be much more difficult to parse. Luckily, for this project, you only need to consider the simpler bacteria CDS fields. You will need to extract the GI number, start and stop coordinates on the gene reading direction, its protein translations and its function. Note that if the gene is on the reverse strand (i.e., with the complement keyword), then your start coordinate would be larger than your stop coordinate. A sample output after processing the smaller plasmid GenBank file is provided for you to compare with your results.

 

The Solution Strategy

In the following we provided some suggestions on how this Perl program might be constructed. This is a skeleton of a Perl solution. It contains the steps involved in the computation but not the actual Perl statements that implement these steps. In another word, this solution contains the algorithms but not the specific Perl code. Your job is to try to create actual Perl statements that can carry out these steps:

while (<>) {  # read each input line
chomp; # remove \n
if (/it is a CDS line?/) {
if (it is complement?) {
save start and stop for reverse strand
} else {
save start and stop for forward strand
}
} elsif (/it is a product line?/) {
save the gene function text
} elsif (/it is the Gene Index line?) {
save the Gene Index number
} elsif (/it is the protein translation line?/) {
save the first line of protein sequence
if (this is also the last protein line) {
output everything we got for this CDS
} else {
set a flag to note that we are collecting more protein lines
}
} elsif (we are collecting more protein lines?) {
concatenate the next protein line to our collection
if (this is the last protein line?) {
reset the flag of protein collection
output everything we got for this CDS
}
}
}

You will need to declare some variables that collect the GI number, start, stop, protein sequence and the function. Pay attention to the input data and see if you can find a particular string or pattern that allows you to identify the various if conditions set in the above example. Note in particular that the double quotes are NOT part of the data you should collect, but they serve as good markers of the text you indeed should extract and output. Feel free to use the Perl Programming Discussion area on WebCT to ask questions if you need more helps or are having problem implementing a particular feature of the program.

 

The Scoring Scheme

This project calls for five output per each CDS record. Here is the tentative scoring scheme for the project:

  • 3 points for being able to scan the input file and generate some sort of output without any Perl errors. Note that you should not be satisfied with just testing the two sample GenBank files we provided. Take a look at the other bacteria genomes and check if your program also work on a few other genomes since that is what we will be doing to test the general applicability of your program.
  • 8 points for the successful output of the GI index number, the start, the stop and the function text. That's 2 points for each data successfully collected and output. Note that these are all one liners in the input files.
  • 4 points for the successful output of the protein sequences. This is harder because it spans multiple lines in the input files.
  • The total of this project will be 15 points of the semester score for 596 students. For 690H students, you are welcome to try this project although you don't need to submit anything.

It may be a good time now to review the project submission guidelines and to go into your class account to check that you can run the submit program. For any questions about this project, please post them to WebCT discussions.

Last modified August 23, 2010. All rights reserved.