Fall 2007 Gen/ComS/BCB 596 Project 1
Extract useful data from data files
Due: Thursday, September 27 at 11 am.
In the first project we are trying to extract some useful data from two files downloaded from GenBank, lactose and arabidopsis. Both files are available at ~cs596/pub/scripts. You can use any Unix utility, scripting language or programming language to implement this project. However, using Vect coupled with simple Perl code may be your best choice. You must create your solutions as individual executable command files like job1, job2, ... such that the TA only needs to type
job1 lactose
to get the desired output specified in job 1 below. In other words, you need to wrap all functionality of your implementation into script files named job1, job2, etc. and enable the execution bits of the scripts so the TA can run them directly as commands. Although the TA will not have to compose complex commands for you to run your jobs, he may look at your scripts to make sure they are reasonable, e.g., that you didn't hardwire the output into those scripts using a cat command.
Specific descriptions about each job are given below.
Job 1: (10pt)
Our first job is to collect all keywords in the lactose file. This may be useful, for example, to build a gene ontology (GO) database. If you have looked at that file carefully, you know that keywords always start with the KEYWORDS tag, may span multiple lines in the file, and are always ended by the SOURCE tag. The keyword list are separated by the semicolon ; symbol, and some keywords are multi-word phrases such as "Outer membrane". Your job is to extract all keywords, turn them into lower case strings, and print each keyword (or keyword phrase) separately on a line by itself, like the following:
...
milk
glycosidase
hydrolase
transmembrane protein
lectin
calcium
glycosyltransferase
hexosyltransferase
mammary gland
milk
calcium
glycosyltransferase
hexosyltransferase
...
Hint: there is an lc function in Perl or something similar in your chosen language to convert upper cases to lower cases letters.
Job 2: (10pt)
After you have successfully grabbed the keywords from the lactose file, in job2 you need to sort the list and remove duplicate entries, allowing only unique keywords in the final output. Note that if you haven't converted all keywords to lower case, "milk" and "Milk" will be listed as two keywords, which is therefore not correct.
Job 3: (30pt)
In this harder job 3, we wish to extract gene models from the arabidopsis file, based on information already contained in its CDS tags. Our extracted output will be a FASTA format file that contains full length gene models from the arabidopsis file. A FASTA file is made up of a >gene_name plus a series of DNA data lines. Each gene model is made up of multiple exons in uppercase letters and introns in lowercase letters connected alternatively together. You need to recognize coordinates of the exons and make them uppercases. Also, if a gene is on the complement strand of the arabidopsis BAC sequence, you need to reverse-complement the gene sequence and write out the gene model always in the 5'-to-3' translation direction. Your output should be something like the following:
>At2g18200
ATGCAATTGCTAAGAACCCTAACCACAAGAACAAGGAGCCGTCGCAGTGGATACGAGTGT
GTAACCAAGCATTCCAACTTTAGCTTACTCGGAGCAAAGCTAAGGAGCTCACGACCGTTC
CTTACTATGCTCCATATCGATAGGCTTGGTGGAGACTTTCCTGCGATTTTGGAAAAGCTT
CCACGCCAAAAACCAAATAAGACAGTGGTGACAAGCAAATTGAGCCATCCAATCTTCACT
CATGTTATTTACATATATATGTTATTTATAAAAATTTACATCGATTCAGTTAGCCTGATA
AAGTAA
>At2g18210
ATGCAAATGCTAAGAAACTTAAGCACGAGGACGAGGAGTCGTCGCGGCGGATATGAGCGT
GTAAGCGATGATTCCACCTTCAGCCTACTTGGAGCAAAGCTAAGGAGGTCAACGAGCGTT
CCATACTATGCTCCATCGATAAGGCTTGGTGGAGATTTTCCTGTGATTTTGGAAAAGCTT
CCACGCCAAAAACCAACTAAAACAGTGGTGACAAGCAAATTAAGCCATCCAATCTTCAGT
TTATTTGATGGTTATCGCCGCCATAACAAGAAGAAAGCGACGGCTAAACCGGAGTTCTCT
AGATACCATGAATACCTTAAAGAAAGTGGAATGTGGGATTTGAGATCTAATAGTCCGGTC
ATCTACTTTAAGTAG
>At2g18220
ATGGGTGCCAAAGgttcgatttttatccagtttttggatttatattgtaattagAGCTTA
AGGGATTCGAGATTGATAAACACTTTAAATCAAATGTAGATGATAAAAAGCGTGTGAAGA
AGTTGAAATCTAAGAAACTAGAAGCTGAGGAAGAGCTCAATAATGTTCAAGAAATCGATG
CACATGATATAGTAATGGAGCAGAAGAGCGATAAGAAGCGTGGGAAGAAGGTGAAATCTA
AGAAAGCAGAAGCTGAGGAGCATGAAGAAGAGCTTAAGAGGCTTCAAGAAAAGgtgagtc
atctatgaaacatgtttggttgcgatattgtaatgatatattagtggaaagttgagtttt
ttttttaactttacggatttttttgttccgcagGATCCTGATTTTTTTCAGTATATGAAA
GAGCATGATGCAGAGCTTCTAAAGTTTGATGCTACTGAAATTGAGgtgagttttgttagt
...
There is a related example of this job on the Vect DNA to Protein Tutorial that may get you started. However, in that tutorial the exons were spliced together and protein translations were made, while in this job the requirement is different. It is difficult to do this work completely in Vect even with Simple User Rules. You can, however, use Vect to extract relevant data and use another standalone Perl program to generate the final output. Note that the coordinates listed in the arabidopsis file start from 1, so to take substrings you need to decrease those coordinate values by one.
Job 4: (20pt)
Once you can get the individual gene models out of the arabidopsis file in job 3, in the last job we want to gather the codon usage statistics among those exon regions in each gene model. This is helpful in phylogenetic analysis based on codon usage patterns. Your output should be similar to the following. Note that the codon list should also be sorted in alphabetical order:
AAA ==> 266
AAC ==> 132
AAG ==> 288
AAT ==> 127
...
TTT ==> 161
|