Fall 2007 Gen/ComS/BCB 596 Project 2
Determine DNA bases from trace files, sequence quality control and vector trimming
Due: Thursday, October 4 at 11 am.
The jobs of this project are to analyze the trace files stored in ~cs596/pub/data/easy and ~cs596/pub/data/moderate, and to generate DNA sequences and quality values from these trace files. After that, you will analyze the sequence and quality data obtained in order to produce clean range information of each sequence for the fragment assembly project next. Some reading of the software document indicated in blue text below is needed for this project; or you may wish to copy down the command line examples given in the class.
Warning! Do not copy the trace files to your own directories. You will run out of your disk quota that way! Read the trace files in their own directories within the public class account, and save only the output to your own directories.
Although the jobs of this project are very straightforward, they are computationally intensive. In order to avoid disturbing other users of the computer science department lab (and draw the system administrator's attention :-), the following operating guidelines must be followed:
- To spread out job loads, you should run your jobs only at the hours obtained from the following simple calculation. Take the last four digits of your university ID number, then modulo it by 24. If the hour obtained is not your preferred hour, you can add 6, 12 or 18 hours to it. For example, if your ID is 2234, modulo by 24 yields 2, so you can run your jobs at 2am, 8am, 2pm and 8pm. This should give you plenty of opportunities to do your projects.
- If possible, avoid running your jobs at other hours. Even though the machines you login may not seem to have other users, your jobs still slow down the file server, which is constantly shared by everybody using any machine in the CS lab at any time!
- Before you start your jobs, check the current job load on the machine you login by typing the "w" command. If the load average on that machine is more than 1.00, do not run your jobs on that machine. Find another machine with a lighter load to run your jobs instead. If you ssh into pyrite.cs.iastate.edu, it will automatically assign you to the lowest loaded machine, therefore you just need to logout and log back in again to switch to a less busy machine.
- Finally, to run your jobs, always prefix your statement with the "nice" command. For example, if the command you intent to run is
phred -id here -sa there -qa where
you should actually run it as
nice phred -id here -sa there -qa where
That will give the other users running smaller interactive jobs like editors or compilers on the same machine a faster response, without slowing down your job too much.
The above guidelines are likely needed for the other computational intensive projects for this course. So be familiar with them now!
Job 1: (10pt)
Obtain phred base-calls and quality values from trace files stored in ~cs596/pub/data/easy using phred. Read its document in ~cs596/pub/doc to determine the appropriate command line parameters to use. Store the results into FASTA flat files in your own directory. We will need those files in later projects, so keep them after this project.
Job 2: (10pt)
Obtain the original ABI base-calls from the same trace files stored in ~cs596/pub/data/easy. You can either use phred to do this, or write your own simple program to extract bases visible in each trace file. We will not need those files after this project, so you can delete them after you submit this project.
- Note1: phred quality data for ABI base-calls are all zero; so omit them to save disk space!
- Note2: make sure you do get the ABI base-calls if you use phred; if the sequences you obtained in this job is exactly the same as the ones you obtained in job 1, something is wrong!
Job 3: (20pt)
Repeat the same jobs 1 and 2 above for more trace files stored in ~cs596/pub/data/moderate. Do not mix data you obtained from "moderate" with data you obtained from "easy"; save them into different directories.
Submission for grading jobs 1-3:
For both the "easy" and "moderate" data sets, just submit the sorted FASTA headers of the DNA sequences from both phred and ABI sequences, like the following. Do not submit either the sequences themselves or their quality values; that will only waste our disk space! We do not need your DNA or quality data in this stage to grade your project. Your assembled genomic sequence later can give us more information about the quality of your work.
>ATNMA01TR 970 0 970 ABI
>ATNMA02TR 1002 0 1002 ABI
>ATNMA03TF 854 0 854 ABI
>ATNMA03TR 867 0 867 ABI
>ATNMA04TR 1045 0 1045 ABI
>ATNMA05TF 858 0 858 ABI
>ATNMA06TF 869 0 869 ABI
>ATNMA06TR 988 0 988 ABI
>ATNMA07TF 873 0 873 ABI
>ATNMA07TR 1046 0 1046 ABI
Job 4: (20pt)
With phred base-calls and quality values obtained from the trace files in ~cs596/pub/data/easy, use lucy to select a useful data range of each sequence for DNA fragment assembly. Read the document of lucy to determine the appropriate command line parameters to use.
Don't forget that you need to trim the vector sequence and remove vector contaminants from the data set as well using the PUC19 and PUC19splice files stored in the ~cs596/pub/data directory. Again, read the lucy document to determine how to use these two files to trim vector fragments.
Note the term "trimming" here simply means to mark the beginning and ending of the good quality range along a sequence. We do not mean to physically remove the regions outside of this good quality range. Each input sequence to the subsequent fragment assembly process should have the same length as before, except that they are now marked in their header line for their good quality region.
The version of phred we installed in the class account tends to call quality values lower than expected. To cope with this situation, you may wish to modify lucy's cutoff criteria to increase the useful data range. However, blindly lowering the cutoff values will result in bad data getting into the genomic data processing pipeline, which can make your assembly effort harder later.
Finally, you should remember to generate the debug information file when using lucy for project submission purpose. See below.
Job 5: (20pt)
Do the same job above for the data set obtained from trace files in ~cs596/pub/data/moderate. Avoid mixing data in this job with data from the previous job by saving into different files.
Submission for grading jobs 4 and 5:
Keep the data you obtained in your own directory for future use. For jobs 4 and 5 above, submit the CLR and CLV values obtained from each data set, like the following. Do not submit sequences or quality values themselves. Note that you need to filter the debug file to extract only CLR and CLV values; we are not interested in the other CL? values. You may use your Perl/Vect skill learned in project 1 to do this.
ATMRA24TF CLR 12 684 CLV 0 0
ATMRA24TR CLR 40 535 CLV 40 0
ATMRA25TF CLR 9 673 CLV 0 0
ATMRA25TR CLR 0 0 CLV 0 0
ATMRA26TF CLR 48 688 CLV 48 0
ATMRA26TR CLR 0 0 CLV 0 0
ATMRA27TF CLR 21 665 CLV 19 0
ATMRA27TR CLR 41 626 CLV 37 0
ATMRA30TF CLR 24 686 CLV 24 0
ATMRA30TR CLR 40 598 CLV 0 0
ATMRB80TFB CLR 51 681 CLV 51 0
ATMRB84TFB CLR 34 666 CLV 30 0
ATMRA31TF CLR 1 419 CLV 0 0
|