|
|
|
Fall 2007 Gen/ComS/BCB 596 Project 3
DNA fragment assembly and closure of genome
Due: Thursday, October 25 at 11 am.
This can be considered the most important project of the semester. Your job is to assemble the fragment sequences obtained from trace files in ~cs596/pub/data/easy and ~cs596/pub/data/moderate to produce two full length BAC sequences. Assuming you have done the previous project correctly, you should already know how to do base-calling, quality assessment, vector trimming and contaminant removal. If you are not familiar with these topics, review them now since you may need to do them again several times in this project to finish the closure attempt. Note that you do NOT need to resubmit your intermediate results like the previous project while you are redoing these work for this project. We only ask for your completed BAC sequences in this project.
Again, in order to avoid disturbing other users of the computer science lab (and draw the administrator's attention :), the following guidelines must be followed. If people don't follow these time allocation rules you may not get your project going as smoothly as you could.
- To spread out loads, you should run your jobs only at the hours calculated from the last four digits of your university ID number modulo 24. If the hour obtained is not a preferred hour, you can add 6, 12 or 18 hours to it. Therefore, if your ID is 2234, modulo 24 yields 2, so you should run your jobs at 2 am, 8 am, 2 pm and 8 pm. Avoid running your jobs at other hours. Although the machine you login may not seem occupied, other users may be doing their projects on other machines that share the same file server, so network traffics can still slow down the performance of all machines when people are running many data processing jobs at the same time.
- Before you start your work, check the current job load on the machine you login by typing the "w" command. If the load average on that machine is more than 1.00 per CPU, do not run your jobs there. Find another machine with lighter load average to run your jobs. Remember, this saves you time if you just check.
- Finally, always prefix your jobs with the "nice" command. For example, if the actual command you intent to run is
phred -id here -sa there -qa place
you should run it as
nice phred -id here -sa there -qa place
This will let other users running interactive jobs get a faster response from the CPU, without slowing your project too much.
Grading: (200pt! Yes, this project is very expensive!)
Although the necessary steps involved in assembling the full length sequences for easy and moderate varies from person to person depending on the parameters they have chosen, the results you need to submit for grading are supposed to be the same: just submit your assembled full length BAC sequences for easy and moderate, or whatever pieces of them which you believe are the best you can get before the due day.
After you have finished your project, just use the submit program to submit your results. If you have more than one contigs for a BAC, say, if you cannot close moderate, you should name your sequences in an understandable way so the TA knows that they below to the same BAC, such as moderate1.seq, moderate2.seq, etc. Actually, to be safe that there are no misunderstanding, you should always submit a README file to explain your results. If multiple contigs are submited for the same BAC, they must all be in the same direction (i.e., from the same strand of the BAC DNA. If they are inconsistent, the shorter ones will be considered to be incorrect sequences and subject to the following formula for panelty, therefore you shoudl double check of all contigs are pointing in the same direction before you submit.
The total score for each BAC will be 100 points. Your score for each BAC will be calculated based on the following formula:
Total = (percent of correct sequences) - (percent of incorrect sequences)
With the formula above, you can see that it may be a good idea to submit only sequences you are sure to be correct, rather than forcing yourself to combine all contigs into oOctober 11, 2007 are correct contigs occupying 40% of the total sequence length each, but B is an incorrectly assembled contig that's about 20% the total length. If you just submit A and C as two separate contigs, you get 40+40=80 points. If you submit the B contig as well, you will get a lower score of 40+40-20=60 points. Worst yet, if you insist in submitting ACB together as a full length sequence, and you get their order wrong, you will end up getting a score of 40-20-40=-20 (i.e. zero since we don't take negative scores). Therefore, don't feel obliged to submit a full length BAC sequence. Just try your best, then submit only what you believe are correct.
The TA will use the sgrasta program to check your answers against our best known answers. That program allows a few errors and still provide the longest alignment with the known answers. Therefore, you do not need to worry that a few wrong bases in the middle of your sequences can cut your scores in half. :-)
Methods:
You have the options of using either the TIGR suite of software tools to assemble your BACs (lucy, run_TA, asm_to_group.pl, grouper, etc.), or the U. of Washington suite of tools (phredPhrap, consed, etc.) Actually, you probably want to use both suites to cross-check your assembly results to make sure they are correct.
Detail steps of DNA fragment assembly are too complex to be repeated here. In the classes, we have covered enough details , and will continue to address your questions if you raise them. You are more than welcome to ask questions regarding this project in the classes. We can bring your problems online to try to solve them in the classes.
Here we provide just a brief outline of the (possibly repetitive) steps you may need to follow to assemble the fragments:
- base calling (see project 2),
- contaminant removal and quality assessment (see project 2),
- fragment assembly (see the many documents in ~cs596/pub/doc; also check your class notes),
- contig grouping (this is for TIGR tools only),
- Inspect results and do simulated PCR to obtain hidden chromatogram files for gap closure (see below),
- Go back to step 1 with any new chromatogram files you obtained until you are satisfied with the results,
- Possibly edit your results manually (if you use consed), then submit your answers.
Remember, You can submit multiple times if you wish, until the deadline, with no penalty.
Simulated PCR:
As we explained in the class, in a real-word scenario, a genome can seldom be closed at the first random assembly phase of a shotgun sequencing project. Almost always, technicians have to use their partial assembly results to infer new PCR targets in the clone library, determine the correct primer pairs for these targets, and "fish out" or PCR out additional sequence data to close their gaps. Since we cannot afford to have a real PCR step in our simple class project, we have developed a simulated PCR program to mimic this closure step by retrieving chromatogram files people in TIGR already have obtained with their actual PCR efforts.
To use the pcr program, type your command like the following:
pcr easy AAAAAAAAAAAAAAAAAAAA TTTTTTTTTTTTTTTTTTTT GGGGGGGGGGGGGGGGGGGG ...
The first argument to the pcr program must be either easy or moderate for the two different BACs. The rest of the arguments are your PCR bait sequences that you inferred from your current partial assembly results. You most likely will be looking at the ends of your current contigs to infer these. The bait sequences have to be at least 20 bp long to be considered a valid PCR primer, but a longer bait may miss just one base that can prevent some trace files to be sent back to you. So the optimal bait size is probably just 20 bp.
In order to prevent people from exhaustively searching for all possible baits of PCR targets (all 4^20 of them!), we have set a limit of 200 PCR attempts for each BAC group. This limit will be enforced by the pcr program. 200 PCR attempts are more than enough for the closure if you know what you are looking for before you call the pcr program.
If you run into any difficulty executing the pcr program, let the Instructor know.
Hints:
We offer the following hints in order to make your life easier when doing this project. If we have additional information later we will post them to to WebCT:
- The first suggestion is to ask your questions in the classes. Since we anticipate many people will have similar questions for this difficult project, we like everybody to share the answers we provided. Please do not send emails to the Instructor for questions that of general interest to the other students. Instead, ask them in the classes or WebCT.
- Although the pcr program will tell you how many chromatogram files that are still hidden from you, you do NOT really need to get them all to close your BACs. Some of the hidden chromatogram files were produced to double check the quality of certain areas of the BAC and have nothing to do with closure. We just hide them all because we don't know which ones were generated for closure and which ones were generated for quality check at TIGR originally.
- The moderate BAC group has a nasty repeat region which may cause some uncertainty to the assembly program. You can manually decide which answers you would like to take based on the average clone length data summarized by the check_coverage program.
- If you think you have an alternative solution that you can use your partially assembled contigs as seeds to fish out the full length BAC sequences from the GenBank, forget about it. The chromatogram files we provided for your projects have been specially *cooked* to make them assemble to different results than TIGR submitted GenBank sequences. If you submit GenBank sequence as answers we will consider you cheating because you can't create some part of the TIGR submitted GenBank sequences based on the trace files you were given.
- If you believe you should get some PCR results with the pcr program but you didn't, don't forget to check the direction of your baits and/or to reverse-complement them before trying again. Sometimes the PCR direction should be on the opposite strand when you read the grouper output in the wrong direction. Also, you may want to try another PCR baits if your chosen ones do not hit any hidden files. Sometimes the real PCR reaction as carried out in TIGR starts much earlier or later than your chosen spots on the contigs. Don't forget, there are other real world constrains on picking PCR primers which we do not enforce with the pcr program in order to reduce the complexity of the project, but they are observed by TIGR PCR designs. We have explained this in the class.
|