Fall 2010 Gen/ComS/BCB 596 Project 3

DNA fragment assembly and closure of genome

Due: Tuesday, November 2 at 11:59 pm

This can be considered the most important project of the semester. Your job is to assemble the fragment sequences obtained from trace files in ~cs596/pub/data/easy and ~cs596/pub/data/moderate to produce two full length BAC sequences. Assuming you have done the previous project correctly, you should already know how to do base-calling, quality assessment, vector trimming and contaminant removal. If you are not familiar with these topics, review them now since you may need to do them again several times in this project to finish the closure attempt. Note that you do NOT need to resubmit your intermediate results like the previous project while you are redoing these work for this project. We only ask for your completed BAC sequences in this project.

 

Again, in order to avoid disturbing other users of the computer science lab (and draw the administrator's attention :), the following guidelines are recommended. If most people don't follow these time allocation guidelines you may not get your project going as smoothly as you could.

  1. To spread out loads, you are recommended to run your jobs at the hours calculated from the last four digits of your university ID number modulo 24. If the hour obtained is not a preferred hour, you can add 6, 12 or 18 hours to it. Therefore, if your ID is 2234, modulo 24 yields 2, so you may run your jobs at 2 am, 8 am, 2 pm and 8 pm. Avoid running your jobs at other hours. Although the machine you login may not seem occupied, other users may be doing their projects on the other CS machines that share the same file server, so network traffics can still slow down the performance of all machines when people are running many data processing jobs at the same time.
  2. Before you start your work, check the current job load on the machine you login by typing the "w" command. If the load average on that machine is more than 1.00 per CPU, do not run your jobs there. Find another machine with lighter load average to run your jobs. Remember, this saves you time if you just check. If you always login to pyrite.cs.iastate.edu, it will automatically assign your login session to the lightest loaded machine at the time of login, alleviating some of the load unbalance problems we have seen before.
  3. Finally, always prefix your jobs with the "nice" command. For example, if the actual command you intent to run is

    phred -id here -sa there -qa place

    you should run it as

    nice phred -id here -sa there -qa place

    This will let other users running interactive jobs get a faster response from the CPU, without slowing your project too much.

Grading: (20 pt. Yes, this project is the most expensive project!)

Although the necessary steps involved in assembling the full length BAC sequences for easy and moderate varies from person to person because that depend on the parameters chosen for each assembly run, the results you need to submit for grading are supposed to be the same --- just submit your assembled full length BAC sequences for easy and moderate, or whatever pieces of them which you believe are the best contigs you can get before the due day.

 

After you have finished your project, just use the submit program to submit your results. If you have more than one contigs for a BAC, say, if you cannot make closure on moderate, you should name your sequences in an understandable way so the TA knows that they below to the same BAC, such as moderate1.seq, moderate2.seq, etc. Actually, to be safe that there are no misunderstanding, you should also submit a README file to explain your results to the TA. If multiple contigs are submitted for the same BAC, they must all be in the same direction, i.e., from the same strand of the BAC DNA. If they are inconsistent, the shorter ones will be considered incorrect sequences and subjected to the following formula for penalty, therefore you should double check of all contigs for the same BAC are pointing in the same direction before you submit them.

 

The total score for each BAC will be 10 points. Your score for each BAC will be calculated based on the following formula:

Total = ( (percent of correct sequences) - (percent of incorrect sequences) ) / 10

With the formula above, you can see that it may be a good idea to submit only sequences you are sure to be correct, rather than forcing yourself to combine all contigs into one BAC sequences. For example, assuming A and C are correct contigs occupying 40% of the total sequence length each, but B is an incorrectly assembled contig that's about 20% the total length. If you just submit A and C as two separate contigs, you get 4+4=8 points. If you submit the B contig as well, you will get a lower score of 4+4-2=6 points. Worst yet, if you insist in submitting ACB together as a full length sequence, and you get their order wrong, you will end up getting a score of 4-2-4=-2 (i.e. zero since we don't take negative scores). Therefore, don't feel obliged to submit a full length BAC sequence. Just try your best, then submit only what you believe are correctly oriented set of contigs for each BAC.

 

The TA will use the sgrasta program to check your answers against our best known answers. That program allows a few errors and still provide the longest alignment with the known answers. Therefore, you do not need to worry that a few wrong bases in the middle of your sequences can cut your scores in half. :-)

Methods:

You have the options of using either the TIGR suite of software tools to assemble your BACs (lucy, run_TA, asm_to_group.pl, grouper, etc.), or the U. of Washington suite of tools (phredPhrap, consed, etc.) Actually, you probably want to use both suites to cross-check your assembly results to make sure they are correct.

 

Detail steps of DNA fragment assembly are too complex to be repeated here. In the classes, we have covered enough details , and will continue to address your questions if you raise them in the class or on WebCT. You are more than welcome to ask questions regarding this project in the class; we can bring your problems to the projector screen to try to solve them in the class.

 

Here we provide just a brief outline of the (possibly repetitive) steps you may need to follow in order to assemble the fragments:

  1. base calling (see project 2),
  2. contaminant removal and quality assessment (see project 2),
  3. fragment assembly (see the many documents in ~cs596/pub/doc; also check your class notes),
  4. contig grouping (this is for TIGR tools only),
  5. Inspect results and do simulated PCR to obtain hidden chromatogram files for gap closure (see below),
  6. Go back to step 1 with any new chromatogram files you obtained until you are satisfied with the results,
  7. Possibly edit your results manually (if you use consed), then submit your answers.

Remember, You can submit multiple times if you wish, until the deadline, with no penalty. If you use the same file names the older ones will be overwritten by newer ones. If you use different file names all files will be accumulated for TA to view.

Simulated PCR:

As we explained in the class, in a real-word scenario, a genome can seldom be closed at the first random assembly phase of a shotgun sequencing project. Almost always, technicians have to use their partial assembly results to infer new PCR targets in the clone library, determine the correct primer pairs for these targets, and "fish out" or PCR out additional sequence data to close their gaps. Since we cannot afford to have a real PCR step in our simple class project, we have developed a simulated PCR program to mimic this closure step by retrieving chromatogram files people in TIGR already have obtained with their actual PCR efforts.

 

To use the pcr program, type your command like the following:

pcr easy AAAAAAAAAAAAAAAAAAAA TTTTTTTTTTTTTTTTTTTT GGGGGGGGGGGGGGGGGGGG ...

The first argument to the pcr program must be either easy or moderate for the two different BACs. The rest of the arguments are your PCR bait sequences that you inferred from your current partial assembly results. You most likely will be looking at the ends of your current contigs to infer these. The bait sequences have to be at least 20 bp long to be considered a valid PCR primer for our project purpose (so you can't run an exhaustive primer enumeration search using Perl :), and a longer bait may miss just one base-pair that can prevent some trace files to be sent back to you by the pcr program . So the optimal bait size is probably just 20 bp!

 

In order to prevent people from exhaustively searching for all possible baits of PCR targets (all 4^20 of them!), we have set a limit of 200 PCR attempts for each BAC group. This limit will be enforced by the pcr program. 200 PCR attempts are more than enough for the closure if you know what you are looking for before you call the pcr program.

 

If you run into any difficulty executing the pcr program, let the Instructor know.

Hints:

We offer the following hints in order to make your life easier when doing this project. If we have additional information later we will post them to WebCT discussions on the assembly problem:

  1. The first suggestion is to ask your questions in the class. Since we anticipate many people will have similar questions for this somewhat difficult project, we like everybody to share the answers we provided. Please do not send emails to the Instructor for questions that of general interest to the other students. Instead, ask them in the class or on WebCT.
  2. Although the pcr program will tell you how many chromatogram files that are still hidden from you, you do NOT really need to get them all to close your BACs. Some of the hidden chromatogram files were produced to double check the quality of certain assembled areas of the BAC due to TIGR's high quality standard and thus have nothing to do with closure. We just hide them all because we don't know which ones were generated for closure and which ones were generated for quality check back at TIGR..
  3. The moderate BAC group has a nasty repeat region which may cause some uncertainty to the assembly program. You can manually decide which answers you would like to take based on the average clone length data summarized by the check_coverage program.
  4. If you think you have an alternative solution that you can use your partially assembled contigs as seeds to fish out the full length BAC sequences from the GenBank, forget about it. The chromatogram files we provided for your projects have been specially *cooked* to make them assemble to different results than TIGR submitted GenBank sequences. If you submit GenBank sequence as answers we will consider you cheating because you can't create some part of the TIGR submitted GenBank sequences based on the actual trace files you were given.
  5. If you believe you should get some PCR results with the pcr program but you didn't, don't forget to check the direction of your baits and/or to reverse-complement them before trying again. Sometimes the PCR direction should be on the opposite strand if you read the grouper output in the wrong direction. Also, you may want to try another PCR baits if your chosen ones do not hit any hidden files. Never repeat the search with the same unsuccessful baits! Sometimes the actual PCR reaction as carried out in TIGR starts much earlier or later than your chosen spots on the contigs. Don't forget, there are other real-world constrains on picking PCR primers but we do not enforce them with the pcr program in order to reduce the complexity of the project. These PCR primer design restrictions are observed by TIGR PCR design and limited their primer choices. We have explained this in the class.
Last modified August 23, 2010. All rights reserved.