Friday, July 03, 2009

What to propose to NIH?

How about this?

Goal: To fully characterize all of the biases and sequence specificities of transformational recombination in H. influenzae.

Specific questions to answer:
  1. Is DNA binding a distinct step that precedes the initiation of DNA uptake? If so, does it have the same sequence specificity as DNA uptake? Does it have any topological specificity?
  2. What is the complete sequence specificity of DNA uptake? How absolute is the requirement for a good match to the USS consensus at uptake (are non-USS DNAs taken up at lower frequency or not at all)? This will be investigated with plasmids or short fragments containing 12% degenerate USSs, and with ones containing completely random sequences (we could create these or just use fragments of unrelated DNA). Does uptake have any topological specificity?
  3. Does the translocation step impose any sequence specificity? How strict is the requirement for a pre-existing free end (does circular DNA sometimes get cut or nicked in the periplasm, or transported intact into the cytoplasm?
  4. Is there any sequence specificity to the DNA degradation that accompanies uptake and translocation?
  5. What proteins interact directly with DNA during binding, uptake and translocation?
  6. What recombination biases affect indels? (How efficiently are different indels transfered by recombination?
  7. What are the recombination biases along the full length of the chromosome?
  8. How does mismatch repair affect the outcome of transformational recombination? How much of the recombination bias found by #7 is due to mismatch repair?

Thursday, July 02, 2009

Proposal priorities

I've gone through our 2007 CIHR proposal, incorporating my polishing of the text and notes about things the reviewers liked or didn't like. I also added notes about what we could reasonably accomplish before Sept. 15; here is an updated version of the ideas I posted a few weeks ago.

Specific Aims:
  1. "What is the H. influenzae uptake specificity? A pool of USSs that have been intensively but randomly mutagenized and then selected for the ability to be taken up by competent cells will be sequenced to fully specify the uptake bias." I've been getting info about designing and ordering the degenerate oligos. The post-doc had already set up a spreadsheet that does the calculations I wanted to think about, and he and I agreed that we should start with 12% degeneracy rather than 9%. This will reduce the fraction of oligos that are strongly preferred, giving more sensitivity for detecting weaker effects. I've had replies from two custom-oligo companies, and the good news is that our degenerate oligo pool will be easier to get and much cheaper than I had expected. For the proposal we'll only need to do some conventional sequencing, and as our USSs will already be in plasmids we think we'll just sequence each plasmid insert separately. This will be wasteful but not very expensive if we do the DNA reactions and cleanups ourselves, and probably cheaper when we consider the time/money we'll save by not having to troubleshoot a more 'efficient' sequencing strategy.
  2. "What forces act on DNA during uptake? Laser-tweezer analysis of USS-dependent uptake by wild type and mutant cells will reveal the forces acting on the DNA at both the outer and inner membranes." My physicist collaborator is keen to have me back to get this working. The only think I could aim to get done before September 15 is to attach some chromosomal DNA to the styrene beads and show that Bacillus subtilis will bind to it and pull on it in the tweezers apparatus, and that H. influenzae doesn't stick nonspecifically to beads with no DNA on them (a concern raised by a reviewer). The biotin-linked chromosomal DNA I'll use for this will also be used in other preliminary experiments.
  3. "Does the USS polarize the direction of uptake? Using magnetic beads to block uptake of either end of a small DNA fragment will show whether DNA uptake is symmetric around the asymmetric USS." Our 1 micron paramagnetic streptavidin beads are on their way, and I'll use these to check for non-specific binding of cells to beads and for specific binding of competent cells to beads with DNA on them (using the same DNA prepared above). I found the source of the 50 micron beads and will order them tomorrow; maybe I can use them in the same way.
  4. "Does the USS increase DNA flexibility? Cyclization of short USS-containing fragments will reveal whether the USS causes DNA to be intrinsically bent or flexible, and whether ethylation or nicking can replace parts of the USS." I'm going to try the nicking protocol tomorrow.
  5. "Which proteins interact with incoming DNA? Cross-linking proteins to DNA tagged with magnetic beads, followed by HPLC-MS, will be used to isolate and identify proteins that directly contact DNA on the cell surface." We'll have the DNA on the big magnetic beads (see 3 above) and can use this to try out the formaldehyde cross-linking. It would be good to show that we can distinguish between non-specific proteins or peptides (independent of both DNA and competence) and peptides cross-linked only when cells are competent and the beads have DNA on them. One of the reviewers really liked this part, but thought we should be more ambitions and propose more diverse approaches to this problem, including in vitro cross-linking to purified secretin.
  6. "Which proteins determine USS specificity? Heterologous complementation with homologs from the related Actinobacillus pleuropneumoniae (which recognizes a variant USS) will identify the proteins responsible for this specificity." It would be good to have results of some complementation experiments with single gene plasmids, as the whole-operon plasmids seemed to cause growth problems.
The RA will be back on Monday, and she, I and the post-doc will sit down together to work out who can try to do what. Then I'll start thinking about integating these experiments and the post-doc's experiments into the NIH proposal.

Does EthBr turn restriction enzymes into nickases?

I'm looking for a way to create a nick at a specific point in a closed circular plasmid, to see if increased DNA flexibility facilitates DNA uptake. New England Biolabs now sells specific nickases, but their site preferences aren't very convenient. I also have an old paper (Shortle et al. 1982 PNAS 79:1588) describing use of ethidium bromide to convert restriction enzymes into nickases. So I think I'll just do a simple test. Here's the relevant paragraph from that paper:
Restriction Enzyme Nicking Reactions. Covalently closed circular pBR322 DNA was nicked with restriction endonucleases HindIII, Cla I, or BamHI by incubating 10 pkg of plasmid DNA in a 100-Al solution of 20 mM Tris HCl, pH 7.8, 7 mM MgCl2, 7 mM 2-mercaptoethanol*, gelatin* (100 Ag/ml) and a concentration of ethidium bromide (50 ug/ml, 75 ug/ml, or 100 ug/ml, respectively) determined by titration to give an optimal level of nicking. An amount of restriction endonuclease was added sufficient to convert 50-90% of the input DNA to an open circular form on incubation at room temperature for 2-4 hr. The nicking reaction with the EcoRI enzyme consisted of 100 mM Tris'HCl, pH 7.6, 50 mM NaCL, 5 mM MgCl2, gelatin* (100 ug/ml), ethidium bromide (150 ug/ml). Reactions were stopped by addition of excess EDTA followed by phenol extraction and ethanol precipitation.
I have lots of supercoiled plasmid that has one EcoRI or BamHI or HindIII site, and a big bottle of 1 mg/ml EthBr. So I'll just set up a series of digests with different concentrations of EthBr and restriction enzyme and the standard digestion buffers*, and run a gel to see what I get.

*In the old days we used to add mercaptoethanol and gelatin to stabilize our restriction enzymes, but I won't bother.

Email to potential suppliers of a degenerate USS oligonucleotide

Dear oligonucleotide expert,

I'm writing to inquire about ordering a large highly degenerate oligonucleotide.

We're hoping to obtain an oligonucleotide that will be about 50 nt long and 12% degenerate at every position (so this will really be a population of degenerate nucleotides). Below I've listed specifications.

Would you be able to synthesize such an oligonucleotide for us? If so, could you let me know the estimated cost and time frame, and any other issues you think we should consider?

Thanks very much,

Rosie Redfield

Sequence: Not yet finalized, but probably similar to: AAAGTGCGGTTAATTTTTACAGTATTTTTACTGGTACCATATGAATTCCA

Degeneracy: Every position should be degenerate, with 4% each of the three other bases. For example, every position specified as an A should be 88% A and 4% each of T, G and C.

Scale: Because synthesizing this oligo will probably require a dedicated run, we would like to obtain as much as can easily be prepared in a single run.
(Calculation: 1 micromole of a 50 nt fragment is 6.02 x 10^17 molecules, with mass = 15 mg.)
Modifications: None.

Purification: For our experiments it will important to minimize the fraction of oligos that are not full length. What purification methods do you recommend for oligos of this length, and what contamination level should we expect?

Synthesis and sequencing of degenerate USS

One of our planned experiments (for our grant proposals) will thoroughly characterize the sequence bias of the H. influenzae DNA uptake system. We'll do this by having the cells take up plasmids from a pool containing many degenerate versions of the canonical preferred sequence, the USS, and then comparing the USS sequences of plasmids that were taken up with the sequences in the input plasmid pool.

What are the steps (the challenges)?
  1. synthesizing the degenerate USS
  2. putting the degenerate USS in a plasmid vector
  3. having cells selectively take up plasmids from the pool
  4. reisolating the plasmids that have been taken up
  5. sequencing representative USS from the two plasmid pools (input and taken-up)
  6. interpreting the sequences (characterizing the sequence diversity)
What I've thought of so far:

Steps 1 and 2. We'll obtain the degenerate USS as a pool of degenerate oligos, each with a 20 nt adapter on their 3' end. We can then make double-stranded versions using a sequence complementary to the adapter as a primer, using Klenow polymerase or Taq. Taq will leave A overhangs on both 3' ends and we'll use these to ligate the USSs into pTOPO or another TA-cloning vector. We'll order the oligos from a supplier, specifying the 30 nt USS consensus and flanking adapter sequence, and that each position should have 3% of each of the other nucleotides. They will probably need to be synthesized as a special batch, using specially prepared input nucleotide mixes (e.g. for A positions, 91% A and 3% each of G, T and C).

The 9% degeneracy was chosen to give an average of about 3 non-consensus bases in each USS, but I haven't done the math to predict the expected distribution. This is essential because we don't want the mix to consist mostly of USS variants that are just as good as the consensus. OK, the USS is 30 nt long, with 7 positions that show no consensus (bases at these positions probably don't influence uptake at all). So 21 positions that matter, with some probably mattering much more than others. That means the average USS will have only 2 non-consensus bases at positions that matter, and many will have the perfect consensus. I suspect this is too low - I'll work it out with the post-doc today.

Step 3. The cells need to be as competent as possible (to maximize uptake and recovery), but they also need to be given the DNA at a concentration that maximizes their selectivity. So they shouldn't be given so much DNA that the uptake system is saturated. 200 ng/ml of DNA is saturating for standard preps of competent cells, so we'll give these cells 100 ng/ml. We usually have ~ 10^9 cells/ml, so if they each take up 10 plasmids we'll have a maximum of ~10^10 plasmids to reisolate and sequence (assuming no losses). Depending on other constraints, the whole experiment could also be done with short USS-containing fragments rather than plasmids.

Step 4. Reisolating the plasmids that have been taken up may be easiest if we use a rec-2 mutant, because these cells can't take DNA past the first stage of uptake. The post-doc is getting ready to do some preliminary tests of DNA recovery, using radiolabeled DNAs. If the plasmids remain un-nicked they can be isolated using a miniprep kit. If we can't reisolate them away from the chromosomal DNA we can instead PCR-amplify the USS sequences, using flanking plasmid and adapter sequences for the primers and the minimum number of PCR cycles to maintain diversity (to reduce amplification artefacts). We'd better think carefully about much diversity our recovered DNA will have.

Step 5. The post-doc and I had a long discussion about the sequencing yesterday. For thorough sequencing of the input and taken-up pools we'd probably use one or two Illumina lanes - I need to be coached on the details of how this works. But for preliminary characterization (for the grant proposals) we could get by with much less sequencing - even sequencing just 100 of the recovered USSs should be enough to demonstrate that things are working. If the recovered USSs are in plasmids, we can just transform these into E. coli and do conventional Sanger/capillary sequencing of some inserts - this wold be inefficient but not overly expensive. If the recovered DNA is in short linear fragments (from PCR?), I was originally suggesting that we'd ligate these end-to-end into blocks of 5-20 (depending on size) and clone these for conventional sequencing. But the post-doc pointed out that big problems can arise when sequencing repeats, so cloning and sequencing them singly might still be faster and cheaper than troubleshooting these problems.

Step 6. Analyzing 100 preliminary sequences for the grant proposal data will be no big deal, but analyzing many thousands from the Illumina runs will require a more sophisticated approach which I haven't thought through yet.

Wednesday, July 01, 2009

First uptake experiment

Didn't work.

The plasmid DNA minipreps didn't contain any plasmid at all (except for the positive control that I'd seeded with a known amount of plasmid). But this might just be because my frozen competent cells were old and not very competent at all. Their transformation frequency was only about 2.5 x 10^-4, so even if the competent cells had taken up supercoiled plasmid I probably wouldn't have been able to detect it in my minipreps.

I may repeat the experiment one time, using fresh competent wildtype and rec-2 cells. (The rec-2 mutant cells can't transport DNA out of the periplasm, so may give higher sensitivity.) Or I may decide that this experiment shouldn't be high on my priority list because it isn't something the reviewers of our previous proposal were concerned about.

Saturday, June 27, 2009

Lab decoration

To celebrate getting back into the lab, I spent a little while
improving the decor by my bench.

Starting experiments again

I'm (finally) starting an experiment. This is the test of whether cells can take up intact closed circular plasmid DNA. The original experiment (Kahn et al 1983 - pdf here - compare lanes A and G of Fig. 3) used plasmids that had been cut, labeled with 32-P, and religated; the religation trapped some supercoils, allowing the gels to show that supercoiling was preserved in the DNA that had been taken up. Our new 32-P won't arrive until sometime next week, so I'm going to do a practice run using cold DNA. If I'm lucky, the cells will take up enough that I can see it in a gel.

The post-doc showed me the stock of the plasmid I'll use (pUSS-1, left by a previous post-doc). So this afternoon I'm just running a quick gel to check the state of this DNA. And pouring some plates to streak out the wildtype and rec-2 cells I'll be using. If the DNA is supercoiled I'll try it out tomorrow on some wildtype competent cells I have frozen.

We've also ordered the tester-kit of four kinds of streptavidin-coated magnetic beads with different surface properties. They're much bigger than I wanted, but I figure I can at least use them to find out whether H. influenzae cells stick to any of (or all of?) these beads in the absence of DNA. If they do, the planned experiments may not work.

The new post-doc's plan

Our new post-doc is planning a project that fits wonderfully with our research questions. He's going to incubate competent cells with fragments of chromosomal DNA from a different H. influenzae strain (about 2% sequence divergence), and reisolate the DNA that has passed through different stages of uptake and recombination. The pools of DNA (each potentially about 10^10 different molecules, taken up by about 10^9 different cells) will then be intensively sequenced with the latest ultra-high-throughput technology. By simply comparing how many times each donor sequence appears in each pool, we will learn an enormous amount.

The biggest thing we want to understand is the bias of the DNA uptake machinery towards the uptake signal sequence (USS) motif. Sites matching this motif are very common in the H. influenzae genome (~2000 sites), and we have characterized them in great detail, but we know very little about their role in uptake. This is a big problem, because we think that their role in uptake is the ultimate cause of their accumulation in the genome. We know that cells prefer sequences perfectly matching the core consensus (AAGTGCGGT) 10-100-fold over unrelated sequences. Our previous attempts to find out more about the effects of mismatches and of changes to the flanking-sequence- consensus haven't gotten very far, partly because we were testing each variant USS separately, and partly because the results were not very consistent (high variation between experiments). This new project will not only measure the relative uptake of all the >2000 sequences closely matching the consensus, it will measure the uptake of all the other sequences in the genome too. The resolution can be manipulated by changing the sizes of the chromosomal DNA fragments used as input (100 bp, 1 kb, 10 kb), and if more diversity is wanted we can repeat the analysis with mixtures of DNAs from other strains and closely related species. We will represent this full uptake-bias as a position-weight matrix for the 30+ bp of the extended USS, in the same way as we now represent the consensus of the USS in the genome. Comparing the two matrices will tell us the extent to which the sequence motifs found in the genome correspond are produced by the bias of the uptake machinery.

The project will answer a lot of other questions, both mechanistic and evolutionary. WHY the uptake machinery is biased is our biggest research question. We think that the bias arises for mechanistic reasons; i.e. the sequences favoured by the uptake machinery are those that are easiest to take up. One issue that the post-doc's experiments will resolve is whether the direction of DNA uptake is set by the orientation of the USS. This will tell us quite a bit about the uptake mechanism. The uptake analysis will also let us define a practical USS - the set of sequences that are consistently taken up much better than other sequences. At present most analysis has considered only sequences perfectly matching the core consensus of the motif. This is an easy place to draw a line between USS and not-a-USS but it is very unlikely to correspond with reality.

We may also be able to find out whether a distinct DNA-binding step precedes the initiation of uptake? DNA uptake is conventionally treated as being initiated after DNA is specifically bound to the cell surface. But we don't know if DNA does stick to competent cells in a way that it doesn't to non-competent cells, or if it does, whether the binding is sequence-specific. In principle I think that any protein-catalyzed step can be modeled as having separate binding and catalysis stages, but as a first step we just want to find out whether whatever protein or proteins that initially bind the DNA also play direct roles in its uptake. The alternative would be a protein that just binds DNA and holds it close to the cell surface, increasing its local concentration so it's more likely to encounter the uptake machinery. If we did find such a protein we'd want to find out whether it had any sequence specificity. The post-doc's experiments won't be able to tell us whether the protein exists, but he may be able to look for a specific DNA fraction that has bound to competent cells but not started uptake.

The final part of his experiments will sequence DNA from a pool of 10^9 cells that have recombined fragments of donor DNA into their chromosomes. This will be the biggest sequencing challenge, as only about 1% of each genome will be recombined donor sequences. But it will give a big payoff for understanding both the mechanistic factors limiting recombination and the evolutionary consequences of recombination between diverging lineages.

For example, are USS sufficiently close together that they cause recombination to be high and even across the whole genome? Or does recombination reveal substantial disparities in coverage that correlate with differences in USS spacing? The answer is likely to depend on the lengths of the DNA fragments the cells are given; fragments shorten than the average USS spacing will likely give large disparities that will be evened out when the fragments are large enough that they usually contain at least one good USS.

Thursday, June 25, 2009

tiny magnetic beads?

I was about to order a tester-kit of streptavidin-coated magnetic beads from Invitrogen (4 different bead surface types, to find the surface that gives the least background), when I realized that these weren't the tiny 50 nm beads I had in mind but 1 µ beads that are as big as H. influenzae cells. I don't want to use great big beads if I can get smaller ones (yes, I realize that 1 µ is hardly 'great big').

Now I can't find the supplier that had the 50 nm streptavidin-coated beads at all. The best I can find is a European company, Ademtech, that has 100 and 200 nm beads. I've sent them an email asking if they have a Canadian supplier, but I'm not optimistic. Maybe I should just try the tester kit of big beads, and then scale down...

Invitrogen charges more than $500 for a stand that holds 16 microfuge tubes against rare-earth magnets, to separate the beads from the liquid they're suspended in. The 6-tube one is only about $150. But, I can buy a sampler of 50 NdFeB magnets of assorted sizes (1/8 inch to 1/2 inch, rods and discs) from Lee Valley Tools for $16.50.

Wednesday, June 24, 2009

Real USS spacings are not random


The co-author only needed about 20 minutes to write the little perl script I'd asked for, and here are the results I generated with it. The blue bars show the spacing distribution of the real H. influenzae uptake sequence cores, and the red bars show the spacing distribution of the same number of random positions in a sequence of the same size (based on 10 replicates).

So the statistical advisor was correct; the real distributions have far fewer close spacings than expected for a random sequence. This is despite the the presence of two forces expected to increase close spacings - USS pairs serving as transcription terminators, and general clustering of USS in non-coding sequences. As hypothesized in yesterday's post, the discrepancy between real and random is stronger when we eliminate the terminator effect by only considering spacings between USS in the same orientation.

And the previous mathematical analysis was right, the real spacing distribution is more even than the random one - not only are there fewer close USS than expected, there are fewer far-apart USS too. (This is more obvious in the upper graph.)

Now I need to modify the perl script that locates uptake sequences so it will find the Neisseria DUS, and then do the same analysis for the N. meningitidis genome. The previously published analysis of these spacings simply compared the mean spacing of DUS with the mean length of apparent recombination tracts, concluding that the similar lengths was consistent with the hypothesis that DUS arise or spread by recombination.

Tuesday, June 23, 2009

Spacing of uptake sequences

I've been looking more at the spacings of the uptake sequences that arise in our model genomes and that are present in the real H. influenzae genome. First, in the panel on the left, is what we see in our simulated genomes. I only have data for simulations done with relatively small recombination fragments because the others take a very long time to run.

The notable thing (initially I was surprised) is that almost all of the uptake sequences in these genomes are at least one fragment length apart. That is, if the fragments that recombined with the genome were 25 bp (top histogram), none of the resulting evolved uptake sequences are in the 0-10 and 10-20 bins. If the fragments were 500 bp (bottom histogram, not to the same scale), none are in any bin smaller than 500 bp.

For the 100 bp fragments I've plotted both the spacings between perfect US (middle green histogram) and the spacings between pooled perfect and singly mismatched US (blue histogram below the green one). When the mismatched US are included the distribution is tighter, but there are still very few closer than 500 bp.

This result makes sense to me, because the uptake bias only considers the one best-matched US in a fragment. So a fragment that contained 2 well-matched USs is no better off than a fragment with one, and selection will be very weak on USs that are within a fragment's length of another US.

I talked this result over with my statistics advisor, and we agreed that there's not much point in worrying about whether the distributions are more even than expected for random spacing, because these are so obviously not random spacing.

After getting these results I decided I should take my own look at the spacings of the real uptake sequences in the H. influenzae genome, and these are shown in the graphs in the lower panel. I first checked the spacings of all the uptake sequences that are in the same orientation; the upper graph shows the pooled data for spacings between 'forward' USS and spacings between 'reverse' USS. Let's consider this graph after the one below it.

The lower graph shows the spacings between all of the USSs, regardless of which orientation each is in. This is the data that has been previously analyzed and found to be non-random (in a mathematical analysis that's too sophisticated for me to follow).

We don't see the same abrupt drop-off at close spacings that's in the simulated genomes, but the statistics advisor thought the drop-off at close spacings might be significant. Rather than making a theoretical prediction, he suggested just generating randomly spaced positions (same number of occurrences as the real USS, and in the same length of sequence as the real genome) and comparing their distributions to the real ones in my graph. I'm asking my co-author to write me a little perl script to do this (yes, I probably could write it myself, but she can write it faster). I'll use it to generate 10 random pseudo-genomes worth of spacings, and compare their distributions to my real genome.

There are good reasons to expect the real distribution to not be random. First, uptake sequences are preferentially found outside of genes. In H. influenzae 30% of the USS are in the 10% of the genome that's non-coding, and this would be expected to cause some clustering, increasing the frequencies of close spacings. Second, USS often occur in closely spaced, oppositely oriented pairs that act as transcriptional terminators. This is also expected to increase the frequencies of close spacings. I can't easily correct for the first effect, but I can correct for the second one by only considering USS that are in the same orientation, and that's what I've done in the upper graph.

There we see a much stronger tendency to avoid close spacings. The numbers aren't really comparable because I'm only considering half as many USS in each orientation, but the overall shapes of the two distributions suggest fewer close spacings in the same-orientations graph.

So, two things to do: First, predict the spacings of random positions, using the perl script that isn't written yet. Second, I need to get better data for the simulated evolved genomes, because these were initiated with fake genomes consisting of pure strings of uptake sequences, and the uptake sequences in the resulting evolved genomes had not originated de novo but instead had been preserved from some of the originating ones. I need to redo this with evolved genomes that began as random sequences, but these are taking longer to simulate.

Added later: And a third thing to do: Reread the paper by Treangen et al., who investigated the spacings of the Neisseria DUSs and compared them to the average lengths of recombination tracts.

Monday, June 22, 2009

Uptake of closed circular DNA

H. influenzae is reported to efficiently take up closed circular DNA as well as linear DNA. The DNA can be re-isolated intact after the cells have been treated with DNase I, so it is thought to be in the periplasm. I've based much of my thinking about the mechanism of DNA uptake on this result, but I've never tested it for myself. Now's the time, because the ideas in the DNA uptake proposal depend on this being true.

How to do it? How was the original experiment done? (First step is to find the paper...) Would they have used radioactively labeled DNA, and if so how did they label it without compromising its supercoiled state? (Prep the plasmid from E. coli grown in the presence of massive amounts of 32-P? Yikes!) Or maybe they did a Southern blot to detect it? Find the paper...

Basic plan: Incubate competent cells (wild type or rec-2 mutant) with supercoiled plasmid containing a USS - pUSS1 isolated from E. coli using a kit should be fine, except for the labeling question. Treat cells with DNase I and wash them thoroughly to get rid of external DNA. Lyse cells and prep DNA using a plasmid-prep kit. Run in a gel. If the recovery is really high I might be able to see the unlabeled plasmid, especially if I use one of those ethidium alternatives that's much more sensitive. How high could recovery be? Say I have 10^9 cells, each takes up 10 plasmid molecules. That's only 10^10 plasmids; if they're 2kb long that's 2x10^10kb of DNA, which is about 0.02 micrograms. 20 ng can easily be seen in a gel, especially with a fancy dye, but I'd be out of luck if uptake is less or recovery is poor (for example, if the DNA is nicked the kit won't work).

Ideas from lab meeting

At today's lab meeting we discussed the feasibility of the six Specific Aims from the 2007 CIHR proposal on DNA uptake, and considered what kinds of preliminary data we might generate. I'll discuss each separately here.
  • "What is the H. influenzae uptake specificity? A pool of USSs that have been intensively but randomly mutagenized and then selected for the ability to be taken up by competent cells will be sequenced to fully specify the uptake bias."
The new-post-doc is beginning experiments to show that he can reisolate DNA that has been specifically bound by competent cells. He also suggested a better way to make the pool of randomly mutated USS. Rather than using the degenerate 30-bp USS oligos as primers with a mutagenesis kit, he suggests instead creating 50- to 60-bp oligos with 20 bp at the 3' end outside of the USS. He'll then use Klenow polymerase to fill in the second strand of these oligos, with a primer complementary to the 3' end, and clone these to make our diverse library of degenerate USSs. Sequencing of some of these will provide preliminary data that our input is correct.
  • What forces act on DNA during uptake? Laser-tweezer analysis of USS-dependent uptake by wild type and mutant cells will reveal the forces acting on the DNA at both the outer and inner membranes.
This work was begun by a physics grad student from the lab with the tweezers (across town). I subsequently revised the plan to make things simpler, more efficient, and better controlled. I have a stock of the styrene beads somewhere (with streptavidin), and I've already checked that our standard MAP7 DNA has a suitable length distribution. So I need to put biotin ends on the DNA. I have biotinylated nucleotides but don't remember how I was going to attach these to the DNA ends. I could cut the DNA with a rare-cutting restriction enzyme (SmaI?) and then fill-in with the nucleotide (better read my old notes). SmaI sites are sufficiently rare that most molecules in the MAP7 DNA prep would have biotin only on one end. Then I need to bind the DNA to the beads. That's easy, but how do I check how much DNA is on the beads? Measure DNA concentration by Absorbance at 260? (Does styrene absorb UV?) Then I need to arrange with my physicist collaborator to try them out, using Bacillus subtilis first because its cells are nice and big and it doesn't care about the sequence of the DNA it takes up.
  • Does the USS polarize the direction of uptake? Using magnetic beads to block uptake of either end of a small DNA fragment will show whether DNA uptake is symmetric around the asymmetric USS.
This time I need to lay out clearly how answering this question would help distinguish between different models of uptake. We tried this experiment unsuccessfully about 5 years ago with much bigger streptavidin-sepharose beads (we later realized that the beads were way bigger than the cells, and porous to boot). I need to do more careful thinking of how this would be done and what we might measure.
  • Does the USS increase DNA flexibility? Cyclization of short USS-containing fragments will reveal whether the USS causes DNA to be intrinsically bent or flexible, and whether ethylation or nicking can replace parts of the USS.
The RA pointed out that a former post-doc tried unsuccessfully to get this cyclization assay working. The problems are all nicely recorded in her blog, as well as being documented in her lab book. Hmmm, what to do? I found a 2009 paper about DNA kinking that used nicking and 2 nt gaps as controls; they didn't see significant kinking with their nicked DNAs. I was going to test whether ethidium causes restriction enzymes to nick circles instead of cutting both strands, but the post-doc and RA tell me that we can now buy 'nickase' derivatives of restriction enzymes instead.
  • Which proteins interact with incoming DNA? Cross-linking proteins to DNA tagged with magnetic beads, followed by HPLC-MS, will be used to isolate and identify proteins that directly contact DNA on the cell surface.
It would be good to at least show that the cross-linking works... The RA says that she and a former post-doc tried using formaldehyde to cross-link DNA and proteins and got nothing but gunk. So I think this will take a lot of optimization. Start by getting the magnetic beads on the DNA. Start by ordering some magnetic beads.
  • Which proteins determine USS specificity? Heterologous complementation with homologs from the related Actinobacillus pleuropneumoniae (which recognizes a variant USS) will identify the proteins responsible for this specificity.
I made the plasmids for this work several years ago, but they were not well-tolerated by the host cells. I should pull them out and see if I can get some results.

Before any of this, I or the post-doc should carefully test whether H. influenzae can indeed take up closed circular DNA intact. I've been citing an old report of this for years, but it's critical to my model of DNA uptake so I really need to test it myself. I'll write a separate post about this.

Babes

Sunday, June 21, 2009

Preliminary experiments for our next grant proposals

It's my turn to do lab meeting tomorrow afternoon, and I'm going to talk about the preliminary data needed for the CIHR and NIH grant proposals I'm planning to submit in September. In preparation I've been rereading the CIHR proposal that didn't get funded two years ago, and the reviewers' critiques of it.

Basically it's a lovely proposal with one massive flaw. The experiments are a fine mix of safe and risky, and the proposal is beautifully written and illustrated. BUT, there's no preliminary data to show that the experiments we propose are likely to work. (The reviewers also grumbled that I was being sneaky in submitting proposals to two different committees in the same competition. Such greed is, of course, un-Canadian, but that won't be an issue this time.)

So now it's time to think more strategically (i.e. about strategy) than I usually do. Which of the 6 lines of investigation I propose are easiest to generate preliminary data for? Which are the most severely weakened by their lack of preliminary data?

Here are the Specific Aims from that proposal, with brief notes about preliminary data in italics:
  1. "What is the H. influenzae uptake specificity? A pool of USSs that have been intensively but randomly mutagenized and then selected for the ability to be taken up by competent cells will be sequenced to fully specify the uptake bias." The new-post-doc is about to start doing preliminary experiments for this and related work, so we should at least be able to show that we can reisolate DNA that has been specifically bound by competent cells.
  2. "What forces act on DNA during uptake? Laser-tweezer analysis of USS-dependent uptake by wild type and mutant cells will reveal the forces acting on the DNA at both the outer and inner membranes." This work was begun by a physics grad student from the lab with the tweezers (across town). We have all the components ready, and I'm keen to work on this myself.
  3. "Does the USS polarize the direction of uptake? Using magnetic beads to block uptake of either end of a small DNA fragment will show whether DNA uptake is symmetric around the asymmetric USS." We've tried this unsuccessfully in the past with much bigger beads. The magnetic beads may also be valuable for the post-doc's DNA-reisolation work - I wonder what form preliminary data should take.
  4. "Does the USS increase DNA flexibility? Cyclization of short USS-containing fragments will reveal whether the USS causes DNA to be intrinsically bent or flexible, and whether ethylation or nicking can replace parts of the USS." We should at least try this cyclization assay - it shouldn't be a big deal.
  5. "Which proteins interact with incoming DNA? Cross-linking proteins to DNA tagged with magnetic beads, followed by HPLC-MS, will be used to isolate and identify proteins that directly contact DNA on the cell surface." It would be good to at least show that the cross-linking works...
  6. "Which proteins determine USS specificity? Heterologous complementation with homologs from the related Actinobacillus pleuropneumoniae (which recognizes a variant USS) will identify the proteins responsible for this specificity." I made the plasmids for this work several years ago, but they were not well-tolerated by the host cells. I should pull them out and see if I can get some results.

Friday, June 19, 2009

For the Discussion?

We need to remember that the uptake bias specified by the position-weight matrix the model uses isn't the effective bias in the run. Instead, the effective bias changes within each cycle and between cycles. In each cycle it typically starts with even the perfect-US fragments having only a 10% chance of recombining, and decreases gradually to a point where the disparity between perfect and less-than-perfect US is substantially reduced. How substantially depends on how many US the genome has at that cycle.

Thursday, June 18, 2009

The Lewontin approach to confusion

(I should be thinking about DNA uptake projects but instead I'm still trying to understand why so few US accumulate in our simulations.)

I'm (at last) trying the approach I learned from Dick Lewontin when I was a beginning post-doc. When you're trying to understand a confusing situation where multiple factors are influencing the outcome, consider the extreme cases rather than trying to think realistically. Create imaginary super-simple situations where most of the variables have been pushed to their extremes (i.e. always in effect or never in effect). This lets you see the effect of each component in isolation much more clearly than if you just held the other variables constant at intermediate values. Then you can go back and apply the new insights to a more complex but more realistic situation.

So I've created versions of our model where the input DNA consists of just end-to-end uptake sequences, where the genome doesn't mutate (the fragments being recombined do have mutations), where the probability of recombination is effectively 1.0 for fragments with perfect uptake sequences and almost zero for fragments with less-than-perfect ones, and where this probability doesn't change during the run. The genome is small (2 kb) to speed up the run and the fragments are only 25 bp, so recombination doesn't bring in a lot of background mutation.

These runs reach equilibria with 45-50 uptake sequences in their 2 kb genomes. If they had one uptake sequence every 25 bp they would have 80 in the genome. Maybe this is as good as it can get - it's certainly way better than what we had previously.

Eliminating the background mutation in the genome makes a surprisingly large difference; a run with it present had only 16 perfect uptake sequences. I wonder if, with this eliminated, the amount of the genome recombined per generation now has much less effect on the final equilibrium? In the runs I've just done, fragments equivalent to 100% of the genome are recombined, replacing about 63% of the genome in each cycle. So I've now run one with half this recombination - it reaches the same equilibrium.

I had previously thought that the reason our runs reached equilibria that were independent of mutation rate was some complex effect of the weird bias-reduction component of our simulations. But in these new 'extreme' runs I had the bias reduction turned off, and I got the same equilibria with mutation rates of 0.01 and 0.001. I also tested a more-extreme versions of our position-weight matrix, where a singly mismatched uptake sequence was 1000-fold less likely to recombine rather than the usual 100-fold less, and this gave the same equilibrium as the usual matrix. So I tried using a weaker matrix, with only a 10-fold bias for each position. This gave about 30 uptake sequences in the 2 kb, a much less dramatic reduction than I had expected. Not surprisingly there were also more mismatched uptake sequences at the equilibrium.

This is excellent progress for two reasons. First, I now have a much better grasp of what factors determine the outcomes of our more realistic simulations. Second, I now know how to generate evolved genomes with a high density of uptake sequences, which we can analyze to determine how the uptake sequences are spaced around the chromosome.

Wednesday, June 17, 2009

The one big problem

Now that we have more data from our model and I've summarized it, we're left with one big problem: Why don't more uptake sequences accumulate in our simulated evolved genomes?

In all the generic simulations the bias favouring perfect matches to the 10 bp consensus US is very strong - they're favoured 100-fold over sequences with one mismatch to the consensus. If nothing else is limiting I would expect the final evolved genomes to have about one of these perfect US in each uptake-sized segment. That is, if the fragments taken up in the run were 1000 bp, I would expect the evolved genome to have about one perfect US per 1000 bp. If the simulated fragments were 100 bp, I'd expect about one perfect US per 100 bp. But my impression is that we get a lot less than this. The outcome appears to be independent of mutation rate, so the explanation shouldn't be that accumulation is limited by mutational decay of uptake sequences.

The other variable I could change is how quickly the bias is reduced when fewer than the desired number of fragments have recombined. At present it's 0.75. Increasing it would make the early cycles go slower, but with such a high mutation rate this isn't a big problem. I'll try a couple of runs using 0.9 and 0.95.

I wrote 'my impression is' above because I haven't really gone back and redone/rechecked the simulations that should have given the most accumulation. These would be ones where the amount of the genome replaced by recombination each generation is almost 100%, or where the genome mutation rate is set to zero. I'll look carefully at what's already done and queue up some new runs today.

This is a really important issue. We need to understand what is limiting US accumulation in the model, and I don't think we do yet.

Update: I've checked the runs my co-author did before she left. With 100 bp fragments, replacing 50% of the genome in each cycle, she always got less than one perfect US per kb. A run that started with 200 perfects seeded in a 200 kb synthetic sequence ended with 33, not 2000. A run that started with a random sequence ended with 18 perfects. These may not have been the final equilibria, but , in the last half of both runs the numbers of perfects were wandering around between 10 and 100, never more. In these runs the mutation rates for the genomes were very low (the rates for the fragments were 0.001), so the problem isn't background mutational decay of perfect US (the numbers of singly mismatched US at equilibrium were also low, about 100-150).

While I was traveling I did some runs with 500% of the genome replaced - in practice this means that most parts of the genome would have replaced several times before the cycle ended, and only a few bits would have not been replaced at all. These runs had less sensitivity because the genomes were only 20 kb, but they did give higher scores, with about 30-40 perfect US in the 20 kb genome. With this much recombination, the mutation rate of the genome itself becomes irrelevant as the almost the whole mutated genome is replaced by recombination in each cycle.

Should I repeat these runs, and do some with fragments only 50 bp long too? The only reason would be to increase my confidence that these are real equilibrium outcomes. I had terminated my runs when the scores seemed stable, but maybe they would creep up if left long enough (maybe using a seeded genome isn't a very good test of the true approach to equilibrium). The results were the same with mutation rates of 0.001 and 0.01, so I could use a rate of 0.01 and set it to run for 10,000 cycles, which is more than 10 times longer than my other runs went for.

More US-variation progress

The manuscript about variation in uptake sequences is still a mess, but today I'm going to send it off to my co-author anyway because my brain is tired of dealing with it. The Introduction isn't bad, and the Results is at least complete in that it describes all the analyses and their results (even though a couple of the model analyses aren't actually done yet). But the Discussion is still a shambles. It's not that we don't have important ideas to put in the Discussion, but I'm too saturated with details to think about them.

The draft figures are done. Some of them are far from polished, but they show what the data is. A couple of others are sketches because the data is incomplete. Lots of new runs to generate the missing data have already finished on the new server, and other (longer) runs are ongoing.

One analyses I'm working on is the spatial distribution of uptake sequences around the genome. Years ago a paper reported that USS are spaced more evenly than would be expected for randomly positioned sequences and suggested this might mean they had an intracellular role in chromosome maintenance. I'm wondering if this even spacing might instead by a consequence of the recombination events they arise by - if so we would expect the uptake sequences that arise in our simulations to also be relatively evenly spaced. A recent paper on Neisseria DUS found spacing to have roughly the same properties as the average lengths of recombination tracts (but this was a very weak constraint).

To do this analysis we need to know the locations of our uptake sequences in our evolved simulated genomes, but our model doesn't report this. On Monday I tried using the Gibbs motif sampler to find these. It easily found the perfect occurrences but needed some nudging to find most of the singly and doubly mismatched ones. This will be useful for making logos that show the evolved motifs, but not very suitable for analyzing spacing. My co-author has now modified a little perl script she had written so it lists the positions of the uptake sequences. The version she sent was for the USS 9 bp core, so I modified it to find the 10 bp generic US that the simulations had been generating. So now I have my first list of spacings.

The USS spacing analysis was done by Sam Karlin (noted Stanford mathematician interested in genomics), using some fancy math he'd invented. I emailed my local statistics expert about the best way to analyze the spacings, and he recommended that I use simulations (rather than probability theory) to find out what a random distribution should look like, and then compare our spacings to this.

To do this I think I need to write a little script that generates a specified number of random positions in a string of specified length (e.g. get 243 positions in a 200,000 sequence, to match the data I had for an evolved genome) and calculates the variance of the spacing distribution. The script needs to do this many times (10,000, suggested the expert!), and generate some summary that I can use to decide whether the spacings of the 243 uptake sequences in my evolved genome is significantly different from random spacing. The expert is off at the evolution meetings right now - I'll consult with him when he gets back.

In the interim I need to get back to thinking about DNA uptake, to plan the preliminary experiments and analyses we want to get done for our upcoming proposals.

Sunday, June 14, 2009

US-variation progress

Progress on both the simulation and manuscript-writing fronts:

The minor snags with using the new server were easily resolved, and yesterday I queue'd up all of the runs needed for the data on mutation-rate effects. Some of them I queue'd up twice because I made mistakes in the settings the first time - one result of this is that we'll have five rather than four replicates of the key runs. And all but two of the runs had finished by this morning - the two still running were doing 50,000 cycles rather than 10,000.

The new writing plan is to generate a draft manuscript (very crude in spots) and draft figures (some in final form, a few just sketches because I don't have the final simulation data yet), and send the whole thing off to my co-author. She can then revise the draft while I generate and analyze the rest of the data and create the semi-final figures. Our goal is to have the semi-final version of the whole thing ready when we meet up at the Gordon Conference on Microbial Population Biology (end of July), and to do the final revisions together then.

Friday, June 12, 2009

Getting organized

Yesterday I wrote an outline for the simulation section of our US-variation manuscript, based what I could remember about the new simulations I ran while I was traveling, and sent it to my co-author. Then I started going through all these simulations to figure out exactly what I had actually done and learned.

I printed out some blank table sheets and filled in a line for each run, listing the key settings and the results (US score, had the genome reached equilibrium). There were about 45 runs, but I'd been sensible enough to keep the related ones together and delete all the false starts so it only took me a couple of hours to sort them all out. Now (once I hear back from my co-author), I need to decide where the gaps are and what replicates are needed, and start queueing the runs on the new server.