Illumina Data Reduction Primer

March 13th, 2013

Here are some notes on how I analyze Illumina data. I would appreciate any comments on how to improve, supplement or extend these methods.

I’m currently using an Amazon Machine Image to analyze PacBio data, and hope to write about this experience soon. Suffice it to say that the high error rate and variable read lengths make the process much more challenging than was the case for Illumina paired reads.

-- Eric Kofoid

The Amazing History of DR397

March 29th, 2012

Drew Reams studies unselected duplication formation, and recently found an eel in the DNA!

Recall that duplication formation is a fundamental small-effect driver of evolutionary adaptation, and TIDs may be the primum mobile of most or all duplications. Drew’s “recalcitrant” duplication strains — those with joints not easily determined by multiplex PCR — are a class which should include TIDs. To analyze them, he sequences their genomes by Illumina technology.

We think TID formation often begins with a snap-back of a 3′ end at a short inverted repeat, forming a stem-loop structure which then initiates DNA synthesis using self as template.

A subsequent template switch restores the fork and finishes the TID. See the excellent article in our Encyclopedia.

The extensive secondary structures at such symmetric TID joints are toxic and rarely observed. Instead, remodelling asymmetric deletions are selected spontaneously, yielding “SJ” (“short junction”) joints. David Leach has shown that sbcCD backgrounds tolerate such structures by not cutting them with the endonuclease product.

Drew wondered what would happen to duplication frequencies in such backgrounds, which may allow the cell to survive the two initial steps of TID formation and increase the yield of duplications. To enhance stem-loop persistance, he also made the cells recQ to prevent stem melting.

He observed a 2-fold increase in duplications in both the chromosome and wild type F’128. The latter contains two IS3 elements and nearly all spontaneous duplications happen by recombination between them. If at least one were removed, the duplication frequency increased by an order of magnitude compared to sbcCD+ recQ+ cells. Something definitely happens in the absence of sbcCD and recQ when IS recombination is blocked

One sequenced duplication was  in F’128 of strain DR397. It had a lacZ read depth 3-fold greater than the chromosome with remaining plasmid DNA about 9-fold greater. In addition, there was a large deletion extending from traI up to lacZ accounting for ~20% of the plasmid.

When we inspected the anomalous read-pair data, we discovered two symmetrical TID joins. We were able to confirm these by showing that reads right at the edge of the deletion window fully contained these joints.

Drew’s suspicions were vindicated. Removing activities which cut stem loop structures or prevent them from folding encourages secondary structure formation and reduces counterselection of the long symmetrical products of snap back and strand switching.

But, what’s the meaning of the enormous deletion? There are a couple of models which come to mind.

Model 1: A snapback at traI followed by another at lac, 5′ resection and ligation of ends will produce a product that matches our observations. It would look something like the following: 

Model 2: A snapback at traI and strand switch at lac will restore the replication fork and lead to a TID. Eventual remodelling by recombination of the flanking repeats will yield our observed product: 

Other related models are possible, with the order of the snapbacks and strand switches reversed, with double strand switches instead of snapbacks, etc. Nevertheless, we favor the second model above, as it only involves two steps to get the essential intermediate, while many generations can pass before selection of the final product.

An interesting aspect of the TID model is the inevitability of the remodeling event. When the origin is itself in the TID, counterselection on a large number of unnecessary genes leads easily to their deletion by recombination. The resulting fitness increase will lead to expansion in the population of the symmetric inversion and eventual extinction of the TID.

-- Eric Kofoid

Sophie’s Mystery

March 28th, 2012

We’ve been developing methods to identify cells carrying TIDs (“Tandem Inverted Duplications”). Sophie Maisnier-Patin uses P22 transduction to introduce various tools which will introduce a drug resistance cassette only if they are able to recombine into such inverted structures.

Sophie asks whether a kanR MudJ element can nucleate TIDs by virtue of its terminal inverted repeat? This entity is 104 bp long and can form a perfectly paired 48 bp stem with an 8 base loop. The hypothesis is something like the following:

The tool Sophie used in this experiment contained two oppositely facing MudJ elements:

She transduced this into a recipient containing MudJ and a deletion which should prevent recombination with cob and his. :

The resulting strain, SMP1666, should have a lac+ camR kanR trp- phenotype. Unfortunately, no TIDs was found. Instead, Illumina and subsequent PCR analyses showed that one end of the tool inserted by a strand annealing mechanism, as the his:cob deletion was not quite large enough. The strain was unstable for chloramphenicol resistance, expected from the flanking trpB::MudJ direct repeats. Sophie showed by transduction to trp+ that the SMP1666 parent, as expected, contained a MudJ-bisected trpB locus, and that trp+transductants, whether camR or camS, were invariably lac-, another prediction of the hypothesis. The corrected model became:

The Mystery

A large fraction of the trp+ lac- transductants retained kanamycin resistance! How could this be, when elimination of the last MudJ in the chromosome should, by definition, also remove the remaining kanR and lac loci? Clearly, something was wrong. There had to be another MudJ with an impaired promoter incapable of driving lacZ. We wrestled for weeks with elaborate models. All of them suffered from the need for at least two concerted events, implying a low rearrangement frequency, when the opposite had been observed.

We remembered some interesting results from two other Sophian strains, SMP1750 & TT26263, in which the donor DNA had recombined with the recipient through the MudJ inverted repeat locus itself, even though the element orientations were divergent. Although the inverted-repeat region is short, when folded it is prone to attack by SbcCD. This would provide many 3′ ends which could then anneal and rescue the cell. A corollary is that the MudJ stem-loop structure is recombinationally potent.

In light of this, we have refined our model:

The head-to-head Muds in the middle would have no promoter at all for lac, but the internal constitutive kanR promoter would still be active. The trpB gene would, of course, have been restored exactly as expected in earlier models.

-- Eric Kofoid

Hastings on TIDs

November 3rd, 2011

An article1 by P.J. Hastings and colleagues corroborates our observations that a substantial fraction of amplifications occur as TIDs.


1. Lin D, Gibson IB, Moore JM, Thornton PC, Leal SM, Hastings PJ. 2011. Global chromosomal structural instability in a subpopulation of starving Escherichia coli cells. PLoS Genet 7: e1002223

-- Eric Kofoid

Dearth of Wikipedia Science Articles

October 17th, 2011

I recently submitted a Wikipedia article, an interesting and illuminating experience. In the process, I discovered that many important scientific topics are missing or poorly represented in this massive compendium. Examples of some of the more egregious no-shows: the his, eut, and pdu operons, linear transformation, F’128, and printing (the genetic technique).

Creating and editing Wikipedia articles is relatively easy and explained here. Registering as an editor takes a few minutes, and getting up to speed only a couple of hours. Once you’ve done this, writing an entry is not much harder than writing any other essay. You should give it a try!

-- Administrator

Lenski, Creationism & Rational Thought

November 19th, 2010

Richard Lenski published a pivotal paper in 2008 which rigorously demonstrated evolution in a laboratory setting1. Recently, a “controversy” over Lenski’s research backfired, severely discrediting his critics, a group of creationists and intelligent designers whose intellectual and logical abilities are markedly unscientific. The debate has been mirrored here, the mirroring being necessary as the original site has been subjected to frequent revisionist censoring. The critical rebuttal was this reply by Lenski to Andrew Schlafly. An article in the New Scientist blog discusses the brouhaha. Another good discussion can be found at RationalWiki.


1. Z. D. Blount,  C. Z. Borland  & R.E. Lenski (2008) “Historical contingency and the evolution of a key innovation in an experimental population of Escherichia coli” PNAS 105 7899-7906.

-- Eric Kofoid

CoGe

November 12th, 2010

There is an excellent comparative genomics site at UC Berkeley called CoGe. Check it out!

-- Eric Kofoid

Naked MudJ Revealed

November 8th, 2010

Using Solexa methodology, I recently sequenced strain SMP1666 (from Sophie Maisnier-Patin), containing 3 copies of MudJ. One had lost the infamous trpB-proximal stem-loop structure1,2, and the other two were pristine — one hesitates to say “wild type” as the lollypop structure was itself an artifact of the original construction out of a MudI background3. The complete MudJ sequence can be found here. The big surprise is the cynX & cynS genes between lac and the Tn5 drug-resistance cassette.


1.J.Zieg & R.Kolter (1989) “The right end of MudI(Ap,lac)” Arch.Microbiol. 153 1-6
2.W.Metcalf, P.Steed & B.Wanner (1990) “Identification of Phosphate Starvation-Inducible Genes in Escherichia coli K-12 by DNA Sequence Analysis of psi::lacZ(Mu dl) Transcriptional Fusions” J.Bacteriol. 172 3191-3200
3.B.Castilho, P.Olfson & M.Casadaban (1984) “Plasmid insertion mutagenesis and lac gene fusion with mini-mu bacteriophage transposons” J.Bacteriol. 158 488-495

-- Eric Kofoid

A Favorite Rothian Story

September 28th, 2010

An extravagantly wealthy king had twin sons. One was very pessimistic and the other unusually optimistic. The king was worried that both would suffer in the long run from these defects. He decided to apply the well-known therapy of confronting extremes.

He led his young pessimist to a large room in the palace, filled with wonderful toys from floor to ceiling. The child immediately burst into tears. “What is the matter, my son?”, exclaimed the king. “I know that these toys will all eventually break, and my sadness will be all the greater as there are so many!”, exclaimed the little prince through his tears.

Puzzled by this, the king fetched his other son, the eternal optimist, and led him to a large room in the royal stables, a room filled from floor to ceiling with manure. When the child saw this, he immediately yelled a cry of glee, ran to the pile and began to dig furiously through it with his bare hands. “My son!”, said the king, “Have you gone mad? Why are you doing this?”, to which the joyous prince replied, “Oh, father! Thank you so much! With all this horse shit, there’s sure to be a pony in there somewhere!”

-- Eric Kofoid

The Sad Truth about Illumina Data Clustering

September 28th, 2010

This article is a continuation of Illumina Data Clustering, and is a perfect example of why we, as scientists, should resist the hubris of premature expectations.

The standard Illumina protocol for library preparation requires 18 cycles of PCR after adaptor ligation to enrich for fragments with doubly modified ends. Incomplete products from a previous round can snap-back during this step (“megaprimer snap-back”), creating artifactual templates which will then amplify along with the others. This is a first order process and should be fairly common when the 3′ of the elongating strand just happens to fall on a complementary REP site. A less common artifact could occur by a second order process involving megaprimer extension and reannealling in trans to a complementary REP site.

I found a group of closely spaced NlaIV restriction sites which would destroy megraprimer formation by the snap-back route when digested. If clusters arose from preexisting TIDs or by the rare megaprimer extension event, NlaIV digestion would have no effect on cluster amplification.

I did the experiment, and found that cutting the template DNA with NlaIV prevented amplification. I am forced to conclude that the beautiful clustering of REP-mediated TID joints found in our data is strictly man made by megaprimer snap-back!

Models for PCR Detection of REP-Mediated Clusters

-- Eric Kofoid