Archive for the ‘Inverted Duplication’ Category

The Amazing History of DR397

Thursday, March 29th, 2012

Drew Reams studies unselected duplication formation, and recently found an eel in the DNA!

Recall that duplication formation is a fundamental small-effect driver of evolutionary adaptation, and TIDs may be the primum mobile of most or all duplications. Drew’s “recalcitrant” duplication strains — those with joints not easily determined by multiplex PCR — are a class which should include TIDs. To analyze them, he sequences their genomes by Illumina technology.

We think TID formation often begins with a snap-back of a 3′ end at a short inverted repeat, forming a stem-loop structure which then initiates DNA synthesis using self as template.

A subsequent template switch restores the fork and finishes the TID. See the excellent article in our Encyclopedia.

The extensive secondary structures at such symmetric TID joints are toxic and rarely observed. Instead, remodelling asymmetric deletions are selected spontaneously, yielding “SJ” (“short junction”) joints. David Leach has shown that sbcCD backgrounds tolerate such structures by not cutting them with the endonuclease product.

Drew wondered what would happen to duplication frequencies in such backgrounds, which may allow the cell to survive the two initial steps of TID formation and increase the yield of duplications. To enhance stem-loop persistance, he also made the cells recQ to prevent stem melting.

He observed a 2-fold increase in duplications in both the chromosome and wild type F’128. The latter contains two IS3 elements and nearly all spontaneous duplications happen by recombination between them. If at least one were removed, the duplication frequency increased by an order of magnitude compared to sbcCD+ recQ+ cells. Something definitely happens in the absence of sbcCD and recQ when IS recombination is blocked

One sequenced duplication was  in F’128 of strain DR397. It had a lacZ read depth 3-fold greater than the chromosome with remaining plasmid DNA about 9-fold greater. In addition, there was a large deletion extending from traI up to lacZ accounting for ~20% of the plasmid.

When we inspected the anomalous read-pair data, we discovered two symmetrical TID joins. We were able to confirm these by showing that reads right at the edge of the deletion window fully contained these joints.

Drew’s suspicions were vindicated. Removing activities which cut stem loop structures or prevent them from folding encourages secondary structure formation and reduces counterselection of the long symmetrical products of snap back and strand switching.

But, what’s the meaning of the enormous deletion? There are a couple of models which come to mind.

Model 1: A snapback at traI followed by another at lac, 5′ resection and ligation of ends will produce a product that matches our observations. It would look something like the following: 

Model 2: A snapback at traI and strand switch at lac will restore the replication fork and lead to a TID. Eventual remodelling by recombination of the flanking repeats will yield our observed product: 

Other related models are possible, with the order of the snapbacks and strand switches reversed, with double strand switches instead of snapbacks, etc. Nevertheless, we favor the second model above, as it only involves two steps to get the essential intermediate, while many generations can pass before selection of the final product.

An interesting aspect of the TID model is the inevitability of the remodeling event. When the origin is itself in the TID, counterselection on a large number of unnecessary genes leads easily to their deletion by recombination. The resulting fitness increase will lead to expansion in the population of the symmetric inversion and eventual extinction of the TID.

-- Eric Kofoid

Hastings on TIDs

Thursday, November 3rd, 2011

An article1 by P.J. Hastings and colleagues corroborates our observations that a substantial fraction of amplifications occur as TIDs.

1. Lin D, Gibson IB, Moore JM, Thornton PC, Leal SM, Hastings PJ. 2011. Global chromosomal structural instability in a subpopulation of starving Escherichia coli cells. PLoS Genet 7: e1002223

-- Eric Kofoid

The Sad Truth about Illumina Data Clustering

Tuesday, September 28th, 2010

This article is a continuation of Illumina Data Clustering, and is a perfect example of why we, as scientists, should resist the hubris of premature expectations.

The standard Illumina protocol for library preparation requires 18 cycles of PCR after adaptor ligation to enrich for fragments with doubly modified ends. Incomplete products from a previous round can snap-back during this step (“megaprimer snap-back”), creating artifactual templates which will then amplify along with the others. This is a first order process and should be fairly common when the 3′ of the elongating strand just happens to fall on a complementary REP site. A less common artifact could occur by a second order process involving megaprimer extension and reannealling in trans to a complementary REP site.

I found a group of closely spaced NlaIV restriction sites which would destroy megraprimer formation by the snap-back route when digested. If clusters arose from preexisting TIDs or by the rare megaprimer extension event, NlaIV digestion would have no effect on cluster amplification.

I did the experiment, and found that cutting the template DNA with NlaIV prevented amplification. I am forced to conclude that the beautiful clustering of REP-mediated TID joints found in our data is strictly man made by megaprimer snap-back!

Models for PCR Detection of REP-Mediated Clusters

-- Eric Kofoid

Possible REP Snap-Back in TT24815 Chromosome

Thursday, March 25th, 2010

A read-depth analysis of the TT24815 chromosome is noisy though relatively flat. It has a few interesting spikes, including one near 336000, about 3x higher than the average read depth elsewhere.


The nearest rrn locus is rrnH, about 41 kb upstream of the spike, so I doubt that rrn’s are involved in the amplification.

I scanned the immediate vicinity and found a REP sequence at position 336899. The feature lies in the intercistronic space between the large (4 kb) rhsD gene and rhsE.

(In the following figures, add 336898 to “stem-loop” addresses and 300000 to plotfold addresses to get correct chromosomal positions.)


I looked for clusters of anomalous read-pairs in the vicinity. There are 23 read-pairs containing convergent joins in the rhsD gene (STM0291), 3 containing divergent joins a little upstream of these, 3 containing deletions in rhsD, and 8 containing tandem duplication join points, also in rhsD.


These read pairs are isolated members of the collection of sheared DNA. No single cell carries all of them and most contain none. Many may be uninheritable cast-offs discarded by error correction machinery.

Nevertheless, it appears that the REP could be nucleating snap-backs, which then get fixed in a variety of ways. Some of these may be full-fledged TIDs, some may be remodeled into tandem duplication arrays. My guess is that enough cells are carrying various forms of amplifications arising from this REP that they account for the spike.

-- Eric Kofoid

Illumina Data Clustering

Thursday, March 25th, 2010


Illumina data provides a wealth of information about DNA rearrangements which typify the genotype of a strain under study. Transient or low-level rearrangements can also be identified and have been discussed in the article, Finding Strange Structures in Solexa. In this article, I  confine myself to alterations consistent with red and blue read pairs.

Groups of related DNA structures are found, defined by reads which are physically close to each other and define a single type of rearrangement, but which are not so frequent as to have risen to fixation in the genotype. I would like to suggest that members of such groups, or “clusters”, may have originated by a mechanism common to the ensemble as a whole, and that these mechanisms may be of evolutionary importance.

Some Definitions

The following phrases are from the Roth Encyclopedia. Clicking on them will bring up the definition in a separate window.

Anomalous pairs

Anomalous types


Convergent joint


Divergent joint

Mated pair

Paired-end (PE) reads


Read pair

Reference sequence


Here are examples of blue and red clusters in the F’128(FC40) molecule of LT2 strain TT24815:


Here are data from the central group of the previous illustrations remapped as diff graphs:


From the first set of graphs, I can see that both ends of the DNA fragments have been randomly sheared, although the right ends are less variable than the left. This combined with the slopes in the diff graphs allow me to infer that both clusters are probably characterized by several different join points.


Clustering in Illumina data indicates that certain types of rearrangements occur more often in the subject genome than statistically expected. They may be counterselected but have a high enough formation rate that they are in steady state with the dominant genotype, or are selectively neutral and undergo a drunkard’s walk in frequency with time. Clearly, they are not under positive selection, as they are present at levels much smaller than one per genome equivalent of  DNA.

What is the nature of the rearrangements? They could represent the joints of inverted segments. We tend to disregard this possibility, as inversions require a concerted ballet of simultaneous low-probability events. They could represent transient “one-off” uninheritable creations, such as snap-back extensions which simply die unproductively after forming at a high rate owing to specific features of the DNA neighborhood. Again, we suspect this is not the case, as the structures would be lethal — selection would likely modulate the nucleating features over time to decrease such suicidal tendencies.

We favor the notion that these clusters represent the joints of unselected aTIDs, which we think form relatively often, are only moderately upsetting to the cell, and — if truly deleterious — are easily collapsed. If this idea is correct, then read-pair clusters are the very stuff of evolution, as they represent preexisting sites at which rapid amplification could occur should selection ever come into play.

-- Eric Kofoid

Solexa “Diff” Clarified

Thursday, February 11th, 2010


Certain readers have had problems with my recent posting regarding Strange Structures, specifically with the meaning of Diff. I humbly present this brief clarification.

What is Diff?

We are interested in Solexa read pairs containing an inverted duplication. When the join is convergent, we call it “red”, and when divergent, “blue”. This colorful nomenclature was developed by our collaborator Yong Lu, and has served us well.  Diff is a statistic used when analyzing such linked pairs, and is the difference in master addresses of the two reads defining the pair.  Master address referes to the location of a read in the master reference sequence.

Consider the following diagram of a convergent join contained within a red read pair:

Typical Convergent Join, Near Middle of Read-Pair

In this case, Diff equals the position of “d” minus the position of “u” in the reference sequence. For instance, let “d” be the first nucleotide of lacIZ in F’128 (FC40) at position 147161, and “u” , the first nucleotide of prpE at 136636. In this example,

Diff = 147161 – 136636 = 10525 (represented by the green bar).

Fine! Why do we need it?

There are two reasons.

First, we like to imagine that a snap-back or strand-switch followed by remodeling gave rise to these inverted junctions, in which case Diff is an estimate of the lower limit of DNA involved in the initiating palindrome, which certainly had to have started somewhere to the right of “f” and ended somewhere to the left of “u”. This means it couldn’t have been any smaller than the distance from “u” to “f”, which is only a few nucleotides larger than the distance from “u” to “d”. We prefer defining Diff in terms of “d” rather than “f”, as read lengths are variable (from 25 to 36 bp) but we always know where the first nucleotide is.

Second, Diff is useful for sorting through data from a strain containing a high frequency join. Let’s say that nearly every cell in the strain analyzed has the join indicated above. Then, we will find in our data a large number of “red” read pairs which contain the junction.

Basic analysis.

We can make a spread sheet in which each line corresponds to one of these sequences, with the first cell occupied by the address of the smaller read, the second by the address of the larger read, and the third, Diff, calculated by subtracting the first cell from the second. Why “smaller” and “larger”, instead of “left” and “right”? Because we don’t know anything about the true history of these DNA molecules. The model shown above is just one possibility. If you replace all the primed letters with unprimed and vice versa, you will get an equally probable model. So, when tabulating Solexa paired reads, we take the smaller as “left” by convention.


We sort the lines according to the value of the first cell. This means that the first line will correspond to that piece of sequenced DNA with the earliest left read in the reference sequence. We can diagram its realtionship to the reference sequence,


Likewise, the last row will derive from DNA with the latest left read,


As you can see, the value of Diff (green bars) increases as the joint in the DNA is found further to the left. It is not difficult to show for a set of read pairs containing the same inverted join that plotting Diff on the Y axis against the left address on X yields a line with slope -2. See Strange Structures for more detail.


A dramatic example of this analysis is the read pairs crossing the convergent join point of array EK568 in strain TT25790, shown here,


We can use Diff to hunt for other, less common join points subject to two conditions:

  1. There must be several read pairs containing the joint with sufficient spread in the Diff values to allow a significant least-squares straight line to be drawn through the data.
  2. It must be possible to cluster these read pairs by appropriate sorting to exclude contamination by the mass of data deriving from other transient structures. I have found several ways of sorting the data which allow me to screen the data with a moving window.

This process has yielded a number of transient or low-level inverted join points in every strain I’ve examined.

-- Eric Kofoid

Kofoid’s Notes for Analytical Genetics, 2009

Wednesday, October 21st, 2009

Here are my notes for a 15′ talk at Analytical Genetics, 2009 (Asilomar). The actual talk had to be pared to 5′ for mysterious reasons!

-- Eric Kofoid

Finding Strange Structures in Solexa

Wednesday, August 5th, 2009

The point is…

‘Strange structures’ are DNA features not characteristic of the dominant type in a population of molecules, but which nevertheless occur at some finite and presumably observable level. Because Solexa looks at all entities in the population, whether or not they are stably replicated or fixed, we should find evidence of strange structures by appropriate analysis of the data. As I mentioned in a previous posting, ‘the anomalous reads’ are the interesting stuff. I will show here how I process them to pull out features of interest.

Some Definitions and a Brief Intro

Solexa determines the sequences at the ends of actual DNA chunks prepared by chopping and sizing. We refer to these tracts as “reads”. Two reads bounding a chunk are called a “read-pair”. Our project identifies 35 nucleotide reads tethered to form read-pairs of around 300 bp. Most map in a predictable fashion to the genome of Salmonella enterica typhimurium, strain LT2, the so-called “reference sequence”. A typical Solexa project results in 70-fold coverage of the genome, including any stably inherited episomes such as F plasmids. “Anomalous” read-pairs are, by definition, nonconformant and represent only a small fraction of total reads. They comprise strain-specific inherited modifications which differ intrinsically from the reference wild type, such as selected amplifications of lacZ, newly arisen low-frequency rearrangements not yet fixed in the main population (“mutant structures”), or transient alterations incompatible with stable inheritance (“ephemeral structures”). They may also derive from PCR or sample preparation artifacts, although we expect these to be very infrequent.

Anomalous Databases

Yong Lu has generated four databases of anomalous read-pairs compatible with certain rearrangements of interest: “red” and “blue” pairs which span, respectively, convergent and divergent joins of inverted duplications; “green” pairs, crossing tandem duplication joins; and, “black” pairs which enclose deletions. For the rest of this discussion, I will concentrate on my analysis of the red and blue data for the F’128(FC40) plasmid of strain TT25790. This plasmid contains the selected lac amplification EK568.

Red and Blue

The red data set contains 5247 read-pairs, of which 625 cross the join point of the amplified array. The window within which we can locate such joint spanning pairs would comprise about 2(350 – 35) = 630 bp. Twice the number of read pairs is the number of reads in this region. Therefore, the read density is (2×625)/630 = 2.0, close to the read density for the chromosome (15716836/4857432 = 3.2). The point is, even though they are anomalous compared to the reference sequence, red read-pairs which cross the EK568 array edge are typical of strain TT25790, as the rearrangement is present in nearly all cells.

What about the remaining red anomalous pairs? There are 5247 – 625 = 4622 of them for a plasmid of size 231427 bp. As expected, the coverage for these is much smaller, (2×4622) / 231427 = 0.004, as these reads are truly anomalous, in the sense of being unexpected even taking into account the EK568 plasmid rearrangement.

Similarly, there are 638 read-pairs in the blue database, of which only 255 cross the amplification boundary. This predicts a blue coverage of (2×255)/630 = 0.8-fold, less than half what we expect. I am not what causes this discrepancy. Yong is looking into it, in case there is a technical glitch of some type. Nevertheless, this is 25x greater than the truly anomalous background and still close to the base-line chromosomal read density.

Data Manipulation

Typically, I sort the data according to the address of the first (or “left”) read, where “address” is the position where the first nucleotide of the block would fall in the reference sequence. How far apart should we expect two consecutive left red reads in F’128(FC40) to be after sorting? Around 231427 / 4622 = 50, assuming that all the reads left after subtracting out the EK568 joints are independent.

However, when we scan the sorted data, we discover clusters of read-pairs whose consecutive occurrences are much closer than expectation. Such bunching is what we would expect if DNA abnormalities form dynamically at a rate high enough that several instances can be trapped and processed during a normal Solexa run. The entities we expect to exhibit such clustering behavior in the red and blue data are those mentioned earlier in this article, namely unfixed mutant or ephemeral structures such as inversions, amplified TIDs, or products of snap-back or strand-switched replication.

Data Analysis

Imagine a pot of DNA, sheared and sized so that every molecule and was about 350 bp long with two broken blunt ends. If such an entity contained a convergent join point, it would be eligible for membership in the red dataset and would be identified by two 35 bp sequences belonging to a single read-pair. Each of the two reads would be characterized by an address corresponding to the position of that nucleotide tract in the reference sequence. When data is entered into the red database, the read with the smaller reference address is placed in a column to the left of the second read. We refer to these, respectively, as the “left” and “right” reads. I find it useful to define the parameter “diff” for each read-pair, equal to the difference between the addresses of the two reads:

diff = address(right) – address(left)

Now, consider the case in which the same join point occurs sufficiently frequently that there are several instances of it in the red set. Because random breakage created the molecules whose end sequences form the data, we expect the position of the junction in any given read-pair to be randomly positioned between the two ends which include it. After sorting the database on the left reads, we can locate the cluster containing this joint, as the rows defining the cluster should have left and right read addresses each much closer to other members of its column than would be expected for reads in the same number of contiguous rows chosen at random from the database.

The Red Set…

Given these conditions and the absence of random polluting read-pairs (similar left addresses but unrelated right addresses), we can derive a useful relationship between diff and the address, x, of the left read:


x0 and R0 are the addresses of the reads in the leftmost possible read-pair. It is clear that

diff = R0 – x0 – 2Δx

However, Δx is just equal to x – x0, which, on substitution, gives

diff = R0 + x0 – 2x

This allows us to analyze clusters graphically by plotting the diff value of each member against its left address. If the cluster derives from doubly sheared molecules sharing the same convergent join point, then a straight line with slope of -2 will fit the data.

This is exactly what is observed in array EK568 (TT25790), where the red points are the corresponding read-pairs crossing the join point:


The least-squares best fit through the red data is

diff = y = -1.939x + 306535,

which yields an x intercept at y=0

xi = 306535/1.939 = 158089.

From the diff vs x relationship, we know that xi = (R0+x0)/2. We can calculate R0 from the known join point coordinates, [158087,158295], as

R0 = 158295-35 = 158260.

This allows us to estimate

x0 = 2(158089)-158260 = 157918.

If we examine the table of sorted Solexa data, we find that the earliest read-pair has left and right addresses, respectively, of 157884 and 158255, differing from prediction by 34 and 5, indicating that it is nearly identical to the leftmost possible read-pair, which one might expect, given the extensive coverage over this particular join point.

Of course, we also observe some very intriguing light and dark green data, lying on lines of slopes equal to -1! These are fascinating in their own right, and will be discussed in a later posting. Suffice it to say that they probably derive from singly sheared entities, and have an intimate albeit fleeting relationship to the red joint.

The Blue Set…

The same analysis can be done for the blue data set, which contains the read-pairs which cross the major divergent join point for array EK568:


In this case, “leftmost” has a subtly altered meaning. Because these are divergent joints, the value of x increases from right to left. This means that in the sorted data, the leftmost reads will be found at the bottom of the data set, opposite that for the red set.

As can be seen from the figure,

Δx = x0-x,

because of the directionality of x just discussed. Therefore, substitution into

diff = R0 – x0 + 2Δx

yields the same relationship found in the red analysis,

diff = R0 + x0 – 2x.

As we saw above for red data, most blue data plot as a straight line with slope ≈-2, as seen in this graph of the second join point from the same array, EK568:


Because of the divergent nature of the joint, features such as the “joint edge” are flipped with respect those on the red graph. Using the same methods as before (but estimating R0 as the join point coordinate itself, 132098), we can derive x0 as 134094. These values differ from the earliest read pair by 124 and -21, respectively. Although the former value is a little larger than I’d like, it’s still not bad considering that the blue data has greater scatter than that of the red.

The blue “-1″ outliers are more suspect than those in the red data. The Bayesian filter used was my two eyes! Nevertheless, they do define a line of slope -1, and many of the blue dots may, in fact, belong to this hypothetical cluster.

Average Read-Pair Length…

As can be seen in the figures above, a lower bound for L can be estimated by defining the first diff in the sorted data as

difffirst = R0-x0

and the last diff as

difflast = R0-x0-2(L-70) = difffirst+140-2L [red data]
difflast = R0-x0+2(L-70) = difffirst-140+2L [blue data]

which rearrange to

L = (140 + difffirst-difflast)/2 [red data]
L = (140 + difflast-difffirst)/2 [blue data]

Applying these equations and using the first and last read-pairs to approximate difffirst and difflast gives L≈243 and 208 for the red and blue data, respectively. These are shorter than the 300 originally estimated, but these calculations yield only lower bounds. The first or last rows of data may be significantly displaced from the theoretical boundary read-pair.


Anyone who has read this far will want to know what other strange structures are found aside from the join points already discussed, which aren’t really that strange as they are found in almost every plasmid. My next post discusses this and the interesting phenomenon of ‘singly sheared’ molecules.

-- Eric Kofoid

Amplification Levels & Copy Number from Solexa

Wednesday, July 15th, 2009

We can calculate levels of amplification as well as plasmid ploidy in a straightforward fashion from Solexa data. Consider our analysis of TT25790, which contains Elisabeth’s array EK568, fully characterized at both join points.

When we look at the “read density” map for reads crossing plasmid F’128(FC40) in this strain, we see that the frequency of reads increases abruptly at reference position ~132098 and returns to baseline abruptly at ~158297. If we go to the raw read data, we find 504973 reads in this interval of 26200 bp, for an average read density of 504973/26200 = 19.3 reads/nucleotide.

We can calculate in a similar manner the read density for the remaining, unamplified region of the plasmid, a circle of size 231427 bp. The total reads for the entire plasmid are 1569700, and for the unamplified region, 1569700-504973 = 1064727. Thus, the average unamplified read density will be 1064727/(231427-26200) = 5.2 reads/nucleotide.

Thus, the EK568 array is amplified with respect to the remainder of the plasmid by a factor of 19.3/5.2 = 3.7. This seems unusually low, given the fact that the samples were grown in minimal lactose medium. Even though the strain is rec, we expect this number to differ from preparation to preparation, as rec-independent recombination by mechanisms such as annealing, snap-back extension, and strand switching appears to be fairly frequent in F plasmid derivatives, perhaps the consequence of continuous rolling circle generation of long single-stranded DNA ends.

Returning to the raw data, we can ask what the read density across the 4068 bp lacIZ fusion gene itself is. We find 88003 reads across the gene, for a density of 88003/4068 = 21.6, which yields an amplification level of 21.6/5.2 = 4.2, still lower than expected.

Now, let’s look at the copy number of the plasmid itself. The chromosome contains 4857432 bp, across which we gathered 15716836 reads for an average density of 15716836/4857432 = 3.2 reads/nucleotide. We know that the unamplified region of the plasmid has a density of 5.2. Therefore, the ploidy of this plasmid with respect to the chromosome is 5.2/3.2 = 1.6, a trifle smaller than our working estimate of 2. Bear in mind, though, that this sample came from an overnight culture in stationary state. Under these conditions, we expect the copy number of F to be at its lowest.

We can use the read densities of the lacIZ gene and the chromosome to determine the copy number of the fusion per chromosome, equal to 21.5/3.2 = 6.8. If the activity of the mutant gene is 2% of wild-type and we assume strict additivity of gene expression, we calculate about 2×6.8 = 13.6 % final wild-type activity.

This may be enough to allow significant growth, but why wasn’t a greater growth rate selected by simple continued amplification? The answer may simply be that rec-independent amplification is slow compared to the relatively brief time to grow to stationary state with an already appreciable amount of lac activity.

Finally, why is the amplification level of lacIZ greater than the average amplification level of the array itself? The answer lies in the word “average”. Remember that this is a TID array in which the elements can be of different sizes. If many of the elements containing lacIZ are smaller than average, the actual level of the gene would be correspondingly greater, as observed.

Many thanks to Yong Lu for helping me collate this data!

-- Eric Kofoid