Archive for the ‘Evolution’ Category

Lenski, Creationism & Rational Thought

Friday, November 19th, 2010

Richard Lenski published a pivotal paper in 2008 which rigorously demonstrated evolution in a laboratory setting1. Recently, a “controversy” over Lenski’s research backfired, severely discrediting his critics, a group of creationists and intelligent designers whose intellectual and logical abilities are markedly unscientific. The debate has been mirrored here, the mirroring being necessary as the original site has been subjected to frequent revisionist censoring. The critical rebuttal was this reply by Lenski to Andrew Schlafly. An article in the New Scientist blog discusses the brouhaha. Another good discussion can be found at RationalWiki.

1. Z. D. Blount,  C. Z. Borland  & R.E. Lenski (2008) “Historical contingency and the evolution of a key innovation in an experimental population of Escherichia coli” PNAS 105 7899-7906.

-- Eric Kofoid

Solexa “Diff” Clarified

Thursday, February 11th, 2010


Certain readers have had problems with my recent posting regarding Strange Structures, specifically with the meaning of Diff. I humbly present this brief clarification.

What is Diff?

We are interested in Solexa read pairs containing an inverted duplication. When the join is convergent, we call it “red”, and when divergent, “blue”. This colorful nomenclature was developed by our collaborator Yong Lu, and has served us well.  Diff is a statistic used when analyzing such linked pairs, and is the difference in master addresses of the two reads defining the pair.  Master address referes to the location of a read in the master reference sequence.

Consider the following diagram of a convergent join contained within a red read pair:

Typical Convergent Join, Near Middle of Read-Pair

In this case, Diff equals the position of “d” minus the position of “u” in the reference sequence. For instance, let “d” be the first nucleotide of lacIZ in F’128 (FC40) at position 147161, and “u” , the first nucleotide of prpE at 136636. In this example,

Diff = 147161 – 136636 = 10525 (represented by the green bar).

Fine! Why do we need it?

There are two reasons.

First, we like to imagine that a snap-back or strand-switch followed by remodeling gave rise to these inverted junctions, in which case Diff is an estimate of the lower limit of DNA involved in the initiating palindrome, which certainly had to have started somewhere to the right of “f” and ended somewhere to the left of “u”. This means it couldn’t have been any smaller than the distance from “u” to “f”, which is only a few nucleotides larger than the distance from “u” to “d”. We prefer defining Diff in terms of “d” rather than “f”, as read lengths are variable (from 25 to 36 bp) but we always know where the first nucleotide is.

Second, Diff is useful for sorting through data from a strain containing a high frequency join. Let’s say that nearly every cell in the strain analyzed has the join indicated above. Then, we will find in our data a large number of “red” read pairs which contain the junction.

Basic analysis.

We can make a spread sheet in which each line corresponds to one of these sequences, with the first cell occupied by the address of the smaller read, the second by the address of the larger read, and the third, Diff, calculated by subtracting the first cell from the second. Why “smaller” and “larger”, instead of “left” and “right”? Because we don’t know anything about the true history of these DNA molecules. The model shown above is just one possibility. If you replace all the primed letters with unprimed and vice versa, you will get an equally probable model. So, when tabulating Solexa paired reads, we take the smaller as “left” by convention.


We sort the lines according to the value of the first cell. This means that the first line will correspond to that piece of sequenced DNA with the earliest left read in the reference sequence. We can diagram its realtionship to the reference sequence,


Likewise, the last row will derive from DNA with the latest left read,


As you can see, the value of Diff (green bars) increases as the joint in the DNA is found further to the left. It is not difficult to show for a set of read pairs containing the same inverted join that plotting Diff on the Y axis against the left address on X yields a line with slope -2. See Strange Structures for more detail.


A dramatic example of this analysis is the read pairs crossing the convergent join point of array EK568 in strain TT25790, shown here,


We can use Diff to hunt for other, less common join points subject to two conditions:

  1. There must be several read pairs containing the joint with sufficient spread in the Diff values to allow a significant least-squares straight line to be drawn through the data.
  2. It must be possible to cluster these read pairs by appropriate sorting to exclude contamination by the mass of data deriving from other transient structures. I have found several ways of sorting the data which allow me to screen the data with a moving window.

This process has yielded a number of transient or low-level inverted join points in every strain I’ve examined.

-- Eric Kofoid

Finding Strange Structures in Solexa

Wednesday, August 5th, 2009

The point is…

‘Strange structures’ are DNA features not characteristic of the dominant type in a population of molecules, but which nevertheless occur at some finite and presumably observable level. Because Solexa looks at all entities in the population, whether or not they are stably replicated or fixed, we should find evidence of strange structures by appropriate analysis of the data. As I mentioned in a previous posting, ‘the anomalous reads’ are the interesting stuff. I will show here how I process them to pull out features of interest.

Some Definitions and a Brief Intro

Solexa determines the sequences at the ends of actual DNA chunks prepared by chopping and sizing. We refer to these tracts as “reads”. Two reads bounding a chunk are called a “read-pair”. Our project identifies 35 nucleotide reads tethered to form read-pairs of around 300 bp. Most map in a predictable fashion to the genome of Salmonella enterica typhimurium, strain LT2, the so-called “reference sequence”. A typical Solexa project results in 70-fold coverage of the genome, including any stably inherited episomes such as F plasmids. “Anomalous” read-pairs are, by definition, nonconformant and represent only a small fraction of total reads. They comprise strain-specific inherited modifications which differ intrinsically from the reference wild type, such as selected amplifications of lacZ, newly arisen low-frequency rearrangements not yet fixed in the main population (“mutant structures”), or transient alterations incompatible with stable inheritance (“ephemeral structures”). They may also derive from PCR or sample preparation artifacts, although we expect these to be very infrequent.

Anomalous Databases

Yong Lu has generated four databases of anomalous read-pairs compatible with certain rearrangements of interest: “red” and “blue” pairs which span, respectively, convergent and divergent joins of inverted duplications; “green” pairs, crossing tandem duplication joins; and, “black” pairs which enclose deletions. For the rest of this discussion, I will concentrate on my analysis of the red and blue data for the F’128(FC40) plasmid of strain TT25790. This plasmid contains the selected lac amplification EK568.

Red and Blue

The red data set contains 5247 read-pairs, of which 625 cross the join point of the amplified array. The window within which we can locate such joint spanning pairs would comprise about 2(350 – 35) = 630 bp. Twice the number of read pairs is the number of reads in this region. Therefore, the read density is (2×625)/630 = 2.0, close to the read density for the chromosome (15716836/4857432 = 3.2). The point is, even though they are anomalous compared to the reference sequence, red read-pairs which cross the EK568 array edge are typical of strain TT25790, as the rearrangement is present in nearly all cells.

What about the remaining red anomalous pairs? There are 5247 – 625 = 4622 of them for a plasmid of size 231427 bp. As expected, the coverage for these is much smaller, (2×4622) / 231427 = 0.004, as these reads are truly anomalous, in the sense of being unexpected even taking into account the EK568 plasmid rearrangement.

Similarly, there are 638 read-pairs in the blue database, of which only 255 cross the amplification boundary. This predicts a blue coverage of (2×255)/630 = 0.8-fold, less than half what we expect. I am not what causes this discrepancy. Yong is looking into it, in case there is a technical glitch of some type. Nevertheless, this is 25x greater than the truly anomalous background and still close to the base-line chromosomal read density.

Data Manipulation

Typically, I sort the data according to the address of the first (or “left”) read, where “address” is the position where the first nucleotide of the block would fall in the reference sequence. How far apart should we expect two consecutive left red reads in F’128(FC40) to be after sorting? Around 231427 / 4622 = 50, assuming that all the reads left after subtracting out the EK568 joints are independent.

However, when we scan the sorted data, we discover clusters of read-pairs whose consecutive occurrences are much closer than expectation. Such bunching is what we would expect if DNA abnormalities form dynamically at a rate high enough that several instances can be trapped and processed during a normal Solexa run. The entities we expect to exhibit such clustering behavior in the red and blue data are those mentioned earlier in this article, namely unfixed mutant or ephemeral structures such as inversions, amplified TIDs, or products of snap-back or strand-switched replication.

Data Analysis

Imagine a pot of DNA, sheared and sized so that every molecule and was about 350 bp long with two broken blunt ends. If such an entity contained a convergent join point, it would be eligible for membership in the red dataset and would be identified by two 35 bp sequences belonging to a single read-pair. Each of the two reads would be characterized by an address corresponding to the position of that nucleotide tract in the reference sequence. When data is entered into the red database, the read with the smaller reference address is placed in a column to the left of the second read. We refer to these, respectively, as the “left” and “right” reads. I find it useful to define the parameter “diff” for each read-pair, equal to the difference between the addresses of the two reads:

diff = address(right) – address(left)

Now, consider the case in which the same join point occurs sufficiently frequently that there are several instances of it in the red set. Because random breakage created the molecules whose end sequences form the data, we expect the position of the junction in any given read-pair to be randomly positioned between the two ends which include it. After sorting the database on the left reads, we can locate the cluster containing this joint, as the rows defining the cluster should have left and right read addresses each much closer to other members of its column than would be expected for reads in the same number of contiguous rows chosen at random from the database.

The Red Set…

Given these conditions and the absence of random polluting read-pairs (similar left addresses but unrelated right addresses), we can derive a useful relationship between diff and the address, x, of the left read:


x0 and R0 are the addresses of the reads in the leftmost possible read-pair. It is clear that

diff = R0 – x0 – 2Δx

However, Δx is just equal to x – x0, which, on substitution, gives

diff = R0 + x0 – 2x

This allows us to analyze clusters graphically by plotting the diff value of each member against its left address. If the cluster derives from doubly sheared molecules sharing the same convergent join point, then a straight line with slope of -2 will fit the data.

This is exactly what is observed in array EK568 (TT25790), where the red points are the corresponding read-pairs crossing the join point:


The least-squares best fit through the red data is

diff = y = -1.939x + 306535,

which yields an x intercept at y=0

xi = 306535/1.939 = 158089.

From the diff vs x relationship, we know that xi = (R0+x0)/2. We can calculate R0 from the known join point coordinates, [158087,158295], as

R0 = 158295-35 = 158260.

This allows us to estimate

x0 = 2(158089)-158260 = 157918.

If we examine the table of sorted Solexa data, we find that the earliest read-pair has left and right addresses, respectively, of 157884 and 158255, differing from prediction by 34 and 5, indicating that it is nearly identical to the leftmost possible read-pair, which one might expect, given the extensive coverage over this particular join point.

Of course, we also observe some very intriguing light and dark green data, lying on lines of slopes equal to -1! These are fascinating in their own right, and will be discussed in a later posting. Suffice it to say that they probably derive from singly sheared entities, and have an intimate albeit fleeting relationship to the red joint.

The Blue Set…

The same analysis can be done for the blue data set, which contains the read-pairs which cross the major divergent join point for array EK568:


In this case, “leftmost” has a subtly altered meaning. Because these are divergent joints, the value of x increases from right to left. This means that in the sorted data, the leftmost reads will be found at the bottom of the data set, opposite that for the red set.

As can be seen from the figure,

Δx = x0-x,

because of the directionality of x just discussed. Therefore, substitution into

diff = R0 – x0 + 2Δx

yields the same relationship found in the red analysis,

diff = R0 + x0 – 2x.

As we saw above for red data, most blue data plot as a straight line with slope ≈-2, as seen in this graph of the second join point from the same array, EK568:


Because of the divergent nature of the joint, features such as the “joint edge” are flipped with respect those on the red graph. Using the same methods as before (but estimating R0 as the join point coordinate itself, 132098), we can derive x0 as 134094. These values differ from the earliest read pair by 124 and -21, respectively. Although the former value is a little larger than I’d like, it’s still not bad considering that the blue data has greater scatter than that of the red.

The blue “-1″ outliers are more suspect than those in the red data. The Bayesian filter used was my two eyes! Nevertheless, they do define a line of slope -1, and many of the blue dots may, in fact, belong to this hypothetical cluster.

Average Read-Pair Length…

As can be seen in the figures above, a lower bound for L can be estimated by defining the first diff in the sorted data as

difffirst = R0-x0

and the last diff as

difflast = R0-x0-2(L-70) = difffirst+140-2L [red data]
difflast = R0-x0+2(L-70) = difffirst-140+2L [blue data]

which rearrange to

L = (140 + difffirst-difflast)/2 [red data]
L = (140 + difflast-difffirst)/2 [blue data]

Applying these equations and using the first and last read-pairs to approximate difffirst and difflast gives L≈243 and 208 for the red and blue data, respectively. These are shorter than the 300 originally estimated, but these calculations yield only lower bounds. The first or last rows of data may be significantly displaced from the theoretical boundary read-pair.


Anyone who has read this far will want to know what other strange structures are found aside from the join points already discussed, which aren’t really that strange as they are found in almost every plasmid. My next post discusses this and the interesting phenomenon of ‘singly sheared’ molecules.

-- Eric Kofoid

Amplification Levels & Copy Number from Solexa

Wednesday, July 15th, 2009

We can calculate levels of amplification as well as plasmid ploidy in a straightforward fashion from Solexa data. Consider our analysis of TT25790, which contains Elisabeth’s array EK568, fully characterized at both join points.

When we look at the “read density” map for reads crossing plasmid F’128(FC40) in this strain, we see that the frequency of reads increases abruptly at reference position ~132098 and returns to baseline abruptly at ~158297. If we go to the raw read data, we find 504973 reads in this interval of 26200 bp, for an average read density of 504973/26200 = 19.3 reads/nucleotide.

We can calculate in a similar manner the read density for the remaining, unamplified region of the plasmid, a circle of size 231427 bp. The total reads for the entire plasmid are 1569700, and for the unamplified region, 1569700-504973 = 1064727. Thus, the average unamplified read density will be 1064727/(231427-26200) = 5.2 reads/nucleotide.

Thus, the EK568 array is amplified with respect to the remainder of the plasmid by a factor of 19.3/5.2 = 3.7. This seems unusually low, given the fact that the samples were grown in minimal lactose medium. Even though the strain is rec, we expect this number to differ from preparation to preparation, as rec-independent recombination by mechanisms such as annealing, snap-back extension, and strand switching appears to be fairly frequent in F plasmid derivatives, perhaps the consequence of continuous rolling circle generation of long single-stranded DNA ends.

Returning to the raw data, we can ask what the read density across the 4068 bp lacIZ fusion gene itself is. We find 88003 reads across the gene, for a density of 88003/4068 = 21.6, which yields an amplification level of 21.6/5.2 = 4.2, still lower than expected.

Now, let’s look at the copy number of the plasmid itself. The chromosome contains 4857432 bp, across which we gathered 15716836 reads for an average density of 15716836/4857432 = 3.2 reads/nucleotide. We know that the unamplified region of the plasmid has a density of 5.2. Therefore, the ploidy of this plasmid with respect to the chromosome is 5.2/3.2 = 1.6, a trifle smaller than our working estimate of 2. Bear in mind, though, that this sample came from an overnight culture in stationary state. Under these conditions, we expect the copy number of F to be at its lowest.

We can use the read densities of the lacIZ gene and the chromosome to determine the copy number of the fusion per chromosome, equal to 21.5/3.2 = 6.8. If the activity of the mutant gene is 2% of wild-type and we assume strict additivity of gene expression, we calculate about 2×6.8 = 13.6 % final wild-type activity.

This may be enough to allow significant growth, but why wasn’t a greater growth rate selected by simple continued amplification? The answer may simply be that rec-independent amplification is slow compared to the relatively brief time to grow to stationary state with an already appreciable amount of lac activity.

Finally, why is the amplification level of lacIZ greater than the average amplification level of the array itself? The answer lies in the word “average”. Remember that this is a TID array in which the elements can be of different sizes. If many of the elements containing lacIZ are smaller than average, the actual level of the gene would be correspondingly greater, as observed.

Many thanks to Yong Lu for helping me collate this data!

-- Eric Kofoid

Why is natural selection hard to beat and when do you need to beat it?

Monday, March 9th, 2009

[This is a stub entry I'm making for John under his name. He should re-edit it with his own words. -- Eric]

Here’s a brief review I just wrote with Dan.

Why is natural selection hard to beat and when do you need to beat it?
John R. Roth and Dan I. Andersson

Bacterial genetics defeats natural selection — it uses positive selection to detect large-phenotype mutants without influencing their frequency.  Metazoans maintain organism integrity by defeating natural selection on somatic cell growth.  Bacterial genetics relies on selection strong enough to prevent growth of both the parent and common slightly-improved mutants.  When selective stringency is reduced, frequent small-effect mutations allow growth and initiate a cascade of successive improvements.  This rapid response rests on the unexpectedly high formation rate of small-effect mutations (particularly duplications and amplifications). Duplications form at a rate 104 times that of null mutations.  The high frequency of small-effect mutations reflects features of replication, repair and coding that minimize the costs of mutation.
The striking effect of small-effect mutations is seen in a system designed by John Cairns to test the effect of growth limitation on mutation rate.  A leaky E. coli mutant (lac) is plated on lactose medium.  Revertant (Lac+) colonies appear over 6 days above a lawn of (108) non-growing parent cells. These colonies have been attributed to stress-induced mutagenesis of the non-growing parent. This conclusion ignores natural selection, assuming that only large-effect mutants appear– as is true for lab genetic selections.  However, selection is not stringent in the Cairns system — small increases in lac enzymes allow growth.  Common cells with a lac duplication (and 2x the mutant enzyme level) initiate slow-growing colonies, in which selection drives a multi-step adaptation process – higher amplification, reversion to lac+ and loss of mutant lac alleles.  The high yield of revertant colonies under selection does not reflect mutagenesis, but rather the high spontaneous rate of gene duplication (10-5), amplification (10-2/step) and the selective addition of mutation targets (more cells with more mutant lac copies/cell).
Metazoan somatic cells may escape natural selection by the same mechanism.  Metazoans reduce the basal level of unexpressed genes 1000-fold (compared to bacteria) by their epi-genetic modification of DNA and histones – making it impossible for small-effect mutations to provide growth.

-- John Roth

The origin of mutants under selection: Interactions of mutation, growth and selection

Monday, March 9th, 2009

[This is a stub entry I'm making for John under his name. He should re-edit it with his own words. -- Eric]

Here’s the abstract to a new article.

The origin of mutants under selection: Interactions of mutation, growth and selection

Dan I Andersson, Diarmaid Hughes and John R Roth

In microbial genetics, positive selection detects rare cells with an altered growth phenotype (mutants or recombinants).  The frequency of mutants signals the rate of mutant formation – an increased frequency suggests a higher mutation rate.  Increases in mutant frequency are never attributed to growth under selection.  The converse is true in natural populations, where changes in phenotype frequency reflect selection, genetic drift or founder effects, but never changes in mutation rate.   The apparent conflict is resolved because restrictive rules allow laboratory selection to detect mutants without influencing their frequency.  With these rules, mutant frequency can reliably reflect mutation rates. When the rules are not followed, selection rather that mutation rate dictates mutant frequency – as in natural populations.  In several laboratory genetic systems, non-growing stressed populations show an increase in mutant frequency that has been attributed to stress-induced mutagenesis (adaptive mutation).  Since the mutant frequency is used to infer mutation rate (standard lab practice), the rules must be obeyed.  A breakdown of the rules in these systems may have allowed selection to cause frequency increases that were attributed to mutagenesis.  These systems have sparked interest in interactions between mutation and selection. This has led to a better understanding of how mutants arise, and how very frequent, small-effect mutations, such as duplications and amplifications, can contribute to mutant appearance by increasing gene dosage and mutational target size.

-- John Roth