Possible REP Snap-Back in TT24815 Chromosome

March 25th, 2010

A read-depth analysis of the TT24815 chromosome is noisy though relatively flat. It has a few interesting spikes, including one near 336000, about 3x higher than the average read depth elsewhere.

tt24815_readdepth

The nearest rrn locus is rrnH, about 41 kb upstream of the spike, so I doubt that rrn’s are involved in the amplification.

I scanned the immediate vicinity and found a REP sequence at position 336899. The feature lies in the intercistronic space between the large (4 kb) rhsD gene and rhsE.

(In the following figures, add 336898 to “stem-loop” addresses and 300000 to plotfold addresses to get correct chromosomal positions.)

stem_loop

I looked for clusters of anomalous read-pairs in the vicinity. There are 23 read-pairs containing convergent joins in the rhsD gene (STM0291), 3 containing divergent joins a little upstream of these, 3 containing deletions in rhsD, and 8 containing tandem duplication join points, also in rhsD.

rep_snap_back

These read pairs are isolated members of the collection of sheared DNA. No single cell carries all of them and most contain none. Many may be uninheritable cast-offs discarded by error correction machinery.

Nevertheless, it appears that the REP could be nucleating snap-backs, which then get fixed in a variety of ways. Some of these may be full-fledged TIDs, some may be remodeled into tandem duplication arrays. My guess is that enough cells are carrying various forms of amplifications arising from this REP that they account for the spike.

-- Eric Kofoid

Illumina Data Clustering

March 25th, 2010

Introduction

Illumina data provides a wealth of information about DNA rearrangements which typify the genotype of a strain under study. Transient or low-level rearrangements can also be identified and have been discussed in the article, Finding Strange Structures in Solexa. In this article, I  confine myself to alterations consistent with red and blue read pairs.

Groups of related DNA structures are found, defined by reads which are physically close to each other and define a single type of rearrangement, but which are not so frequent as to have risen to fixation in the genotype. I would like to suggest that members of such groups, or “clusters”, may have originated by a mechanism common to the ensemble as a whole, and that these mechanisms may be of evolutionary importance.

Some Definitions

The following phrases are from the Roth Encyclopedia. Clicking on them will bring up the definition in a separate window.

Anomalous pairs

Anomalous types

Clusters

Convergent joint

Diff

Divergent joint

Mated pair

Paired-end (PE) reads

Read

Read pair

Reference sequence

Examples

Here are examples of blue and red clusters in the F’128(FC40) molecule of LT2 strain TT24815:

blue&red_clusters

Here are data from the central group of the previous illustrations remapped as diff graphs:

blue&red_diffs

From the first set of graphs, I can see that both ends of the DNA fragments have been randomly sheared, although the right ends are less variable than the left. This combined with the slopes in the diff graphs allow me to infer that both clusters are probably characterized by several different join points.

Implications

Clustering in Illumina data indicates that certain types of rearrangements occur more often in the subject genome than statistically expected. They may be counterselected but have a high enough formation rate that they are in steady state with the dominant genotype, or are selectively neutral and undergo a drunkard’s walk in frequency with time. Clearly, they are not under positive selection, as they are present at levels much smaller than one per genome equivalent of  DNA.

What is the nature of the rearrangements? They could represent the joints of inverted segments. We tend to disregard this possibility, as inversions require a concerted ballet of simultaneous low-probability events. They could represent transient “one-off” uninheritable creations, such as snap-back extensions which simply die unproductively after forming at a high rate owing to specific features of the DNA neighborhood. Again, we suspect this is not the case, as the structures would be lethal — selection would likely modulate the nucleating features over time to decrease such suicidal tendencies.

We favor the notion that these clusters represent the joints of unselected aTIDs, which we think form relatively often, are only moderately upsetting to the cell, and — if truly deleterious — are easily collapsed. If this idea is correct, then read-pair clusters are the very stuff of evolution, as they represent preexisting sites at which rapid amplification could occur should selection ever come into play.

-- Eric Kofoid

Illumina/Solexa

March 25th, 2010

“Solexa” has been our word of choice to describe the rapid throughput sequencing strategy generating pairs of linked reads. You all know this, and the methodology has been described exhaustively in this blog.

However, “Solexa” was actually the name of a company which was acquired by Illumina, Inc. on 11/1/2006. References to this technology in both the literature and on-line have gradually been migrating to “Illumina” as the approved adjective.

After consultation with our collaborators, we have made an executive decision to switch to the current term du jour.

How sad — I really liked “Solexa”!

-- Eric Kofoid

Solexa “Diff” Clarified

February 11th, 2010


Apologia

Certain readers have had problems with my recent posting regarding Strange Structures, specifically with the meaning of Diff. I humbly present this brief clarification.

What is Diff?

We are interested in Solexa read pairs containing an inverted duplication. When the join is convergent, we call it “red”, and when divergent, “blue”. This colorful nomenclature was developed by our collaborator Yong Lu, and has served us well.  Diff is a statistic used when analyzing such linked pairs, and is the difference in master addresses of the two reads defining the pair.  Master address referes to the location of a read in the master reference sequence.

Consider the following diagram of a convergent join contained within a red read pair:

Typical Convergent Join, Near Middle of Read-Pair

In this case, Diff equals the position of “d” minus the position of “u” in the reference sequence. For instance, let “d” be the first nucleotide of lacIZ in F’128 (FC40) at position 147161, and “u” , the first nucleotide of prpE at 136636. In this example,

Diff = 147161 – 136636 = 10525 (represented by the green bar).

Fine! Why do we need it?

There are two reasons.

First, we like to imagine that a snap-back or strand-switch followed by remodeling gave rise to these inverted junctions, in which case Diff is an estimate of the lower limit of DNA involved in the initiating palindrome, which certainly had to have started somewhere to the right of “f” and ended somewhere to the left of “u”. This means it couldn’t have been any smaller than the distance from “u” to “f”, which is only a few nucleotides larger than the distance from “u” to “d”. We prefer defining Diff in terms of “d” rather than “f”, as read lengths are variable (from 25 to 36 bp) but we always know where the first nucleotide is.

Second, Diff is useful for sorting through data from a strain containing a high frequency join. Let’s say that nearly every cell in the strain analyzed has the join indicated above. Then, we will find in our data a large number of “red” read pairs which contain the junction.

Basic analysis.

We can make a spread sheet in which each line corresponds to one of these sequences, with the first cell occupied by the address of the smaller read, the second by the address of the larger read, and the third, Diff, calculated by subtracting the first cell from the second. Why “smaller” and “larger”, instead of “left” and “right”? Because we don’t know anything about the true history of these DNA molecules. The model shown above is just one possibility. If you replace all the primed letters with unprimed and vice versa, you will get an equally probable model. So, when tabulating Solexa paired reads, we take the smaller as “left” by convention.

diff_spreadsheet

We sort the lines according to the value of the first cell. This means that the first line will correspond to that piece of sequenced DNA with the earliest left read in the reference sequence. We can diagram its realtionship to the reference sequence,

right_join

Likewise, the last row will derive from DNA with the latest left read,

left_join

As you can see, the value of Diff (green bars) increases as the joint in the DNA is found further to the left. It is not difficult to show for a set of read pairs containing the same inverted join that plotting Diff on the Y axis against the left address on X yields a line with slope -2. See Strange Structures for more detail.

Example.

A dramatic example of this analysis is the read pairs crossing the convergent join point of array EK568 in strain TT25790, shown here,

red_diffs

We can use Diff to hunt for other, less common join points subject to two conditions:

  1. There must be several read pairs containing the joint with sufficient spread in the Diff values to allow a significant least-squares straight line to be drawn through the data.
  2. It must be possible to cluster these read pairs by appropriate sorting to exclude contamination by the mass of data deriving from other transient structures. I have found several ways of sorting the data which allow me to screen the data with a moving window.

This process has yielded a number of transient or low-level inverted join points in every strain I’ve examined.

-- Eric Kofoid

Kofoid’s Notes for Analytical Genetics, 2009

October 21st, 2009

Here are my notes for a 15′ talk at Analytical Genetics, 2009 (Asilomar). The actual talk had to be pared to 5′ for mysterious reasons!

-- Eric Kofoid

Blog Renamed

September 25th, 2009

The Rothlab blog has been renamed “The RADIO” (Rothlab Aggregator of Data, Ideas & Opinions).

Certain members of the laboratory found the previous title unpalatable because of its whimisical, irreverant, sarcastic, mocking and extraordinarily unprofessional nature. So offended were these tender and gentle souls that they would not even use the site! We hope their sensibilities are now relieved.

As an added benefit, we can now use the catchy phrase, “Why don’t you put it on the RADIO?” This alone would seem to make the change entirely worthwhile.

Enjoy and, by all means, post often and engage each other, otherwise we learn nothing!

-- Eric Kofoid

Calculating Doubling Times

September 24th, 2009

Folks in the lab have recently been calculating doubling times of strains. I found doing the calculations one at a time tedious, so I put together a spreadsheet to automatically handle the data from the Biotek plate reader in batch.

I probably spent a totally unreasonable amount of time putting this together, but hopefully someone will find it useful. It is particularly optimized to work with the raw data as it is spit out into Excel by the KC4 program. It is set up to work with the scales of data that I get out (48 hour growth, final OD ~0.8), but you should be able to easily modify the settings to work better for your data.

To start, you simply copy the raw data from your Excel worksheet to the worksheet in the calculator helpfully named “Raw Data”.  There is already a data set pasted in to demonstrate some of the functions, so simply delete it before you add your own data.

There’s a lot going on in there, so if you need something and can’t find it, then track me down, because it is probably hidden somewhere. Alternatively, you can start unhiding rows and try to disentangle the works yourself…

————-

An additional note on calculating doubling times: I would assert that they should be calculated using corrected OD values (OD values where the initial un-inoculated reading is subtracted from all subsequent readings). Anyone want to challenge this?

Good luck!

Get DoublingTimeCalculator!

DTC

-- Doug

Doubling-Time Calculator

September 23rd, 2009

Folks in the lab have recently been calculating doubling times of strains.  I found doing the calculations one at a time tedious, so I put together a spreadsheet to automatically handle the data from the Biotek plate reader in batch.

I probably spent a totally unreasonable amount of time putting this together, but hopefully someone will find it useful.  It is particularly optimized to work with the raw data as it is spit out into Excel by the KC4 program.  It is optimized to work with the scales of data that I get out (48 hour growth, final OD ~0.8), but you should be able to easily modify the settings to work better for your data.

To start, you simply copy the raw data from your Excel worksheet to the worksheet in the calculator helpfully named “Raw Data”.

There’s a lot going on in there, so if you need something and can’t find it, then track me down, because it is probably hidden somewhere in there.  Alternatively, you can start unhiding rows and try to disentangle the works yourself…

Good luck!

Doug

Get Doubling Time Calculator
DTC

Finding Strange Structures in Solexa

August 5th, 2009

The point is…

‘Strange structures’ are DNA features not characteristic of the dominant type in a population of molecules, but which nevertheless occur at some finite and presumably observable level. Because Solexa looks at all entities in the population, whether or not they are stably replicated or fixed, we should find evidence of strange structures by appropriate analysis of the data. As I mentioned in a previous posting, ‘the anomalous reads’ are the interesting stuff. I will show here how I process them to pull out features of interest.

Some Definitions and a Brief Intro

Solexa determines the sequences at the ends of actual DNA chunks prepared by chopping and sizing. We refer to these tracts as “reads”. Two reads bounding a chunk are called a “read-pair”. Our project identifies 35 nucleotide reads tethered to form read-pairs of around 300 bp. Most map in a predictable fashion to the genome of Salmonella enterica typhimurium, strain LT2, the so-called “reference sequence”. A typical Solexa project results in 70-fold coverage of the genome, including any stably inherited episomes such as F plasmids. “Anomalous” read-pairs are, by definition, nonconformant and represent only a small fraction of total reads. They comprise strain-specific inherited modifications which differ intrinsically from the reference wild type, such as selected amplifications of lacZ, newly arisen low-frequency rearrangements not yet fixed in the main population (“mutant structures”), or transient alterations incompatible with stable inheritance (“ephemeral structures”). They may also derive from PCR or sample preparation artifacts, although we expect these to be very infrequent.

Anomalous Databases

Yong Lu has generated four databases of anomalous read-pairs compatible with certain rearrangements of interest: “red” and “blue” pairs which span, respectively, convergent and divergent joins of inverted duplications; “green” pairs, crossing tandem duplication joins; and, “black” pairs which enclose deletions. For the rest of this discussion, I will concentrate on my analysis of the red and blue data for the F’128(FC40) plasmid of strain TT25790. This plasmid contains the selected lac amplification EK568.

Red and Blue

The red data set contains 5247 read-pairs, of which 625 cross the join point of the amplified array. The window within which we can locate such joint spanning pairs would comprise about 2(350 – 35) = 630 bp. Twice the number of read pairs is the number of reads in this region. Therefore, the read density is (2×625)/630 = 2.0, close to the read density for the chromosome (15716836/4857432 = 3.2). The point is, even though they are anomalous compared to the reference sequence, red read-pairs which cross the EK568 array edge are typical of strain TT25790, as the rearrangement is present in nearly all cells.

What about the remaining red anomalous pairs? There are 5247 – 625 = 4622 of them for a plasmid of size 231427 bp. As expected, the coverage for these is much smaller, (2×4622) / 231427 = 0.004, as these reads are truly anomalous, in the sense of being unexpected even taking into account the EK568 plasmid rearrangement.

Similarly, there are 638 read-pairs in the blue database, of which only 255 cross the amplification boundary. This predicts a blue coverage of (2×255)/630 = 0.8-fold, less than half what we expect. I am not what causes this discrepancy. Yong is looking into it, in case there is a technical glitch of some type. Nevertheless, this is 25x greater than the truly anomalous background and still close to the base-line chromosomal read density.

Data Manipulation

Typically, I sort the data according to the address of the first (or “left”) read, where “address” is the position where the first nucleotide of the block would fall in the reference sequence. How far apart should we expect two consecutive left red reads in F’128(FC40) to be after sorting? Around 231427 / 4622 = 50, assuming that all the reads left after subtracting out the EK568 joints are independent.

However, when we scan the sorted data, we discover clusters of read-pairs whose consecutive occurrences are much closer than expectation. Such bunching is what we would expect if DNA abnormalities form dynamically at a rate high enough that several instances can be trapped and processed during a normal Solexa run. The entities we expect to exhibit such clustering behavior in the red and blue data are those mentioned earlier in this article, namely unfixed mutant or ephemeral structures such as inversions, amplified TIDs, or products of snap-back or strand-switched replication.

Data Analysis

Imagine a pot of DNA, sheared and sized so that every molecule and was about 350 bp long with two broken blunt ends. If such an entity contained a convergent join point, it would be eligible for membership in the red dataset and would be identified by two 35 bp sequences belonging to a single read-pair. Each of the two reads would be characterized by an address corresponding to the position of that nucleotide tract in the reference sequence. When data is entered into the red database, the read with the smaller reference address is placed in a column to the left of the second read. We refer to these, respectively, as the “left” and “right” reads. I find it useful to define the parameter “diff” for each read-pair, equal to the difference between the addresses of the two reads:

diff = address(right) – address(left)

Now, consider the case in which the same join point occurs sufficiently frequently that there are several instances of it in the red set. Because random breakage created the molecules whose end sequences form the data, we expect the position of the junction in any given read-pair to be randomly positioned between the two ends which include it. After sorting the database on the left reads, we can locate the cluster containing this joint, as the rows defining the cluster should have left and right read addresses each much closer to other members of its column than would be expected for reads in the same number of contiguous rows chosen at random from the database.

The Red Set…

Given these conditions and the absence of random polluting read-pairs (similar left addresses but unrelated right addresses), we can derive a useful relationship between diff and the address, x, of the left read:

Solexa_data_analysis_red

x0 and R0 are the addresses of the reads in the leftmost possible read-pair. It is clear that

diff = R0 – x0 – 2Δx

However, Δx is just equal to x – x0, which, on substitution, gives

diff = R0 + x0 – 2x

This allows us to analyze clusters graphically by plotting the diff value of each member against its left address. If the cluster derives from doubly sheared molecules sharing the same convergent join point, then a straight line with slope of -2 will fit the data.

This is exactly what is observed in array EK568 (TT25790), where the red points are the corresponding read-pairs crossing the join point:

ek586_convergent_joint

The least-squares best fit through the red data is

diff = y = -1.939x + 306535,

which yields an x intercept at y=0

xi = 306535/1.939 = 158089.

From the diff vs x relationship, we know that xi = (R0+x0)/2. We can calculate R0 from the known join point coordinates, [158087,158295], as

R0 = 158295-35 = 158260.

This allows us to estimate

x0 = 2(158089)-158260 = 157918.

If we examine the table of sorted Solexa data, we find that the earliest read-pair has left and right addresses, respectively, of 157884 and 158255, differing from prediction by 34 and 5, indicating that it is nearly identical to the leftmost possible read-pair, which one might expect, given the extensive coverage over this particular join point.

Of course, we also observe some very intriguing light and dark green data, lying on lines of slopes equal to -1! These are fascinating in their own right, and will be discussed in a later posting. Suffice it to say that they probably derive from singly sheared entities, and have an intimate albeit fleeting relationship to the red joint.

The Blue Set…

The same analysis can be done for the blue data set, which contains the read-pairs which cross the major divergent join point for array EK568:

Solexa_data_analysis_blue

In this case, “leftmost” has a subtly altered meaning. Because these are divergent joints, the value of x increases from right to left. This means that in the sorted data, the leftmost reads will be found at the bottom of the data set, opposite that for the red set.

As can be seen from the figure,

Δx = x0-x,

because of the directionality of x just discussed. Therefore, substitution into

diff = R0 – x0 + 2Δx

yields the same relationship found in the red analysis,

diff = R0 + x0 – 2x.

As we saw above for red data, most blue data plot as a straight line with slope ≈-2, as seen in this graph of the second join point from the same array, EK568:

ek586_divergent_joint

Because of the divergent nature of the joint, features such as the “joint edge” are flipped with respect those on the red graph. Using the same methods as before (but estimating R0 as the join point coordinate itself, 132098), we can derive x0 as 134094. These values differ from the earliest read pair by 124 and -21, respectively. Although the former value is a little larger than I’d like, it’s still not bad considering that the blue data has greater scatter than that of the red.

The blue “-1″ outliers are more suspect than those in the red data. The Bayesian filter used was my two eyes! Nevertheless, they do define a line of slope -1, and many of the blue dots may, in fact, belong to this hypothetical cluster.

Average Read-Pair Length…

As can be seen in the figures above, a lower bound for L can be estimated by defining the first diff in the sorted data as

difffirst = R0-x0

and the last diff as

difflast = R0-x0-2(L-70) = difffirst+140-2L [red data]
difflast = R0-x0+2(L-70) = difffirst-140+2L [blue data]

which rearrange to

L = (140 + difffirst-difflast)/2 [red data]
L = (140 + difflast-difffirst)/2 [blue data]

Applying these equations and using the first and last read-pairs to approximate difffirst and difflast gives L≈243 and 208 for the red and blue data, respectively. These are shorter than the 300 originally estimated, but these calculations yield only lower bounds. The first or last rows of data may be significantly displaced from the theoretical boundary read-pair.

Next…

Anyone who has read this far will want to know what other strange structures are found aside from the join points already discussed, which aren’t really that strange as they are found in almost every plasmid. My next post discusses this and the interesting phenomenon of ‘singly sheared’ molecules.

-- Eric Kofoid