Archive for February, 2010

Solexa “Diff” Clarified

Thursday, February 11th, 2010


Certain readers have had problems with my recent posting regarding Strange Structures, specifically with the meaning of Diff. I humbly present this brief clarification.

What is Diff?

We are interested in Solexa read pairs containing an inverted duplication. When the join is convergent, we call it “red”, and when divergent, “blue”. This colorful nomenclature was developed by our collaborator Yong Lu, and has served us well.  Diff is a statistic used when analyzing such linked pairs, and is the difference in master addresses of the two reads defining the pair.  Master address referes to the location of a read in the master reference sequence.

Consider the following diagram of a convergent join contained within a red read pair:

Typical Convergent Join, Near Middle of Read-Pair

In this case, Diff equals the position of “d” minus the position of “u” in the reference sequence. For instance, let “d” be the first nucleotide of lacIZ in F’128 (FC40) at position 147161, and “u” , the first nucleotide of prpE at 136636. In this example,

Diff = 147161 – 136636 = 10525 (represented by the green bar).

Fine! Why do we need it?

There are two reasons.

First, we like to imagine that a snap-back or strand-switch followed by remodeling gave rise to these inverted junctions, in which case Diff is an estimate of the lower limit of DNA involved in the initiating palindrome, which certainly had to have started somewhere to the right of “f” and ended somewhere to the left of “u”. This means it couldn’t have been any smaller than the distance from “u” to “f”, which is only a few nucleotides larger than the distance from “u” to “d”. We prefer defining Diff in terms of “d” rather than “f”, as read lengths are variable (from 25 to 36 bp) but we always know where the first nucleotide is.

Second, Diff is useful for sorting through data from a strain containing a high frequency join. Let’s say that nearly every cell in the strain analyzed has the join indicated above. Then, we will find in our data a large number of “red” read pairs which contain the junction.

Basic analysis.

We can make a spread sheet in which each line corresponds to one of these sequences, with the first cell occupied by the address of the smaller read, the second by the address of the larger read, and the third, Diff, calculated by subtracting the first cell from the second. Why “smaller” and “larger”, instead of “left” and “right”? Because we don’t know anything about the true history of these DNA molecules. The model shown above is just one possibility. If you replace all the primed letters with unprimed and vice versa, you will get an equally probable model. So, when tabulating Solexa paired reads, we take the smaller as “left” by convention.


We sort the lines according to the value of the first cell. This means that the first line will correspond to that piece of sequenced DNA with the earliest left read in the reference sequence. We can diagram its realtionship to the reference sequence,


Likewise, the last row will derive from DNA with the latest left read,


As you can see, the value of Diff (green bars) increases as the joint in the DNA is found further to the left. It is not difficult to show for a set of read pairs containing the same inverted join that plotting Diff on the Y axis against the left address on X yields a line with slope -2. See Strange Structures for more detail.


A dramatic example of this analysis is the read pairs crossing the convergent join point of array EK568 in strain TT25790, shown here,


We can use Diff to hunt for other, less common join points subject to two conditions:

  1. There must be several read pairs containing the joint with sufficient spread in the Diff values to allow a significant least-squares straight line to be drawn through the data.
  2. It must be possible to cluster these read pairs by appropriate sorting to exclude contamination by the mass of data deriving from other transient structures. I have found several ways of sorting the data which allow me to screen the data with a moving window.

This process has yielded a number of transient or low-level inverted join points in every strain I’ve examined.

-- Eric Kofoid