**The point is…**

‘Strange structures’ are DNA features not characteristic of the dominant type in a population of molecules, but which nevertheless occur at some finite and presumably observable level. Because Solexa looks at all entities in the population, whether or not they are stably replicated or fixed, we should find evidence of strange structures by appropriate analysis of the data. As I mentioned in a previous posting, ‘the anomalous reads’ are the interesting stuff. I will show here how I process them to pull out features of interest.

**Some Definitions and a Brief Intro**

Solexa determines the sequences at the ends of actual DNA chunks prepared by chopping and sizing. We refer to these tracts as “reads”. Two reads bounding a chunk are called a “read-pair”. Our project identifies 35 nucleotide reads tethered to form read-pairs of around 300 bp. Most map in a predictable fashion to the genome of *Salmonella enterica typhimurium*, strain LT2, the so-called “reference sequence”. A typical Solexa project results in 70-fold coverage of the genome, including any stably inherited episomes such as F plasmids. “Anomalous” read-pairs are, by definition, nonconformant and represent only a small fraction of total reads. They comprise strain-specific inherited modifications which differ intrinsically from the reference wild type, such as selected amplifications of *lacZ*, newly arisen low-frequency rearrangements not yet fixed in the main population (“mutant structures”), or transient alterations incompatible with stable inheritance (“ephemeral structures”). They may also derive from PCR or sample preparation artifacts, although we expect these to be very infrequent.

**Anomalous Databases**

Yong Lu has generated four databases of anomalous read-pairs compatible with certain rearrangements of interest: “red” and “blue” pairs which span, respectively, convergent and divergent joins of inverted duplications; “green” pairs, crossing tandem duplication joins; and, “black” pairs which enclose deletions. For the rest of this discussion, I will concentrate on my analysis of the red and blue data for the F’_{128}(FC40) plasmid of strain TT25790. This plasmid contains the selected *lac* amplification EK568.

**Red and Blue**

The red data set contains 5247 read-pairs, of which 625 cross the join point of the amplified array. The window within which we can locate such joint spanning pairs would comprise about 2(350 – 35) = 630 bp. Twice the number of read pairs is the number of reads in this region. Therefore, the read density is (2×625)/630 = 2.0, close to the read density for the chromosome (15716836/4857432 = 3.2). The point is, even though they are *anomalous* compared to the reference sequence, red read-pairs which cross the EK568 array edge are *typical* of strain TT25790, as the rearrangement is present in nearly all cells.

What about the remaining red anomalous pairs? There are 5247 – 625 = 4622 of them for a plasmid of size 231427 bp. As expected, the coverage for these is much smaller, (2×4622) / 231427 = 0.004, as these reads are truly anomalous, in the sense of being unexpected even taking into account the EK568 plasmid rearrangement.

Similarly, there are 638 read-pairs in the blue database, of which only 255 cross the amplification boundary. This predicts a blue coverage of (2×255)/630 = 0.8-fold, less than half what we expect. I am not what causes this discrepancy. Yong is looking into it, in case there is a technical glitch of some type. Nevertheless, this is 25x greater than the truly anomalous background and still close to the base-line chromosomal read density.

**Data Manipulation**

Typically, I sort the data according to the address of the first (or “left”) read, where “address” is the position where the first nucleotide of the block would fall in the reference sequence. How far apart should we expect two consecutive left red reads in F’_{128}(FC40) to be after sorting? Around 231427 / 4622 = 50, assuming that all the reads left after subtracting out the EK568 joints are independent.

However, when we scan the sorted data, we discover clusters of read-pairs whose consecutive occurrences are much closer than expectation. Such bunching is what we would expect if DNA abnormalities form dynamically at a rate high enough that several instances can be trapped and processed during a normal Solexa run. The entities we expect to exhibit such clustering behavior in the red and blue data are those mentioned earlier in this article, namely unfixed mutant or ephemeral structures such as inversions, amplified TIDs, or products of snap-back or strand-switched replication.

**Data Analysis**

Imagine a pot of DNA, sheared and sized so that every molecule and was about 350 bp long with two broken blunt ends. If such an entity contained a convergent join point, it would be eligible for membership in the red dataset and would be identified by two 35 bp sequences belonging to a single read-pair. Each of the two reads would be characterized by an address corresponding to the position of that nucleotide tract in the reference sequence. When data is entered into the red database, the read with the smaller reference address is placed in a column to the left of the second read. We refer to these, respectively, as the “left” and “right” reads. I find it useful to define the parameter “diff” for each read-pair, equal to the difference between the addresses of the two reads:

diff = address(right) – address(left)

Now, consider the case in which the same join point occurs sufficiently frequently that there are several instances of it in the red set. Because random breakage created the molecules whose end sequences form the data, we expect the position of the junction in any given read-pair to be randomly positioned between the two ends which include it. After sorting the database on the left reads, we can locate the cluster containing this joint, as the rows defining the cluster should have left and right read addresses each much closer to other members of its column than would be expected for reads in the same number of contiguous rows chosen at random from the database.

**The Red Set…**

Given these conditions and the absence of random polluting read-pairs (similar left addresses but unrelated right addresses), we can derive a useful relationship between diff and the address, x, of the left read:

x_{0} and R_{0} are the addresses of the reads in the leftmost possible read-pair. It is clear that

diff = R_{0} – x_{0} – 2Δx

However, Δx is just equal to x – x_{0}, which, on substitution, gives

diff = R_{0} + x_{0} – 2x

This allows us to analyze clusters graphically by plotting the diff value of each member against its left address. If the cluster derives from doubly sheared molecules sharing the same convergent join point, then a straight line with slope of -2 will fit the data.

This is exactly what is observed in array EK568 (TT25790), where the red points are the corresponding read-pairs crossing the join point:

The least-squares best fit through the red data is

diff = y = -1.939x + 306535,

which yields an x intercept at y=0

x_{i} = 306535/1.939 = 158089.

From the diff vs x relationship, we know that x_{i} = (R_{0}+x_{0})/2. We can calculate R_{0} from the known join point coordinates, [158087,158295], as

R_{0} = 158295-35 = 158260.

This allows us to estimate

x_{0} = 2(158089)-158260 = 157918.

If we examine the table of sorted Solexa data, we find that the earliest read-pair has left and right addresses, respectively, of 157884 and 158255, differing from prediction by 34 and 5, indicating that it is nearly identical to the leftmost possible read-pair, which one might expect, given the extensive coverage over this particular join point.

Of course, we also observe some very intriguing light and dark green data, lying on lines of slopes equal to -1! These are fascinating in their own right, and will be discussed in a later posting. Suffice it to say that they probably derive from *singly* sheared entities, and have an intimate albeit fleeting relationship to the red joint.

**The Blue Set…**

The same analysis can be done for the blue data set, which contains the read-pairs which cross the major divergent join point for array EK568:

In this case, “leftmost” has a subtly altered meaning. Because these are *divergent* joints, the value of x increases from right to left. This means that in the sorted data, the leftmost reads will be found at the bottom of the data set, opposite that for the red set.

As can be seen from the figure,

Δx = x_{0}-x,

because of the directionality of x just discussed. Therefore, substitution into

diff = R_{0} – x_{0} + 2Δx

yields the same relationship found in the red analysis,

diff = R_{0} + x_{0} – 2x.

As we saw above for red data, most blue data plot as a straight line with slope ≈-2, as seen in this graph of the second join point from the same array, EK568:

Because of the divergent nature of the joint, features such as the “joint edge” are flipped with respect those on the red graph. Using the same methods as before (but estimating R_{0} as the join point coordinate itself, 132098), we can derive x_{0} as 134094. These values differ from the earliest read pair by 124 and -21, respectively. Although the former value is a little larger than I’d like, it’s still not bad considering that the blue data has greater scatter than that of the red.

The blue “-1″ outliers are more suspect than those in the red data. The Bayesian filter used was my two eyes! Nevertheless, they do define a line of slope -1, and many of the blue dots may, in fact, belong to this hypothetical cluster.

**Average Read-Pair Length…**

As can be seen in the figures above, a lower bound for L can be estimated by defining the first diff in the sorted data as

diff_{first} = R_{0}-x_{0}

and the last diff as

diff_{last} = R_{0}-x_{0}-2(L-70) = diff_{first}+140-2L [red data]

diff_{last} = R_{0}-x_{0}+2(L-70) = diff_{first}-140+2L [blue data]

which rearrange to

L = (140 + diff_{first}-diff_{last})/2 [red data]

L = (140 + diff_{last}-diff_{first})/2 [blue data]

Applying these equations and using the first and last read-pairs to approximate diff_{first} and diff_{last} gives L≈243 and 208 for the red and blue data, respectively. These are shorter than the 300 originally estimated, but these calculations yield only lower bounds. The first or last rows of data may be significantly displaced from the theoretical boundary read-pair.

**Next…**

Anyone who has read this far will want to know what *other* strange structures are found aside from the join points already discussed, which aren’t really that strange as they are found in almost every plasmid. My next post discusses this and the interesting phenomenon of ‘singly sheared’ molecules.

*-- Eric Kofoid*