Archive for March, 2010

Possible REP Snap-Back in TT24815 Chromosome

Thursday, March 25th, 2010

A read-depth analysis of the TT24815 chromosome is noisy though relatively flat. It has a few interesting spikes, including one near 336000, about 3x higher than the average read depth elsewhere.


The nearest rrn locus is rrnH, about 41 kb upstream of the spike, so I doubt that rrn’s are involved in the amplification.

I scanned the immediate vicinity and found a REP sequence at position 336899. The feature lies in the intercistronic space between the large (4 kb) rhsD gene and rhsE.

(In the following figures, add 336898 to “stem-loop” addresses and 300000 to plotfold addresses to get correct chromosomal positions.)


I looked for clusters of anomalous read-pairs in the vicinity. There are 23 read-pairs containing convergent joins in the rhsD gene (STM0291), 3 containing divergent joins a little upstream of these, 3 containing deletions in rhsD, and 8 containing tandem duplication join points, also in rhsD.


These read pairs are isolated members of the collection of sheared DNA. No single cell carries all of them and most contain none. Many may be uninheritable cast-offs discarded by error correction machinery.

Nevertheless, it appears that the REP could be nucleating snap-backs, which then get fixed in a variety of ways. Some of these may be full-fledged TIDs, some may be remodeled into tandem duplication arrays. My guess is that enough cells are carrying various forms of amplifications arising from this REP that they account for the spike.

-- Eric Kofoid

Illumina Data Clustering

Thursday, March 25th, 2010


Illumina data provides a wealth of information about DNA rearrangements which typify the genotype of a strain under study. Transient or low-level rearrangements can also be identified and have been discussed in the article, Finding Strange Structures in Solexa. In this article, I  confine myself to alterations consistent with red and blue read pairs.

Groups of related DNA structures are found, defined by reads which are physically close to each other and define a single type of rearrangement, but which are not so frequent as to have risen to fixation in the genotype. I would like to suggest that members of such groups, or “clusters”, may have originated by a mechanism common to the ensemble as a whole, and that these mechanisms may be of evolutionary importance.

Some Definitions

The following phrases are from the Roth Encyclopedia. Clicking on them will bring up the definition in a separate window.

Anomalous pairs

Anomalous types


Convergent joint


Divergent joint

Mated pair

Paired-end (PE) reads


Read pair

Reference sequence


Here are examples of blue and red clusters in the F’128(FC40) molecule of LT2 strain TT24815:


Here are data from the central group of the previous illustrations remapped as diff graphs:


From the first set of graphs, I can see that both ends of the DNA fragments have been randomly sheared, although the right ends are less variable than the left. This combined with the slopes in the diff graphs allow me to infer that both clusters are probably characterized by several different join points.


Clustering in Illumina data indicates that certain types of rearrangements occur more often in the subject genome than statistically expected. They may be counterselected but have a high enough formation rate that they are in steady state with the dominant genotype, or are selectively neutral and undergo a drunkard’s walk in frequency with time. Clearly, they are not under positive selection, as they are present at levels much smaller than one per genome equivalent of  DNA.

What is the nature of the rearrangements? They could represent the joints of inverted segments. We tend to disregard this possibility, as inversions require a concerted ballet of simultaneous low-probability events. They could represent transient “one-off” uninheritable creations, such as snap-back extensions which simply die unproductively after forming at a high rate owing to specific features of the DNA neighborhood. Again, we suspect this is not the case, as the structures would be lethal — selection would likely modulate the nucleating features over time to decrease such suicidal tendencies.

We favor the notion that these clusters represent the joints of unselected aTIDs, which we think form relatively often, are only moderately upsetting to the cell, and — if truly deleterious — are easily collapsed. If this idea is correct, then read-pair clusters are the very stuff of evolution, as they represent preexisting sites at which rapid amplification could occur should selection ever come into play.

-- Eric Kofoid


Thursday, March 25th, 2010

“Solexa” has been our word of choice to describe the rapid throughput sequencing strategy generating pairs of linked reads. You all know this, and the methodology has been described exhaustively in this blog.

However, “Solexa” was actually the name of a company which was acquired by Illumina, Inc. on 11/1/2006. References to this technology in both the literature and on-line have gradually been migrating to “Illumina” as the approved adjective.

After consultation with our collaborators, we have made an executive decision to switch to the current term du jour.

How sad — I really liked “Solexa”!

-- Eric Kofoid