Illumina Data Clustering


Illumina data provides a wealth of information about DNA rearrangements which typify the genotype of a strain under study. Transient or low-level rearrangements can also be identified and have been discussed in the article, Finding Strange Structures in Solexa. In this article, I  confine myself to alterations consistent with red and blue read pairs.

Groups of related DNA structures are found, defined by reads which are physically close to each other and define a single type of rearrangement, but which are not so frequent as to have risen to fixation in the genotype. I would like to suggest that members of such groups, or “clusters”, may have originated by a mechanism common to the ensemble as a whole, and that these mechanisms may be of evolutionary importance.

Some Definitions

The following phrases are from the Roth Encyclopedia. Clicking on them will bring up the definition in a separate window.

Anomalous pairs

Anomalous types


Convergent joint


Divergent joint

Mated pair

Paired-end (PE) reads


Read pair

Reference sequence


Here are examples of blue and red clusters in the F’128(FC40) molecule of LT2 strain TT24815:


Here are data from the central group of the previous illustrations remapped as diff graphs:


From the first set of graphs, I can see that both ends of the DNA fragments have been randomly sheared, although the right ends are less variable than the left. This combined with the slopes in the diff graphs allow me to infer that both clusters are probably characterized by several different join points.


Clustering in Illumina data indicates that certain types of rearrangements occur more often in the subject genome than statistically expected. They may be counterselected but have a high enough formation rate that they are in steady state with the dominant genotype, or are selectively neutral and undergo a drunkard’s walk in frequency with time. Clearly, they are not under positive selection, as they are present at levels much smaller than one per genome equivalent of  DNA.

What is the nature of the rearrangements? They could represent the joints of inverted segments. We tend to disregard this possibility, as inversions require a concerted ballet of simultaneous low-probability events. They could represent transient “one-off” uninheritable creations, such as snap-back extensions which simply die unproductively after forming at a high rate owing to specific features of the DNA neighborhood. Again, we suspect this is not the case, as the structures would be lethal — selection would likely modulate the nucleating features over time to decrease such suicidal tendencies.

We favor the notion that these clusters represent the joints of unselected aTIDs, which we think form relatively often, are only moderately upsetting to the cell, and — if truly deleterious — are easily collapsed. If this idea is correct, then read-pair clusters are the very stuff of evolution, as they represent preexisting sites at which rapid amplification could occur should selection ever come into play.

-- Eric Kofoid

Tags: , , , ,

One Response to “Illumina Data Clustering”

  1. Rafael Camacho says:

    Gracias por la información, me es relevante
    Saludos, Rafael

Leave a Reply