Archive for the ‘Illumina’ Category

Illumina Data Reduction Primer

Wednesday, March 13th, 2013

Here are some notes on how I analyze Illumina data. I would appreciate any comments on how to improve, supplement or extend these methods.

I’m currently using an Amazon Machine Image to analyze PacBio data, and hope to write about this experience soon. Suffice it to say that the high error rate and variable read lengths make the process much more challenging than was the case for Illumina paired reads.

-- Eric Kofoid

The Sad Truth about Illumina Data Clustering

Tuesday, September 28th, 2010

This article is a continuation of Illumina Data Clustering, and is a perfect example of why we, as scientists, should resist the hubris of premature expectations.

The standard Illumina protocol for library preparation requires 18 cycles of PCR after adaptor ligation to enrich for fragments with doubly modified ends. Incomplete products from a previous round can snap-back during this step (“megaprimer snap-back”), creating artifactual templates which will then amplify along with the others. This is a first order process and should be fairly common when the 3′ of the elongating strand just happens to fall on a complementary REP site. A less common artifact could occur by a second order process involving megaprimer extension and reannealling in trans to a complementary REP site.

I found a group of closely spaced NlaIV restriction sites which would destroy megraprimer formation by the snap-back route when digested. If clusters arose from preexisting TIDs or by the rare megaprimer extension event, NlaIV digestion would have no effect on cluster amplification.

I did the experiment, and found that cutting the template DNA with NlaIV prevented amplification. I am forced to conclude that the beautiful clustering of REP-mediated TID joints found in our data is strictly man made by megaprimer snap-back!

Models for PCR Detection of REP-Mediated Clusters

-- Eric Kofoid

Possible REP Snap-Back in TT24815 Chromosome

Thursday, March 25th, 2010

A read-depth analysis of the TT24815 chromosome is noisy though relatively flat. It has a few interesting spikes, including one near 336000, about 3x higher than the average read depth elsewhere.


The nearest rrn locus is rrnH, about 41 kb upstream of the spike, so I doubt that rrn’s are involved in the amplification.

I scanned the immediate vicinity and found a REP sequence at position 336899. The feature lies in the intercistronic space between the large (4 kb) rhsD gene and rhsE.

(In the following figures, add 336898 to “stem-loop” addresses and 300000 to plotfold addresses to get correct chromosomal positions.)


I looked for clusters of anomalous read-pairs in the vicinity. There are 23 read-pairs containing convergent joins in the rhsD gene (STM0291), 3 containing divergent joins a little upstream of these, 3 containing deletions in rhsD, and 8 containing tandem duplication join points, also in rhsD.


These read pairs are isolated members of the collection of sheared DNA. No single cell carries all of them and most contain none. Many may be uninheritable cast-offs discarded by error correction machinery.

Nevertheless, it appears that the REP could be nucleating snap-backs, which then get fixed in a variety of ways. Some of these may be full-fledged TIDs, some may be remodeled into tandem duplication arrays. My guess is that enough cells are carrying various forms of amplifications arising from this REP that they account for the spike.

-- Eric Kofoid

Illumina Data Clustering

Thursday, March 25th, 2010


Illumina data provides a wealth of information about DNA rearrangements which typify the genotype of a strain under study. Transient or low-level rearrangements can also be identified and have been discussed in the article, Finding Strange Structures in Solexa. In this article, I  confine myself to alterations consistent with red and blue read pairs.

Groups of related DNA structures are found, defined by reads which are physically close to each other and define a single type of rearrangement, but which are not so frequent as to have risen to fixation in the genotype. I would like to suggest that members of such groups, or “clusters”, may have originated by a mechanism common to the ensemble as a whole, and that these mechanisms may be of evolutionary importance.

Some Definitions

The following phrases are from the Roth Encyclopedia. Clicking on them will bring up the definition in a separate window.

Anomalous pairs

Anomalous types


Convergent joint


Divergent joint

Mated pair

Paired-end (PE) reads


Read pair

Reference sequence


Here are examples of blue and red clusters in the F’128(FC40) molecule of LT2 strain TT24815:


Here are data from the central group of the previous illustrations remapped as diff graphs:


From the first set of graphs, I can see that both ends of the DNA fragments have been randomly sheared, although the right ends are less variable than the left. This combined with the slopes in the diff graphs allow me to infer that both clusters are probably characterized by several different join points.


Clustering in Illumina data indicates that certain types of rearrangements occur more often in the subject genome than statistically expected. They may be counterselected but have a high enough formation rate that they are in steady state with the dominant genotype, or are selectively neutral and undergo a drunkard’s walk in frequency with time. Clearly, they are not under positive selection, as they are present at levels much smaller than one per genome equivalent of  DNA.

What is the nature of the rearrangements? They could represent the joints of inverted segments. We tend to disregard this possibility, as inversions require a concerted ballet of simultaneous low-probability events. They could represent transient “one-off” uninheritable creations, such as snap-back extensions which simply die unproductively after forming at a high rate owing to specific features of the DNA neighborhood. Again, we suspect this is not the case, as the structures would be lethal — selection would likely modulate the nucleating features over time to decrease such suicidal tendencies.

We favor the notion that these clusters represent the joints of unselected aTIDs, which we think form relatively often, are only moderately upsetting to the cell, and — if truly deleterious — are easily collapsed. If this idea is correct, then read-pair clusters are the very stuff of evolution, as they represent preexisting sites at which rapid amplification could occur should selection ever come into play.

-- Eric Kofoid


Thursday, March 25th, 2010

“Solexa” has been our word of choice to describe the rapid throughput sequencing strategy generating pairs of linked reads. You all know this, and the methodology has been described exhaustively in this blog.

However, “Solexa” was actually the name of a company which was acquired by Illumina, Inc. on 11/1/2006. References to this technology in both the literature and on-line have gradually been migrating to “Illumina” as the approved adjective.

After consultation with our collaborators, we have made an executive decision to switch to the current term du jour.

How sad — I really liked “Solexa”!

-- Eric Kofoid