Mar 1, 2010
Molecular Ecology Resources
The presence of heterozygous indels in a DNA sequence usually results in the sequence being discarded. If the sequence trace is of high enough quality, however, it will contain enough information to reconstruct the two constituent sequences with very little ambiguity. Solutions already exist using comparisons with a known reference sequence, but this is often unavailable for nonmodel organisms or novel DNA regions. I present a program which determines the sizes and positions of heterozygous indels in a DNA sequence and reconstructs the two constituent haploid sequences. No external data such as a reference sequence or other prior knowledge are required. Simulation suggests an accuracy of >99% from a single read, with errors being eliminable by the inclusion of a second sequencing read, such as one using a reverse primer. Diploid sequences can be fully reconstructed across any number of heterozygous indels, with two overlapping sequencing reads almost always sufficient to infer the entire DNA sequence. This eliminates the need for costly and laborious cloning, and allows data to be used which would otherwise be discarded. With no more laboratory work than is needed to produce two normal sequencing reads, two aligned haploid sequences can be produced quickly and accurately and with extensive phasing information.