Joint Biostatistics Seminar
Qin Zhang, Assistant Professor
School of Informatics, Computing & Engineering
Friday, April 5, 2019, 1-2pm, HITS1110
Similarity Joins on Sequence Data
Detecting similar pairs from a collection of sequences under edit distance, which is often called edit similarity joins, is a fundamental problem in databases and data mining, and has important applications in bioinformatics such as genome sequence assembly. However, existing algorithms cannot scale on data sets where the lengths of the sequences are long and the distance threshold is large, which appear frequently at the era of big data. For example, the recent Single-Molecule Real-Time (SMRT) sequencing technique produces sequencing reads of lengths up to 100,000 bps, with error rates in the range of 12% ~ 18% (which means that we need to set the distance threshold to be something like 20% of the length of the read). Our experimental results showed that all previous algorithms for edit similarity joins failed to handle such data within a reasonable time budget.
In this talk we present our recent advance on the design of efficient algorithms for edit similarity joins. Our algorithms make use of novel ideas from theoretical computer science such as metric embedding and locality sensitive hashing, and outperform all previous algorithms by orders of magnitude on long sequences and large distance thresholds. We will also discuss how to apply these ideas to the problem of detecting overlapping pairs of error-prone reads produced by SMRT.