You are here

Efficient String Graph Construction Algorithm

Download pdf | Full Screen View

Date Issued:
2019
Abstract/Description:
In the field of genome assembly research where assemblers are dominated by de Bruijn graph-based approaches, string graph-based assembly approach is getting more attention because of its ability to losslessly retain information from sequence data. Despite the advantages provided by a string graph in repeat detection and in maintaining read coherence, the high computational cost for constructing a string graph hinders its usability for genome assembly. Even though different algorithms have been proposed over the last decade for string graph construction, efficiency is still a challenge due to the demand for processing a large amount of sequence data generated by NGS technologies. Therefore, in this thesis, we provide a novel, linear time and alphabet-size-independent algorithm SOF which uses the property of irreducible edges and transitive edges to efficiently construct string graph from an overlap graph. Experimental results show that SOF is at least 2 times faster than the string graph construction algorithm provided in SGA, one of the most popular string graph-based assembler, while maintaining almost the same memory footprint as SGA. Moreover, the availability of SOF as a subprogram in the SGA assembly pipeline will give user facilities to access the preprocessing and postprocessing steps for genome assembly provided in SGA.
Title: Efficient String Graph Construction Algorithm.
50 views
21 downloads
Name(s): Morshed, S.M. Iqbal, Author
Yooseph, Shibu, Committee Chair
Zhang, Shaojie, Committee Member
Valliyil Thankachan, Sharma, Committee Member
University of Central Florida, Degree Grantor
Type of Resource: text
Date Issued: 2019
Publisher: University of Central Florida
Language(s): English
Abstract/Description: In the field of genome assembly research where assemblers are dominated by de Bruijn graph-based approaches, string graph-based assembly approach is getting more attention because of its ability to losslessly retain information from sequence data. Despite the advantages provided by a string graph in repeat detection and in maintaining read coherence, the high computational cost for constructing a string graph hinders its usability for genome assembly. Even though different algorithms have been proposed over the last decade for string graph construction, efficiency is still a challenge due to the demand for processing a large amount of sequence data generated by NGS technologies. Therefore, in this thesis, we provide a novel, linear time and alphabet-size-independent algorithm SOF which uses the property of irreducible edges and transitive edges to efficiently construct string graph from an overlap graph. Experimental results show that SOF is at least 2 times faster than the string graph construction algorithm provided in SGA, one of the most popular string graph-based assembler, while maintaining almost the same memory footprint as SGA. Moreover, the availability of SOF as a subprogram in the SGA assembly pipeline will give user facilities to access the preprocessing and postprocessing steps for genome assembly provided in SGA.
Identifier: CFE0007504 (IID), ucf:52635 (fedora)
Note(s): 2019-05-01
M.S.
Engineering and Computer Science, Computer Science
Masters
This record was generated from author submitted information.
Subject(s): bioinformatics -- string graph -- sequence assembly
Persistent Link to This Record: http://purl.flvc.org/ucf/fd/CFE0007504
Restrictions on Access: public 2019-05-15
Host Institution: UCF

In Collections