A Next-Generation Sequence Clustering Method for E. coli through Proteomics-Genomics Data Mapping
Mikang Sim, Ho-Sik Seok, Jaebum Kim*
a Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Korea
bUBITA Center for Biotechnology Research (CBRU), Konkuk University, Seoul 143-701, Korea
Mikang Sim, Ho-Sik Seok, Jaebum Kim*
a Department of Animal Biotechnology, Konkuk University, Seoul 143-701, Korea
bUBITA Center for Biotechnology Research (CBRU), Konkuk University, Seoul 143-701, Korea
Recent publications of various ‘omics’ data have provided new challenges and opportunities to the development of novel approaches to the assembly of next-generation sequences. As an attempt to improve the quality of assembled sequences, we developed a next-generation sequence clustering method by using the interdependency between genomics and proteomics data, which has not been well utilized so far in this field. Given a set of next-generation read sequences with a number of protein sequences, our method clusters the read sequences by mapping to the protein sequences. As a preliminary research, we selected Escherichia coli (E. coli) as our target species and simulated next-generation reads of E. coli to evaluate our method by analyzing the actual adjacency of the clustered reads in the E. coli genome. We found that (i) read base matching (RBM) ratio, which represents the amount of bases in a read that are mapped to a protein sequence, higher than 50~70% is a useful criterion for effective read clustering and (ii) higher RBM ratio does not always lead to better quality of clusters in the case of E. coli. These preliminary results demonstrate that the integrative approach is simple yet has great potential for clustering adjacent reads in a genome.