TitleScalable kernel methods and algorithms for general sequence analysis
NameKuksa, Pavel (author), Pavlovic, Vladimir (chair), Kukikowski, Casimir (internal member), Schliep, Alexander (internal member), Leslie, Christina (outside member), Rutgers University, Graduate School - New Brunswick,
Sequential machine theory
DescriptionAnalysis of large-scale sequential data has become an important task in machine learning and pattern recognition, inspired in part by numerous scientific and technological applications such as the document and text classification or the analysis of biological sequences. However, current computational methods for sequence comparison still lack accuracy and scalability necessary for reliable analysis of large datasets. To this end, we develop a new framework (efficient algorithms and methods) that solve sequence matching, comparison, classification, and pattern extraction problems in linear time, with increased accuracy, improving over the prior art. In particular, we propose novel ways of modeling sequences under complex transformations (such as multiple insertions, deletions, mutations) and present a new family of similarity measures (kernels), the spatial string kernels (SSK). SSKs can be computed very efficiently and perform better than the best available methods on a variety of distinct classification tasks. We also present new algorithms for approximate (e.g., with mismatches) string comparison that improve currently known time complexity bounds for such tasks and show order-of-magnitude running time improvements. We then propose novel linear time algorithms for representative pattern extraction in sequence data sets that exploit developed computational framework. In an extensive set of experiments on many challenging classification problems, such as detecting homology (evolutionary similarity) of remotely related proteins, categorizing texts, and performing classification of music samples, our algorithms and similarity measures display state-of-the-art classification performance and run significantly faster than existing methods.
NoteIncludes bibliographical references
Noteby Pavel Kuksa
CollectionGraduate School - New Brunswick Electronic Theses and Dissertations
Organization NameRutgers, The State University of New Jersey
RightsThe author owns the copyright to this work.