Search of Patterns in Genomic Sequences

Nowadays, the available amount of genomic data is huge and it is growing thanks to the Next Generation Sequencing (NGS) technology. Reading and sequencing the human genome is becoming a relatively fast and inexpensive process. The corresponding potential for data analysis and querying is not a problem equally faced and it poses an important open challenge to the data management community. At the current state-of-art, NGS data are managed in physical formats and standards that are strongly influenced by the data production process of sequencing machines, and do not offer any high-level view to queried or analyzed. Most of the work in this area is carried out manually by biologists and it can take an enormous amount of time producing poor results. In addition, biologists' observations and analyses are usually limited to specific portions of the genome which are the target of each biological or medical experiment. These are incredible limitations to the discovery of new interesting biological phenomena (e.g., new promising regions of the genome, or new correlations between biological features) that can not be achieved at a small experiment scale. The GenData2020 research project was born to address this challenge, enabling an efficient and effective query and analysis process of genomic data. GenData2020, a framework for NGS data storage, management, query, and analysis, is currently under design and development. The two main elements used by this system are a Genometric Data Model (GDM), which encodes experiment results in a format that takes into account the organization of the genome, and specifically its separation into genomic regions, and a Genometric Query Language (GMQL), that uses those as the main data abstractions for extracting regions of interest from experiments and for computing their properties, with high-level operations for manipulating regions and for measuring their distances. Our main contribution to the project is aimed to the design and implementation of a pattern-search algorithm which provides biologists with the ability, once they identify an interesting genomic pattern, to look for similar occurrences in the data, thus facilitating genomic data access and use. We also developed a stand-alone desktop application, available for download, that allows biologists to use our algorithm in collaboration with IGB (Integrated Genome Browser), a visualization tool used to observe genomic data sets.