High-performance data processing technology through a new database partitioning method
DGIST developed a new graph-based database partitioning method and its system implementation showed 4.2 times faster performance on average than Apacke Spark SQL
DGIST developed a core technology that supports a fast and efficient large-scale data analysis, which can have a huge impact on large-scale data analysis in a near future.
DGIST announced on May 21st that Professor Min-Soo Kim’s team in the Department of Information and Communication Engineering developed a data management and processing techniques for relational database called ‘GPT (Graph-based Partitioning Table) technology.’ GPT technology shows more than 4 times faster query performance on average compared with widely used Spark SQL system and can be applied to various areas requiring fast join processing technique.
Relational database is widely used in various fields. As the size of relational database increases, a number of machines are used to store such large data where each node manages a part of data. Each part of data is called “partition” of a data and is generated by partitioning an input data as a number of individual partitions. ‘Apache Spark SQL’ is widely used parallel query processing system for relational database. Although a number of query processing technologies have been developed, they require expensive network communication among machines to process large-scale of data.
To overcome a performance issue, Professor Min-Soo Kim’s team studied a more efficient method to manage and process large-scale relational database in parallel and distributed environments. The team developed GPT technology that supports an efficient database partitioning method for relational database which can eliminate an expensive network communication among machines during query processing, thereby successfully resolving critical issues in database partitioning method and parallel and distributed query processing technologies.
GPT technology uses graph-theoretic view for modeling co-partitioning relationships among relational tables. Each table to be partitioned is modeled as a vertex and co-partitioning relationships (or join predicate) between two tables is represented as an edge, and some tables are replicated across machines. To decide tables to be partitioned, GPT technology exploits a concept of hub vertex so that adjacent tables of the same hub table are co-partitioned. By doing so, query processing using co-partitioned tables does not require network communication.
The GPT technology developed by Professor Min-Soo Kim’s team achieves 4.2 times faster performance on average compared with Apache Spark SQL when we use TPC-DS database and queries, which is the industry standard benchmarking method. In addition, GPT technology can be used as an optimization technique for large-scale data processing in a real world beyond a theoretical issue.
Professor Min-Soo Kim in the DGIST Department of Information and Communication Engineering explained that “As there are huge interest regarding fast and efficient large-scale data processing starting from 2010s, we have focused on studying this issue. We expect that the technology for processing relational data we developed from this research will be very useful in the future as data becomes larger and complex.”
This research was co-conducted by Ph.D. candidate Yoon-Min Nam in the Department of Information and Communication Engineering as the first author and was published on April issue of ‘Information Sciences,’ a world-renowned international journal.
Related Journal Article