Institutional Repository
Technical University of Crete
EN  |  EL



My Space

Cross-platform query optimization on Apache Beam

Stamatakis Georgios

Full record

Year 2019
Type of Item Diploma Work
Bibliographic Citation Georgios Stamatakis, "Cross-platform query optimization on Apache Beam", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2019
Appears in Collections


We live in an era where the rapid growth of data streams both in complexity and velocity introduce new challenges on a daily basis. These streams can be infinite, emit data at high-speeds and can be generated by non-stationary distributions, thus a modern approach is required when performing complex queries over such data. Since modern Big Data stream processing frameworks are evolving rapidly high-level languages and abstractions are necessary in order to provide support for increasingly complex queries. Query processing however, requires a high-level language be translated into a set of low-level data manipulation operations and on top of that producing an optimal plan for the given query can be extremely difficult as it requires minimizing costs based on available statistics. The objective of query optimization is to maximize (or minimize) metrics like throughput and latency which are vital to the performance of a stream processing system but in order to do so effectively a framework needs to track a wide variety of performance metrics of components inside and outside its cluster. Such metrics can include the health and utilization of the stream pipeline operators, network speed and performance, hardware statistics of each cluster node and finally information about incoming data streams like skew, throughput and tuple size.In this thesis we propose a high-level real-time query optimization platform that gathers various statistics from modern stream processing frameworks and based on the requested queries each one is assigned on the most suitable framework according to various strategies. Such strategies aim to combine the metrics and statistics collected by our platform, both past and present, in order to correctly match queries to available frameworks. Our proposed platform consists of three main modules, the first of which is responsible for gathering available metrics from resource managers and available frameworks. The second module consists of a data ingestion Kafka Streams application that allow us to perform distributed sampling over incoming data streams and provide us with information related to incoming data streams, like skew and throughput. Finally, based on the available metrics each query is assigned to a framework running a pipeline designed in Apache Beam, a framework allowing us to write high-level code that can be executed in a variety of Big Data frameworks with ease.Experimental results prove that the gathered statistics improve query assignments as well as overall performance metrics like throughput and latency. Furthermore, increasing the parallelism level of frameworks yields better results as higher rates of data can be ingested by the stream processing frameworks. Finally, experimental evaluation shows that different strategies may work better under certain conditions.

Available Files