Stefanos Kalogerakis, "Migrating state between jobs in Apache Spark", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2020
https://doi.org/10.26233/heallink.tuc.87951
Nowadays, data is being generated at an unprecedented rate and impacts every aspect of our everyday life. As this amount increases, more and more organizations try to incorporate techniques to handle that data in real-time and evolve their business strategy. One critical challenge is ensuring fault-tolerance and high availability in our data. On different occasions, the heterogeneous systems responsible for data processing must disrupt their operation and update their infrastructure. In some other cases, system failures can occur. Therefore, migration techniques that prevent data loss are getting increasingly important.In this thesis, we propose a state migration algorithm implemented on Apache Spark’s Structured Streaming API. This powerful API offers a fast, scalable solution for processing complex workloads and ensures fault tolerance through its checkpointing mechanism. The algorithm handles state among different jobs and covers various scenarios where users might wish to split, merge, or remotely deploy workflows in each job with no data loss. In that way, users have complete control over workflow operators and can impact their execution at will. Additionally, to prove that our implementation works, we used Rapidminer Studio workflow designer to present complete and detailed test-cases for the cases mentioned above.